Use python to download files from websites

Spread the love

Hello everyone,
I would like to share with everyone different ways to use python to download files on a website.
Usually files are returned by clicking on links but sometimes there may be embedded files as well, for instance an image or PDF embedded into a web page.
We will be using an extra BeautifulSoup library here for parsing the webpages and making it easier for us to navigate but the whole job is done by the urllib2 library which is included by default in python.

Basics

First we will have a look at urrllib2 library in python. It allows opening webpages and files from web using urls.
To open an arbitrary url, you can use

  import urrllib2
  resp = urllib2.urlopen( 'http://www.testurl.com' )

The response is the object returned by the website.
Right now, we will be using BeautifulSoup library for viewing the webpage with ease. It is a very simple to use library that simplifies the task of navigating through HTML in webpages. You can get the library from here: http://www.crummy.com/software/BeautifulSoup/#Download
The library sometimes becomes tricky to install and use, so you can directly get the Tarball from: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ and then unzip bs4 it in your project folder and use it.
You need to import the library into python as

  from bs4 import BeautifulSoup

First, we will go through the basics of BeautifulSoup and its use in easily navigating through a webpage’s code.
A soup can be created by the object returned by urllib2.

  soup = BeautifulSoup( resp.read() )

Now is the time for some magic, you can easily process the soup using tags. For instance, to find all hyperlinks, you can use

  links = soup.find_all( 'a' )  #p is an array of all hyperlink tags

Now to get the url from an object from the array is as easy as:

  foreach links as link: # Processing each link and getting the url value
      url = link.get( 'href' )

Downloading files

Now let us see how to download files
Case 1
File is embedded in the page HTML, taking example of a JPEG embedded in the site.
We can first find the image in the page easily using Beautiful Soup by

  images = soup.find_all( 'img' )

You can get the url path for the image using the value of ‘src’

  foreach images as image:      #Processing each link and getting the url value
      filename = image.get( 'src' )

To get the file, you need to do something like

  data = urllib2.urlopen( filname ).read()

The final step is saving the file

with open( "myfile.jpeg", "wb" ) as code :
    code.write( data )

And done!!!
Case 2
There might be another case, when the file is returned on clicking a link in a browser. In our case, it wouldn’t be a click but a request using

  res = urllib.urlopen( url )

Now, we need to identify that the response is a file. How do we do that?
The response header is somewhat different for files than webpages, it looks like

  Content-Disposition: attachment; filename="filename.extension"

We can access the response header using

  header = res.info()

and check whether the response has a Content Disposition header in it.
It is as simple as doing

  if 'Content-Disposition' in str( header ):
      # It is a file

Now to download and save it, we can proceed the same way as last one

  with open( "myfile", "wb" ) as code :
      code.write( res )

You can get the file name as well using the Content disposition header
A simple python script does that

  filename = res.info()['Content-Disposition'] . split( '=' )[-1] . strip( '"' )

Basically the script uses the response array to get the Content-disposition object, and then we split it at the ‘=’ sign. So we have an array of strings and we are interested in only the last object ie “filename.extension”. We drop the ” by using .split( '"' ) and done, we get the filename of the attachment.
One important thing to note is that the filename may be in the form of File%20name.txt(for File name.txt) as HTML encodes urls using an ASCII character. See http://www.w3schools.com/tags/ref_urlencode.asp for more details.
It can easily be fixed by

  filename = urllib.unquote( filename )

That’s all and we can now download and save files from all websites using python 🙂

You may also like...

9 Responses

  1. Hi, I don’t believe this code wouldn’t work for a website that requires users to login, right?
    I have a site I’ve logged into using mechanize, and then found a desired file using beautiful soup. The issue I’m having now is the file link on the website is actually a request like:
    https://www.somewebsite.com/a/document.html?key=2618380
    Not sure how to account for this, still new to python and these modules, and I’m realizing my understanding of headers and cookies and authorization schemes is not as deep as this requires.

  2. Raptors95 says:

    Hello Kunal,
    I’m having an issue with the “foreach links as link” command. Python is giving me a syntax error. I believe this is due to the data type of “links”. When I type in type(links) in the command window, I get the following message: “” Shouldn’t “links” be an array?

  3. Raptors95 says:

    Sorry for that, the message is “class ‘bs4.element.ResultSet'”

    • Kunal Grover says:

      The error is a syntax error, means that it isn’t due to the data type.Actually, it is wrongly stated in this blog post.
      Python uses
      for i in all:
      Instead of
      foreach i in all:
      I will fix that, thanks for telling.

  4. AE says:

    Hi Kunal,
    The import package is urllib2. There is a typo. Please correct.
    Regards,
    AE.

  5. mp252 says:

    Would it be possible to download links, from an email?

  6. The code snippets and the examples are very explanatory. Thank you for the detailed article.

  1. July 15, 2014

    […] are libraries to look at. I have written about them here regarding some basic navigation http://crondev.wordpress.com/2014/06/15/use-python-to-download-files-from-websites/ Here we are looking at python ways of interacting with website […]

Leave a Reply

Your email address will not be published. Required fields are marked *