Use python to download files from websites

by Kunal Grover · Published June 15, 2014 · Updated October 19, 2018

Spread the love

Hello everyone,
I would like to share with everyone different ways to use python to download files on a website.
Usually files are returned by clicking on links but sometimes there may be embedded files as well, for instance an image or PDF embedded into a web page.
We will be using an extra BeautifulSoup library here for parsing the webpages and making it easier for us to navigate but the whole job is done by the urllib2 library which is included by default in python.

Basics

First we will have a look at urrllib2 library in python. It allows opening webpages and files from web using urls.
To open an arbitrary url, you can use

  import urrllib2
  resp = urllib2.urlopen( 'http://www.testurl.com' )

The response is the object returned by the website.
Right now, we will be using BeautifulSoup library for viewing the webpage with ease. It is a very simple to use library that simplifies the task of navigating through HTML in webpages. You can get the library from here: http://www.crummy.com/software/BeautifulSoup/#Download
The library sometimes becomes tricky to install and use, so you can directly get the Tarball from: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ and then unzip bs4 it in your project folder and use it.
You need to import the library into python as

  from bs4 import BeautifulSoup

First, we will go through the basics of BeautifulSoup and its use in easily navigating through a webpage’s code.
A soup can be created by the object returned by urllib2.

  soup = BeautifulSoup( resp.read() )

Now is the time for some magic, you can easily process the soup using tags. For instance, to find all hyperlinks, you can use

  links = soup.find_all( 'a' )  #p is an array of all hyperlink tags

Now to get the url from an object from the array is as easy as:

  foreach links as link: # Processing each link and getting the url value
      url = link.get( 'href' )

Downloading files

Now let us see how to download files
Case 1
File is embedded in the page HTML, taking example of a JPEG embedded in the site.
We can first find the image in the page easily using Beautiful Soup by

  images = soup.find_all( 'img' )

You can get the url path for the image using the value of ‘src’

  foreach images as image:      #Processing each link and getting the url value
      filename = image.get( 'src' )

To get the file, you need to do something like

  data = urllib2.urlopen( filname ).read()

The final step is saving the file

with open( "myfile.jpeg", "wb" ) as code :
    code.write( data )

And done!!!
Case 2
There might be another case, when the file is returned on clicking a link in a browser. In our case, it wouldn’t be a click but a request using

  res = urllib.urlopen( url )

Now, we need to identify that the response is a file. How do we do that?
The response header is somewhat different for files than webpages, it looks like

  Content-Disposition: attachment; filename="filename.extension"

We can access the response header using

  header = res.info()

and check whether the response has a Content Disposition header in it.
It is as simple as doing

  if 'Content-Disposition' in str( header ):
      # It is a file

Now to download and save it, we can proceed the same way as last one

  with open( "myfile", "wb" ) as code :
      code.write( res )

You can get the file name as well using the Content disposition header
A simple python script does that

  filename = res.info()['Content-Disposition'] . split( '=' )[-1] . strip( '"' )

Basically the script uses the response array to get the Content-disposition object, and then we split it at the ‘=’ sign. So we have an array of strings and we are interested in only the last object ie “filename.extension”. We drop the ” by using .split( '"' ) and done, we get the filename of the attachment.
One important thing to note is that the filename may be in the form of File%20name.txt(for File name.txt) as HTML encodes urls using an ASCII character. See http://www.w3schools.com/tags/ref_urlencode.asp for more details.
It can easily be fixed by

  filename = urllib.unquote( filename )

That’s all and we can now download and save files from all websites using python 🙂

Tags: BeautifulSoup Download Downloading files python sites urllib2 urlllib Websites

thepeglegpete says:

May 6, 2015 at 11:37 pm

Hi, I don’t believe this code wouldn’t work for a website that requires users to login, right?
I have a site I’ve logged into using mechanize, and then found a desired file using beautiful soup. The issue I’m having now is the file link on the website is actually a request like:
https://www.somewebsite.com/a/document.html?key=2618380
Not sure how to account for this, still new to python and these modules, and I’m realizing my understanding of headers and cookies and authorization schemes is not as deep as this requires.

Reply
- Kunal Grover says:
  
  May 7, 2015 at 12:58 am
  
  Actually it would. Have a look at https://github.com/kunalgrover05/IITM-Moodle-Downloader . Here, I have used Cookie based authentication to make it possible. It is actually supported at the Urllib2 level itself.
  Mechanize too supports that for sure, since it is equivalent to a browser.
  
  Reply
Raptors95 says:

May 16, 2015 at 1:17 am

Hello Kunal,
I’m having an issue with the “foreach links as link” command. Python is giving me a syntax error. I believe this is due to the data type of “links”. When I type in type(links) in the command window, I get the following message: “” Shouldn’t “links” be an array?

Reply
Raptors95 says:

May 16, 2015 at 1:19 am

Sorry for that, the message is “class ‘bs4.element.ResultSet'”

Reply
- Kunal Grover says:
  
  May 16, 2015 at 12:41 pm
  
  The error is a syntax error, means that it isn’t due to the data type.Actually, it is wrongly stated in this blog post.
  Python uses
  for i in all:
  Instead of
  foreach i in all:
  I will fix that, thanks for telling.
  
  Reply
AE says:

December 6, 2016 at 9:23 pm

Hi Kunal,
The import package is urllib2. There is a typo. Please correct.
Regards,
AE.

Reply
mp252 says:

March 30, 2017 at 8:39 pm

Would it be possible to download links, from an email?

Reply
Python Examples says:

March 10, 2019 at 4:53 pm

The code snippets and the examples are very explanatory. Thank you for the detailed article.

Reply

Python- automate navigation through websites (Part 1) | Cron-Dev

July 15, 2014

[…] are libraries to look at. I have written about them here regarding some basic navigation http://crondev.wordpress.com/2014/06/15/use-python-to-download-files-from-websites/ Here we are looking at python ways of interacting with website […]

Use python to download files from websites

Basics

Downloading files

Like this:

Related

You may also like...

9 Responses

Leave a Reply Cancel reply

Subscribe to Blog via Email

Blog Stats

Recent Posts

Archives

Kunal Grover

Verified Services

Use python to download files from websites

Basics

Downloading files

Share this:

Like this:

Related

You may also like...

Python- automate navigation through websites

Automate facebook comments and likes

Make a certificate creator using Python

9 Responses

Leave a Reply Cancel reply

Subscribe to Blog via Email

Blog Stats

Recent Posts

Archives

Kunal Grover

Verified Services