Use python to download files from websites
Hello everyone,
I would like to share with everyone different ways to use python to download files on a website.
Usually files are returned by clicking on links but sometimes there may be embedded files as well, for instance an image or PDF embedded into a web page.
We will be using an extra BeautifulSoup library here for parsing the webpages and making it easier for us to navigate but the whole job is done by the urllib2 library which is included by default in python.
Basics
First we will have a look at urrllib2 library in python. It allows opening webpages and files from web using urls.
To open an arbitrary url, you can use
import urrllib2 resp = urllib2.urlopen( 'http://www.testurl.com' )
The response is the object returned by the website.
Right now, we will be using BeautifulSoup library for viewing the webpage with ease. It is a very simple to use library that simplifies the task of navigating through HTML in webpages. You can get the library from here: http://www.crummy.com/software/BeautifulSoup/#Download
The library sometimes becomes tricky to install and use, so you can directly get the Tarball from: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ and then unzip bs4 it in your project folder and use it.
You need to import the library into python as
from bs4 import BeautifulSoup
First, we will go through the basics of BeautifulSoup and its use in easily navigating through a webpage’s code.
A soup can be created by the object returned by urllib2.
soup = BeautifulSoup( resp.read() )
Now is the time for some magic, you can easily process the soup using tags. For instance, to find all hyperlinks, you can use
links = soup.find_all( 'a' ) #p is an array of all hyperlink tags
Now to get the url from an object from the array is as easy as:
foreach links as link: # Processing each link and getting the url value url = link.get( 'href' )
Downloading files
Now let us see how to download files
Case 1
File is embedded in the page HTML, taking example of a JPEG embedded in the site.
We can first find the image in the page easily using Beautiful Soup by
images = soup.find_all( 'img' )
You can get the url path for the image using the value of ‘src’
foreach images as image: #Processing each link and getting the url value filename = image.get( 'src' )
To get the file, you need to do something like
data = urllib2.urlopen( filname ).read()
The final step is saving the file
with open( "myfile.jpeg", "wb" ) as code : code.write( data )
And done!!!
Case 2
There might be another case, when the file is returned on clicking a link in a browser. In our case, it wouldn’t be a click but a request using
res = urllib.urlopen( url )
Now, we need to identify that the response is a file. How do we do that?
The response header is somewhat different for files than webpages, it looks like
Content-Disposition: attachment; filename="filename.extension"
We can access the response header using
header = res.info()
and check whether the response has a Content Disposition header in it.
It is as simple as doing
if 'Content-Disposition' in str( header ): # It is a file
Now to download and save it, we can proceed the same way as last one
with open( "myfile", "wb" ) as code : code.write( res )
You can get the file name as well using the Content disposition header
A simple python script does that
filename = res.info()['Content-Disposition'] . split( '=' )[-1] . strip( '"' )
Basically the script uses the response array to get the Content-disposition object, and then we split it at the ‘=’ sign. So we have an array of strings and we are interested in only the last object ie “filename.extension”. We drop the ” by using .split( '"' )
and done, we get the filename of the attachment.
One important thing to note is that the filename may be in the form of File%20name.txt(for File name.txt) as HTML encodes urls using an ASCII character. See http://www.w3schools.com/tags/ref_urlencode.asp for more details.
It can easily be fixed by
filename = urllib.unquote( filename )
That’s all and we can now download and save files from all websites using python 🙂
Hi, I don’t believe this code wouldn’t work for a website that requires users to login, right?
I have a site I’ve logged into using mechanize, and then found a desired file using beautiful soup. The issue I’m having now is the file link on the website is actually a request like:
https://www.somewebsite.com/a/document.html?key=2618380
Not sure how to account for this, still new to python and these modules, and I’m realizing my understanding of headers and cookies and authorization schemes is not as deep as this requires.
Actually it would. Have a look at https://github.com/kunalgrover05/IITM-Moodle-Downloader . Here, I have used Cookie based authentication to make it possible. It is actually supported at the Urllib2 level itself.
Mechanize too supports that for sure, since it is equivalent to a browser.
Hello Kunal,
I’m having an issue with the “foreach links as link” command. Python is giving me a syntax error. I believe this is due to the data type of “links”. When I type in type(links) in the command window, I get the following message: “” Shouldn’t “links” be an array?
Sorry for that, the message is “class ‘bs4.element.ResultSet'”
The error is a syntax error, means that it isn’t due to the data type.Actually, it is wrongly stated in this blog post.
Python uses
for i in all:
Instead of
foreach i in all:
I will fix that, thanks for telling.
Hi Kunal,
The import package is urllib2. There is a typo. Please correct.
Regards,
AE.
Would it be possible to download links, from an email?
The code snippets and the examples are very explanatory. Thank you for the detailed article.