Python- automate navigation through websites
Hi, There are many cases where we need to do the same thing over and over again on websites, for instance downloading episodes of a series manually from a website, checking mails every 10 mins waiting for some important mails, logging into your company or college website everyday just for the sake of logging in. For the impatient people out there, python comes to rescue 🙂
Python and its libraries can be used very effectively to navigate through websites, fill redundant forms and a lot more. A lot of cool things can be built where the website lacks an API. Some examples are: An automatic files downloader from a website, automated login and filling forms, booking tickets automatically, automatic payments using your credit card or debit card. So here, I am going to discuss a few libraries and some tips to use them effectively.
For checking out the content of websites, beautiful soup and urllib/urllib2 are libraries to look at. I have written about them here regarding some basic navigation http://crondev.wordpress.com/2014/06/15/use-python-to-download-files-from-websites/
Here we are looking at python ways of interacting with website forms. One of the difficult tasks is to login into websites.
Doing it the hard way- Using urllib2
Kind of a difficult and complex way, I would say. We will need to use cookielib for this task. So, let’s import it
import cookielib
Now, we need to create a cookie jar to keep all cookies.
cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener)
Next, we need to make a request to the login page using the username and password as login parameters.
authentication_url = 'https://site/login/index.php' payload = { 'username' : username, 'password' : password } data = urllib.urlencode(payload) req = urllib2.Request(authentication_url, data)
Now, we can proceed with opening the webpages.
resp = urllib2.urlopen( 'https://site/link' )
Doing it the Mechanize way 🙂
Mechanize is a very powerful engine and includes a browser itself. So users don’t need to worry about cookies as long as they use the same browser object. Here is a small snippet with inline comments to describe how to use it. Make a new browser object
br = Browser()
Open a page
page = br.open( 'https://example.com/examplepage' )
Mechanize has inbuilt functions for handling forms easily
br.select_form( nr = 0 ) #Selecting a form using the form number 0 for first br.select_form( name = 'form1' ) #Select form using the name- can be easily found using the source of the page
Note that the browser object takes care of cookies automatically. Now the form is selected, you can fill in form fields
br.form[ 'userLogin' ] = user_name br.form[ 'userPassword' ] = password
You can also select radio options this way. Here 2 is the option-ID, if not specified, radio buttons start with 0 as option-ID(Inspect the element to check if it is specified)
br.form[ 'radio-name' ] = [ '2' ]
Using select drop-downs can be a bit tricky at times but here is the way
select1 = br.form.find_control( 'dropdown-name' ) for item in select1.items: if item.name == 'required': item.selected = True
Submit the form with all these parameters
br.submit()
Directly go to the next page. Cookies taken care of by Mechanize browser
resp = br.open( 'https://nextpage.html' )
Note: This is an error that I faced (in case the HTML in the site has no closing tags somewhere)
mechanize._form.ParseError: nested FORMs
Here’s the fix. Analyze the response and then modify it to what mechanize would like to see ie fix the non matching tags to generate the response that should have been. Setting response for Page as the hardcoded response:
resp.set_data( hardcoded_resp ) br.set_response( resp )
I will end this post here to avoid making it too lengthy, I will add more ways of doing this in the second part of this post.
After going over a number of the blog articles on your website, I seriously appreciate
your technique of blogging. I added it to my bookmark website list and will be checking back soon.
Nice replies in return of this matter with real arguments
and describing everything concerning that.
Wheres the part 2 for the love of God?
Good question! I honestly don’t remember what I was going to write. Sorry for that.
You might find some things of interest here: https://crondev.wordpress.com/category/python/