Digging into python: Part 1 | Web Scrapping

Until and unless we are not start doing we will not learn a single thing in our life. So we are going to do a simple program in python. The first program that we are going to write will not be like finding the sum of 2 numbers or printing the even numbers. From our high school's we are doing the same programs in c, c++, java etc. As I said python is a very high level language and simple to code, this programs will not be more than 5 lines of  code. So we are going to write little advanced but an easy one. This was the first program that I tried writing in python and it was really fun and interesting.

Things you will learn
  • How to write a function?
  • How to install a package?
  • How to import a package?
  • How to run a python program from file?

Web scrapping with python 

Web scrapping is the art of extracting useful information from a website by the means of automating a system. So what we are going to do is, we are going to write a program to extract all the links from a web page. For this we  are using a package called mechanize. Using mechanize we will be able to create a browser object in CLI(Command line interface). Using this browser object we can go to any website and read the contents.
For installing the package mechanize type below command in terminal of Linux machine
sudo pip install mechanize
 After installing the package we can start coding. Code is
import mechanize
def linkminer():                      
    browser = mechanize.Browser(factory=mechanize.RobustFactory())
    browser.set_handle_robots(False)
    browser.open("http://www.minerbots.blogspot.in/")  #you can give your url
    html = browser.response().readlines()
    for link in browser.links():
      print link.text, link.url
 linkminer()

   
You can see that this code is also only 9 lines of code 😎. In the first line we imported the package. Then we wrote the function definition for mining the links. Inside the function we are creating a browser object. Using the browser object we opened a link and read entire page to a variable. After that, in the next line we started a loop which will iterate until the last link in that response. In the loop itself we print the text for the link and the url for the particular link. You can write the code either inside the interpreter itself or on a file and can run in terminal. For running the program from a file you have to write entire code into a file and save it as a .py file. Then you can run the program by simply typing below command in Linux terminal. You can give file name like this only if your terminal's present working directory is same as the directory where the file is. Otherwise you have to give the absolute path to the file instead of file name only.
python filename.py


OUTPUT:


 For more information related to web scraping and python you can go to my another blog Minerbots

Comments

Popular Posts