Python is really a powerful language and with proper use of it anyone can make beautiful things. After studying Python I was really impressed by its power and to be more specific I really love how we can scrape any website easily with the help of python. Scraping is a process of extracting data from website by their html data. So I learned its basic and started scraping many website.
Recently I thought of creating something big through scraping but I was having no idea what to do. Then I came across with the site of MP transportation and I realized that they got so many data inside there website. The website is very simple, you open the site enter your transport number details and then search it. Then you will get result about your transport vehicle which includes type, color etc.
With python2.7 I created one script to scrape because with python 3.x there were less support to some modules. I decided to go for 'last' search type because with others I was facing some issues (may be site problem). For this I will have to search each input from 0000 - 9999 in short it makes around 10000 requests. We took 4 digits because it requires min 4 characters to enter. So yeah it was this large.
I created one program and started scrapping but then with 0000 input and 'last' type search I found that it scraped successfully and I got 1700+ data. But the problem was that it took 5 minutes to scrape 1 request. This happened because of server delay. It was not my problem but it was server's problem to search this much data from database. After realizing this I did some maths.
If 1 request take = 5 minutes,
10000 requests = 50000 minutes = 833.33 hours = 35 days approx = 1 month 4 days
So in short I need my laptop to run for 1 month and 4 days to run continuously and trust me it's really a bad idea to do so. But is it worth doing it ?
If 1 request is giving approx 1000 data
10000 requests = 10,000,000
So yeah, hypothetically in 35 days I will be able to achieve 10 millions of data.
But still being a programmer we must do stuff as fast as possible and to achieve this one thing is sure that I need some power, memory, security etc. I tried Multiprocessing and multi threading but it was not working as expected
So the solution for this problem was getting your hand on some free servers. So I started searching some free website host company which supports python and thought of deploying my script over there. I tried this in pythonanywhere.com and in Heroku with the help of Flask framework but there was no success. I waited almost 15 days to decide what to do. Later I found one site scrapinghub.com which lets you deploy spider on cloud and rest they will take care of that so I went for it and started learning it.
After that I learned how to use Scrapy and scrapinghub and I created another new program to scrape website with the help of Scrapy spiders. Source code for this is at the end of this page
Day 1 - 4,092,328 (4 millions of data in 17 hours)
In just 34 hours by scraping we collected 10 millions of data which was estimated earlier. If we tried to do this process in old fashion like in laptop then it would have taken 1 month so we optimized it.
Now what ?
The main question arises is what to do with data ? Which tools to use while analyzing.
Since the size of our JSON files are huge. If we will be able to convert JSON file to database file then it would be really great but doing this will again require loads of time.
From JSON to Database
We can do 5 data per second,
for 10,000,000 = 2,000,000 seconds = 33333 minutes = 555 hours = 23 days.
Now that thing is not possible.
I tried even doing it through SQL script which is much better as compare to the previous script but still it will also take approx 20 days.
So we will use these data in JSON format, load it into python script and then do our maths over there. Loading one file may take approx 10 minutes but time is not an issue. The problem is that loading JSON file in python takes so much of memory. I mean a lot and since we are working on normal laptop then we need to think of something else. To avoid such problem I used ijson module in python. Its really a handy tool which iterates over JSON data rather than loading it all of sudden. But again with this power we need to sacrifice time a little but still its worth it.