Wednesday, November 11, 2009

Developing a Crawler and Parser library

The internet has a huge plethora of readily accessible information, the problem is that the average human being isn't able to process large quantities of information in a short period of time. So I have taken to the task of developing a small library of functions to allow me to automate downloading of webpages and extract data out of them for analysis.

This project of mine has been going on and off for the past year since doing work in matlab and then moving on to python which has the advantage of running multi-threaded programs. This is a must since you are usually limited to the response time and the speed of a website instead of the processing capabilities of your CPU. Processing textual data can be ridiculously fast, even for an interpretive language like python.

Mashing up data from different sources is a very fascinating experience and I do see some interesting opportunities in the future. The first application I designed was downloading all the prices of different CPUs from an online store and then combining that data with published benchmarking data from another site. Using both pieces of information, I could easily calculate price/performance ratios for every listed CPU to select a CPU which was best suited to my computing requirements and price range. I wouldn't have to thumb through long benchmarking articles anymore to figure out what might be the best CPU for me.

It even gets better as the process can also be automated to the point where it is possible to create a program that will inform me when certain products have gone on sale and notify me if I would like to purchase something. This brings a new paradigm to shopping as it becomes possible to set up an automated notification system when something you want comes into your price range. This is a good thing since you don't have to waste time reading advertisements since you have a program that does that for you.

I am now ramping up the scale of this program to allow me to download larger data sets to mash up financial data from publicly stocks with their stock price to see if I can develop a valuation model to determine the fair price of a stock by using fundamental data. Instead of limiting myself to tracking small groups of stocks, I might be able to look at the stock market as a whole. Using a large data set to look at pricing distributions I can determine with greater confidence as to what makes one stock expensive or cheap compared to another set of stocks or perhaps understand what is going on with the stock market as a whole instead of the filtered (and usually useless) information provided through most financial sites.

The power of this program is immense as I have been working on downloading financial reports from every company listed company for the past 10 years. To get to this data, one must transverse several webpages to finally get to the real data. In the past week, I have already downloaded and parsed 180,000 webpages so far and will be going be downloading another 150,000 tonight while I sleep. This program has read and processed more than I will ever read in my entire lifetime and I think that is amazing.

I have no idea what I am going to learn yet, but the prospect of looking at large data sets and interpreting them has me quite excited. I am looking forward to working on this little project.

No comments: