Wednesday, June 15, 2011

Building a web craling platform

I've been doing some web-crawling over the last few years. I started off with some really rudimentary pattern matching stuff but over the last 2~3 years I've become much more comfortable with text processing. To the point where I am becoming able to programmatically to edit chunks of text that I write. The power of really good text processing is amazing, especially when it comes to structured textual information.

I have spent an innumerable amount of time trying to figure out the best way of writing programs to parse text and I have been thinking of programmatic methods of getting the data out the information out there. But the more and more time one spends with text, patterns and strategies arise that can be used over and over again to get at the information embedded in text. What I want to be able to do is to create a simple frame work that will allow me to quickly create parsers for whatever text document I want to get at and keep a library of them so I can stream data from a variety of websites. Eventually, I might be able to make it easy enough that even non-programmers can be able to write parsers too and that may have some interesting applications.

No comments: