Saturday, January 23, 2010

You're lucky if it's "garbage in, garbage out"

When it comes to code, you don't usually have control over the data coming in. Usually I write code that I use on my own, when I know what kind of input to expect from myself or other parts of the program. Things don't break quite as easily. But when it comes to dealing with information out of the web, especially parsing data out of an html file, you don't have that luxury. Things can break on a moment's notice when someone decides to change the data format or communication errors resulting in bad data.

I can have some primitive code up and running pretty quick but the program will crash at the first sign of bad input. If you're analyzing large data sets, the last thing that you want is the code crashing when you have bad data. Graceful error handling is important and also a pain in the ass to deal with.

Firstly there are a whole slew of potential errors that could occur that are hard to predict. Download connections could be reset, the available data could be incomplete, there could be an error in the input format. Then if you are cross referencing data from other sources, you have to make sure that the data you want to look at for all sources is available and who knows how things might break in the future.

Having robust code that can detect errors and report back can easily double the effort required to write a program. It's hard to realize what needs to be made more robust until things start breaking. I had a basic version of a stock analysis program working last year but I was continually hacking parts of the program to ignore bad data. I generally resorted to "execute and pray" kind of tactic to see if I would get any results.

And so error handling and data integrity verification has become another important pillar to good programming that I've just realized and will have to spend more time devising a good strategy on dealing with this kind of stuff.

No comments: