Thursday, June 16, 2011

Premature Optimization is Bad

The title almost sounds dirty, but still, one shouldn't get too excited and ahead of them self when it comes to implementing programming solutions. One of the biggest challenges to programming is understanding the full scope of the problem that is dealing with. When I mean "problem," I don't mean it in a sense that something is broken, but in the sense of understanding the "specification of the system" of what one intends to build.

I am currently putting together a more advanced parser that can read text and extract values out of it from html. With the advent of content management systems (CMS), most of the data displayed on the internet follows a specific layout depending on the page. Writing parsers can be a tedious process as you need to do several things:

1.    Effectively isolate the block of text want to analyze
2.    Write a parser to target and extract the information you want to get, and
3.    Create a data structure to save that data

This process is time consuming because I would have to write additional code to parse different values and change the data structure to save everything. For every additional dependency there is in code, the probably for error goes up exponentially (a gut feeling). The less dependencies between modules there are, the less errors in code that you'll probably get. The best kind of code is code that automatically adapts itself to whatever you're doing... but I am going off on a tangent. Right... dealing with premature code optimization.

I have come up with an idea using html templates instead of writing code to parse data out of an html block. Writing procedural code requires one to think of a strategy to get at the data wanted and then codify the process in a program. That means that for every block of html, one would need to write code to get at the data. I've done that before and it can be a time consuming process (which can be made somewhat easier with the use of parsing libraries however).

Using templates it becomes pretty easy to specify the structure of the text and target certain sections of the data using keyword markers. Editing text and converting them into templates can be somewhat tedious but still far easier than writing code to extract data. The work required is just turning parts of the template into wild cards and entering keywords to certain parts to get at the data and I really like this solution, so much that I decided to write code that would semi-automatically take html text and convert it into a template. It took me over a day to try and put something together... and I realized that I still hadn't fully understand the use cases of the html templates, the possible forms of the html templates and just as important, the form of the templates can have small variations that could cause the code to not work.

So here I was, spending a god chunk of a day trying to write code to optimize a process that I hadn't fully understood yet. And I have a feeling that some of the biggest failures of software projects is a poor understanding of the use cases which may need extensive revisions to deal with unforeseen problems.

More important than making it work quickly, is that it works first. I ought to stick to that before putting effort into speeding up certain processes. Once I get this templating engine up and running, it'll be interesting to see what uses I can come up with for this code.

No comments: