300 Lines Lighter
For all the geeks out there that read 3athlete, I thought you might appreciate a look into the internals of 3athlete...
When I first started 3athlete, triathlon and RSS had not crossed paths yet That meant I had to pre-process all of the html before it went into the database. I created a backend system that used little perl plugins to parse the contents of various sites and dump everything into a database. Since I spent many years hacking away at ugly perl code at my day job, this was pretty trivial using HTML::TokeParser. Below is an example of one of my first plugins:
As the web has grown more semantic over the past few years, more and more site owners have figured out that RSS feeds translate into additional page views. That, in turn, has made my life easier and allowed me to start parsing XML instead of HTML. Well--almost.
Since it was not a necessity, to switch to a native RSS/XML parser, I decided to keep the model of one plugin per site and treat the XML as html. That would look something like this to the tokenizer:
This led to many complications as my humble assumptions have been proven wrong many times over. For example, what if a site owner changes their XML parser and switches from title, link, description to title, description, link? (Hint: They would go away for a month.) Or, what if the content is wrapped in a CDATA tag? (Hint: The CDATA would get converted to a comment.)
Thankfully, I recently did some RSS parser work for planet.3athlete since it only reads RSS feeds (and RSS's brother, atom). Using the planet.3athlete code, I was able to quickly introduce a native RSS parser into the 3athlete backend and shrink the code base by 300 lines. Not bad for 30 minutes of tinkering over the weekend! The end result should be less "oops" on the home page and faster integration of new sites that have a feed, but only time will tell. I am sure the RSS parser will present its own issues.
When I first started 3athlete, triathlon and RSS had not crossed paths yet That meant I had to pre-process all of the html before it went into the database. I created a backend system that used little perl plugins to parse the contents of various sites and dump everything into a database. Since I spent many years hacking away at ugly perl code at my day job, this was pretty trivial using HTML::TokeParser. Below is an example of one of my first plugins:
$grabberConst{'source-url'}='http://www.usatriathlon.org/newsReleases.asp';
my $data=getContent($conf,$grabberConst{'source-url'});
my $stream = HTML::TokeParser->new( \$data ) or die $!;
# get the first two headlines
while (my $tag = $stream->get_tag('tr')) {
my($title,$url,$description);
if($tag->[1]{valign} && $tag->[1]{valign} eq 'top'){
$tag = $stream->get_tag('a');
$url = $tag->[1]{href};
$title = $stream->get_trimmed_text('/a');
$tag = $stream->get_tag('font'); #release date title
$tag = $stream->get_tag('font'); #date
$tag = $stream->get_tag('font');
$description=$stream->get_trimmed_text('/font');
}
if($url && $title && $description){
$grabberData{$url}{'title'}=$title;
$grabberData{$url}{'description'}=$description;
}
}
As the web has grown more semantic over the past few years, more and more site owners have figured out that RSS feeds translate into additional page views. That, in turn, has made my life easier and allowed me to start parsing XML instead of HTML. Well--almost.
Since it was not a necessity, to switch to a native RSS/XML parser, I decided to keep the model of one plugin per site and treat the XML as html. That would look something like this to the tokenizer:
foreach my $sourceUrl (@sourceUrls){
my $data=getContent($conf,$sourceUrl);
my $stream = HTML::TokeParser->new( \$data ) or die $!;
while (my $tag = $stream->get_tag('item')) {
my $title = $stream->get_trimmed_text('/title');
my $description = $stream->get_trimmed_text('/description');
my $url = $stream->get_trimmed_text('/link');
if($url && $title && $description){
$grabberData{$url}{'title'}=$title;
$grabberData{$url}{'description'}=$description;
#print "$title\n$url\n$description\n\n\n";
}
}
}
This led to many complications as my humble assumptions have been proven wrong many times over. For example, what if a site owner changes their XML parser and switches from title, link, description to title, description, link? (Hint: They would go away for a month.) Or, what if the content is wrapped in a CDATA tag? (Hint: The CDATA would get converted to a comment.)
Thankfully, I recently did some RSS parser work for planet.3athlete since it only reads RSS feeds (and RSS's brother, atom). Using the planet.3athlete code, I was able to quickly introduce a native RSS parser into the 3athlete backend and shrink the code base by 300 lines. Not bad for 30 minutes of tinkering over the weekend! The end result should be less "oops" on the home page and faster integration of new sites that have a feed, but only time will tell. I am sure the RSS parser will present its own issues.

0 Comments:
Post a Comment
<< Home