Wednesday, March 18, 2009

» 3 project metadata scraping

Freshmeat 3 provides an invaluable service to the FOSS community by allowing us to keep track of new projects and new releases of existing projects in a single central location (at least for most of them). As such, it is especially useful to packagers. Recently, the site has undergone a complete rewrite of its web interface and in that process, the XML project metadata that was available on the previous version of the site has been removed and will be replaced with a JSON API. I don't really care whether it's JSON or XML though, albeit XML is slightly easier to parse and process (e.g. with XPath), as long as it is machine readable. Over the past few years, I've actually been using those project descriptions (as XML) to fill RPM spec files automatically, instead of having to tediously copy/paste the usual annoying bits (Summary:, %description, URL:, License:, ...). Until a JSON API is available, and out of sheer curiosity, I've written a little PHP command-line script that scrapes the project metadata from the HTML markup on the new site. It also resolves redirect URLs to sources and website (by doing a HTTP HEAD) and tries to format the description as paragraphs, using the command-line tool fmt). If you want to have a look on how to do that, or even use the script as-is, here it is: fm3-scrape And just for the record, here is the old script, which is broken now: ffxml

Labels: , , ,