HTML Visual Structure to Content Tree

If you’ve ever wanted a “convenient” way to download a webpage and segment it into related content (or break it into a tree) the way it is so obvious to a human to do so, here’s a python module that does so. Originally this was written for my search engine, Synaptic Search (which now seems like a really dumb name to me ), and it was written in perl. This is a port to python that leaks less memory and includes handles to manipulate/interact with the content tree more easily. For instance, you can search for a particular word and return the related content tree under it. It’s not always right, and there’s still lots of features to add, but the concept is pretty powerful. It basically downloads the html and all the related stylesheets to find the “average” distribution of what the text on a page looks like, and then follows the “visual” deviations of the text flow in english standard order to determine which content is “under” which headers, where it is segmented, etc. It also includes a language detection module (port of the perl CPAN version), and Mark Pilgrim’s openanything module.

At one point, I was going to include an OCW module to detect words inside images, but I think CSS has taken over for menu styles, leaving the “text images” as a thing of 1999.

This is version 1.1