HTML Visual Structure to Content Tree
If you’ve ever wanted a “convenient” way to download a webpage and segment it into related content (or break it into a tree) the way it is so obvious to a human to do so, here’s a python module that does so. Originally this was written for my search engine, Synaptic Search (which now seems like a really dumb name to me
At one point, I was going to include an OCW module to detect words inside images, but I think CSS has taken over for menu styles, leaving the “text images” as a thing of 1999.
This is version 1.1