View Single Post
Old May 17th, 2006, 11:41 AM   #9
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
You could do it with a number of languages, including Python. Material coming from a web site is nothing more than material exchanged within the constraints of the http protocol. You may request the resource directly from the server and scan its contents, including parsing for HTML tags and entities. Among these would be links, of course. To be effective, you would need to know how to distinguish (logically, via content, or possibly position) what links constitute your trail. Once at the destination you would need to know what items of information on THAT page were relevant. I strongly suspect that, at this point, you're going to have to include human intervention, with its marvelous visual identification capabilities. Still, one could probably devise a presentation, maybe block the relevant information, and then revert to logical parsing of that. If so, one would then merely write out the file in a form (such as .csv) readable by Excel. Again, you probably see that it is not a trivial process. At least, it isn't unheard of. Google, for one, spiders sites, looking for relevant keywords, links, and other things.

Now, don't get mad: have you tried Ctrl-Break or ALT-F4?
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote