View Single Post
Old May 24th, 2006, 9:11 AM   #145
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
oh no, I thought I was done with the parsing... Should I make a new function for pulling the data from the tables? Is this where the [td],[tr] tags come in for parsing? All right, well I don't think I need a new function, but I'm a little confused as to how far I have to take the parsing. Is my ultimate goal to just pull the text without any html tags?

if so, for the following table:
<table border="0" cellpadding="2" cellspacing="1" width="100%"><tr><td colspan="2"><font face="verdana" size="-2"><b>Contact Information</b></font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">
Address: </font></td><td bgcolor="white"><font face="arial" size="-1">
75, Quai d'Orsay<br />75007 Paris, France
 </font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Phone:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-55-55</font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Fax:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-54-65</font></td></tr></table>

Would I have to go through and write a statement like the one you originally posted:
Quote:
import re

# find "address" label
address = soup.firstText(re.compile("Address:"))

# find the first "tr" attribute above the address label
tr = address.findParent("tr")

# print the second "td" that belongs to the "tr" attribute:
print tr.fetch("td")[1]
also, in some cases, companies don't have a fax number posted or a phone number, does this mean I'm going to have to write if-statements in addition?

Also, now that I have isolated the table that contains the relevent data, how can I parse through that as opposed to the whole page?

Last edited by zem52887; May 24th, 2006 at 9:30 AM.
zem52887 is offline   Reply With Quote