oh no, I thought I was done with the parsing... Should I make a new function for pulling the data from the tables? Is this where the [td],[tr] tags come in for parsing? All right, well I don't think I need a new function, but I'm a little confused as to how far I have to take the parsing. Is my ultimate goal to just pull the text without any html tags?
if so, for the following table:
<table border="0" cellpadding="2" cellspacing="1" width="100%"><tr><td colspan="2"><font face="verdana" size="-2"><b>Contact Information</b></font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">
Address: </font></td><td bgcolor="white"><font face="arial" size="-1">
75, Quai d'Orsay<br />75007 Paris, France
</font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Phone:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-55-55</font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Fax:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-54-65</font></td></tr></table>
Would I have to go through and write a statement like the one you originally posted:
Quote:
import re
# find "address" label
address = soup.firstText(re.compile("Address:"))
# find the first "tr" attribute above the address label
tr = address.findParent("tr")
# print the second "td" that belongs to the "tr" attribute:
print tr.fetch("td")[1]
|
also, in some cases, companies don't have a fax number posted or a phone number, does this mean I'm going to have to write if-statements in addition?
Also, now that I have isolated the table that contains the relevent data, how can I parse through that as opposed to the whole page?