![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#141 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Yep, you seem to be having trouble distinguishing between lists and single items.
Try to think of a list as a basket. The basket may contain, say, a single apple, but this doesn't mean that you can eat the basket. You have to take the apple out of the basket before eating it. A basket containing a single apple is not the same as a single apple on its own, just as a Python list containing one item is not the same as that single item on it's own. If you have a list of BeautifulSoup search results, you can't access the search results inside the list until you access them with indices. In a sense, you keep trying to eat the basket, and that's where you run into problems. Realise the basket is different from the apples contained within, and you'll find life easier ![]() |
|
|
|
|
|
#142 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Wow, what a fantastic analogy, and a life lesson too
![]() so now that I have tested each of the functions under: def get_company_data(company_url):
soup = BeautifulSoup(urlopen(company_url))
#Company Profile
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Contact Information
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Financial Highlights
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
z = len(highlights)
if z == 0:
print "N/A"
else:
print fhighlights
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")and see that they work, I need to add in a return statement. Is this return statement under the def_get_company_names, or is it after the for loops at the end of the script? also, should it look something like: return[company_profile, contacttable, fhighlights, keypeople] Okay I'm excited that this is all coming together but I don't really know where to go from here. I'm not ready to import into a CSV yet, am I? I don't know how to combine each of the functions with the for-loops etc., and I could use some help formatting the return statement for def_get_company_data. Thanks for everyone who's helped thus far. Last edited by zem52887; May 23rd, 2006 at 3:01 PM. |
|
|
|
|
|
#143 | |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Well I tried to do a little problem solving and this is what I'm currently at:
def get_company_data(company_url):
soup = BeautifulSoup(urlopen(company_url))
#Company Profile
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
return[companyprofile]
#Contact Information
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
return[contacttable]
#Financial Highlights
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
z = len(highlights)
if z == 0:
return["N/A"]
else:
return[fhighlights]
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
return[keypeople]
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
company_data = get_company_data(company_url)
print company_data
sleep(1)I was trying to figure out how to format the return statements by using get_company_urls etc., but I couldn't really come up with anything because we're using regex to locate data (rather than a link where we can access the tags "a" and 'href'). Thus, I wasn't really sure how to format it. Additionally, I tried to write the for-statement so the program knows to loop the functions, but again I think I'm off by a bit, if someone wants to take a look and possibly point me in the right direction I'd be very grateful as I'd like to get this exported to excel maybe later today. I was going through the python tutorial and it reads: Quote:
|
|
|
|
|
|
|
#144 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
The return statement exits the function. Thus, you need to place all the items in a list and return them at the end of your function. If you have lots of return statements only returning one value, then it will exit the function when it comes to the first return statement it finds.
You also need to do some further parsing. You may have the right tables, but you still need to pull the right information out of them. |
|
|
|
|
|
#145 | |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
oh no, I thought I was done with the parsing... Should I make a new function for pulling the data from the tables? Is this where the [td],[tr] tags come in for parsing? All right, well I don't think I need a new function, but I'm a little confused as to how far I have to take the parsing. Is my ultimate goal to just pull the text without any html tags?
if so, for the following table: <table border="0" cellpadding="2" cellspacing="1" width="100%"><tr><td colspan="2"><font face="verdana" size="-2"><b>Contact Information</b></font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1"> Address: </font></td><td bgcolor="white"><font face="arial" size="-1"> 75, Quai d'Orsay<br />75007 Paris, France </font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Phone:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-55-55</font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Fax:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-54-65</font></td></tr></table> Would I have to go through and write a statement like the one you originally posted: Quote:
Also, now that I have isolated the table that contains the relevent data, how can I parse through that as opposed to the whole page? Last edited by zem52887; May 24th, 2006 at 8:30 AM. |
|
|
|
|
|
|
#146 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Hm. You could make some new functions, yes. Might be a good idea. You can call them from the get_company_data function. Perhaps something like:
def get_company_data(company_url): soup = BeautifulSoup(urlopen(company_url)) profile_table = get_company_profile_table(soup) company_name = get_company_name(profile_table) # ...etc... contactinfo_table = get_company_contact_information(soup) address = get_company_address(contactinfo_table) # ...more of the same... return [company_name, address, ...] Remember that you need to get the information into a state in which you can make a spreadsheet out of it. It's no good just having the table that contains the contact information; you have to pull the company name, the address and so forth from it. I your original post lists five piece of data: 1) name 2) description 3) address 4) financial highlights (if there) 5) key people You need to pull these out of the tables you've found so that you can construct the spreadsheet to import into Excel. |
|
|
|
|
|
#147 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Quote:
With fetch("tr") and fetch("td") you can get specific cells from a table. So you could get a cell from the second row in the first column, or, with a list comprehension, you could get all of the text in a certain column or row, joined together. |
||
|
|
|
|
|
#148 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Before I attempt financial highlights I was wondering if you could look check my progress... I've tested it and I think I've isolated the actual data as opposed to the table, but I just want to be sure.
soup = BeautifulSoup(urlopen(company_url))
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Description
description = companyprofile.fetch("td")
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Address
address = contacttable.firstText(re.compile("Address:"))
add = address.findParent("tr")
#Phone
phone = contacttable.firstText(re.compile("Phone:"))
phonenumber = phone.findParent("tr")
#Fax
fax = contacttable.firstText(re.compile("Fax:"))
faxnumber = fax.findParent("tr")Will the above code be enough to import into excel? |
|
|
|
|
|
#149 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
okay well I've been trying to do both the financial highlights and key people and I've been struggling a lot. I'm having problems with the financial highlights because each company has something else posted (not specific numbers, but in terms of some have revenue posted, some don't some have employees posted some have employee growth etc etc) so I'm not sure how I can isolate it.
I'm having a similar problem with the key people as some companies have invented postions, some have one person acting as the CEO, CFO, President, Secretary etc., so I can't just fetch("CEO") like we did for contact information regarding phone etc... (granted, I'm not sure if I'm right about that yet but assuming I am -- which is not necessarily a great assumption)... Anyone have any ideas on how to isolate the final two pieces of data? |
|
|
|
|
|
#150 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Quote:
|
||
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|