Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 23rd, 2006, 2:44 PM   #141
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Yep, you seem to be having trouble distinguishing between lists and single items.

Try to think of a list as a basket. The basket may contain, say, a single apple, but this doesn't mean that you can eat the basket. You have to take the apple out of the basket before eating it.

A basket containing a single apple is not the same as a single apple on its own, just as a Python list containing one item is not the same as that single item on it's own.

If you have a list of BeautifulSoup search results, you can't access the search results inside the list until you access them with indices. In a sense, you keep trying to eat the basket, and that's where you run into problems. Realise the basket is different from the apples contained within, and you'll find life easier
Arevos is offline   Reply With Quote
Old May 23rd, 2006, 2:48 PM   #142
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Wow, what a fantastic analogy, and a life lesson too

so now that I have tested each of the functions under:
def get_company_data(company_url):
        soup = BeautifulSoup(urlopen(company_url))
        #Company Profile
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")
        
        #Contact Information
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")

       #Financial Highlights
        highlights = soup.firstText(re.compile("Highlights"))
        fhighlights = highlights.findParent("table")

        z = len(highlights)
        if z == 0:
            print "N/A"

        else:
            print fhighlights

        #Key People
        key = soup.firstText(re.compile("Key People"))
        keypeople = key.findParent("table")

and see that they work, I need to add in a return statement. Is this return statement under the def_get_company_names, or is it after the for loops at the end of the script?

also, should it look something like:
return[company_profile, contacttable, fhighlights, keypeople]

Okay I'm excited that this is all coming together but I don't really know where to go from here. I'm not ready to import into a CSV yet, am I? I don't know how to combine each of the functions with the for-loops etc., and I could use some help formatting the return statement for def_get_company_data. Thanks for everyone who's helped thus far.

Last edited by zem52887; May 23rd, 2006 at 3:01 PM.
zem52887 is offline   Reply With Quote
Old May 24th, 2006, 7:56 AM   #143
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Well I tried to do a little problem solving and this is what I'm currently at:

def get_company_data(company_url):
        soup = BeautifulSoup(urlopen(company_url))
        #Company Profile
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")
        return[companyprofile]
        
        #Contact Information
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")
        return[contacttable]
        
        #Financial Highlights
        highlights = soup.firstText(re.compile("Highlights"))
        fhighlights = highlights.findParent("table")

        z = len(highlights)
        if z == 0:
            return["N/A"]

        else:
            return[fhighlights]

        #Key People
        key = soup.firstText(re.compile("Key People"))
        keypeople = key.findParent("table")
        return[keypeople]
    
for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
    
        for company_urls in get_company_urls(company_index):
               company_data = get_company_data(company_url)

               print company_data
               sleep(1)

I was trying to figure out how to format the return statements by using get_company_urls etc., but I couldn't really come up with anything because we're using regex to locate data (rather than a link where we can access the tags "a" and 'href'). Thus, I wasn't really sure how to format it. Additionally, I tried to write the for-statement so the program knows to loop the functions, but again I think I'm off by a bit, if someone wants to take a look and possibly point me in the right direction I'd be very grateful as I'd like to get this exported to excel maybe later today.

I was going through the python tutorial and it reads:
Quote:
The return statement returns with a value from a function.
With this in mind, I think I might be right with my return statements?
zem52887 is offline   Reply With Quote
Old May 24th, 2006, 8:05 AM   #144
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
The return statement exits the function. Thus, you need to place all the items in a list and return them at the end of your function. If you have lots of return statements only returning one value, then it will exit the function when it comes to the first return statement it finds.

You also need to do some further parsing. You may have the right tables, but you still need to pull the right information out of them.
Arevos is offline   Reply With Quote
Old May 24th, 2006, 8:11 AM   #145
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
oh no, I thought I was done with the parsing... Should I make a new function for pulling the data from the tables? Is this where the [td],[tr] tags come in for parsing? All right, well I don't think I need a new function, but I'm a little confused as to how far I have to take the parsing. Is my ultimate goal to just pull the text without any html tags?

if so, for the following table:
<table border="0" cellpadding="2" cellspacing="1" width="100%"><tr><td colspan="2"><font face="verdana" size="-2"><b>Contact Information</b></font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">
Address: </font></td><td bgcolor="white"><font face="arial" size="-1">
75, Quai d'Orsay<br />75007 Paris, France
 </font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Phone:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-55-55</font></td></tr><tr valign="top"><td bgcolor="eeeeee"><font face="arial" size="-1">Fax:</font></td><td bgcolor="white"><font face="arial" size="-1">+33-1-40-62-54-65</font></td></tr></table>

Would I have to go through and write a statement like the one you originally posted:
Quote:
import re

# find "address" label
address = soup.firstText(re.compile("Address:"))

# find the first "tr" attribute above the address label
tr = address.findParent("tr")

# print the second "td" that belongs to the "tr" attribute:
print tr.fetch("td")[1]
also, in some cases, companies don't have a fax number posted or a phone number, does this mean I'm going to have to write if-statements in addition?

Also, now that I have isolated the table that contains the relevent data, how can I parse through that as opposed to the whole page?

Last edited by zem52887; May 24th, 2006 at 8:30 AM.
zem52887 is offline   Reply With Quote
Old May 24th, 2006, 8:29 AM   #146
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Hm. You could make some new functions, yes. Might be a good idea. You can call them from the get_company_data function. Perhaps something like:
def get_company_data(company_url):
	soup = BeautifulSoup(urlopen(company_url))

	profile_table = get_company_profile_table(soup)
	company_name = get_company_name(profile_table)
	# ...etc...

	contactinfo_table = get_company_contact_information(soup)
	address = get_company_address(contactinfo_table)
	# ...more of the same...

	return [company_name, address, ...]
Or you could create one big function. It's easier to work with smaller functions as a rule, however.

Remember that you need to get the information into a state in which you can make a spreadsheet out of it. It's no good just having the table that contains the contact information; you have to pull the company name, the address and so forth from it.

I your original post lists five piece of data:

1) name
2) description
3) address
4) financial highlights (if there)
5) key people

You need to pull these out of the tables you've found so that you can construct the spreadsheet to import into Excel.
Arevos is offline   Reply With Quote
Old May 24th, 2006, 8:35 AM   #147
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
also, in some cases, companies don't have a fax number posted or a phone number, does this mean I'm going to have to write if-statements in addition?
Maybe. You might be able to say something like: "Get all text in the second column in the table that has the text 'Address:' in it.".
Quote:
Originally Posted by zem52887
Also, now that I have isolated the table that contains the relevent data, how can I parse through that as opposed to the whole page?
Just treat it in the same way you would the whole page. contacttable.firstText() will only find text in the contact table, and contacttable.fetch("tr") will get all the table rows in the contact table (not the whole document).

With fetch("tr") and fetch("td") you can get specific cells from a table. So you could get a cell from the second row in the first column, or, with a list comprehension, you could get all of the text in a certain column or row, joined together.
Arevos is offline   Reply With Quote
Old May 24th, 2006, 9:20 AM   #148
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Before I attempt financial highlights I was wondering if you could look check my progress... I've tested it and I think I've isolated the actual data as opposed to the table, but I just want to be sure.

        soup = BeautifulSoup(urlopen(company_url))
        #Company Profile - Table
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")
        #Description
        description = companyprofile.fetch("td")
        
        #Contact Information - Table
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")
        #Address
        address = contacttable.firstText(re.compile("Address:"))
        add = address.findParent("tr")
        #Phone
        phone = contacttable.firstText(re.compile("Phone:"))
        phonenumber = phone.findParent("tr")
        #Fax
        fax = contacttable.firstText(re.compile("Fax:"))
        faxnumber = fax.findParent("tr")

Will the above code be enough to import into excel?
zem52887 is offline   Reply With Quote
Old May 24th, 2006, 10:12 AM   #149
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay well I've been trying to do both the financial highlights and key people and I've been struggling a lot. I'm having problems with the financial highlights because each company has something else posted (not specific numbers, but in terms of some have revenue posted, some don't some have employees posted some have employee growth etc etc) so I'm not sure how I can isolate it.

I'm having a similar problem with the key people as some companies have invented postions, some have one person acting as the CEO, CFO, President, Secretary etc., so I can't just fetch("CEO") like we did for contact information regarding phone etc... (granted, I'm not sure if I'm right about that yet but assuming I am -- which is not necessarily a great assumption)...

Anyone have any ideas on how to isolate the final two pieces of data?
zem52887 is offline   Reply With Quote
Old May 24th, 2006, 10:21 AM   #150
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
Will the above code be enough to import into excel?
Perhaps. It depends on how formatted you want the Excel data to be. I suggest trying importing one row of data from one single company to test it, before trying to get all the companies.

Quote:
Originally Posted by zem52887
Anyone have any ideas on how to isolate the final two pieces of data?
I don't understand; as far as I can see, the "Key People" table and the "Financial Highlights" table have headings of "Key People" and "Financial Highlights" appropriately. Why don't you just search for tables containing that text?
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 5:48 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC