![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#201 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
What did you use to test the function? Presumably you took some URL and did:
print get_company_data(test_url) |
|
|
|
|
|
#202 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Yes, indeed. (As well as importing the necessary libraries etc.)
|
|
|
|
|
|
#203 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Hm, well, I don't see anything obvious that would cause such a problem. When I get home, I'll try it out for myself.
|
|
|
|
|
|
#204 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Thanks... for the record, this is the script I attempted to test:
import re
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
html = urlopen("http://biz.yahoo.com/ic/135/135359.html")
soup = BeautifulSoup(html)
def get_company_data("http://biz.yahoo.com/ic/135/135359.html"):
#Company Name
name = soup.firstText(re.compile("Company Profile"))
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
z = len(highlights)
if z == 0:
"N/A"
else:
fhighlights
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
output = "<table>"
output += "<tr>\n"
output += "<td>" + name + "</td>"
output += "<td>" + companyprofile + "</td>"
output += "<td>" + contacttable + "</td>"
output += "<td>" + z + "</td>"
output += "<td>" + keypeople + "</td>"
output += "</tr>"
output += "</table>"
return output
print get_company_data("http://biz.yahoo.com/ic/135/135359.html") |
|
|
|
|
|
#205 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
There are a number of things that are wrong in your above script. I've fixed the problems I can see, and highlighted the changes in red.
import re
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
def get_company_data(company_url):
soup = BeautifulSoup(urlopen(company_url))
#Company Name
name = soup.firstText(re.compile("Company Profile"))
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
if len(highlights) == 0:
z = "N/A"
else:
z = fhighlights
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
output = "<table>"
output += "<tr>\n"
output += "<td>" + name + "</td>"
output += "<td>" + companyprofile + "</td>"
output += "<td>" + contacttable + "</td>"
output += "<td>" + z + "</td>"
output += "<td>" + keypeople + "</td>"
output += "</tr>"
output += "</table>"
return output
print get_company_data("http://biz.yahoo.com/ic/135/135359.html") |
|
|
|
|
|
#206 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Okay I understand the changes for some reason I thought I had to assign len(highlights) to a variable as opposed to having it function on it's own. That's why I was having problems deciding what to do with the "N/A" and "fhighlights", they needed to be output to a variable, not printed or just left hanging in space.
As for the company_url changes, does that need to be defined? If I'm running this function in isolation it doesn't know what company_url is, no? or are these proposed changes for the actual script -- not the test script? Finally, I seem to encounter this error a lot: output += "<td>" + companyprofile + "</td>" TypeError: 'NoneType' object is not callable could you remind me as to what it means so I can try troubleshooting it? |
|
|
|
|
|
#207 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Take this simple function: def double(x):
return x * 2y = double(2) # x is 2 and y is 4 z = double(3) # x is 3 and z is 6 some_value = 5 print double(some_value) # x becomes equal to some_value (ie. 5) Let me give another example, to show you what I mean: x = 10
def foobar(x):
print "Argument x =", x
x = 7
print "Argument x =", x
print "Global x =", x
foobar(16)
print "Global x =", xGlobal x = 10 Argument x = 16 Argument x = 7 Global x = 10 Quote:
x = None x() |
||
|
|
|
|
|
#208 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Okay I hate coming here without presenting any attempted solutions, but I really have no idea why I'm still getting this reference error. The function works fine for the Company Name portion but the ensuing (contacttable, companyprofile etc) functions aren't working and I'm getting the aforementioned reference error. I'd be really appreciative if someone could hint at what to do. I'd love the luxary to be able to try and sit here and problem solve for the entire summer as I fear the next project that they're going to assign me, but I'm on a kind of on a non-binding deadline of sorts. By this I mean my superior would like this finished by this week. Am I going to get fired if it's not done? I doubt it. But it would probably look very good if I could complete it on time. I think he underestimates the difficulty of teaching oneself a computer language and writing a (relatively?) advanced script with it... And as always I wanted to thank everyone who's helped thus far, especially Arevos for all his time and effort.
Also, going back to the problem on hand, when I test it using the following: output = [name, companyprofile, contacttable, z, keypeople] return output |
|
|
|
|
|
#209 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Can you produce the entire program so that this code can be looked at in context?
|
|
|
|
|
|
#210 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
of course, my apologies
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
from time import sleep
import re
import sys
industry_page = "http://biz.yahoo.com/ic/ind_index.html"
def get_industry_urls(industry_page):
soup = BeautifulSoup(urlopen(industry_page))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links if a.string != "Alphabetical"]
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link['href']
def get_company_urls(company_index):
soup = BeautifulSoup(urlopen(company_index))
urls = soup.fetch("table")[21].fetch("a")
return [a['href'] for a in urls if "q?s" not in a['href'] and a.string != "Public" and a.string != "Private / Foreign"]
def get_company_data(company_urls):
soup = BeautifulSoup(urlopen(company_urls))
#Company Name
name = soup.firstText(re.compile("Company Profile"))
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
if len(highlights) == 0:
z = "N/A"
else:
z = fhighlights
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
output = "<table>"
output += "<tr>\n"
output += "<td>" + name + "</td>\n"
output += "<td>" + companyprofile + "</td>\n"
output += "<td>" + contacttable + "</td>\n"
output += "<td>" + z + "</td>\n"
output += "<td>" + keypeople + "</td>"
output += "</tr>"
output += "</table>"
return output
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
print get_company_data(company_urls)
sleep(1) |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|