![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#181 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Thanks, no I understand the difference between functions and variables I just misused the terminology. That being said I'd like to know that my script works before embedding the HTML in it. I should be able to print the data if my script is correct, right? I'd like to know that I have something done, that at the very least I can run the script and print the data. Otherwise I feel like it's a bit of a daunting task to start trying to embed the HTML when I'm not even sure the script works in the first place. Is this an okay approach? (Maybe not ideal as I'm probably going to have to backtrack and delete some of the returns that I currently have, but for peace of mind?)
Last edited by zem52887; May 25th, 2006 at 10:07 AM. |
|
|
|
|
|
#182 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
|
|
|
|
|
|
|
#183 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
I want to make sure that all each function is working in conjuction with each other one, though (including the loop at the end and everything). I'll increase the sleep to 20 so it does it for one company and I'll manually quit afterwards.
|
|
|
|
|
|
#184 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
okay well I tried running the script and not surprisingly, it didn't work.
def get_company_data(company_urls):
soup = BeautifulSoup(urlopen(company_urls))
#Company Name
title = soup.fetch("title")
output = "<title>\n"
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
output += "<table>" + companyprofile + "</table>\n"
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
output += "<table>" + contacttable + "</table>\n"
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
z = len(highlights)
if z == 0:
"N/A"
else:
fhighlights
output += "<table>" + z + "</table>\n"
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
output += "<table>" + keypeople + "</table>\n"
return output
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
print get_company_data(company_urls)
sleep(1)Since I tested each of the other functions (that being, get_industry_urls, get_company_index, and get_company_urls) I can isolate the problem as being either the get_company_data function, or the for-loop. I personally think I have errors in both, but I'm not sure where. If I had to guess, I think that my if-statement isn't formatted properly. My output variables are also probably a little mucked up. On second thought, the only thing that's probably correct in the get_company_data function are the locations of the tables. Other than that, I could be very far off. |
|
|
|
|
|
#185 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Well, I'm not quite sure why you're using "<table>" tags. Isn't it more logical to embed each table in a single "<td>" cell, and have one company per row?
|
|
|
|
|
|
#186 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
oh indeed. I was confused on the output line but now I understand.
The following should work, no? def get_company_data(company_urls):
soup = BeautifulSoup(urlopen(company_urls))
#Company Name
title = soup.fetch("title")
output = title
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
output += companyprofile
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
output += contacttable
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
z = len(highlights)
if z == 0:
"N/A"
else:
fhighlights
output += z
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
output += keypeople
return output
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
print get_company_data(company_urls)
sleep(1)profile = soup.fetchText(re.compile("Company Profile"))[2] IndexError: list index out of range ... but when I test this part of the function on an individual company, it works fine. |
|
|
|
|
|
#187 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
I suggest putting some debugging code in. Get it to print out URL of each company page it accesses, and perhaps try getting it to print out the HTML, too. |
|
|
|
|
|
|
#188 | |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Quote:
|
|
|
|
|
|
|
#189 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
You can just put in an extra print line:
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
print company_urls
print get_company_data(company_urls)
sleep(1)import sys
# ...
for company_urls in get_company_urls(company_index):
try:
print get_company_data(company_urls)
except IndexError:
# Print out offending URL and then exit program
print "IndexError on URL:", company_urls
sys.exit(1)
sleep(1) |
|
|
|
|
|
#190 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Thanks Arevos that's interesting stuff. I see where the program is making errors. From the main link:
"http://biz.yahoo.com/ic/ind_index.html" you can sort by either "public or private" and instead of going to the company link, it's attempting to fetch.companyprofile information from that page as opposed to the company url. When I asked it to print the company_urls it displayed: http://biz.yahoo.com/ic/112_cl_pub.html which is the list of the public agriculture companies. From there, it attempted to apply the get_company_data function. At least, that's the what I think happened. Now how do we go about troubleshooting this issue? |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|