Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 25th, 2006, 9:46 AM   #181
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Thanks, no I understand the difference between functions and variables I just misused the terminology. That being said I'd like to know that my script works before embedding the HTML in it. I should be able to print the data if my script is correct, right? I'd like to know that I have something done, that at the very least I can run the script and print the data. Otherwise I feel like it's a bit of a daunting task to start trying to embed the HTML when I'm not even sure the script works in the first place. Is this an okay approach? (Maybe not ideal as I'm probably going to have to backtrack and delete some of the returns that I currently have, but for peace of mind?)

Last edited by zem52887; May 25th, 2006 at 10:07 AM.
zem52887 is offline   Reply With Quote
Old May 25th, 2006, 10:07 AM   #182
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
Thanks, no I understand the difference between functions and variables I just misused the terminology. That being said I'd like to know that my function works before embedding the HTML in it. I should be able to print the data if my function is correct, right?
Yep, printing the output is a good way of checking the function works, though I'd suggest testing the function on a single page, rather than running through the thousands of pages the main program fetches.
Arevos is offline   Reply With Quote
Old May 25th, 2006, 10:10 AM   #183
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
I want to make sure that all each function is working in conjuction with each other one, though (including the loop at the end and everything). I'll increase the sleep to 20 so it does it for one company and I'll manually quit afterwards.
zem52887 is offline   Reply With Quote
Old May 25th, 2006, 10:20 AM   #184
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay well I tried running the script and not surprisingly, it didn't work.

def get_company_data(company_urls):
        soup = BeautifulSoup(urlopen(company_urls))

        
        #Company Name
        title = soup.fetch("title")
        output = "<title>\n"
        
        #Company Profile - Table
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")
        output += "<table>" + companyprofile + "</table>\n"   
            
        #Contact Information - Table
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")
        output += "<table>" + contacttable + "</table>\n"  
            
        #Financial Highlights - Table
        highlights = soup.firstText(re.compile("Highlights"))
        fhighlights = highlights.findParent("table")
        
        z = len(highlights)
        if z == 0:
            "N/A"

        else:
            fhighlights
        output += "<table>" + z + "</table>\n"
        
        
        #Key People
        key = soup.firstText(re.compile("Key People"))
        keypeople = key.findParent("table")
        output += "<table>" + keypeople + "</table>\n"
        
        return output
            
for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
    
        for company_urls in get_company_urls(company_index):
            print get_company_data(company_urls)
            sleep(1)

Since I tested each of the other functions (that being, get_industry_urls, get_company_index, and get_company_urls) I can isolate the problem as being either the get_company_data function, or the for-loop. I personally think I have errors in both, but I'm not sure where.

If I had to guess, I think that my if-statement isn't formatted properly. My output variables are also probably a little mucked up. On second thought, the only thing that's probably correct in the get_company_data function are the locations of the tables. Other than that, I could be very far off.
zem52887 is offline   Reply With Quote
Old May 25th, 2006, 10:28 AM   #185
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Well, I'm not quite sure why you're using "<table>" tags. Isn't it more logical to embed each table in a single "<td>" cell, and have one company per row?
Arevos is offline   Reply With Quote
Old May 25th, 2006, 10:31 AM   #186
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
oh indeed. I was confused on the output line but now I understand.
The following should work, no?

def get_company_data(company_urls):
        soup = BeautifulSoup(urlopen(company_urls))
        
        #Company Name
        title = soup.fetch("title")
        output = title
        
        #Company Profile - Table
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")
        output += companyprofile  
            
        #Contact Information - Table
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")
        output += contacttable  
            
        #Financial Highlights - Table
        highlights = soup.firstText(re.compile("Highlights"))
        fhighlights = highlights.findParent("table")
        
        z = len(highlights)
        if z == 0:
            "N/A"

        else:
            fhighlights

        output += z
        
        
        #Key People
        key = soup.firstText(re.compile("Key People"))
        keypeople = key.findParent("table")
        output += keypeople
        
        return output
            
for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
    
        for company_urls in get_company_urls(company_index):
            print get_company_data(company_urls)
            sleep(1)
But when I run the script it tells me that:
profile = soup.fetchText(re.compile("Company Profile"))[2]
IndexError: list index out of range

... but when I test this part of the function on an individual company, it works fine.
zem52887 is offline   Reply With Quote
Old May 25th, 2006, 10:54 AM   #187
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
But when I run the script it tells me that:
profile = soup.fetchText(re.compile("Company Profile"))[2]
IndexError: list index out of range

... but when I test this part of the function on an individual company, it works fine.
Well, it's essentially telling you that there are less than three "Company Profile" pieces of text on the page, when you're trying to get the third.

I suggest putting some debugging code in. Get it to print out URL of each company page it accesses, and perhaps try getting it to print out the HTML, too.
Arevos is offline   Reply With Quote
Old May 25th, 2006, 11:03 AM   #188
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Quote:
Originally Posted by Arevos
Well, it's essentially telling you that there are less than three "Company Profile" pieces of text on the page, when you're trying to get the third.
Hm, that makes sense but is strange nonetheless. I've tested 25+ links and not had a problem with any of them having only 2 company profiles, but I guess it only takes one. How do I go about debugging and having it display the company url? Is that a mode in python or do I edit my print line?
zem52887 is offline   Reply With Quote
Old May 25th, 2006, 11:42 AM   #189
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
You can just put in an extra print line:
for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
    
        for company_urls in get_company_urls(company_index):
            print company_urls
            print get_company_data(company_urls)
            sleep(1)
I also suggest you look into try-except blocks. It'll be in the tutorial under error handling. Essentially, you can "catch" errors and have code to handle them in some way:
import sys

# ...

for company_urls in get_company_urls(company_index):
    try:
        print get_company_data(company_urls)
    except IndexError:
        # Print out offending URL and then exit program
        print "IndexError on URL:", company_urls
        sys.exit(1)
    sleep(1)
Arevos is offline   Reply With Quote
Old May 25th, 2006, 11:53 AM   #190
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Thanks Arevos that's interesting stuff. I see where the program is making errors. From the main link:
"http://biz.yahoo.com/ic/ind_index.html"
you can sort by either "public or private" and instead of going to the company link, it's attempting to fetch.companyprofile information from that page as opposed to the company url. When I asked it to print the company_urls it displayed:
http://biz.yahoo.com/ic/112_cl_pub.html
which is the list of the public agriculture companies. From there, it attempted to apply the get_company_data function.

At least, that's the what I think happened. Now how do we go about troubleshooting this issue?
zem52887 is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:23 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC