Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jun 1st, 2006, 1:33 PM   #231
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Arevos thank you so much, it's been outputting all day and the load time is significantly faster. The cache is now over 200mb! Who would've thought it could take up so much space. Ultimately, I think I'm going to open it in word and print from there. I just have to figure out how to keep the tables from getting separated in word. But Word is nice in that I can select all ---> change font and format the table nicely. I think we're just about done then, shall I post up the completed code for anyone else tracking/using this thread to learn python?
zem52887 is offline   Reply With Quote
Old Jun 1st, 2006, 1:46 PM   #232
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
I think we're just about done then, shall I post up the completed code for anyone else tracking/using this thread to learn python?
If you wish. Good to know it's just about finished, anyway
Arevos is offline   Reply With Quote
Old Jun 6th, 2006, 2:54 PM   #233
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
For the record here it is:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
from time import sleep
import re
import sys
import shelve

industry_page = "http://biz.yahoo.com/ic/ind_index.html"
cache = shelve.open("cache.dat")

def cached_urlopen(url):
	if not cache.has_key(url):
		cache[url] = urlopen(url).read()
	return cache[url]

def get_industry_urls(industry_page):
	soup  = BeautifulSoup(cached_urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links if a.string != "Alphabetical"]

def get_company_index(industry_url):
        soup  = BeautifulSoup(cached_urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return index_link['href']
                
def get_company_urls(company_index):
        soup = BeautifulSoup(cached_urlopen(company_index))
        urls = soup.fetch("table")[21].fetch("a")        
        return [a['href'] for a in urls if "q?s" not in a['href'] and a.string != "Public" and a.string != "Private / Foreign"]

def get_company_data(company_urls):
        soup = BeautifulSoup(cached_urlopen(company_urls))
        
        #Company Name
        name = soup.firstText(re.compile("Company Profile")).replace("Company Profile - Yahoo! Finance", "")
                
        #Company Profile - Table
        profile = soup.fetchText(re.compile("Company Profile"))[2]
        companyprofile = profile.findNext("table")         
            
        #Contact Information - Table
        contact = soup.firstText(re.compile("Contact Information"))
        contacttable = contact.findParent("table")
        
            
        #Financial Highlights - Table
        highlights = soup.firstText(re.compile("Highlights"))
        fhighlights = highlights.findParent("table")
        
        if len(highlights) == 0:
            z = "N/A"

        else:
            z = fhighlights
                      
        #Key People
        key = soup.firstText(re.compile("Key People"))
        keypeople = key.findParent("table")
    
        #Public/Private
        chart = soup.fetchText(re.compile("Chart"))

        if len(chart) == 0:
            q = "<b>Priv</b>"

        else:
            q = "Pub"
        
        output = "<table border = 1>"
        output += "<tr>\n"
        output += "<td width=\"10%\">""<b>" + str(name) + "</b>""</td>"
        output += "<td width=\"35%\">" + str(companyprofile) + "</td>"
        output += "<td width=\"19.75%\">" + str(contacttable) + "</td>"
        output += "<td width=\"19.75%\">" + str(z) + "</td>"
        output += "<td width=\"15%\">" + str(keypeople) + "</td>"
        output += "<td width=\.5%\">" + str(q) + "</td>"
        output += "</tr>"
        output += "</table>"
        return output
        
file = open("data.html", "w")

file.write("<table>\n")    # \n means add a newline

for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
          
        for company_urls in get_company_urls(company_index):
                file.write(get_company_data(company_urls))
                print get_company_data(company_urls)

file.write("</table>\n")
file.close()

happy parsing!
zem52887 is offline   Reply With Quote
Old Jun 7th, 2006, 7:51 AM   #234
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
that's strange... I tried running the program from my home computer, and it gave me a NoneType Error: get_industry_urls cannot fetch table 7 or something along those lines. The script is identical so I don't really know why that happened... any ideas?
zem52887 is offline   Reply With Quote
Old Jun 7th, 2006, 8:53 AM   #235
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Maybe Yahoo! changed it's page layout slightly since the cache was built? I presume your home computer doesn't have the cache file.
Arevos is offline   Reply With Quote
Old Jun 7th, 2006, 10:49 AM   #236
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
it does not indeed... but shouldn't it be able to build one?
zem52887 is offline   Reply With Quote
Old Jun 7th, 2006, 10:56 AM   #237
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
it does not indeed... but shouldn't it be able to build one?
What I mean is that if Yahoo! has changed its layout, then your home computer will be attempting to get data from the new layout, whilst your work machine will be getting data from the old layout in the cache.

The only way to be sure is to open up an interactive Python session with IDLE and test the tables:
>>> from urllib2 import urlopen
>>> from BeautifulSoup import BeautifulSoup
>>> industry_page = "http://biz.yahoo.com/ic/ind_index.html"
>>> soup  = BeautifulSoup(urlopen(industry_page))
>>> table = soup.fetch("table")[7]
>>> table.fetch("a")
...etc
Interestingly enough, this works fine for me. Try deleting your cache and trying again.
Arevos is offline   Reply With Quote
Old Jun 7th, 2006, 10:56 AM   #238
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
at home? or at work? (I assume home/ i've tried running it in different directories, so wouldn't this accomplish the same thing?)

... yeah I doubt yahoo changed anything on the webpage, the program doesn't have to be in any particular folder on my computer, right? Like, it is still able to open beautifulsoup if it's saved to the desktop right? (And if it wasn't I don't think I'd get a NoneType error)

if Firefox is my default browser at home while IE is at work, would this have any impact?
zem52887 is offline   Reply With Quote
Old Jun 7th, 2006, 12:29 PM   #239
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
at home? or at work? (I assume home/ i've tried running it in different directories, so wouldn't this accomplish the same thing?)
At home. And I'd try removing the cache, just to be sure.

Also, try using the interactive interpreter to track down the problem as I outlined in my previous post.

You also could just ignore the problem - if it works at your workplace, isn't that good enough?

As an aside, the get_industry_urls function works okay on my machine when I call it directly. I haven't tried running the entire program (would take too long).
Quote:
Originally Posted by zem52887
... yeah I doubt yahoo changed anything on the webpage, the program doesn't have to be in any particular folder on my computer, right? Like, it is still able to open beautifulsoup if it's saved to the desktop right? (And if it wasn't I don't think I'd get a NoneType error)
As long as BeautifulSoup is installed.

Quote:
Originally Posted by zem52887
if Firefox is my default browser at home while IE is at work, would this have any impact?
No; it doesn't rely on the browser.
Arevos is offline   Reply With Quote
Old Jun 7th, 2006, 1:33 PM   #240
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
hm I'll use the interactive interpreter and try the above suggestions. Herein lies the problem: by my calculations the program takes about 10 hours to run. I work from 8am - 5pm, and apparently the computers network gets shut down at night. I tried running it from the moment I got to work to the end of the day but I encountered an error at 4pm (which I think was caused by the lack of memory). It's relatively computer intensive (so when I'm trying to open other programs and surf the net, the system slows to a crawl), at least on an office computer if you can imagine, so it's hard to get the last couple of industries. I have all the data through Education Services and I've been manually putting each into a separate word document as opposed to one 80mb html that takes 10 minutes to open. (Also because I now have nothing to do except make it exceptionally pretty hah). So, I'd like to just run the program fully at home untouched to get the last few industries, or I'm going to try and edit the program so instead of going through all the industries on get_industry_urls it starts at Education & Training Services (which would solve my problem).

Last edited by zem52887; Jun 7th, 2006 at 2:01 PM.
zem52887 is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 11:57 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC