![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#231 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Arevos thank you so much, it's been outputting all day and the load time is significantly faster. The cache is now over 200mb! Who would've thought it could take up so much space. Ultimately, I think I'm going to open it in word and print from there. I just have to figure out how to keep the tables from getting separated in word. But Word is nice in that I can select all ---> change font and format the table nicely. I think we're just about done then, shall I post up the completed code for anyone else tracking/using this thread to learn python?
|
|
|
|
|
|
#232 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
![]() |
|
|
|
|
|
|
#233 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
For the record here it is:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
from time import sleep
import re
import sys
import shelve
industry_page = "http://biz.yahoo.com/ic/ind_index.html"
cache = shelve.open("cache.dat")
def cached_urlopen(url):
if not cache.has_key(url):
cache[url] = urlopen(url).read()
return cache[url]
def get_industry_urls(industry_page):
soup = BeautifulSoup(cached_urlopen(industry_page))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links if a.string != "Alphabetical"]
def get_company_index(industry_url):
soup = BeautifulSoup(cached_urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link['href']
def get_company_urls(company_index):
soup = BeautifulSoup(cached_urlopen(company_index))
urls = soup.fetch("table")[21].fetch("a")
return [a['href'] for a in urls if "q?s" not in a['href'] and a.string != "Public" and a.string != "Private / Foreign"]
def get_company_data(company_urls):
soup = BeautifulSoup(cached_urlopen(company_urls))
#Company Name
name = soup.firstText(re.compile("Company Profile")).replace("Company Profile - Yahoo! Finance", "")
#Company Profile - Table
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
#Contact Information - Table
contact = soup.firstText(re.compile("Contact Information"))
contacttable = contact.findParent("table")
#Financial Highlights - Table
highlights = soup.firstText(re.compile("Highlights"))
fhighlights = highlights.findParent("table")
if len(highlights) == 0:
z = "N/A"
else:
z = fhighlights
#Key People
key = soup.firstText(re.compile("Key People"))
keypeople = key.findParent("table")
#Public/Private
chart = soup.fetchText(re.compile("Chart"))
if len(chart) == 0:
q = "<b>Priv</b>"
else:
q = "Pub"
output = "<table border = 1>"
output += "<tr>\n"
output += "<td width=\"10%\">""<b>" + str(name) + "</b>""</td>"
output += "<td width=\"35%\">" + str(companyprofile) + "</td>"
output += "<td width=\"19.75%\">" + str(contacttable) + "</td>"
output += "<td width=\"19.75%\">" + str(z) + "</td>"
output += "<td width=\"15%\">" + str(keypeople) + "</td>"
output += "<td width=\.5%\">" + str(q) + "</td>"
output += "</tr>"
output += "</table>"
return output
file = open("data.html", "w")
file.write("<table>\n") # \n means add a newline
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_urls in get_company_urls(company_index):
file.write(get_company_data(company_urls))
print get_company_data(company_urls)
file.write("</table>\n")
file.close()happy parsing! |
|
|
|
|
|
#234 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
that's strange... I tried running the program from my home computer, and it gave me a NoneType Error: get_industry_urls cannot fetch table 7 or something along those lines. The script is identical so I don't really know why that happened... any ideas?
|
|
|
|
|
|
#235 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Maybe Yahoo! changed it's page layout slightly since the cache was built? I presume your home computer doesn't have the cache file.
|
|
|
|
|
|
#236 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
it does not indeed... but shouldn't it be able to build one?
|
|
|
|
|
|
#237 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
The only way to be sure is to open up an interactive Python session with IDLE and test the tables: >>> from urllib2 import urlopen
>>> from BeautifulSoup import BeautifulSoup
>>> industry_page = "http://biz.yahoo.com/ic/ind_index.html"
>>> soup = BeautifulSoup(urlopen(industry_page))
>>> table = soup.fetch("table")[7]
>>> table.fetch("a")
...etc |
|
|
|
|
|
|
#238 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
at home? or at work? (I assume home/ i've tried running it in different directories, so wouldn't this accomplish the same thing?)
... yeah I doubt yahoo changed anything on the webpage, the program doesn't have to be in any particular folder on my computer, right? Like, it is still able to open beautifulsoup if it's saved to the desktop right? (And if it wasn't I don't think I'd get a NoneType error) if Firefox is my default browser at home while IE is at work, would this have any impact? |
|
|
|
|
|
#239 | |||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Also, try using the interactive interpreter to track down the problem as I outlined in my previous post. You also could just ignore the problem - if it works at your workplace, isn't that good enough? As an aside, the get_industry_urls function works okay on my machine when I call it directly. I haven't tried running the entire program (would take too long). Quote:
Quote:
|
|||
|
|
|
|
|
#240 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
hm I'll use the interactive interpreter and try the above suggestions. Herein lies the problem: by my calculations the program takes about 10 hours to run. I work from 8am - 5pm, and apparently the computers network gets shut down at night. I tried running it from the moment I got to work to the end of the day but I encountered an error at 4pm (which I think was caused by the lack of memory). It's relatively computer intensive (so when I'm trying to open other programs and surf the net, the system slows to a crawl), at least on an office computer if you can imagine, so it's hard to get the last couple of industries. I have all the data through Education Services and I've been manually putting each into a separate word document as opposed to one 80mb html that takes 10 minutes to open. (Also because I now have nothing to do except make it exceptionally pretty hah). So, I'd like to just run the program fully at home untouched to get the last few industries, or I'm going to try and edit the program so instead of going through all the industries on get_industry_urls it starts at Education & Training Services (which would solve my problem).
Last edited by zem52887; Jun 7th, 2006 at 2:01 PM. |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|