![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#71 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
industry_url = "http://biz.yahoo.com/ic/110.html"
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link
print get_company_index(industry_url)how bout them apples? *crosses fingers* edit : wait im closer but not there yet, i have the link AND the company name not just the link or does that not matter? |
|
|
|
|
|
#72 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Nearly! Remember that you want the value of the "href" attribute, since that's where the URL is.
|
|
|
|
|
|
#73 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
industry_url = "http://biz.yahoo.com/ic/110.html"
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link['href']
print get_company_index(industry_url)however, that's for when industry_url = a specific industry... I need to add in a for loop if I want it to work in conjunction with the get_industry_url code that you provided, no?? |
|
|
|
|
|
#74 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
![]() Create some experimental code that will print out the company index URL for each of the industries listed on Yahoo!. |
|
|
|
|
|
|
#75 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
of course... I mean, this won't work without a for loop right?
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
industry_page = "http://biz.yahoo.com/ic/ind_index.html"
def get_industry_urls(industry_page):
soup = BeautifulSoup(urlopen(industry_page))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links]
industry_url = "get_industry_urls(industry_page)"
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link['href']
print get_company_index(industry_url) |
|
|
|
|
|
#76 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
it's blinking, the suspense is killing me, I don't know if it's doing anything or if it's just taking awhile because it has to apply for like 36,000 links
|
|
|
|
|
|
#77 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Yes, you need a for loop. This is because get_industry_urls returns a list of all the URLs for each industry listed on Yahoo!, and get_company_index returns the company index URL for a single industry URL.
|
|
|
|
|
|
#78 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
industry_page = "http://biz.yahoo.com/ic/ind_index.html"
def get_industry_urls(industry_page):
soup = BeautifulSoup(urlopen(industry_page))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links]
industry_url = "get_industry_urls(industry_page)"
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
index_link = soup.fetch("table")[11].fetch("a")[2]
return index_link['href']
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
print get_company_index(industry_url)wait... is that better? this is absolutely amazing, it's listing out every single company index page... it's taking awhile but i can't complain. whoa wait it just turned red on me and I got a bunch of errors *tear it was going so nicely... hm that's strange... it's listing out all the company index links but the first link it lists is: don't know why that's happening Last edited by zem52887; May 19th, 2006 at 4:34 PM. |
|
|
|
|
|
#79 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Debugging is the skill of getting your programs to work. One way to debug a piece of code is to go through it and work out what it's doing.
industry_url = "get_industry_urls(industry_page)" for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)print get_company_index(industry_url) Since hammering Yahoo! with some more iterative searches probably isn't a good idea, I'll show you the answer: for industry_url in get_industry_urls(industry_page): print get_company_index(industry_url) ![]() Also, Yahoo! seems to have a filter of some kind to stop people from DDOSing it's site. Thus, we need to put in a delay between fetching each site. The sleep function does this well. Sleep(1) will wait for 1 second, sleep(2) will wait for 2 and so forth: from time import sleep for industry_url in get_industry_urls(industry_page): print get_company_index(industry_url) sleep(1) |
|
|
|
|
|
#80 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Ha! Looks like you figured most of it out on your own
![]() |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|