Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 19th, 2006, 4:03 PM   #71
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
 
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup


industry_url = "http://biz.yahoo.com/ic/110.html"

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return index_link

print get_company_index(industry_url)

how bout them apples? *crosses fingers*
edit : wait im closer but not there yet, i have the link AND the company name not just the link or does that not matter?
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 4:04 PM   #72
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Nearly! Remember that you want the value of the "href" attribute, since that's where the URL is.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 4:07 PM   #73
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup


industry_url = "http://biz.yahoo.com/ic/110.html"

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return index_link['href']

print get_company_index(industry_url)
about time?

however, that's for when industry_url = a specific industry...

I need to add in a for loop if I want it to work in conjunction with the get_industry_url code that you provided, no??
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 4:13 PM   #74
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
I need to add in a for loop if I want it to work in conjunction with the get_industry_url code that you provided, no??
Try it!

Create some experimental code that will print out the company index URL for each of the industries listed on Yahoo!.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 4:13 PM   #75
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
of course... I mean, this won't work without a for loop right?

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

industry_page = "http://biz.yahoo.com/ic/ind_index.html"

def get_industry_urls(industry_page):
	soup  = BeautifulSoup(urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]

industry_url = "get_industry_urls(industry_page)"

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return index_link['href']

print get_company_index(industry_url)
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 4:16 PM   #76
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
it's blinking, the suspense is killing me, I don't know if it's doing anything or if it's just taking awhile because it has to apply for like 36,000 links
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 4:16 PM   #77
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Yes, you need a for loop. This is because get_industry_urls returns a list of all the URLs for each industry listed on Yahoo!, and get_company_index returns the company index URL for a single industry URL.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 4:21 PM   #78
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

industry_page = "http://biz.yahoo.com/ic/ind_index.html"

def get_industry_urls(industry_page):
	soup  = BeautifulSoup(urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]

industry_url = "get_industry_urls(industry_page)"

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return index_link['href']
    
for industry_url in get_industry_urls(industry_page):
    	company_index = get_company_index(industry_url)    

        print get_company_index(industry_url)

wait... is that better?

this is absolutely amazing, it's listing out every single company index page... it's taking awhile but i can't complain. whoa wait it just turned red on me and I got a bunch of errors

*tear it was going so nicely...

hm that's strange... it's listing out all the company index links but the first link it lists is:

don't know why that's happening

Last edited by zem52887; May 19th, 2006 at 4:34 PM.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 4:37 PM   #79
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Debugging is the skill of getting your programs to work. One way to debug a piece of code is to go through it and work out what it's doing.
industry_url = "get_industry_urls(industry_page)"
This sets the industry_url to hold a piece of text (or a string). Strings are denoted by quotation marks. I'll hazard a guess and say that this isn't what you want to do.

for industry_url in get_industry_urls(industry_page):
    	company_index = get_company_index(industry_url)
This piece of code goes through each of the industries in turn, and for each industry, it puts the company index into a variable called "company_index", overwriting the previous value.
print get_company_index(industry_url)
This code uses the last industry_url value and gets the company index of that.

Since hammering Yahoo! with some more iterative searches probably isn't a good idea, I'll show you the answer:
for industry_url in get_industry_urls(industry_page):
	print get_company_index(industry_url)
However, you have to understand the answer, otherwise it's of no use, so explain why it works

Also, Yahoo! seems to have a filter of some kind to stop people from DDOSing it's site. Thus, we need to put in a delay between fetching each site. The sleep function does this well. Sleep(1) will wait for 1 second, sleep(2) will wait for 2 and so forth:
from time import sleep

for industry_url in get_industry_urls(industry_page):
	print get_company_index(industry_url)
	sleep(1)
Again, explain what this does.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 4:37 PM   #80
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Ha! Looks like you figured most of it out on your own
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 5:47 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC