Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 19th, 2006, 10:49 AM   #51
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
ah wow nub move there, the beginning part is like a redirect kind of... I thought they were two separate entities. I need some time to digest the above post but I think I'll be able to get the hang of this.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 10:56 AM   #52
coldDeath
Expert Programmer
 
coldDeath's Avatar
 
Join Date: Aug 2005
Location: UK
Posts: 862
Rep Power: 3 coldDeath is on a distinguished road
Send a message via AIM to coldDeath Send a message via Yahoo to coldDeath
Quote:
Originally Posted by zem52887
ah wow nub move there, the beginning part is like a redirect kind of... I thought they were two separate entities. I need some time to digest the above post but I think I'll be able to get the hang of this.
Thats the idea. Sit down, think about it logically. Break it down into easier chunks and processes.

Also remember that you have access to the python and Beautiful Soup documentation, so you can refer to those to see the capabiilities of them both.
__________________
Join us at #programmingforums @ irc.freenode.net!

My software never has bugs. It just develops random features.
coldDeath is offline   Reply With Quote
Old May 19th, 2006, 11:49 AM   #53
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
The code below is the code which contains the link for "company index." I'm attempting to write the get_company_urls definition, but I can't figure out how to isolate the company link. In the first example we searched were able to isolate the link by the following:

Quote:
This line gets all "a" tags in table 7, and gives this list of tags the name "links".

Code:
return [a['href'] for a in links]This code takes the "href" attribute from each link, and constructs a new list. This list is returned from the function.
However, there weren't multiple links (if I remember correctly). Thus, if we use the same script then won't it not only return the company index link, but also "industry browser" etc.?

<table>
border=0
cellpadding=2
cellspacing=0
width=100%><tr><td
width=1%><img
src=http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/selectorgray.gif></td><td
nowrap><font
face=arial
size=-1>
Summary
</font></td></tr><tr><td
width=1%>·</td><td
nowrap><a
href="http://us.rd.yahoo.com/finance/industry/morenews/moremod/*http://biz.yahoo.com/ic/news/112.html"><font
face=arial
size=-1>News</font></a></td></tr><tr><td
width=1%>·</td><td
nowrap><a
href="http://us.rd.yahoo.com/finance/industry/morell/moremod/*http://biz.yahoo.com/ic/ll/112pip.html"><font
face=arial
size=-1>Leaders
&amp;
Laggards</font></a></td></tr><tr><td
width=1%>·</td><td
nowrap><font
face=arial
size=-1>
<a href="http://us.rd.yahoo.com/finance/industry/morecoindex/moremod/*http://biz.yahoo.com/ic/112_cl_all.html">Company Index</a>
</font></td></tr><tr><td
width=1%>·</td><td
nowrap><a
href="http://us.rd.yahoo.com/finance/industry/morecolist/moremod/*http://biz.yahoo.com/p/112conameu.html"><font
face=arial
size=-1>Industry
Browser</font></a></td></tr></table><table><tr><td
height=10></td></tr></table><table
border=0
width=100%
cellpadding=4
cellspacing=0><tr
bgcolor=556F93><td
valign=top
nowrap><font
face=verdana
size=-2
color=ffffff><b>Related
Industries</b></font></td></tr></table>

Thus, I can't figure out how to isolate the one link so that I can create a function which fetches it... I'm going to play around with it but if anyone has any suggestions they're welcome
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 12:20 PM   #54
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Okay this is what I have so far, compiling pretty much just what you said in your threads, and my attempt to define the function. I used the same format you used for getting the links even though I know it's wrong, I just wanted to see if I could get something (even if it was more than just company_index links). I encountered some errors so I think there's mistakes in the code other than just the definition function.

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

industry_page = "http://biz.yahoo.com/ic/ind_index.html"

def get_industry_urls(industry_page):
	soup  = BeautifulSoup(urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]

industry_url = get_industry_urls(industry_page)

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        company_links = soup.fetch("table")[9].fetch("a")[3]
        return [a['href'] for a in company_links]

company_url = get_company_index(industry_url)

for industry_url in get_industry_urls(industry_page):
    	company_index = get_company_index(industry_url)

	for company_url in get_company_index(company_index):
            print get_company_data(company_url)

is this kind of what it's supposed to look like?

Last edited by zem52887; May 19th, 2006 at 12:33 PM.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 12:20 PM   #55
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
In the get_industry_urls function, there's this line:
links = soup.fetch("table")[7].fetch("a")
This line fetches all the links in table 7. Notice the [7]. This says you want only one item from the list.

A list contains multiple items. It can contain two, three, four or millions of items. But it can also contain just one, or no items whatsoever. It's still a list, regardless of how many items it has. In order to pull a single item from a list, you need to specify it's index.

For instance, if you have a list with a single letter in it:
list = ["a"]
To get the letter out of the list, you have to specify it's index. Because the list is only one item long, there is only one valid index: 0:
letter = list[0]
The same thing applies to your links. In order to get the first (and only) element out of a list of links:
single_link = soup.fetch("table")[7].fetch("a")[0]
BeautifulSoup provides a shortcut for this type of command, though. The following code is equivalent to the above code:
single_link = soup.fetch("table")[7].a
Arevos is offline   Reply With Quote
Old May 19th, 2006, 12:23 PM   #56
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Also, I should add that if you're not dealing with a list, there's no need for a list comprehension:
return link['href']
List comprehensions are needed for the same reason you need for-loops; for applying the same operation to multiple elements. In this case, the list comprehension takes the "href" attribute from each element in the list. If you're only dealing with a single element, list comprehensions are not needed.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 12:25 PM   #57
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay with that in mind let me ammend my above post to see if I get any closer
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 12:33 PM   #58
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
is this kind of what it's supposed to look like?
Nearly. Your problem is your unfamiliarity functions and lists.

Think of a function as a mysterious machine. You put items into this machine, and get out other items. For instance, say you had a machine that painted everything red; it doesn't matter what you put into the machine, it'll always come out red.

In Python, this red-painting machine might look like this:
def paint_red(item):
	item.colour = "red"
And you might use this function like so:
# Create a cat called Percy
cat = Cat("Percy")
# Create a table
table = Table()
# Create a chair
chair = Chair()

paint_red(cat)
paint_red(table)
paint_red(chair)
Notice that it doesn't matter what goes into the function, it all gets treated the same.

Another key point is the concept of references. The name of an object is independant from what it is. Or, to put it another way, a rose by any other name would still smell as sweet. If you called a chair a "foobar", then it would still have four legs and a back. Changing the name of something doesn't change what it is.

So, take the following code:
cat = Cat("Percy")
item = cat
This code tells Python that Percy the Cat is referenced by the "cat" variable. The "cat" variable is something that points to Percy, like a signpost. The "item" variable also points to Percy. You can call Percy a cat, or you can call him an item; it's still the same cat.

Is that a little more understandable?
Arevos is offline   Reply With Quote
Old May 19th, 2006, 12:47 PM   #59
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
yes yes thanks for the clarification, I personally prefer dr. seuss's sneetch analogy but red chairs are cool too... in any event, if lists start at zero, then I guess I'm really looking for the 8th table and the 2nd link within this table.

err edit: i should still use the 9th table, but the 2nd link no...

now i have to figure out what I renamed improperly, I was merely renaming so I would be able to figure out what I was inputting as opposed to a link, to make it a little more readable if you're unfamiliar with the links.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 1:39 PM   #60
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
One nice thing about functions is that they can be tested without testing the whole program.

For instance, thanks to my web browser, I happen to know that the Yahoo! Agricultural Chemicals page is at "http://biz.yahoo.com/ic/112.html", and that the company index points to "http://us.rd.yahoo.com/finance/industry/morecoindex/moremod/*http://biz.yahoo.com/ic/112_cl_all.html"

Armed with this information, I can write a quick test for the function:
print get_company_index("http://biz.yahoo.com/ic/112.html")
raw_input("Press enter to continue")
I can check that what's printed is the correct URL. If it is, then I know the function works.

Another trick is to comment out code that you don't want executed. The main program loop isn't finished yet, so it'll throw up an error if you run it. To stop this error occuring you can comment out the unfinished code like so:

#for industry_url in get_industry_urls(industry_page):
#    	company_index = get_company_index(industry_url)
#
#	for company_url in get_company_index(company_index):
#            print get_company_data(company_url)
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:56 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC