![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#51 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
ah wow nub move there, the beginning part is like a redirect kind of... I thought they were two separate entities. I need some time to digest the above post but I think I'll be able to get the hang of this.
|
|
|
|
|
|
#52 | |
|
Expert Programmer
|
Quote:
Also remember that you have access to the python and Beautiful Soup documentation, so you can refer to those to see the capabiilities of them both.
__________________
Join us at #programmingforums @ irc.freenode.net! My software never has bugs. It just develops random features.
|
|
|
|
|
|
|
#53 | |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
The code below is the code which contains the link for "company index." I'm attempting to write the get_company_urls definition, but I can't figure out how to isolate the company link. In the first example we searched were able to isolate the link by the following:
Quote:
<table> border=0 cellpadding=2 cellspacing=0 width=100%><tr><td width=1%><img src=http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/selectorgray.gif></td><td nowrap><font face=arial size=-1> Summary </font></td></tr><tr><td width=1%>·</td><td nowrap><a href="http://us.rd.yahoo.com/finance/industry/morenews/moremod/*http://biz.yahoo.com/ic/news/112.html"><font face=arial size=-1>News</font></a></td></tr><tr><td width=1%>·</td><td nowrap><a href="http://us.rd.yahoo.com/finance/industry/morell/moremod/*http://biz.yahoo.com/ic/ll/112pip.html"><font face=arial size=-1>Leaders & Laggards</font></a></td></tr><tr><td width=1%>·</td><td nowrap><font face=arial size=-1> <a href="http://us.rd.yahoo.com/finance/industry/morecoindex/moremod/*http://biz.yahoo.com/ic/112_cl_all.html">Company Index</a> </font></td></tr><tr><td width=1%>·</td><td nowrap><a href="http://us.rd.yahoo.com/finance/industry/morecolist/moremod/*http://biz.yahoo.com/p/112conameu.html"><font face=arial size=-1>Industry Browser</font></a></td></tr></table><table><tr><td height=10></td></tr></table><table border=0 width=100% cellpadding=4 cellspacing=0><tr bgcolor=556F93><td valign=top nowrap><font face=verdana size=-2 color=ffffff><b>Related Industries</b></font></td></tr></table> Thus, I can't figure out how to isolate the one link so that I can create a function which fetches it... I'm going to play around with it but if anyone has any suggestions they're welcome |
|
|
|
|
|
|
#54 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Okay this is what I have so far, compiling pretty much just what you said in your threads, and my attempt to define the function. I used the same format you used for getting the links even though I know it's wrong, I just wanted to see if I could get something (even if it was more than just company_index links). I encountered some errors so I think there's mistakes in the code other than just the definition function.
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
industry_page = "http://biz.yahoo.com/ic/ind_index.html"
def get_industry_urls(industry_page):
soup = BeautifulSoup(urlopen(industry_page))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links]
industry_url = get_industry_urls(industry_page)
def get_company_index(industry_url):
soup = BeautifulSoup(urlopen(industry_url))
company_links = soup.fetch("table")[9].fetch("a")[3]
return [a['href'] for a in company_links]
company_url = get_company_index(industry_url)
for industry_url in get_industry_urls(industry_page):
company_index = get_company_index(industry_url)
for company_url in get_company_index(company_index):
print get_company_data(company_url)is this kind of what it's supposed to look like? Last edited by zem52887; May 19th, 2006 at 12:33 PM. |
|
|
|
|
|
#55 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4
![]() |
In the get_industry_urls function, there's this line:
links = soup.fetch("table")[7].fetch("a")A list contains multiple items. It can contain two, three, four or millions of items. But it can also contain just one, or no items whatsoever. It's still a list, regardless of how many items it has. In order to pull a single item from a list, you need to specify it's index. For instance, if you have a list with a single letter in it: list = ["a"] letter = list[0] single_link = soup.fetch("table")[7].fetch("a")[0]single_link = soup.fetch("table")[7].a |
|
|
|
|
|
#56 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4
![]() |
Also, I should add that if you're not dealing with a list, there's no need for a list comprehension:
return link['href'] |
|
|
|
|
|
#57 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
okay with that in mind let me ammend my above post to see if I get any closer
|
|
|
|
|
|
#58 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4
![]() |
Quote:
Think of a function as a mysterious machine. You put items into this machine, and get out other items. For instance, say you had a machine that painted everything red; it doesn't matter what you put into the machine, it'll always come out red. In Python, this red-painting machine might look like this: def paint_red(item): item.colour = "red" # Create a cat called Percy
cat = Cat("Percy")
# Create a table
table = Table()
# Create a chair
chair = Chair()
paint_red(cat)
paint_red(table)
paint_red(chair)Another key point is the concept of references. The name of an object is independant from what it is. Or, to put it another way, a rose by any other name would still smell as sweet. If you called a chair a "foobar", then it would still have four legs and a back. Changing the name of something doesn't change what it is. So, take the following code: cat = Cat("Percy")
item = catIs that a little more understandable? |
|
|
|
|
|
|
#59 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
yes yes thanks for the clarification, I personally prefer dr. seuss's sneetch analogy but red chairs are cool too... in any event, if lists start at zero, then I guess I'm really looking for the 8th table and the 2nd link within this table.
err edit: i should still use the 9th table, but the 2nd link no... now i have to figure out what I renamed improperly, I was merely renaming so I would be able to figure out what I was inputting as opposed to a link, to make it a little more readable if you're unfamiliar with the links. |
|
|
|
|
|
#60 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4
![]() |
One nice thing about functions is that they can be tested without testing the whole program.
For instance, thanks to my web browser, I happen to know that the Yahoo! Agricultural Chemicals page is at "http://biz.yahoo.com/ic/112.html", and that the company index points to "http://us.rd.yahoo.com/finance/industry/morecoindex/moremod/*http://biz.yahoo.com/ic/112_cl_all.html" Armed with this information, I can write a quick test for the function: print get_company_index("http://biz.yahoo.com/ic/112.html")
raw_input("Press enter to continue")Another trick is to comment out code that you don't want executed. The main program loop isn't finished yet, so it'll throw up an error if you run it. To stop this error occuring you can comment out the unfinished code like so: #for industry_url in get_industry_urls(industry_page): # company_index = get_company_index(industry_url) # # for company_url in get_company_index(company_index): # print get_company_data(company_url) |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|