![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#21 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
wahoo so I downloaded python and am playing around with it (still at work, it let me install for some reason... watch in a minute IT guys will be in here carrying off my PC and giving me the boot) and tried the above command and so far so good!
trying to find my way around beautiful soup but having a bit of trouble, it's not an installable client is it? anyways sorry for the double post I'm just excited that we might be able to actually do this! |
|
|
|
|
|
#22 |
|
Hobbyist Programmer
|
if you want to read a tutorial here is a good one to read.
http://www.programmingforums.org/for...ead.php?t=1289 |
|
|
|
|
|
#23 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
everyone's contributing now, even the sarcastic one, good stuff thanks guys
|
|
|
|
|
|
#24 | |
|
Hobbyist Programmer
|
Quote:
|
|
|
|
|
|
|
#25 | |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
You might want to refer to your Python documentation, don't know exactly what you have
Quote:
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
|
|
|
#26 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
I'm back. Onto Beautiful Soup. The example below imports the BeautifulSoup class, reads in the HTML from a Yahoo Business page, and then parses it into a BeautifulSoup object. An object is terminology for a grouping of data and functionality. A class is the template, or blueprint, of an object.
>>> from BeautifulSoup import BeautifulSoup
>>> html = urlopen("http://biz.yahoo.com/ic/ind_index.html").read()
>>> soup = BeautifulSoup(html)>>> soup.title.string 'Industry Index By Sector: Industry Center - Yahoo! Finance' A quick peek at the HTML source with a web browser reveals that all the industry links are stored in a table. However, there are several tables in the page, and "soup.table" will return only the first one. This is where the fetch function comes in: >>> tables = soup.fetch("table")>>> tables[3] <table style="clear:both; text-transform:uppercase; margin-top:5px;" border="0" width="100%" cellpadding="4" cellspacing="0"><tr bgcolor="EEEEEE"><td><font face="arial" size="+1"> <b>Industry Center</b> </font></td><!-- SpaceID=0 robot --> </tr></table> >>> links = tables[7].fetch("a")>>> for link in links:
... title = link.string.replace("\n", " ")
... url = link['href']
... print title, "-", url
...The second thing is the indentation, the tab before each line that is "inside" the for loop. This tells Python that these lines "belong" to the for loop. More on this can be found in the standard Python docs. I hope that gives you a rough idea on how to proceed. My advice is to tackle each type of page in turn (the industry list page, the company list pages and the company information pages), creating a function for each. To handle the whole thing, one need only combine the separate functions together. This won't be trivial, but with a bit of practice, it shouldn't take long to get to grips with. If you have any problems, just bring them up on the forums. Last edited by Arevos; May 17th, 2006 at 2:40 PM. |
|
|
|
|
|
#27 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
|
|
|
|
|
|
|
#28 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
wow okay haha this is great stuff arevos I really appreciate everything and I wouldn't stray too far from your computer if you're not too busy tonight because I can assure you I will have problems. These commands are slightly more complex than the "print" command haha. I'm going to reread your post a few more times and try and get a better idea on how to proceed, thanks for your help thus far
also, which python program do you suggest I use? the command line? or the IDLE? |
|
|
|
|
|
#29 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
I should also add that once your code starts to become complex, it's best to move away from the interactive interpreter. The code below is essentially the same as detailed above in my previous post. Take this code and put it in a text file, and rename the text file so it has the ".py" extension. Then click on the file and the Python code will be executed automatically.
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
html = urlopen("http://biz.yahoo.com/ic/ind_index.html")
soup = BeautifulSoup(html)
links = soup.fetch("table")[7].fetch("a")
for link in links:
title = link.string.replace("\n", " ")
print title, "-", link['href']
print
raw_input("Please press enter to quit...") |
|
|
|
|
|
#30 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Quote:
|
||
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|