I'm back. Onto Beautiful Soup. The example below imports the BeautifulSoup class, reads in the HTML from a Yahoo Business page, and then parses it into a BeautifulSoup object. An object is terminology for a grouping of data and functionality. A class is the template, or blueprint, of an object.
>>> from BeautifulSoup import BeautifulSoup
>>> html = urlopen("http://biz.yahoo.com/ic/ind_index.html").read()
>>> soup = BeautifulSoup(html) Now that we have a BeautifulSoup object, we can access parts of the HTML. For instance, if we wanted the title of the page:
>>> soup.title.string
'Industry Index By Sector: Industry Center - Yahoo! Finance'
But in this case we don't want to know the title. Ideally, we want a list of all industry links on the page. Because Yahoo! HTML is rather messy, this requires some trial and error.
A quick peek at the HTML source with a web browser reveals that all the industry links are stored in a table. However, there are several tables in the page, and "soup.table" will return only the first one. This is where the fetch function comes in:
>>> tables = soup.fetch("table") The above code puts all of the tables in a list. We can access particular elements in the list using indicies. The following code returns the
4th table in the list (indicies count from 0).
>>> tables[3]
<table style="clear:both; text-transform:uppercase; margin-top:5px;" border="0" width="100%" cellpadding="4" cellspacing="0"><tr bgcolor="EEEEEE"><td><font face="arial" size="+1">
<b>Industry Center</b>
</font></td><!-- SpaceID=0 robot -->
</tr></table>
Through trial and error, it turns out that tables[7] contains the list of industries. Now that we have the correct table, we can now fetch all links from it (links are denoted with the <a> tag):
>>> links = tables[7].fetch("a") In Python, you can apply the same piece of code to each item in a list using a for-loop. In the example below, I'll print out the text of the link, and the URL it points to:
>>> for link in links:
... title = link.string.replace("\n", " ")
... url = link['href']
... print title, "-", url
... There are two things to note about the above code. The first is the replace function used. "\n" is a special code that denotes a new line. This replace function replaces all new lines with spaces, in order to make things look a bit better (browsers treat newlines as whitespace as well).
The second thing is the indentation, the tab before each line that is "inside" the for loop. This tells Python that these lines "belong" to the for loop. More on this can be found in the standard Python docs.
I hope that gives you a rough idea on how to proceed. My advice is to tackle each type of page in turn (the industry list page, the company list pages and the company information pages), creating a function for each. To handle the whole thing, one need only combine the separate functions together.
This won't be trivial, but with a bit of practice, it shouldn't take long to get to grips with. If you have any problems, just bring them up on the forums.