Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 17th, 2006, 2:05 PM   #21
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
wahoo so I downloaded python and am playing around with it (still at work, it let me install for some reason... watch in a minute IT guys will be in here carrying off my PC and giving me the boot) and tried the above command and so far so good!

trying to find my way around beautiful soup but having a bit of trouble, it's not an installable client is it? anyways sorry for the double post I'm just excited that we might be able to actually do this!
zem52887 is offline   Reply With Quote
Old May 17th, 2006, 2:05 PM   #22
demon101
Hobbyist Programmer
 
demon101's Avatar
 
Join Date: Mar 2006
Location: westboro, ohio
Posts: 160
Rep Power: 0 demon101 is an unknown quantity at this point
Send a message via Yahoo to demon101
if you want to read a tutorial here is a good one to read.


http://www.programmingforums.org/for...ead.php?t=1289
__________________
Demon101 Production's

Code Forums
demon101 is offline   Reply With Quote
Old May 17th, 2006, 2:11 PM   #23
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
everyone's contributing now, even the sarcastic one, good stuff thanks guys
zem52887 is offline   Reply With Quote
Old May 17th, 2006, 2:14 PM   #24
demon101
Hobbyist Programmer
 
demon101's Avatar
 
Join Date: Mar 2006
Location: westboro, ohio
Posts: 160
Rep Power: 0 demon101 is an unknown quantity at this point
Send a message via Yahoo to demon101
Quote:
Originally Posted by zem52887
everyone's contributing now, even the sarcastic one, good stuff thanks guys
i have only been doing python for about a week.
__________________
Demon101 Production's

Code Forums
demon101 is offline   Reply With Quote
Old May 17th, 2006, 2:24 PM   #25
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
You might want to refer to your Python documentation, don't know exactly what you have
Quote:
Originally Posted by Python 2.4 Docs
13. Structured Markup Processing Tools
Python supports a variety of modules to work with various forms of structured data markup. This includes modules to work with the Standard Generalized Markup Language (SGML) and the Hypertext Markup Language (HTML), and several interfaces for working with the Extensible Markup Language (XML).
It is important to note that modules in the xml package require that there be at least one SAX-compliant XML parser available. Starting with Python 2.3, the Expat parser is included with Python, so the xml.parsers.expat module will always be available. You may still want to be aware of the PyXML add-on package; that package provides an extended set of XML libraries for Python.
The documentation for the xml.dom and xml.sax packages are the definition of the Python bindings for the DOM and SAX interfaces.
HTMLParser

A simple parser that can handle HTML and XHTML.
sgmllib

Only as much of an SGML parser as needed to parse HTML.
htmllib

A parser for HTML documents.
htmlentitydefs

Definitions of HTML general entities.
xml.parsers.expat

An interface to the Expat non-validating XML parser.
xml.dom

Document Object Model API for Python.
xml.dom.minidom

Lightweight Document Object Model (DOM) implementation.
xml.dom.pulldom

Support for building partial DOM trees from SAX events.
xml.sax

Package containing SAX2 base classes and convenience functions.
xml.sax.handler

Base classes for SAX event handlers.
xml.sax.saxutils

Convenience functions and classes for use with SAX.
xml.sax.xmlreader

Interface which SAX-compliant XML parsers must implement.
xmllib

A parser for XML documents.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old May 17th, 2006, 2:25 PM   #26
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
I'm back. Onto Beautiful Soup. The example below imports the BeautifulSoup class, reads in the HTML from a Yahoo Business page, and then parses it into a BeautifulSoup object. An object is terminology for a grouping of data and functionality. A class is the template, or blueprint, of an object.
>>> from BeautifulSoup import BeautifulSoup
>>> html = urlopen("http://biz.yahoo.com/ic/ind_index.html").read()
>>> soup = BeautifulSoup(html)
Now that we have a BeautifulSoup object, we can access parts of the HTML. For instance, if we wanted the title of the page:
>>> soup.title.string
'Industry Index By Sector: Industry Center - Yahoo! Finance'
But in this case we don't want to know the title. Ideally, we want a list of all industry links on the page. Because Yahoo! HTML is rather messy, this requires some trial and error.

A quick peek at the HTML source with a web browser reveals that all the industry links are stored in a table. However, there are several tables in the page, and "soup.table" will return only the first one. This is where the fetch function comes in:
>>> tables = soup.fetch("table")
The above code puts all of the tables in a list. We can access particular elements in the list using indicies. The following code returns the 4th table in the list (indicies count from 0).
>>> tables[3]
<table style="clear:both; text-transform:uppercase; margin-top:5px;" border="0" width="100%" cellpadding="4" cellspacing="0"><tr bgcolor="EEEEEE"><td><font face="arial" size="+1">
<b>Industry Center</b>
</font></td><!-- SpaceID=0 robot -->
</tr></table>
Through trial and error, it turns out that tables[7] contains the list of industries. Now that we have the correct table, we can now fetch all links from it (links are denoted with the <a> tag):
>>> links = tables[7].fetch("a")
In Python, you can apply the same piece of code to each item in a list using a for-loop. In the example below, I'll print out the text of the link, and the URL it points to:
>>> for link in links:
...     title = link.string.replace("\n", " ")
...     url   = link['href']
...     print title, "-", url
...
There are two things to note about the above code. The first is the replace function used. "\n" is a special code that denotes a new line. This replace function replaces all new lines with spaces, in order to make things look a bit better (browsers treat newlines as whitespace as well).

The second thing is the indentation, the tab before each line that is "inside" the for loop. This tells Python that these lines "belong" to the for loop. More on this can be found in the standard Python docs.

I hope that gives you a rough idea on how to proceed. My advice is to tackle each type of page in turn (the industry list page, the company list pages and the company information pages), creating a function for each. To handle the whole thing, one need only combine the separate functions together.

This won't be trivial, but with a bit of practice, it shouldn't take long to get to grips with. If you have any problems, just bring them up on the forums.

Last edited by Arevos; May 17th, 2006 at 2:40 PM.
Arevos is offline   Reply With Quote
Old May 17th, 2006, 2:30 PM   #27
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
trying to find my way around beautiful soup but having a bit of trouble, it's not an installable client is it? anyways sorry for the double post I'm just excited that we might be able to actually do this!
Yeah, BeautifulSoup doesn't appear to have a Windows installer line some Python libraries do. But it's not too hard to install. Download the ".py" file from here, and put it in C:\Python24\Lib (assuming you installed Python to C:\Python24 of course!). Hopefully, that should do it.
Arevos is offline   Reply With Quote
Old May 17th, 2006, 2:40 PM   #28
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
wow okay haha this is great stuff arevos I really appreciate everything and I wouldn't stray too far from your computer if you're not too busy tonight because I can assure you I will have problems. These commands are slightly more complex than the "print" command haha. I'm going to reread your post a few more times and try and get a better idea on how to proceed, thanks for your help thus far

also, which python program do you suggest I use? the command line? or the IDLE?
zem52887 is offline   Reply With Quote
Old May 17th, 2006, 2:49 PM   #29
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
I should also add that once your code starts to become complex, it's best to move away from the interactive interpreter. The code below is essentially the same as detailed above in my previous post. Take this code and put it in a text file, and rename the text file so it has the ".py" extension. Then click on the file and the Python code will be executed automatically.
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

html = urlopen("http://biz.yahoo.com/ic/ind_index.html")
soup = BeautifulSoup(html)

links = soup.fetch("table")[7].fetch("a")

for link in links:
	title = link.string.replace("\n", " ")
	print title, "-", link['href']

print
raw_input("Please press enter to quit...")
The raw_input at the end makes sure that the CMD window does not close down immediately, but gives you time to read the output.
Arevos is offline   Reply With Quote
Old May 17th, 2006, 2:54 PM   #30
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
These commands are slightly more complex than the "print" command haha.
Programming is a skill that is difficult to get the hang of without any experience. Take it slow, experiment, and post if you find yourself in difficulty. I can't guarentee I'll be around all the time, but there are others who can help you, and referring back to the Python tutorial and reference documents will also help.

Quote:
Originally Posted by zem52887
also, which python program do you suggest I use? the command line? or the IDLE?
IDLE's not bad, because it also contains an text editor with nice formatting. You can open a .py file by going to the File -> Open menu. I believe there's also a File -> New option as well.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 8:57 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC