Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 23rd, 2006, 9:53 AM   #121
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
Specifically, what does the following code mean:
fetchText(text, recursive, limit)
It means that fetchText takes in three arguments at maximum. The "text" argument is the text (or regular expression) to search for. The "recursive" argument can be True or False; it is true by default, which means it will search the entire parsed HTML document. If it is false, it will only search the surface. The "limit" argument tells the function how deep to search. I believe "limit" defaults to 0 or -1, which implies an unlimited depth.

However, all you have to worry about is the text argument. The other two will default to sensible values.

I'd advise using firstText, though. It's like fetchText, but only returns the first result it finds. If you're searching for a unique piece of text, this seems like a more ideal function.

Quote:
Originally Posted by zem52887
I've been examining the HTML on Yahoo!'s site and I think the best way to search for regex would be to avoid using the [td] and [tr] tags and instead, use [table] tags. So my goal is to search something like:
contact = soup.firstText(re.compile("Contact Information"))
contacttable = address.findParent("table")
contacttable.fetch("/table")[0]

Ultimately, I want to fetch the table tags that surround the words "Contact Information." However, when I use compile, I get "list out of range" error, so I think, but might be wrong, that I need to use the re.search function because I'm not dealing with a list?
re.compile is used to tell Beautiful Soup that it's dealing with a regular expression, rather than a plain piece of text. re.compile turns properly formatted string into a regular expression object. This is how it knows it's dealing with a regular expression.

I think your problem is that you're not giving BeautifulSoup enough credit. The last line is unneeded. Try:
print contacttable
And you should see what I mean.

The reason for this is that we are not dealing with raw HTML, but what is known as a DOM Tree. In programming, a tree is a data structure, or a way or storing data. A family tree is a good example of this sort of data structure.

I'll give you a further example. Take this slice of HTML:
<div>
    <b>Hello</b> World
</div>
The corresponding DOM tree would look something like this:
    div
    /  \
   b   [World]
   |
[Hello]
Thus, there are no end-tags in a DOM tree. Instead, there are merely relationships. Going back to the family tree analogy, the [Hello] text is the grandchild of the "div" element, and the child of the "b" element.

When BeautifulSoup talks about parents and children, think of it in terms of a family tree, with the "html" tag being the ultimate ancestor (or the root).
Arevos is offline   Reply With Quote
Old May 23rd, 2006, 10:02 AM   #122
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Indeed, I thought that I would have to also get the end tag but apparently not. BeautifulSoup is incredible. Wow, so it seems the only difficult one will be the Financial Highlights because they're not on every company. So I've been reading up on if statements so I think one will be necessary for that piece of data, no?

Also one more quickie regarding regex, since "Company Profile" appears more than once, but is constant, I need to format the code such that it fetches me the table surrounding the 3rd "Company Profile" expression, would I format it as follows:
profile = soup.firstText(re.compile("Company Profile"))
companyprofile = profile.findParent("table")[2]
print companyprofile

on second look, I'm trying to get the 3rd "Company Profile" but the second table that follows it, so is this where sibling commands come in?

hm, is this where I'm encountering a problem:
<table
cellpadding=0
cellspacing=0
width=100%>

do I need to give that to BeautifulSoup?
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 10:09 AM   #123
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Close, but not quite. "findParent" finds the first parent that matches a certain tag. It does not return a list, so indicies (like [2]) won't work on it.

Instead, you need to find the 3rd matching piece of text. This means you have to use fetchText (which gets a list of all matches) instead of firstText (which gets the first match), and apply an index to that.

Or, to put it another way:
soup.firstText("Company")
Is the same as:
soup.fetchText("Company")[0]
Hopefully, you should be able to guess now what code you need.
Arevos is offline   Reply With Quote
Old May 23rd, 2006, 10:20 AM   #124
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
soup = BeautifulSoup(html)
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")
jackpot... I think I'm getting the hang of this.

on to Financial Highlights, this one seems like it's going to be a bit more complex than the other types of data grabbing. Since some companies have financial data posted and others don't, it seems an if-statement is necessary (to me at least) however, the examples I've seen are all like the following:
>>> x = int(raw_input("Please enter an integer: "))
    >>> if x < 0:
    ...      x = 0
    ...      print 'Negative changed to zero'
    ... elif x == 0:
    ...      print 'Zero'
    ... elif x == 1:
    ...      print 'Single'
    ... else:
    ...      print 'More'

since we're dealing with text and not a numerical value, I obviously can't just put:
if Financial Highlights "exist" 
then...
hah, any helpful hints? or am I underestimating BeautifulSoup and if there's no financial highlights will it just skip over it? I assume I have to address it in some way/shape/form.

Last edited by zem52887; May 23rd, 2006 at 10:38 AM.
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 10:45 AM   #125
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Well, presumably, if BeautifulSoup's fetchText function returns an empty list (an empty list is []), then it has found no text that says "Financial Highlights". So you could do something like:

highlights = soup.fetchText(re.compile("Financial Highlights"))

if highlights == []:
   # There are no financial highlights
(Note the double equals (==). This tells Python you want to compare two values, rather than assign one value to another.)
Arevos is offline   Reply With Quote
Old May 23rd, 2006, 10:47 AM   #126
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
ah indeed. Was thinking about it too concretely.
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 10:52 AM   #127
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
How does this look:
#Financial Highlights
highlights = soup.fetchText(re.compile("Highlights"))[0]
if highlights = [0]
    financialhighlights = highlights.findParent("table")
elif highlights == []
I need to tell Python what to do if highlights doesn't exist, since I'm not really familiar with how this is going to be exported to excel, if I want a blank cell in excel when there are no financial highlights what should I direct python to do? Can I tell it to print "No Data Available" or something and will that translate to excel?
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 10:59 AM   #128
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Use the len function to find out how long a list is. e.g.
len(highlights) == 1
(You can also do len(x) == 0, which is equivalent to x == [])

Or, you could do it thus:
if highlights == []:
    # highlights doesn't exist
else:
    # highlights does exist
Look up the difference between "else" and "elif".
Arevos is offline   Reply With Quote
Old May 23rd, 2006, 11:02 AM   #129
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
I'm not familiar with the len function, I'll read about it, but quickly before I do, is it necessary. The word "highlights" only appears on a company page once, so my logic is that if it's there then we want it to fetch the table around it, and if it is not there, then we want it to display an "N/A" in the excel cell. We're not dealing with multiple highlights so do we need to implement the above?
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 11:02 AM   #130
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
I need to tell Python what to do if highlights doesn't exist, since I'm not really familiar with how this is going to be exported to excel, if I want a blank cell in excel when there are no financial highlights what should I direct python to do? Can I tell it to print "No Data Available" or something and will that translate to excel?
I'll give you a quick example of how to use Python to create a CSV when I get home in about 40 to 50 minutes. You might want to look up how to read/write to files in the tutorial in the meanwhile.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 11:56 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC