![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#121 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
However, all you have to worry about is the text argument. The other two will default to sensible values. I'd advise using firstText, though. It's like fetchText, but only returns the first result it finds. If you're searching for a unique piece of text, this seems like a more ideal function. Quote:
I think your problem is that you're not giving BeautifulSoup enough credit. The last line is unneeded. Try: print contacttable The reason for this is that we are not dealing with raw HTML, but what is known as a DOM Tree. In programming, a tree is a data structure, or a way or storing data. A family tree is a good example of this sort of data structure. I'll give you a further example. Take this slice of HTML: <div>
<b>Hello</b> World
</div> div
/ \
b [World]
|
[Hello]When BeautifulSoup talks about parents and children, think of it in terms of a family tree, with the "html" tag being the ultimate ancestor (or the root). |
||
|
|
|
|
|
#122 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Indeed, I thought that I would have to also get the end tag but apparently not. BeautifulSoup is incredible. Wow, so it seems the only difficult one will be the Financial Highlights because they're not on every company. So I've been reading up on if statements so I think one will be necessary for that piece of data, no?
Also one more quickie regarding regex, since "Company Profile" appears more than once, but is constant, I need to format the code such that it fetches me the table surrounding the 3rd "Company Profile" expression, would I format it as follows: profile = soup.firstText(re.compile("Company Profile"))
companyprofile = profile.findParent("table")[2]
print companyprofileon second look, I'm trying to get the 3rd "Company Profile" but the second table that follows it, so is this where sibling commands come in? hm, is this where I'm encountering a problem: <table cellpadding=0 cellspacing=0 width=100%> do I need to give that to BeautifulSoup? |
|
|
|
|
|
#123 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Close, but not quite. "findParent" finds the first parent that matches a certain tag. It does not return a list, so indicies (like [2]) won't work on it.
Instead, you need to find the 3rd matching piece of text. This means you have to use fetchText (which gets a list of all matches) instead of firstText (which gets the first match), and apply an index to that. Or, to put it another way: soup.firstText("Company")soup.fetchText("Company")[0] |
|
|
|
|
|
#124 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
soup = BeautifulSoup(html)
profile = soup.fetchText(re.compile("Company Profile"))[2]
companyprofile = profile.findNext("table")on to Financial Highlights, this one seems like it's going to be a bit more complex than the other types of data grabbing. Since some companies have financial data posted and others don't, it seems an if-statement is necessary (to me at least) however, the examples I've seen are all like the following: >>> x = int(raw_input("Please enter an integer: "))
>>> if x < 0:
... x = 0
... print 'Negative changed to zero'
... elif x == 0:
... print 'Zero'
... elif x == 1:
... print 'Single'
... else:
... print 'More'since we're dealing with text and not a numerical value, I obviously can't just put: if Financial Highlights "exist" then... Last edited by zem52887; May 23rd, 2006 at 10:38 AM. |
|
|
|
|
|
#125 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Well, presumably, if BeautifulSoup's fetchText function returns an empty list (an empty list is []), then it has found no text that says "Financial Highlights". So you could do something like:
highlights = soup.fetchText(re.compile("Financial Highlights"))
if highlights == []:
# There are no financial highlights |
|
|
|
|
|
#126 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
ah indeed. Was thinking about it too concretely.
|
|
|
|
|
|
#127 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
How does this look:
#Financial Highlights
highlights = soup.fetchText(re.compile("Highlights"))[0]
if highlights = [0]
financialhighlights = highlights.findParent("table")
elif highlights == [] |
|
|
|
|
|
#128 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Use the len function to find out how long a list is. e.g.
len(highlights) == 1 Or, you could do it thus: if highlights == []:
# highlights doesn't exist
else:
# highlights does exist |
|
|
|
|
|
#129 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
I'm not familiar with the len function, I'll read about it, but quickly before I do, is it necessary. The word "highlights" only appears on a company page once, so my logic is that if it's there then we want it to fetch the table around it, and if it is not there, then we want it to display an "N/A" in the excel cell. We're not dealing with multiple highlights so do we need to implement the above?
|
|
|
|
|
|
#130 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
|
|
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|