Quote:
|
Originally Posted by zem52887
Specifically, what does the following code mean:
fetchText(text, recursive, limit)
|
It means that fetchText takes in three arguments at maximum. The "text" argument is the text (or regular expression) to search for. The "recursive" argument can be True or False; it is true by default, which means it will search the entire parsed HTML document. If it is false, it will only search the surface. The "limit" argument tells the function how deep to search. I believe "limit" defaults to 0 or -1, which implies an unlimited depth.
However, all you have to worry about is the text argument. The other two will default to sensible values.
I'd advise using firstText, though. It's like fetchText, but only returns the first result it finds. If you're searching for a unique piece of text, this seems like a more ideal function.
Quote:
|
Originally Posted by zem52887
I've been examining the HTML on Yahoo!'s site and I think the best way to search for regex would be to avoid using the [td] and [tr] tags and instead, use [table] tags. So my goal is to search something like:
contact = soup.firstText(re.compile("Contact Information"))
contacttable = address.findParent("table")
contacttable.fetch("/table")[0]
Ultimately, I want to fetch the table tags that surround the words "Contact Information." However, when I use compile, I get "list out of range" error, so I think, but might be wrong, that I need to use the re.search function because I'm not dealing with a list?
|
re.compile is used to tell Beautiful Soup that it's dealing with a regular expression, rather than a plain piece of text. re.compile turns properly formatted string into a regular expression object. This is how it knows it's dealing with a regular expression.
I think your problem is that you're not giving BeautifulSoup enough credit. The last line is unneeded. Try:
And you should see what I mean.
The reason for this is that we are not dealing with raw HTML, but what is known as a DOM Tree. In programming, a tree is a data structure, or a way or storing data. A family tree is a good example of this sort of data structure.
I'll give you a further example. Take this slice of HTML:
<div>
<b>Hello</b> World
</div> The corresponding DOM tree would look something like this:
div
/ \
b [World]
|
[Hello] Thus, there are no end-tags in a DOM tree. Instead, there are merely relationships. Going back to the family tree analogy, the [Hello] text is the grandchild of the "div" element, and the child of the "b" element.
When BeautifulSoup talks about parents and children, think of it in terms of a family tree, with the "html" tag being the ultimate ancestor (or the root).