Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 22nd, 2006, 10:06 AM   #101
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
Or do you mean I should merely select a link to use to test it myself rather than using the general yahoo! index link which then undergoes the functions.
Yep. Generally speaking, programmers create a function and feed it test data (e.g. an appropriate URL) in order to test that it works.

Quote:
Originally Posted by zem52887
Yeah I forgot about the for-loops and what they're "for"... and you were right. The script is working now, however, it is retrieving not only the company_urls but in some cases it's retrieving the link for a company quote page, and in some cases non-existent links (it seems).
It could be that you need a different table. Remember that in HTML, tables can exist inside other tables. Thus, it's possible that within table[1] exists table[6], which contains more specific data. Try some of the other tables on the page and see if there aren't any that can narrow down your search.
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 10:06 AM   #102
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
ah gotcha will do.
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 10:23 AM   #103
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Arevos, you truly are a miracle worker, I was grabbing the wrong table and now get_company_urls is working flawlessly just as you had suggested.

On to get_company_data, again I'm not sure how to pull the data because there's more than I need in a given table. Is it possible to fetch certain parts of text as we did with links by fetching a, followed by href (in the case of list comprehensions... naturally )

(posts 92&93 explain this problem more thoroughly in case anyone missed those... got a little ahead of myself and thus things got out of order)
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 10:28 AM   #104
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Yep; in theory you could do something like:
soup.fetch("table")[5].fetch("tr")[2].fetch("td")[1].string
That would get you the contents of the cell on the 3rd row down and 2rd column across, on the fifth table on the page. The fetch method works with any tag.

(The .string at the end gets the contents of the tag, just as ['href'] gets the href attribute of the tag)
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 10:30 AM   #105
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
excellent, I'll try that thank you once again
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 10:40 AM   #106
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay this is going to take a little bit of time and effort boo. Is there a way to use BeautifulSoup to search for a given tag within a particular table as opposed to the whole site. For our purposes, the company data (along with unnessary data) is contained within the 3rd table. I want to isolate this table to be able to search for [td] tags to try and isolate the bits of data that are useful to me, is there a way to do this. I apologize for asking so many questions but I haven't been able to find a thorough tutorial that explains how to do the stuff we're doing here (i.e. incorporating html and beautifulsoup into python coding)

edit:
I think I may have gotten it, would the following work to isolate table 3?
html = urlopen("http://biz.yahoo.com/ic/135/135359.html").read("table")[3]

hm, not that but something along those lines...

this perhaps:
>>> html = urlopen("http://biz.yahoo.com/ic/135/135359.html").read()
>>> soup = BeautifulSoup(html)
>>> soup.fetch("table")[3]

my only question is that when I begin to search for td tags, will it refer to the link or just to that table?

I could search using this:
soup.fetch("table")[3].fetch("td")[5]
but if there's a way to just search table three without having to fetch it before searching for the td tag that'd be a little easier, no?

Last edited by zem52887; May 22nd, 2006 at 10:57 AM.
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 10:57 AM   #107
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
I'm not quite sure what you're asking. This code results in all the tables on the page:
soup.fetch("table")
And this code results in all the links in table 2:
soup.fetch("table")[2].fetch("a")
Similarly, this code gets all the "td" tags in table 2:
soup.fetch("table")[2].fetch("td")
It might help to give you a quick run-through of object syntax. Take this following code:
chair.colour
This code refers to the "colour" property of the "chair" object.
chair.chopup()
This code calls the "chopup" method that belongs to the chair object. Thus, you can think of "." as meaning "belongs to". The "colour" property belongs to the chair object, as too does the "chopup" method. In programming lingo, they are members of the chair object.

Lets go back to BeautifulSoup:
links = soup.fetch("tables")[3].fetch("a")
It may be more obvious what this does if I expand it out a little:
all_tables = soup.fetch("tables")
table3 = all_tables[3]
links_in_table3 = table3.fetch("a")
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 11:00 AM   #108
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
right, so to clarify, I mean if I fetch table 3, then if I search for a [td] tag, does it only search within table 3 because that's what I called on... or does it search for the [td] tag in the entire site which urlopen, opened?
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 11:04 AM   #109
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
It only searches within table 3. If you want to seach the whole site, use:
soup.fetch("td")
But if you just want to search a subsection (say, table 3), use:
soup.fetch("table")[3].fetch("td")
Indeed, it makes more sense if you think of it from a "what belongs to what" perspective. The first "fetch" belongs to the "soup" object. The second "fetch" belongs to the table 3 object. Thus, the first fetch searches the whole of the site, whilst the second just searches table 3.
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 11:19 AM   #110
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
*heart pounding* please someone mollify my fears. I don't think the tables are constant from company to company. By this I mean that since some companies have more links or additional sectors, the Description/Financial Information/Contact Info/Key People aren't in the same exact place on every company. Please assure me that there is another way to get this information from the website other than searching by [td] tags because I'm so close... it would be such a shame if this is the only way to locate the pertinent information and for this script to go to waste after Arevos's (and my) hard work.

hm, I'm looking at the html again and there seem to be many tables within tables... I'm really struggling to deciper this data page.

but maybe there is some hope after all...

perhaps parsing using keywords? (is this possible). For instance instead of searching for a [td] or [table] etc., could we just search for the phrase "Financial Highlights"?
zem52887 is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 1:34 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC