Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 22nd, 2006, 12:27 PM   #111
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
Your time has not been wasted, it just may not pay off to the extent you were (unrealistically) expecting. It is unlikely that diverse people are going to build their site to your expectations. They aren't in the business of doing your work for you. Just for shits and giggles, let me quote one of my previous posts:
Quote:
To be effective, you would need to know how to distinguish (logically, via content, or possibly position) what links constitute your trail. Once at the destination you would need to know what items of information on THAT page were relevant. I strongly suspect that, at this point, you're going to have to include human intervention, with its marvelous visual identification capabilities.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old May 22nd, 2006, 12:32 PM   #112
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Quote:
Originally Posted by DaWei
Your time has not been wasted, it just may not pay off to the extent you were (unrealistically) expecting. It is unlikely that diverse people are going to build their site to your expectations. They aren't in the business of doing your work for you. Just for shits and giggles, let me quote one of my previous posts:
whoa whoa whoa DaWei, not to be rude as I am a newcomer to these forums, but your above post is very offensive. In regards to,
Quote:
They aren't in the business of doing your work for you
that just doesn't make any sense in addition to being offensive. It should be pretty obvious that I never asked anyone to "do my work for me," and it's a ridiculous notion to assume that I expected the web designers to tailor make their website for me. The point of this is to try and utlitize what I've been giving to make a retarded task (that being the copying-pasting) manageable.

So please refrain from "I told you so" replies on this thread. They are neither productive nor necessary. Thank you. Instead, try offering a suggestion or two, maybe this last bit of data parsing could be do-able with another language... anything but "I told you so" and a lecture on a thread where people have dedicated a lot of time and effort, not only me but more seasoned/respected (*cough* Arevos *cough*) members from this forum.

On a side note, when I read your replies to various threads, they always seem to be "read the manual," lrn2post, lecturing the ops. While I can agree there is a time and a place for that, it should be clear that a thread with 100 replies and 1000+ views is not. IMHO pick your spots, a spammer/troll who's never going to post here again, that's one thing, this is an entirely different case and as such, why not try coming up with a solution instead of reiterrating the likely scenario that I'm effed.

Last edited by zem52887; May 22nd, 2006 at 12:57 PM.
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 1:02 PM   #113
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Python is a flexible language, and BeautifulSoup a fairly flexible parser. If the tables differ from company to company, then the task becomes more difficult, but not impossible.

I'll take a look around and see if I can come up with anything to do with searching.
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 1:04 PM   #114
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Your confidence is as always refreshing, welcomed, and appreciated.
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 1:15 PM   #115
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
If you took the post as offensive, then I'll lay it at the door of frustration, since you obviously aren't that stupid. You said:
Quote:
I don't think the tables are constant from company to company. By this I mean that since some companies have more links or additional sectors, the Description/Financial Information/Contact Info/Key People aren't in the same exact place on every company.
Of course they're not. They didn't undertake their work with you and your requirements in mind.

The quote of my previous post was not an "I told you so, nananananaaaaaaaana." It was a "you were forewarned to expect difficulties."

Now, if you care to persist in your recriminations, you go right ahead. You can't deny that someone IS doing a lot of your work for you. That's okay, for you were learning, but now you're coming along and implying that it may be "toss in the towel" time, and that Arevos' work will go to waste.

If you don't care for my posts, that's just tough titty. I don't plan to stop saying what I feel like saying. If you want to denigrate my efforts on this forum, you go right ahead. You can also kick your dam' cat when you get home, and slug the wife. I can't stop you.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old May 22nd, 2006, 1:16 PM   #116
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
BeautifulSoup is nicer and more powerful than I thought. It supports text searching through regular expressions and bidirectional navigation of search results.

The way I suggest solving this would be to do something like this:
1. Find a piece of text that is predictably near to the results you want
2. Using this piece of text as a starting point, navigate to the correct results

I've got some dinner cooking, so I have to go. However, here's something quick to start you off:
import re

# find "address" label
address = soup.firstText(re.compile("Address:"))

# find the first "tr" attribute above the address label
tr = address.findParent("tr")

# print the second "td" that belongs to the "tr" attribute:
print tr.fetch("td")[1]
I suggest looking up "regular expression tutorial" on Google, or something similar, and also reading through the BeautifulSoup documentation.
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 2:21 PM   #117
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
will do thanks for the tip

edit: I spoke to my friend who's pretty good with this stuff, he thinks that if we could import everything into an excel document one could use "if" statement macros to get the information we need from it. any truth to this and is it a viable option? or would we be better of using regex?
zem52887 is offline   Reply With Quote
Old May 22nd, 2006, 2:33 PM   #118
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Well, yes... One could use Excel's if statements for a task like this, but it seems akin to using a pickaxe to fix a broken down car. Sure, it could be done, but it isn't really the best tool for the job.

Excel has functions and macros to manipulate data in its spreadsheets, but it wasn't designed to be a full programming language. I'm also unaware of any HTML parsing functionality in Excel that would really be necessary for any project involving the extraction of data from web pages.

I don't like to dismiss options out of hand, but I think I'd be fairly safe in saying that a programming language such as Python is more suited to these sorts of tasks than a spreadsheet application.
Arevos is offline   Reply With Quote
Old May 22nd, 2006, 2:36 PM   #119
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
fair enough, back to learning regex and mastering beautifulsoup.
zem52887 is offline   Reply With Quote
Old May 23rd, 2006, 9:30 AM   #120
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
arite I've been reading about regex and going through the documentation and I'm starting to get it a little bit but I'm still not sure what some of the stuff means.
Specifically, what does the following code mean:
fetchText(text, recursive, limit)

I've been examining the HTML on Yahoo!'s site and I think the best way to search for regex would be to avoid using the [td] and [tr] tags and instead, use [table] tags. So my goal is to search something like:
contact = soup.firstText(re.compile("Contact Information"))
contacttable = address.findParent("table")
contacttable.fetch("/table")[0]

Ultimately, I want to fetch the table tags that surround the words "Contact Information." However, when I use compile, I get "list out of range" error, so I think, but might be wrong, that I need to use the re.search function because I'm not dealing with a list? Or do I need to use compile because we're going to perform this function 36,000 times? If that's the case then I'm formatting it wrong and do I need add another argument?
zem52887 is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:57 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC