Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 30th, 2006, 5:26 PM   #221
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
at the bottom of the page. Private companies do not have such a table so would it be possible to use regex again to find the word "analyst ratings" on a company page and then we could do something with that... or something along those lines?
Sure. One could do something like:
ratings = soup.fetchText(re.compile("analyst ratings"))

if ratings == []:
    public_private = "Private"
else:
    public_private = "Public"
Then you could just just put the public_private variable in it's own column in the output, no?

Quote:
Originally Posted by zem52887
Additionally for #3 I realized once it's output to the HTML I can simply replace "Company Profile - Yahoo! Finance" with a blank (in notepad)... no? Or is that a cop-out and I should implement it in the code? Especially if I have the next few days...
Search and replace can be done fairly easily with Python as well:
text = text.replace("Company Profile - Yahoo! Finance", "")
Arevos is offline   Reply With Quote
Old May 31st, 2006, 9:18 AM   #222
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
I got the search and replace working and now I'm trying to incorporate the public/private column into my table. However, I keep getting empty brackets being returned or the word "Null," so I'm not sure what I'm doing wrong.
chart = soup.fetchText(re.compile("chart"))
print len(chart)

I've tried this code for the following links:
http://biz.yahoo.com/ic/135/135359.html (no chart)
http://biz.yahoo.com/ic/47/47852.html (with chart)

The first link doesn't have the word chart on it, while the second does. However, in both cases, it prints "0". I can't figure out why...
zem52887 is offline   Reply With Quote
Old May 31st, 2006, 9:31 AM   #223
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
Also, is there a way to have python output the data into a new HTML document after each industry is complete? Or is there a way to have it stop pause after each industry? Basically my goal is to have either each industry separate from one another some how so my boss doesn't have 1200 papers, for ease of organization.

And finally:
Is there a way to prevent it from printing one table on two separate pages? Such that, if a table is extending on to the following page, it won't print half on one and half on the other? Since it's from an HTML it seems like this wouldn't be possible, but I figure it's worth asking.

And, I'm trying to minimize the amount of paper used to print, so I'm trying to maximize the way this is formatted. If anyone has any suggestions on how I could change the format to make it fit more per page, landscape perhaps? Please let me know. / How can I make the font size smaller...

I can't add a "<font size = 1>" tag in the table, because each individual table with the "company description" (which takes up the most space) has a font size tag embedded in it that gets fetched with the text. Thus, do I need to do a replace tag but with font size?

companyprofile = profile.findNext("table").replace("font face = arial size = -1","font face = arial size = 1")

however that puts "null" in the table? Is that not the right font tag I'm trying to replace? Or is that I'm trying to replace a tag, and that line of code is trying attempting to replace the words font face which don't appear on the site, only in embedded in the HTML? If the latter is the case, how do you replace tags as opposed to text?

Last edited by zem52887; May 31st, 2006 at 10:00 AM.
zem52887 is offline   Reply With Quote
Old May 31st, 2006, 11:01 AM   #224
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
I got the search and replace working and now I'm trying to incorporate the public/private column into my table. However, I keep getting empty brackets being returned or the word "Null," so I'm not sure what I'm doing wrong.
By default, regular expressions are case sensitive; you're searching for "chart", when on the page there is only "Chart". To make the regular expression ignore case, you need the re.IGNORECASE flag:
re.compile("chart", re.IGNORECASE)
Arevos is offline   Reply With Quote
Old May 31st, 2006, 11:04 AM   #225
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
oh wow, I cannot believe that. Heh tested it in isolation and it works let's see if it'll work in the function.
zem52887 is offline   Reply With Quote
Old May 31st, 2006, 11:07 AM   #226
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
And finally:
Is there a way to prevent it from printing one table on two separate pages? Such that, if a table is extending on to the following page, it won't print half on one and half on the other? Since it's from an HTML it seems like this wouldn't be possible, but I figure it's worth asking.
Not that I know of, but there might be some browser-specific way of doing it. Maybe IE or Firefox has some special custom tag or CSS option which prevents a table from being broken up across pages. It seems a common enough thing to want to do, so there may be an option to do it. If I happen across anything like it, I'll mention it, but I recommend running some searches on it yourself.

Quote:
Originally Posted by zem52887
And, I'm trying to minimize the amount of paper used to print, so I'm trying to maximize the way this is formatted. If anyone has any suggestions on how I could change the format to make it fit more per page, landscape perhaps? Please let me know. / How can I make the font size smaller...
Find a tutorial on CSS Stylesheets. Stylesheets can be used to alter the appearence of HTML remotely. For instance, I could have a stylesheet which had:
td { font-size: 8pt; }
And then all the text in "td" tags would be 8pt in size. I won't go into any more detail here; there are thousands of tutorials and such online that will do a much better job than me of explaining how CSS works.
Arevos is offline   Reply With Quote
Old May 31st, 2006, 11:11 AM   #227
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
thanks much.
zem52887 is offline   Reply With Quote
Old May 31st, 2006, 4:24 PM   #228
megamind5005
Programmer
 
megamind5005's Avatar
 
Join Date: Dec 2004
Location: UK
Posts: 53
Rep Power: 4 megamind5005 is on a distinguished road
Just to throw in an idea, I remember you originally wanted to put all your data in a spreadhseet ... now if I'm not wrong, you can open a HTML document you have created and copy the massive tables into Excel cells. You can use Edit > Paste Special .. if it doesn't work at first. Then you/your boss can do a number of things:
1. manipulate the numerical data much more easily e.g. graphs etc
2. you can mould the actual layout (like the different industries being separated) by inserting rows etc. (I'm sure your Excel-literate enough)
3. You can, if you feel like it, copy the Excel tables back into, oh i dunno, Word and have them printed out anyway you like, or into a WYSIWYG html editor like dreamweaver and have them in the HTML again.

Or something like that. But I imagine a spreadhseet would be more useful in the long run, even if it is just to be printed out.

-A
__________________
Tetris is so unrealistic
megamind5005 is offline   Reply With Quote
Old Jun 1st, 2006, 8:34 AM   #229
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
yeah megamind I've been playing around trying to import the html tables into excel and the formatting gets kinda screwed up. I have been able to import them into word though, and from word I can easily edit the font and style etc. so I'm going to try and manipulate the tables that way for now.

However, I ran the program yesterday when I left working figuring when I got back it would be finished. But, I think they shut down the computer systems or something at night because it only got half done, and my computer was logged out even though I put a note to not log me out on it. Thus, I was wondering if there's a way to be able to pause the program and resume. Or even better scenario, pause it, and be able to resume to a different document, seeing as how this one is 50mb!
zem52887 is offline   Reply With Quote
Old Jun 1st, 2006, 10:10 AM   #230
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Hm. An interesting problem. Yes, it would be possible do "pause" the program, but it would take some fairly complex reworking in order to achieve it.

The most easiet way in my mind to make the program more efficient, is to cache the company HTML that is downloaded, much as a browser would do. I'll explain more about that in a moment.

This can be achieved with the "shelve" module, which allows you to store data to a file. We can use this to create a cache that stores data in a file (in this case, cache.dat):
import shelve

cache = shelve.open("cache.dat")
Next, we need to create a cached_urlopen function. I'll show you the code, then go through it line by line:
def cached_urlopen(url):
	if not cache.has_key(url):
		cache[url] = urlopen(url).read()
	return cache[url]
It's a pretty small function, but it is extremely powerful. I'll explain what it does:
	if not cache.has_key(url):
This line of code checks to see if the URL is not cached. If there is no record of the URL, then we go to the next line:
		cache[url] = urlopen(url).read()
This line downloads the HTML from the URL, and the places it into the cache. The cache associates a URL with the HTML it receives.
	return cache[url]
This final line of code queries the cache and pulls out the HTML associated with the URL.

The shelve module syncs this cache to a file. Everytime you make changes to the cache, such as adding a new URL to it, the shelve module writes these changes to a file. This makes it persistant - it remembers the values written to it even when the program itself closes.

Caching has the advantage of speed. It's far faster to fetch a file from a cache on your computer, than it is to download it from the net.

The disadvantage to caching is that it doesn't take into account changes to the data. If Yahoo changed all it's data, the cache would still give you the old data. You'd have to delete the cache file in order to force it to get the new data Yahoo offers.

However, I suspect the company information (at least the information you're interested in) doesn't change very often. Thus, I'd consider a cache to be rather useful indeed. For instance, if your computer was turned off halfway through, the cache would remember what URLs you've fetched, and would refer to the local disk instead of refetching them. This would make subsequent runs a lot faster.

And how to use this cached_urlopen function? Easy; just replace the urlopen functions with cached_urlopen instead:
def get_company_data(url):
	soup = BeautifulSoup(cached_urlopen(url))
	...
Do this for all of your functions, and you should notice performance increases once the cache starts to fill.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:54 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC