![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#221 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
ratings = soup.fetchText(re.compile("analyst ratings"))
if ratings == []:
public_private = "Private"
else:
public_private = "Public"Quote:
text = text.replace("Company Profile - Yahoo! Finance", "") |
||
|
|
|
|
|
#222 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
I got the search and replace working and now I'm trying to incorporate the public/private column into my table. However, I keep getting empty brackets being returned or the word "Null," so I'm not sure what I'm doing wrong.
chart = soup.fetchText(re.compile("chart"))
print len(chart)I've tried this code for the following links: http://biz.yahoo.com/ic/135/135359.html (no chart) http://biz.yahoo.com/ic/47/47852.html (with chart) The first link doesn't have the word chart on it, while the second does. However, in both cases, it prints "0". I can't figure out why... |
|
|
|
|
|
#223 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
Also, is there a way to have python output the data into a new HTML document after each industry is complete? Or is there a way to have it stop pause after each industry? Basically my goal is to have either each industry separate from one another some how so my boss doesn't have 1200 papers, for ease of organization.
And finally: Is there a way to prevent it from printing one table on two separate pages? Such that, if a table is extending on to the following page, it won't print half on one and half on the other? Since it's from an HTML it seems like this wouldn't be possible, but I figure it's worth asking. And, I'm trying to minimize the amount of paper used to print, so I'm trying to maximize the way this is formatted. If anyone has any suggestions on how I could change the format to make it fit more per page, landscape perhaps? Please let me know. / How can I make the font size smaller... I can't add a "<font size = 1>" tag in the table, because each individual table with the "company description" (which takes up the most space) has a font size tag embedded in it that gets fetched with the text. Thus, do I need to do a replace tag but with font size? companyprofile = profile.findNext("table").replace("font face = arial size = -1","font face = arial size = 1")however that puts "null" in the table? Is that not the right font tag I'm trying to replace? Or is that I'm trying to replace a tag, and that line of code is trying attempting to replace the words font face which don't appear on the site, only in embedded in the HTML? If the latter is the case, how do you replace tags as opposed to text? Last edited by zem52887; May 31st, 2006 at 10:00 AM. |
|
|
|
|
|
#224 | |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
re.compile("chart", re.IGNORECASE) |
|
|
|
|
|
|
#225 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
oh wow, I cannot believe that. Heh tested it in isolation and it works let's see if it'll work in the function.
|
|
|
|
|
|
#226 | ||
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Quote:
Quote:
td { font-size: 8pt; } |
||
|
|
|
|
|
#227 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
thanks much.
|
|
|
|
|
|
#228 |
|
Programmer
Join Date: Dec 2004
Location: UK
Posts: 53
Rep Power: 4
![]() |
Just to throw in an idea, I remember you originally wanted to put all your data in a spreadhseet ... now if I'm not wrong, you can open a HTML document you have created and copy the massive tables into Excel cells. You can use Edit > Paste Special .. if it doesn't work at first. Then you/your boss can do a number of things:
1. manipulate the numerical data much more easily e.g. graphs etc 2. you can mould the actual layout (like the different industries being separated) by inserting rows etc. (I'm sure your Excel-literate enough) 3. You can, if you feel like it, copy the Excel tables back into, oh i dunno, Word and have them printed out anyway you like, or into a WYSIWYG html editor like dreamweaver and have them in the HTML again. Or something like that. But I imagine a spreadhseet would be more useful in the long run, even if it is just to be printed out. -A
__________________
Tetris is so unrealistic |
|
|
|
|
|
#229 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
yeah megamind I've been playing around trying to import the html tables into excel and the formatting gets kinda screwed up. I have been able to import them into word though, and from word I can easily edit the font and style etc. so I'm going to try and manipulate the tables that way for now.
However, I ran the program yesterday when I left working figuring when I got back it would be finished. But, I think they shut down the computer systems or something at night because it only got half done, and my computer was logged out even though I put a note to not log me out on it. Thus, I was wondering if there's a way to be able to pause the program and resume. Or even better scenario, pause it, and be able to resume to a different document, seeing as how this one is 50mb! |
|
|
|
|
|
#230 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Hm. An interesting problem. Yes, it would be possible do "pause" the program, but it would take some fairly complex reworking in order to achieve it.
The most easiet way in my mind to make the program more efficient, is to cache the company HTML that is downloaded, much as a browser would do. I'll explain more about that in a moment. This can be achieved with the "shelve" module, which allows you to store data to a file. We can use this to create a cache that stores data in a file (in this case, cache.dat): import shelve
cache = shelve.open("cache.dat")def cached_urlopen(url): if not cache.has_key(url): cache[url] = urlopen(url).read() return cache[url] if not cache.has_key(url): cache[url] = urlopen(url).read() return cache[url] The shelve module syncs this cache to a file. Everytime you make changes to the cache, such as adding a new URL to it, the shelve module writes these changes to a file. This makes it persistant - it remembers the values written to it even when the program itself closes. Caching has the advantage of speed. It's far faster to fetch a file from a cache on your computer, than it is to download it from the net. The disadvantage to caching is that it doesn't take into account changes to the data. If Yahoo changed all it's data, the cache would still give you the old data. You'd have to delete the cache file in order to force it to get the new data Yahoo offers. However, I suspect the company information (at least the information you're interested in) doesn't change very often. Thus, I'd consider a cache to be rather useful indeed. For instance, if your computer was turned off halfway through, the cache would remember what URLs you've fetched, and would refer to the local disk instead of refetching them. This would make subsequent runs a lot faster. And how to use this cached_urlopen function? Easy; just replace the urlopen functions with cached_urlopen instead: def get_company_data(url): soup = BeautifulSoup(cached_urlopen(url)) ... |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|