Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jun 8th, 2006, 7:06 PM   #241
Cerulean
Professional Programmer
 
Cerulean's Avatar
 
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4 Cerulean is on a distinguished road
As for things slowing to a crawl - why not call time.sleep during the intensive loops periodically to free up resources to other processes so your computer doesn't become unusable?
Cerulean is offline   Reply With Quote
Old Jun 9th, 2006, 11:33 AM   #242
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
I'm unfamiliar with that bit of code Cerulean, could you elaborate a bit as to what it does?

Also, is there a way to have the program check the cache and skip any company links already found there thus adding only new data to an HTML as opposed to building a new data.html from scratch beginning with the data I already have?
zem52887 is offline   Reply With Quote
Old Jun 9th, 2006, 4:14 PM   #243
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
I'm unfamiliar with that bit of code Cerulean, could you elaborate a bit as to what it does?
You already use it (or at least, it was in your code at one point - it seems to have vanished from your finished version). You can use sleep like this:
import time
time.sleep(10)    # sleep for 10 seconds
Or like this:
from time import sleep
sleep(10)         # sleep for 10 seconds
Note the difference between "import" and "from ... import ...".

The same holds true for any module:
import some_module
some_module.some_function()

from some_module import some_function
some_function()
Quote:
Originally Posted by zem52887
Also, is there a way to have the program check the cache and skip any company links already found there thus adding only new data to an HTML as opposed to building a new data.html from scratch beginning with the data I already have?
Maybe something like:
        for company_url in get_company_urls(company_index):
                if cache.has_key(company_url):
                        file.write(get_company_data(company_url))
                        print get_company_data(company_url)

                        # And remember to pause so the server isn't overloaded:
                        sleep(1)
Arevos is offline   Reply With Quote
Old Jun 13th, 2006, 8:20 AM   #244
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
So I ran the program again, and got an error on the American College of Beirut, again. I was curious how I would add in the debugging code to the following :
file = open("data.html", "w")

file.write("<table>\n")    # \n means add a newline

for industry_url in get_industry_urls(industry_page):
        company_index = get_company_index(industry_url)
          
        for company_urls in get_company_urls(company_index):
                if cache.has_key(company_urls):
                    file.write(get_company_data(company_urls))
                    print get_company_data(company_urls)
                                      
                    
file.write("</table>\n")
file.close()

and rather than sys.exit, can I just have it skip over any errors?
zem52887 is offline   Reply With Quote
Old Jun 13th, 2006, 10:12 AM   #245
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
update:
I made a little work-around that involves a bit of manual work, but at the same time, I don't have to wait for the program to do all the URLS that it's already done. I edited the code and have been running the program once for each industry... it's a little bit of work but it gets the job done. (and the data files aren't 800mb )
zem52887 is offline   Reply With Quote
Old Jun 13th, 2006, 12:42 PM   #246
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
So I ran the program again, and got an error on the American College of Beirut, again. I was curious how I would add in the debugging code to the following :
Just put in some print statements, e.g.:
print company_urls
Quote:
Originally Posted by zem52887
and rather than sys.exit, can I just have it skip over any errors?
Yes. You can tell Python to handle an error (or an exception, to be more accurate) with a try/except block:
from urllib2 import urlopen, HTTPError

...

try:
    file.write(get_company_data(company_urls))
except HTTPError:
    print "HTTP error occurred for '%s'" % company_urls
    print """Without this block, Python would just quit.
Instead, we've overridden the default behaviour, and we print
this message instead of quitting. You could even put in some
code to let Python try the URL again, to see if it works if we
do it a second time."""
Also, I don't see the sleep function anywhere in your loop. Where is it? That could be the cause of your errors; trying to get too much information from the website too fast.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 11:57 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC