Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jan 19th, 2007, 9:24 PM   #11
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 4 bulio is on a distinguished road
Ok, I got it working. Here's what I have:

from os import path
from urllib import urlopen
from urlparse import urlsplit
from BeautifulSoup import BeautifulSoup
from httplib import InvalidURL
 
savedir = 'E:\Documents and Settings\Mark-James McDougall\Desktop\DTA'
 
url = 'http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=%s'
 
main_url = 'http://bombingscience.com/'
 
 
for i in range(0, 526):
 
  this_url = url % i
 
  try:
    soup = BeautifulSoup(urlopen(this_url))
 
  except InvalidURL, e:
    print 'url <%s> did not open: %s' % (this_url, e)
    print sys.exit(1)
 
  for img in soup.findAll('img'):
    src = img['src']
 
    # if it's from the ad server, let's ignore this image
    if 'adserver' in src:
      print 'This looks like an ad, skipping: %s' % src
      continue
 
    if not src.startswith('http://'):
      image_url = main_url + src.strip('/')
    else:
      image_url = src
 
    try:
      image = urlopen(image_url).read()
      relative_path = urlsplit(src)[2]
 
      filename = relative_path.split('/')[-1]
 
      open(path.join(savedir, filename), 'wb').write(image)
      print 'got %s successfully' % image_url
 
    except IOError, e:
      print 'could not open this image: <%s>' % image_url

Although, it seems that a bunch of images get downloaded and some work properly, but others don't appear on my PC. The size and filename is there, but no image.

Any idea why?
bulio is offline   Reply With Quote
Old Jan 20th, 2007, 6:57 AM   #12
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Quote:
Originally Posted by bulio View Post
Although, it seems that a bunch of images get downloaded and some work properly, but others don't appear on my PC. The size and filename is there, but no image.

Any idea why?
I'm not sure. Have you tried visiting the URLs that failed with a browser?

The only thing I can think of is that perhaps the relative URLs aren't being correctly parsed. Try replacing these lines:
python Syntax (Toggle Plain Text)
  1. if not src.startswith('http://'):
  2. image_url = main_url + src.strip('/')
  3. else:
  4. image_url = src
With this:
python Syntax (Toggle Plain Text)
  1. image_url = urljoin(this_url, src)
And add urljoin to the list of functions you import from urlparse:
python Syntax (Toggle Plain Text)
  1. from urlparse import urlsplit, urljoin
urljoin will turn any relative link into an absolute one, whilst leaving absolute URLs intact. e.g.
python Syntax (Toggle Plain Text)
  1. >>> urljoin("http://www.foo.com", "bar/foobar.png")
  2. "http://www.foo.com/bar/foobar.png"
  3. >>> urljoin("http://www.foo.com", "http://www.world.com/bar/foobar.png")
  4. "http://www.world.com/bar/foobar.png"
Your code does the same thing, but there may be URLs it falls over on, giving an incorrect final URL. urljoin should work correctly for any valid URL. Whether this is the problem, I'm not sure, but it's the only thing I can currently think of.

Also, you have:
for i in range(0, 526)
Which will go up by increments of 1 each time. The URLs, however, go up in increments of 15 (e.g. st=0, st=15, st=30...) :
for i in range(0, 526, 15)
Whilst either should work, you're probably iterating over the same posts multiple times.
Arevos is offline   Reply With Quote
Old Jan 20th, 2007, 11:08 AM   #13
Duck
Programmer
 
Join Date: Jun 2006
Location: England London
Posts: 72
Rep Power: 2 Duck is on a distinguished road
(you could use HtTrack to download the website (or parts of) then you'd have all the image files etc)
Duck is offline   Reply With Quote
Old Jan 21st, 2007, 12:24 PM   #14
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 4 bulio is on a distinguished road
Areos, your code works great! Now if only I could find out how to only download the images, not the signatures or avatars
bulio is offline   Reply With Quote
Old Jan 21st, 2007, 2:15 PM   #15
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 4 bulio is on a distinguished road
Oh and finally, what would I need to change if I wanted to begin downloading images from say, page 300 of a forum thread?

I'm assuming I'd change:

for i in range(0, 526, 15)
To this:
 for i in range(4500, 526, 15)

Since st=4500 would be the 300th page.

Also, it seems like around 40-60 images didn't get downloaded. Any idea why soem are getting downloaded no problem, but some aren't?

Last edited by bulio; Jan 21st, 2007 at 2:35 PM.
bulio is offline   Reply With Quote
Old Jan 21st, 2007, 3:01 PM   #16
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
The arguments in range are (start, end, increment). The first argument is the starting number, the second is the number after the ending number, the third is the increment.

So (0, 526, 15) will go from 0 to 525 in increments of 15. range(4500, 526, 15) won't do anything, since the end (526) is less than the start(4500).
Arevos is offline   Reply With Quote
Old Jan 21st, 2007, 3:14 PM   #17
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 4 bulio is on a distinguished road
I don't understand why some images are getting downloaded fine, when others aren't even getting downloaded at all. For example:

http://bombingscience.com/graffitifo...ic=4900&st=450

Not one of those images were downloaded.
bulio is offline   Reply With Quote
Old Jan 21st, 2007, 4:45 PM   #18
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Did the page load correctly or did it timeout? Did the program say the images were being downloaded?

Perhaps if you farmed out the functionality to a method, then you could call it just for page 450. Something like:
python Syntax (Toggle Plain Text)
  1. for i in range(0, 526, 15):
  2. save_images(url % i)
And then if you wanted to check a particular page:
python Syntax (Toggle Plain Text)
  1. save_images(url % 450)
I tested it on 450, and it seems to work fine. Perhaps the page is timing out, or maybe it doesn't like you requesting so many pages so quickly. Perhaps put in a "time.sleep" function at the end of the loop to ensure that it waits a few seconds before the next page.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 9:29 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC