Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Other Scripting Languages (http://www.programmingforums.org/forum39.html)
-   -   How could I do this? (http://www.programmingforums.org/showthread.php?t=12408)

bulio Jan 18th, 2007 9:21 PM

How could I do this?
 
Hi everyone,

I visit a forum in which I like to save images from. People post images on the forum, and threads usually span around 10-700 pages. A thread that I would like to currently get images from is this one:

http://bombingscience.com/graffitifo...opic=4900&st=0

I was wondering, which programming language would be best suited (and easiest) to download the .jpg images. I would like the program to recursively go through each thread page and download the images to a folder (omitting signatures, and website images).

Can anyone point me in the right direction on how I might go about doing this?

Thanks

Indigno Jan 18th, 2007 9:47 PM

You could get a crawler and set it up to download every image from that page. I don't know specifics, but that may point you in the right direction as far as googling goes.

Serinth Jan 18th, 2007 11:16 PM

i think wget has a recursive option with a -A option that allows you to specify the filetype you wanna download.

ie.
:

wget -r -l1 --no-parent -A.gif http://www.server.com/dir/

bl00dninja Jan 19th, 2007 2:39 AM

microsoft published a neat book called "programming bots, spiders, and intelligent agents in visual C++". using it and the libraries for a school project right now. would simplify the process for $50.00 or whatever they charge used on amazon.

Arevos Jan 19th, 2007 4:20 AM

Using a program like wget or curl would be easiest. Other than that, Python or Perl would probably be a good choice of language for this sort of work. Visual C++ strikes me as overkill for something a scripting language could accomplish in a fifth the time.

bulio Jan 19th, 2007 1:30 PM

Quote:

Originally Posted by Serinth (Post 122841)
i think wget has a recursive option with a -A option that allows you to specify the filetype you wanna download.

ie.
:

wget -r -l1 --no-parent -A.gif http://www.server.com/dir/

That will probably work, but will it follow all the posts, or only do one page at a time?

Arevos Jan 19th, 2007 1:50 PM

Since the pages of the forum have predictable URLs, you could do something like:

:

for i in $(seq 0 15 525); do
    wget -A.jpg,.gif,.png -r -l1 "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=$i"
done


bulio Jan 19th, 2007 2:12 PM

Quote:

Originally Posted by Arevos (Post 122860)
Since the pages of the forum have predictable URLs, you could do something like:

:

for i in $(seq 0 15 525); do
    wget -A.jpg,.gif,.png -r -l1 "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=$i"
done


I'm not running Linux though, I'm using windows 2000 with wget for windows. Or do you mean do this in python or perl?

Arevos Jan 19th, 2007 6:52 PM

Well, in Python it wouldn't be dissimilar. Perhaps:
:

  1. from os import path
  2. from urllib import urlopen
  3. from urlparse import urlsplit
  4. from BeautifulSoup import BeautifulSoup
  5.  
  6. savedir = ...
  7.  
  8. url = "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=%d"
  9.  
  10. for i in range(0, 526, 15):
  11.     soup = BeautifulSoup(urlopen(url % i))
  12.  
  13.     for img in soup.findall('img'):
  14.         src = img['src']
  15.  
  16.         relative_path = urlsplit(src)[2]
  17.  
  18.         filename = relative_path.split('/')[-1]
  19.  
  20.         image = urlopen(src).read()
  21.  
  22.         open(path.join(savedir, filename), 'wb').write(image)

Just make sure you have the BeautifulSoup py file in the same directory as your script.

Note that I haven't tried the above script in full. Probably needs some tweaking.

bulio Jan 19th, 2007 8:27 PM

Quote:

Originally Posted by Arevos (Post 122869)
Well, in Python it wouldn't be dissimilar. Perhaps:
:

  1. from os import path
  2. from urllib import urlopen
  3. from urlparse import urlsplit
  4. from BeautifulSoup import BeautifulSoup
  5.  
  6. savedir = ...
  7.  
  8. url = "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=%d"
  9.  
  10. for i in range(0, 526, 15):
  11.     soup = BeautifulSoup(urlopen(url % i))
  12.  
  13.     for img in soup.findall('img'):
  14.         src = img['src']
  15.  
  16.         relative_path = urlsplit(src)[2]
  17.  
  18.         filename = relative_path.split('/')[-1]
  19.  
  20.         image = urlopen(src).read()
  21.  
  22.         ZwYx(path.join(savedir, filename), 'wb').write(image)

Just make sure you have the BeautifulSoup py file in the same directory as your script.

Note that I haven't tried the above script in full. Probably needs some tweaking.

Arevos, the script looks great, but I'm getting the error:

:

E:\Documents and Settings\Mark-James McDougall\Desktop\Script>grabber.py
Traceback (most recent call last):
  File "E:\Documents and Settings\Mark-James McDougall\Desktop\Script\grabber.py
", line 13, in <module>
    for img in soup.findall('img'):
TypeError: 'NoneType' object is not callable



All times are GMT -5. The time now is 9:13 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC