Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jan 18th, 2007, 9:21 PM   #1
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 5 bulio is on a distinguished road
How could I do this?

Hi everyone,

I visit a forum in which I like to save images from. People post images on the forum, and threads usually span around 10-700 pages. A thread that I would like to currently get images from is this one:

http://bombingscience.com/graffitifo...opic=4900&st=0

I was wondering, which programming language would be best suited (and easiest) to download the .jpg images. I would like the program to recursively go through each thread page and download the images to a folder (omitting signatures, and website images).

Can anyone point me in the right direction on how I might go about doing this?

Thanks
bulio is offline   Reply With Quote
Old Jan 18th, 2007, 9:47 PM   #2
Indigno
Professional Programmer
 
Indigno's Avatar
 
Join Date: Dec 2005
Location: Anywhere non-productive
Posts: 267
Rep Power: 0 Indigno is an unknown quantity at this point
Send a message via AIM to Indigno Send a message via MSN to Indigno Send a message via Yahoo to Indigno
You could get a crawler and set it up to download every image from that page. I don't know specifics, but that may point you in the right direction as far as googling goes.
__________________
Perhaps I should have a sticky topic for all of the times I "return" to this forum instead of a new one every time.
Indigno is offline   Reply With Quote
Old Jan 18th, 2007, 11:16 PM   #3
Serinth
Programmer
 
Serinth's Avatar
 
Join Date: Sep 2005
Posts: 50
Rep Power: 3 Serinth is on a distinguished road
i think wget has a recursive option with a -A option that allows you to specify the filetype you wanna download.

ie.
 wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
__________________
A girl talked to me once.

http://www.latestanime.com
Serinth is offline   Reply With Quote
Old Jan 19th, 2007, 2:39 AM   #4
bl00dninja
Programming Guru
 
bl00dninja's Avatar
 
Join Date: Oct 2004
Location: namespace std
Posts: 1,246
Rep Power: 5 bl00dninja is on a distinguished road
microsoft published a neat book called "programming bots, spiders, and intelligent agents in visual C++". using it and the libraries for a school project right now. would simplify the process for $50.00 or whatever they charge used on amazon.
__________________
i put on my robe and wizard hat...

Have you ever heard of Plato, Aristotle, Socrates?...Morons.
bl00dninja is offline   Reply With Quote
Old Jan 19th, 2007, 4:20 AM   #5
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Using a program like wget or curl would be easiest. Other than that, Python or Perl would probably be a good choice of language for this sort of work. Visual C++ strikes me as overkill for something a scripting language could accomplish in a fifth the time.
Arevos is offline   Reply With Quote
Old Jan 19th, 2007, 1:30 PM   #6
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 5 bulio is on a distinguished road
Quote:
Originally Posted by Serinth View Post
i think wget has a recursive option with a -A option that allows you to specify the filetype you wanna download.

ie.
 wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
That will probably work, but will it follow all the posts, or only do one page at a time?
bulio is offline   Reply With Quote
Old Jan 19th, 2007, 1:50 PM   #7
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Since the pages of the forum have predictable URLs, you could do something like:

for i in $(seq 0 15 525); do
    wget -A.jpg,.gif,.png -r -l1 "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=$i"
done
Arevos is offline   Reply With Quote
Old Jan 19th, 2007, 2:12 PM   #8
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 5 bulio is on a distinguished road
Quote:
Originally Posted by Arevos View Post
Since the pages of the forum have predictable URLs, you could do something like:

for i in $(seq 0 15 525); do
    wget -A.jpg,.gif,.png -r -l1 "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=$i"
done
I'm not running Linux though, I'm using windows 2000 with wget for windows. Or do you mean do this in python or perl?
bulio is offline   Reply With Quote
Old Jan 19th, 2007, 6:52 PM   #9
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Well, in Python it wouldn't be dissimilar. Perhaps:
python Syntax (Toggle Plain Text)
  1. from os import path
  2. from urllib import urlopen
  3. from urlparse import urlsplit
  4. from BeautifulSoup import BeautifulSoup
  5.  
  6. savedir = ...
  7.  
  8. url = "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=%d"
  9.  
  10. for i in range(0, 526, 15):
  11. soup = BeautifulSoup(urlopen(url % i))
  12.  
  13. for img in soup.findall('img'):
  14. src = img['src']
  15.  
  16. relative_path = urlsplit(src)[2]
  17.  
  18. filename = relative_path.split('/')[-1]
  19.  
  20. image = urlopen(src).read()
  21.  
  22. open(path.join(savedir, filename), 'wb').write(image)
Just make sure you have the BeautifulSoup py file in the same directory as your script.

Note that I haven't tried the above script in full. Probably needs some tweaking.
Arevos is offline   Reply With Quote
Old Jan 19th, 2007, 8:27 PM   #10
bulio
Hobbyist Programmer
 
bulio's Avatar
 
Join Date: Jul 2004
Location: Location
Posts: 138
Rep Power: 5 bulio is on a distinguished road
Quote:
Originally Posted by Arevos View Post
Well, in Python it wouldn't be dissimilar. Perhaps:
python Syntax (Toggle Plain Text)
  1. from os import path
  2. from urllib import urlopen
  3. from urlparse import urlsplit
  4. from BeautifulSoup import BeautifulSoup
  5.  
  6. savedir = ...
  7.  
  8. url = "http://bombingscience.com/graffitiforum/index.php?showtopic=4900&st=%d"
  9.  
  10. for i in range(0, 526, 15):
  11. soup = BeautifulSoup(urlopen(url % i))
  12.  
  13. for img in soup.findall('img'):
  14. src = img['src']
  15.  
  16. relative_path = urlsplit(src)[2]
  17.  
  18. filename = relative_path.split('/')[-1]
  19.  
  20. image = urlopen(src).read()
  21.  
  22. ZwYx(path.join(savedir, filename), 'wb').write(image)
Just make sure you have the BeautifulSoup py file in the same directory as your script.

Note that I haven't tried the above script in full. Probably needs some tweaking.
Arevos, the script looks great, but I'm getting the error:

E:\Documents and Settings\Mark-James McDougall\Desktop\Script>grabber.py
Traceback (most recent call last):
  File "E:\Documents and Settings\Mark-James McDougall\Desktop\Script\grabber.py
", line 13, in <module>
    for img in soup.findall('img'):
TypeError: 'NoneType' object is not callable
bulio is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 12:28 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC