Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Oct 30th, 2006, 4:10 AM   #1
public2
Newbie
 
Join Date: Aug 2006
Posts: 13
Rep Power: 0 public2 is on a distinguished road
urllib and save pictures

Hi.

This is my first post, and is about an assignment I've at my college.

An overall description:
We have to make a function, with one argument, the URL. then we have to search the HTML code for any pictures, and to do that I will search for <img and src tags.

All that I can, but then we have to save the pictures local on my harddrive, and make a collage with all the pictures in it. My hindrance right now is the saving part.

For testing the script, I'm using this code:
def getImageUrl(urlstring):
  import urllib
  connection=urllib.urlopen(urlstring)
  picture = connection.read()
  connection.close()
  curloc = picture.find("img")
  if curloc <> -1:
    picloc = picture.find("<src", curloc)
    picstart = picture.rfind(">",0,picloc)
    #writefile.open(picture,"wt")
    pic = open(picture, 'wb').read
    picture = urllib.urlopen(urlstring)
    pic.write(picture)
    pic.close()
  else:
    print "There is no pictures in this URL"

I know my code isn't optimized, but I just can't seem to find the function, so it will save my pictures...

In advanced thanks.
Greetings
Public2
public2 is offline   Reply With Quote
Old Oct 30th, 2006, 4:38 AM   #2
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
You're on the right track, but there are three problems that I can see with your code. Firstly, you appear to be looking for a 'src' tag, when it's an attribute. Secondly, you're trying to open a file named picture, where picture is a variable containing your HTML page. Thirdly, you're not getting the URL of the image, you're getting the URL of the page again.

Whenever I'm doing any work with HTML in Python, I use Beautiful Soup. It's wonderfully easy to use, and comes as a single py file, so it's really rather good.

Using Beautiful Soup, your function might look like:
python Syntax (Toggle Plain Text)
  1. from urllib import urlopen
  2. from urlparse import urljoin
  3. from BeautifulSoup import BeautifulSoup
  4.  
  5. def downloadImagesFrom(urlstring):
  6. soup = BeautifulSoup(urlopen(urlstring))
  7. image_number = 1
  8. for img in soup.findAll("img"):
  9. if "src" in img.attrMap:
  10. image_url = urljoin(urlstring, img['src'])
  11.  
  12. file = open(str(image_number), "wb")
  13. file.write(urlopen(image_url).read())
  14. file.close()
  15.  
  16. image_number += 1
The above code just writes the images to numerically named files in the current directory. You may wish to do something more sophisticated.
Arevos is offline   Reply With Quote
Old Oct 30th, 2006, 10:27 AM   #3
public2
Newbie
 
Join Date: Aug 2006
Posts: 13
Rep Power: 0 public2 is on a distinguished road
Hey Arevos.

Thanks for your answer, I just got one problem that is, I don't think we are allowed to import external codes like BeautifulSoup.

My code can detect that there is pictures in the HTML code, but I just can't seem to save them to my harddrive. I'll try to make the code work, but it is more difficult then I thought it would be.
public2 is offline   Reply With Quote
Old Oct 30th, 2006, 10:51 AM   #4
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
If you've already got the "src" attribute, you can just use the inner-most indentation of the previous code:
python Syntax (Toggle Plain Text)
  1. from urlparse import urljoin
  2.  
  3. def saveImage(pageUrl, src, savePath):
  4. image_url = urljoin(pageUrl, src)
  5.  
  6. file = open(savePath, "wb")
  7. file.write(urlopen(image_url).read())
  8. file.close()
By the way, you seem to be using str.find when re.findall might be a better choice:
python Syntax (Toggle Plain Text)
  1. import re
  2. from urllib import urlopen
  3.  
  4. imageRe = re.compile('<\s*img.*?src\s*=\s*"(.*?)".*?>', re.IGNORECASE)
  5.  
  6. def findImages(pageUrl):
  7. return imageRe.findall(urlopen(pageUrl).read())
Regular expressions are rather useful for parsing text, and are included in the Python standard library.
Arevos is offline   Reply With Quote
Old Oct 31st, 2006, 3:06 PM   #5
public2
Newbie
 
Join Date: Aug 2006
Posts: 13
Rep Power: 0 public2 is on a distinguished road
Hey again.

I finally got finished with my assignment, and thought I would write the code down here. It turned out that we had to make most of the code in Jython, so some of the modules couldn't be used, but I managed anyway. Here is the complete code:

import urllib
from urlparse import urljoin
import random

def makeCollageFromUrl(urlString):    
    listOfImages = getImagesUrl(urlString)
    imageNames = []
    for imageUrl in listOfImages:
        filename = saveImage(imageUrl)
        imageNames.append(filename)
    width = 640
    height = 480
    picture = makeEmptyPicture(width,height)
    for imageName in imageNames:
        p = makePicture(imageName)
        if p.getWidth()<width and p.getHeight()<height:
            copyPictureToPicture(p,picture,random.randint(0,width-p.getWidth()),random.randint(0,height-p.getHeight()),0.5)
    
    picture.show()
    writePictureTo(picture,r"C:\HTMLCollage.jpg")

def getImagesUrl(urlString):
  connection=urllib.urlopen(urlString)
  getPictures = connection.read()
  connection.close()
  executeIndex = 0
  PicHTMLlist = []
  while getPictures.find("<img",executeIndex) <> -1:
    currentPicIndex = getPictures.find("<img",executeIndex)
    currentSrcIndex = getPictures.find("src=",currentPicIndex)
    nxtIndex = getPictures.find(">",currentSrcIndex)
    executeIndex = nxtIndex
    if getPictures.find("http",currentSrcIndex,nxtIndex)!=-1:
        end = getPictures.find(" ",currentSrcIndex,nxtIndex)
        currentPic = getPictures[currentSrcIndex+4:end]
        currentPic = currentPic.replace('"'," ")
        currentPic = currentPic.replace("'"," ")
        repCurrPic = currentPic.lstrip()
        repCurrPic = repCurrPic.rstrip()
        if repCurrPic.rfind(".jpg") != -1 or repCurrPic.rfind(".gif") != -1:
            PicHTMLlist.append(repCurrPic)
  return PicHTMLlist
  
def saveImage(urlString):
    connection = urllib.urlopen(urlString)
    getPictures = connection.read()
    connection.close()
    sepIndex = urlString.rfind("/")
    filnavn = urlString[(sepIndex+1):]
    file = open(filnavn,"wb")
    file.write(getPictures)
    file.close()
    return filnavn

def copyPictureToPicture(sourcePic,targetPic,offsetX,offsetY, blend):
  for x in range(1,sourcePic.getWidth()+1):
    for y in range(1,sourcePic.getHeight()+1):
      color = sourcePic.getPixel(x,y).getColor()
      targetPixel = targetPic.getPixel(x+offsetX,y+offsetY)
      targetColor = targetPixel.getColor()
      targetPixel.setRed(int(color.getRed()*blend+targetColor.getRed()*blend))
      targetPixel.setGreen(int(color.getGreen()*blend+targetColor.getGreen()*blend))
      targetPixel.setBlue(int(color.getBlue()*blend+targetColor.getBlue()*blend))
There might be some word in Danish, but most of it is in English. Thanks for your help Arevos.

Have a great evening.

Greetings Public2

Last edited by public2; Oct 31st, 2006 at 4:03 PM.
public2 is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 10:16 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC