View Single Post
Old Oct 17th, 2006, 4:08 AM   #21
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Quote:
Originally Posted by nytrokiss View Post
sure i want it to download a webpage grab all the product url's on a page if there is a second page i want it to go to the second page and grab them there also and then return a list of url's and my issue is that it won't return when i want it to!
I'd advise trying Beautiful Soup for this sort of thing. It's a HTML parser library, and available as a single .py file.

For instance, if you want to print a list of all links (the href and the text of the link) on the page:
python Syntax (Toggle Plain Text)
  1. html = urlopen(some_url)
  2. soup = BeautifulSoup(html)
  3. links = soup.fetch('a')
  4. for link in links:
  5. if "href" in link.attrMap:
  6. print link['href'], ":", link.string
(The if-statements checks to see if the link tag has a href attribute, since not all link tags do. But if you're fairly sure that there will never be a link on the page without a href attribute, you can skip the check)

Anyway, maybe for what you're doing, you could do something like:
python Syntax (Toggle Plain Text)
  1. def get_links(url):
  2. soup = BeautifulSoup(urlopen(url))
  3. links = [a for a in soup.fetch('a') if "href" in a.attrMap]
  4. urls = [a["href"] for a in links if a.string != "Next"]
  5. next = [a for a in links if a.string == "Next"]
  6. if next:
  7. urls += get_links(next[0]["href"])
  8. return urls
Arevos is offline   Reply With Quote