Quote:
Originally Posted by nytrokiss
sure i want it to download a webpage grab all the product url's on a page if there is a second page i want it to go to the second page and grab them there also and then return a list of url's and my issue is that it won't return when i want it to!
|
I'd advise trying
Beautiful Soup for this sort of thing. It's a HTML parser library, and available as a single .py file.
For instance, if you want to print a list of all links (the href and the text of the link) on the page:
html = urlopen(some_url)
soup = BeautifulSoup(html)
links = soup.fetch('a')
for link in links:
if "href" in link.attrMap:
print link['href'], ":", link.string
(The if-statements checks to see if the link tag has a href attribute, since not all link tags do. But if you're fairly sure that there will never be a link on the page without a href attribute, you can skip the check)
Anyway, maybe for what you're doing, you could do something like:
def get_links(url):
soup = BeautifulSoup(urlopen(url))
links = [a for a in soup.fetch('a') if "href" in a.attrMap]
urls = [a["href"] for a in links if a.string != "Next"]
next = [a for a in links if a.string == "Next"]
if next:
urls += get_links(next[0]["href"])
return urls