Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 17th, 2006, 4:20 PM   #41
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
heh no not looking to pawn the work off on others, gonna try and learn some programming and impress the boss while I'm at it.
zem52887 is offline   Reply With Quote
Old May 17th, 2006, 4:28 PM   #42
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
Oh, I see; I misunderstood you. The "links" variable already contains all of the links as a list. A list contains multiple values, and you can use a for-loop to apply code to each element in a list:
for link in links:
	html = urlopen(link['href'])
	# do things to the HTML from the link
However, if you go down that route, you're going end up with a huge set of for-loops. This is where functions come in. Take a look at the following code:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

def get_industry_urls(url):
	soup  = BeautifulSoup(urlopen(url))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]

urls = get_industry_urls("http://biz.yahoo.com/ic/ind_index.html")
Here I define a function called get_industry_urls. This gets all the urls to all the industry pages on the URL specified.

I'll take you through this line by line:
def get_industry_urls(url):
This is the start of the function. The "def" keyword tells Python we're defining a function. The "get_industry_urls" is the name of the function. The "(url)" part says that this function takes one argument called "url". An argument is a value that is passed into a function.
	soup  = BeautifulSoup(urlopen(url))
Here, we open the URL passed into the function, and use this to create a new BeautifulSoup object, which is called "soup".
	links = soup.fetch("table")[7].fetch("a")
This line gets all "a" tags in table 7, and gives this list of tags the name "links".
	return [a['href'] for a in links]
This code takes the "href" attribute from each link, and constructs a new list. This list is returned from the function.
urls = get_industry_urls("http://biz.yahoo.com/ic/ind_index.html")
This is how the function is used. We pass in a single argument ("http://biz.yahoo.com/ic/ind_index.html"), and out comes a list of urls from the href attribute of each link.

Read through the Python tutorial and get a feel for the language. I leave it up to you to create the "get_company_urls" and "get_company_information" functions, but if you get stuck, post up here again. However, try to refer to the tutorial and the Python reference as much as you can. You can usually find answers in there faster than a reply in a forum.
Arevos is offline   Reply With Quote
Old May 17th, 2006, 4:30 PM   #43
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
deal. thank you so much.
zem52887 is offline   Reply With Quote
Old May 17th, 2006, 4:35 PM   #44
hervens48
Hobbyist Programmer
 
Join Date: Apr 2006
Location: Montreal, Canada
Posts: 107
Rep Power: 3 hervens48 is on a distinguished road
Send a message via AIM to hervens48 Send a message via MSN to hervens48
hey, what about my offer
hervens48 is offline   Reply With Quote
Old May 19th, 2006, 8:55 AM   #45
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
I was lucky enough to be able to take thursday off, but now I'm back and trying to learn this again... more to come
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 9:46 AM   #46
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay so now this thread is in the proper section. For some reason (perhaps because this thread has been moved) I cannot edit my original post. For anyone willing/wanting feel free to help with this task. If you're trying to learn python (like myself) I feel that if we can collectively complete this task we'll all learn a lot more than merely the "print" and really basic commands. Arevos has been nice enough to pretty much provide a walk through so we have a great starting point. Rather than having him do all the work, I think it'll be helpful to work on this together and try and answer each other's questions regarding the coding. So if anyone has any questions or wants to add anything, feel free, it'll be greatly appreciated.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 9:50 AM   #47
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
Posts are only editable for 30 minutes. This prevents people who don't know how to edit by addition of thoughts (typos are another matter) from raping all meaning of a thread by deleting or totally changing the content of their posts.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old May 19th, 2006, 9:52 AM   #48
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
fair enough... congrats on 3500 posts

on another note: does anyone know of any good python/beautifulsoup tutorials that are more applicable to the type of stuff we're doing here with html and whatnot?
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 10:20 AM   #49
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay, so I took an individual industry and used the above code to parse through it to find which table contained "company index." This link was contained in tables[9], so from there I wanted to make sure that they all are in fact contained in tables[9] so I tried it for a few different industries to which it held. So now I know where the company index link is, so I have to make a function that fetches this link from table 9, in all of the industries... I'm gonna try and write something and I'd appreciate it if anyone could comment on what I do wrong.

okay my problem is that I don't know how to fetch from all the links. In the above code posted by Arevos, he gets all the industry links from parsing through this http://biz.yahoo.com/ic/ind_index.html link. I know which table contains the "company index" link I just don't know how to fetch it from every single industry because it needs to be fetched from multiple links.

additionally when we use 'href' to fetch in the above code, we get a list of links that look like this:
<a href="http://us.rd.yahoo.com/finance/industry/industryindex/912/*http://biz.yahoo.com/ic/912.html">

I'm not really sure what the first part of the link is, or if we need it at all. Is it possible to fetch the links that contain "biz". If you try and access the us.rd.yahoo lin, the page doesn't exist so I'm not sure if it matters at all, but the links we need are the ones that contain "biz". That would give us a full list of the industries without any non-existant links.

sorry for being such a noob, this is probably completely irrelevent.

Last edited by zem52887; May 19th, 2006 at 10:46 AM.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 11:43 AM   #50
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5 Arevos is on a distinguished road
This is where loops come in. Loops are there for instances in which you need repetition. In your case, you want to find the company index link for each industry.

However, you're approaching the problem in the wrong direction. You're taking a linear approach to the problem, i.e. first find the industries, then find the company index for each industry, etc. This isn't the best way of tackling the problem. The skill of programming is to take in the problem as a whole, and divide it into smaller and smaller pieces.

I realise I'm being rather vague, so I'll explain further with an example.

You already have a function that can get a list of URLs from the Yahoo! business industry listing (at http://biz.yahoo.com/ic/ind_index.html). What you now need is the company index for each industry URL. This is a form of repetition; you need to do something (getting the company index), for a list of items (the industry URLs). Whenever you need to repeat a task for many items, a for loop is needed.

For loops are fairly straightforward to get to grips with. Take a look at the following piece of code:
for number in [5, 4, 3, 2, 1]:
	print "There are", number, "green bottles, sitting on the wall."
The output from this piece of code is:
There are 5 green bottles, sitting on the wall.
There are 4 green bottles, sitting on the wall.
There are 3 green bottles, sitting on the wall.
There are 2 green bottles, sitting on the wall.
There are 1 green bottles, sitting on the wall.
Hopefully, you can see how the same piece of code has been applied to each number in the list.

Now take a look at this piece of code:
industry_page = "http://biz.yahoo.com/ic/ind_index.html"

for industry_url in get_industry_urls(industry_page):
	company_index = get_company_index(industry_url)

	for company_url in get_company_urls(company_index):
		print get_company_data(company_url)
In programming, loops like this are called iteration. You'll notice that there are two for-loops, one inside the other.

This code is a little difficult to explain, so I'll take it bit by bit. Here is the first line of the outer loop:
for industry_url in get_industry_urls(industry_page):
Here, we are telling Python that we want to do something for each of the industry URLs we get from the industry page. The URLs are just like the numbers in the previous example; they're just a list of values. We want to take each value in turn, and do something with it.
	company_index = get_company_index(industry_url)
This next piece of code finds the company index URL in the industry page. It uses the "get_company_index" function. This function has not been written, but it should be clear what it does from its name.
	for company_url in get_company_urls(company_index):
Now that we have the company index URL, we can use this to get a list of URLs for individual companies. The get_company_urls function is used for this, which again, has not yet been written.
		print get_company_data(company_url)
Finally, we print out the company data. Again, the get_company_data function has not been written. Also, in the finished software, you will want it to output to a file, rather than print, but this is a minor concern.


I cannot stress too much how important it is to understand how to use loops in Python to handle tasks that are repetative. The above code should provide a template, a skeleton to hang your code off. When you understand what the loop is doing, you can then define the currently non-existant functions referenced by the loop.

If I'm not being clear, just say which parts you're having difficulty with. Unfortunately, it's often hard for a person with some experience to understand the problems an unexperienced beginner has. Programming is very hard, so don't get frustrated. Indeed, you're doing very well for a beginner so far

By the way, links like:
"http://us.rd.yahoo.com/finance/industry/industryindex/912/*http://biz.yahoo.com/ic/912.html"
are perfectly valid. Try sticking it in your browser if you're unsure.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:45 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC