Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old May 19th, 2006, 2:15 PM   #61
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
One step at a time. There are a couple of lines in your code that suggest that you don't understand the meaning behind the code.

I think the most important concept you're missing is scope. In Python, some variables are local. This means that they don't exist outside of where they are created. They exist only in a small part of the program.

Take the function below:
def get_industry_urls(industry_page):
	soup  = BeautifulSoup(urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]
The variable called "soup" is declared inside the function. In Python, you declare a variable by assigning it a value with the "=" operator. Because "soup" is defined inside the function, it is local to that function. It does not exist outside of the boundaries in which it was created.

The same thing happens with "industry_page". When this function is called, there is not one, but two variables called "industry_page". One exists outside the function, and is global. One exists inside the function, and is local. If this sounds confusing, that's because it is. That's why it makes sense to give the local industry_page a different name:
def get_industry_urls(url):
	soup  = BeautifulSoup(urlopen(url))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]
Try to think of functions as self-contained pieces of code. Functions should avoid affecting variables outside, except through the "return" statement.

Because of this, the line below does nothing in your program. You can remove it without altering your program's flow:
industry_url = get_industry_urls(industry_page)
Going back to functions. If you want to test out the function you have made, create a new text file with .py as the extension, and create some test code, like so:
def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        company_links = soup.fetch("table")[11].fetch("a")[3]
        return [a['href'] for a in company_links]

print get_company_index("http://biz.yahoo.com/ic/112.html")
raw_input("Press enter to continue...")
When this piece of code works, then you know the function works, and you can put it back into your main program.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 2:19 PM   #62
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
so that piece of code will work in isolation?
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 2:19 PM   #63
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Also, think about what I said before. Do you really need a list comprehension, if company_links contains a single value?

And think about what the function is doing. "get_company_index" gets the company index from one page. The for-loop it is in handles the repetition.

However, you're almost there, and you're making progress . Programming is a hard skill to get, but once you have the basics; once you get a feel for the "zen", if you like, then things suddenly become a lot clearer.

The problem is getting that initial understanding, which is terribly hard.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 2:21 PM   #64
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
so that piece of code will work in isolation?
Exactly! That's what a function is (or should be). That's why functions are so powerful, because they're not tied down to a single program.

Think of functions like tools. A chainsaw can be used for many purposes, many times over. A bottle-opener is more specific, but it too can be used in many locations.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 2:23 PM   #65
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay I see where you're going... I'm still a bit confused with the purpose of the return function is. I guess I was trying to replicate your initial function because that's the only one that has worked thus far, which included a list comprehension, I'm going to look at the code some more.

going back to the function, if we replace the actual hyperlink with a value such as "industry_page" then it won't work, correct, we need to include that definition in order for it to understand what we're referring to... (basic but I'm just checking)
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 2:52 PM   #66
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
okay I see where you're going... I'm still a bit confused with the purpose of the return function is.
The return keyword tells the function what to output. There's an example below that should make it clear. Notice that they are identical except for the location of the "return":
def foo():
	return 1 + 2
	2 + 3

def bar():
	1 + 2
	return 2 + 3

x = foo()	# x will be 3
y = bar()	# y will be 5
On foo(), the function outputs (or returns) 1 + 2 (which is 3), whilst bar() returns 2 + 3 (equalling 5, of course).

Functions take in arguments, and return a value. The arguments are the input, the return value is the output.

(Incidentally, functions can have no arguments, and they can have no return value. However, most functions usually have either a return value or arguments - often both)

Quote:
Originally Posted by zem52887
I guess I was trying to replicate your initial function because that's the only one that has worked thus far, which included a list comprehension, I'm going to look at the code some more.
Looking at other people's code is a good way to learn, but not to copy. At the end of the day you have to understand what's going on, otherwise you'll just get into a mess

Quote:
Originally Posted by zem52887
going back to the function, if we replace the actual hyperlink with a value such as "industry_page" then it won't work, correct, we need to include that definition in order for it to understand what we're referring to... (basic but I'm just checking)
If I'm understanding you correctly, then yes. If you pass cat into a function, then cat needs to be defined.

However, the exception to this is in function arguments. If I have the function:
def double(x):
	return x * 2
Then x doesn't have to be defined. x is defined when you call the function:
print double(6)
In this case, x becomes 6 when the function is called.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 2:58 PM   #67
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
okay, so I tried to get the basic framework of how this works (I haven't figured out exactly which table etc each link is contained) but how does this look:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

industry_page = "http://biz.yahoo.com/ic/ind_index.html"

def get_industry_urls(industry_page):
	soup  = BeautifulSoup(urlopen(industry_page))
	links = soup.fetch("table")[7].fetch("a")
	return [a['href'] for a in links]

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[10].fetch("a")[3] #not sure which table, looks like the 4th link though  
        return [a for a in index_link]

#def get_company_urls(company_index):
        #soup = BeautifulSoup(urlopen(company_index))
        #urls = soup.fetch("table")[12].fetch("a") #not sure of this table either       
        #return[a for a in urls] 

#def get_company_data(company_url)
        #soup = BeautifulSoup(urlopen(company_url))
        #data[0] = soup.fetch("table")[?] #figure out these tables
        #data[1] = soup.fetch("table")[?]
        #return data 
        
#for industry_url in get_industry_urls(industry_page):
    	#company_index = get_company_index(industry_url)

	#for company_url in get_company_urls(company_index):
            #print get_company_data(company_url) #(well output but you know...)

edit: i'm not sure if this makes sense - does the first for loop create the variable industry_url and store a value into it to be passed to get_company_index?

Last edited by zem52887; May 19th, 2006 at 3:13 PM.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 3:15 PM   #68
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Quote:
Originally Posted by zem52887
edit: i'm not sure if this makes sense - does the first for loop create the variable industry_url and store a value into it to be passed to get_company_index?
Essentially, yes. Take the following example code:
numbers = [1, 2, 3]

for n in numbers
	print 2 * n
The above code is equivalent to the following:
numbers = [1, 2, 3]

n = numbers[0]
print 2 * n
n = numbers[1]
print 2 * n
n = numbers[2]
print 2 * n
I'll also give you a quick runthrough of list comprehensions:

Let's say you have a number, say 5, and you want to double it. You could do something like this:
x = 5 * 2
That's fair enough, but what if you wanted to double every number in a list? That's where list comprehensions come in:
list = [1, 2, 3, 4, 5]
doubled = [x * 2 for x in list]

# doubled equals [2, 4, 6, 8, 10]
With a list comprehension, you can do the same thing to each item in a list. It's a lot like a for-loop, indeed, the two are very much related. You can use a for loop to do the same thing as a list comprehension:
list = [1, 2, 3, 4, 5]
doubled = []	# start with a blank list
for x in list:
	doubled.append(x * 2)	# add x * 2 onto the new "doubled" list
List comprehensions are just a shortcut.

The key thing is that they only apply to lists. With get_industry_urls, a list of urls is returned: one for each industry. With get_company_index, only one url is returned: the url of the company index for a particular industry.
Arevos is offline   Reply With Quote
Old May 19th, 2006, 3:39 PM   #69
zem52887
Hobbyist Programmer
 
Join Date: May 2006
Posts: 127
Rep Power: 3 zem52887 is on a distinguished road
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup


industry_url = "http://biz.yahoo.com/ic/110.html"

def get_company_index(industry_url):
        soup  = BeautifulSoup(urlopen(industry_url))
        index_link = soup.fetch("table")[11].fetch("a")[2]
        return [a for a in index_link]

print get_company_index(industry_url)

it works! it works!

edit: kinda works... it's returning the name of the link, and not the link itself...

when I add in 'href' then it thinks that it's an index and gives me errors... I clearly am missing something but at least I know where company index is located, I just have to figure out how to get it to return the link itself, as opposed to the name of the link.
zem52887 is offline   Reply With Quote
Old May 19th, 2006, 3:58 PM   #70
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
You're using a list comprehension when you don't need one, because you're dealing with a single item and not a list!

Think about what the list comprehension in my get_industry_urls function did, and how you'd apply that to a single item.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:03 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC