![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#41 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
heh no not looking to pawn the work off on others, gonna try and learn some programming and impress the boss while I'm at it.
|
|
|
|
|
|
#42 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Oh, I see; I misunderstood you. The "links" variable already contains all of the links as a list. A list contains multiple values, and you can use a for-loop to apply code to each element in a list:
for link in links: html = urlopen(link['href']) # do things to the HTML from the link from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
def get_industry_urls(url):
soup = BeautifulSoup(urlopen(url))
links = soup.fetch("table")[7].fetch("a")
return [a['href'] for a in links]
urls = get_industry_urls("http://biz.yahoo.com/ic/ind_index.html")I'll take you through this line by line: def get_industry_urls(url): soup = BeautifulSoup(urlopen(url)) links = soup.fetch("table")[7].fetch("a")return [a['href'] for a in links] urls = get_industry_urls("http://biz.yahoo.com/ic/ind_index.html")Read through the Python tutorial and get a feel for the language. I leave it up to you to create the "get_company_urls" and "get_company_information" functions, but if you get stuck, post up here again. However, try to refer to the tutorial and the Python reference as much as you can. You can usually find answers in there faster than a reply in a forum. |
|
|
|
|
|
#43 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
deal. thank you so much.
|
|
|
|
|
|
#44 |
|
Hobbyist Programmer
|
hey, what about my offer
![]() |
|
|
|
|
|
#45 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
I was lucky enough to be able to take thursday off, but now I'm back and trying to learn this again... more to come
|
|
|
|
|
|
#46 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
okay so now this thread is in the proper section. For some reason (perhaps because this thread has been moved) I cannot edit my original post. For anyone willing/wanting feel free to help with this task. If you're trying to learn python (like myself) I feel that if we can collectively complete this task we'll all learn a lot more than merely the "print" and really basic commands. Arevos has been nice enough to pretty much provide a walk through so we have a great starting point. Rather than having him do all the work, I think it'll be helpful to work on this together and try and answer each other's questions regarding the coding. So if anyone has any questions or wants to add anything, feel free, it'll be greatly appreciated.
|
|
|
|
|
|
#47 |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
Posts are only editable for 30 minutes. This prevents people who don't know how to edit by addition of thoughts (typos are another matter) from raping all meaning of a thread by deleting or totally changing the content of their posts.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
|
|
#48 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
fair enough... congrats on 3500 posts
on another note: does anyone know of any good python/beautifulsoup tutorials that are more applicable to the type of stuff we're doing here with html and whatnot? |
|
|
|
|
|
#49 |
|
Hobbyist Programmer
Join Date: May 2006
Posts: 127
Rep Power: 3
![]() |
okay, so I took an individual industry and used the above code to parse through it to find which table contained "company index." This link was contained in tables[9], so from there I wanted to make sure that they all are in fact contained in tables[9] so I tried it for a few different industries to which it held. So now I know where the company index link is, so I have to make a function that fetches this link from table 9, in all of the industries... I'm gonna try and write something and I'd appreciate it if anyone could comment on what I do wrong.
okay my problem is that I don't know how to fetch from all the links. In the above code posted by Arevos, he gets all the industry links from parsing through this http://biz.yahoo.com/ic/ind_index.html link. I know which table contains the "company index" link I just don't know how to fetch it from every single industry because it needs to be fetched from multiple links. additionally when we use 'href' to fetch in the above code, we get a list of links that look like this: <a href="http://us.rd.yahoo.com/finance/industry/industryindex/912/*http://biz.yahoo.com/ic/912.html"> I'm not really sure what the first part of the link is, or if we need it at all. Is it possible to fetch the links that contain "biz". If you try and access the us.rd.yahoo lin, the page doesn't exist so I'm not sure if it matters at all, but the links we need are the ones that contain "biz". That would give us a full list of the industries without any non-existant links. sorry for being such a noob, this is probably completely irrelevent. Last edited by zem52887; May 19th, 2006 at 10:46 AM. |
|
|
|
|
|
#50 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
This is where loops come in. Loops are there for instances in which you need repetition. In your case, you want to find the company index link for each industry.
However, you're approaching the problem in the wrong direction. You're taking a linear approach to the problem, i.e. first find the industries, then find the company index for each industry, etc. This isn't the best way of tackling the problem. The skill of programming is to take in the problem as a whole, and divide it into smaller and smaller pieces. I realise I'm being rather vague, so I'll explain further with an example. You already have a function that can get a list of URLs from the Yahoo! business industry listing (at http://biz.yahoo.com/ic/ind_index.html). What you now need is the company index for each industry URL. This is a form of repetition; you need to do something (getting the company index), for a list of items (the industry URLs). Whenever you need to repeat a task for many items, a for loop is needed. For loops are fairly straightforward to get to grips with. Take a look at the following piece of code: for number in [5, 4, 3, 2, 1]: print "There are", number, "green bottles, sitting on the wall." There are 5 green bottles, sitting on the wall. There are 4 green bottles, sitting on the wall. There are 3 green bottles, sitting on the wall. There are 2 green bottles, sitting on the wall. There are 1 green bottles, sitting on the wall. Now take a look at this piece of code: industry_page = "http://biz.yahoo.com/ic/ind_index.html" for industry_url in get_industry_urls(industry_page): company_index = get_company_index(industry_url) for company_url in get_company_urls(company_index): print get_company_data(company_url) This code is a little difficult to explain, so I'll take it bit by bit. Here is the first line of the outer loop: for industry_url in get_industry_urls(industry_page): company_index = get_company_index(industry_url) for company_url in get_company_urls(company_index): print get_company_data(company_url) I cannot stress too much how important it is to understand how to use loops in Python to handle tasks that are repetative. The above code should provide a template, a skeleton to hang your code off. When you understand what the loop is doing, you can then define the currently non-existant functions referenced by the loop. If I'm not being clear, just say which parts you're having difficulty with. Unfortunately, it's often hard for a person with some experience to understand the problems an unexperienced beginner has. Programming is very hard, so don't get frustrated. Indeed, you're doing very well for a beginner so far ![]() By the way, links like: "http://us.rd.yahoo.com/finance/industry/industryindex/912/*http://biz.yahoo.com/ic/912.html" are perfectly valid. Try sticking it in your browser if you're unsure. |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|