Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jul 10th, 2005, 12:49 PM   #1
sluglicker
Newbie
 
Join Date: Jul 2005
Posts: 3
Rep Power: 0 sluglicker is on a distinguished road
Web Crawler

Does anyone know where I can find a good comprehensive spider/crawler tutorial for C++? I've been looking for a long time and can't locate much. There's some stuff for VB and Java, and of course tons of stuff for Perl, but not C++. I'm new to programming and have chosen as my first language C++, so the other ones don't help me much. This will be my first real project so I need a thorough guide. I did a search here already and found one post that gave a few quick pointers and that did help somewhat. Thanks for your help.
sluglicker is offline   Reply With Quote
Old Jul 10th, 2005, 2:12 PM   #2
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
When you learn a language, you have to learn the language PLUS the rationale behind the design of what you choose to program. I would suggest that you would be better off initially to write something that doesn't require you to simultaneously learn the ins and outs of a separate technical entity as complex as all the things associated with understanding how the internet works. To do that properly you will be reading so many RFCs your C++ book will be buried 10 feet down.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old Jul 10th, 2005, 5:16 PM   #3
Cerulean
Professional Programmer
 
Cerulean's Avatar
 
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4 Cerulean is on a distinguished road
Crawlers are easy peasy if you can use regular expressions (or long manual searching), and can download files via HTTP. This is where C++ is a little shoddy, as it doesn't provide you with these functionalities straight out (you have to write the HTTP code yourself with nothing but the C socket API). The only bot i've written in C++ used Qt (QHttp and QRegExp) to handle it, so it wasn't too bad. I recommend you find a library for those.

The steps are fairly simple:
1. Create a list that will hold your URLs
2. Download the starting page, parse it for URLs (checking anchor and link tags for the value of the href attribute, images for src, etc) and then fix those URLs for direct use by converting local references in the anchor tags (e.g on a page http://foo.org/bar/blah.html, if there is an anchor tag with a href value of "foo.html", the new URL you want will be http://foo.org/bar/foo.html). Take all of the URLs, and put them in the list.
3. Repeat 2 for every URL in the list, popping off the URL you just parsed for links as you go along.

See how long your bot can go for :-). I started at google.com and let it run for two days before I had to stop it.
Cerulean is offline   Reply With Quote
Old Jul 11th, 2005, 3:23 AM   #4
sluglicker
Newbie
 
Join Date: Jul 2005
Posts: 3
Rep Power: 0 sluglicker is on a distinguished road
Quote:
Originally Posted by sluglicker
Does anyone know where I can find a good comprehensive spider/crawler tutorial for C++?
Thanks for the feedback, but that doesn't answer my question. The good news is I already know tcp/ip, udp, http and html; I've studied "C++ Programming Fundamentals" by Chuck Easttom, which covers all the basics such as arrays, strings, exception handling, pointers, classes, various oop concepts, basic data structures and algorithms, and an intro to Visual C++ (I'm using 6.0). If you don't know the answer to my question, that's ok. If anyone knows where I can find a good comprehensive spider/crawler tutorial for C++, it would be greatly appreciated.
sluglicker is offline   Reply With Quote
Old Jul 11th, 2005, 11:39 PM   #5
sluglicker
Newbie
 
Join Date: Jul 2005
Posts: 3
Rep Power: 0 sluglicker is on a distinguished road
Solved

I found a good one for php that goes into great detail.
http://www.searchlore.org/phpregexspider.htm
Close enough. Thanks.

Last edited by sluglicker; Jul 11th, 2005 at 11:41 PM. Reason: forgot to add "solved"
sluglicker is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 6:17 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC