Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Dec 13th, 2004, 7:47 PM   #1
on_auc
Newbie
 
Join Date: Nov 2004
Posts: 24
Rep Power: 0 on_auc is an unknown quantity at this point
Howdy,

I want to know how the google, webcrawler etc. searchengines really work as I am learning php programming and want to write a searchengine.
I have read around 10 websites, found on google, about “how searchengines work” and not a single one of them make it clear if it is the spider or the index or the search software does the ranking according to it’s ranking algorithm.
All they ever say is that, a searchengine has 3 softwares :
a) the spider
B) the index
c) the search system (search-box, template, etc.)
The spiders crawl the web collecting webpages and then forward them to the index and then the search software searches the index for the sought keywords/phrases.
Also, some say that the spiders copy the whole website into it’s index. So, in other words, there is 2 copies of a website. One residing in the website owner’s webserver and the other residing on the index of the searchengine.
So now, I can only assume 3 possibilities how a searchengine works from all this:

1.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) gives the ranking according to the searchengine’s ranking algorithm.
This means, the spider nor the index is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

2.
The spider does the ranking according to the searchengine’s ranking algorithm.
It visits a website and grabs all it’s html codes (copy a website) and then finally dump the html codes to it’s index. When it dumps the copies of websites it ranks them according to the searchengine’s algorithm.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm because that has been already done by the spider when dumping the data onto the index.
This means, the spider is responsible for giving the ranking and not the index nor the search-system responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

3.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is not only a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website but also the system that does the ranking.
When it receives data from the spider, it ranks the links in it’s database according to the searchengine’s ranking algorithm.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm.
Frankly, all it does is output a copy of certain parts of the index onto a searcher’s screen.
This means, neither the spider or the search-system is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.


So, which assumption is correct according to the 3 above ?
__________________
I would like to come-up with my own "Compression Algorithm" and teach that to the browsers so you can now show streaming videos and lengthy animations from your website without losing an arm and a leg on your band-width.
on_auc is offline   Reply With Quote
Old Dec 13th, 2004, 10:48 PM   #2
tempest
Programming Guru
 
tempest's Avatar
 
Join Date: Oct 2004
Posts: 1,041
Rep Power: 6 tempest is on a distinguished road
Send a message via ICQ to tempest Send a message via AIM to tempest Send a message via Yahoo to tempest
I havent read through your posts... but the way search engine rankings are so accurate and useful is because they have had time to have enough users click on the relevant ones (clicks are tracked and most clicks means top position) to make it well rounded.

The spider is just the thing that crawls the internet looking for webpages to cache. Google keeps an index of all the webpages on the internet, and updates them periodically. What the spider does is its given a few key websites that have alot of links and all of the links on all of the websites are analyized and tracked to "mlik" out websites.... the spiders are consistently given websites that they think will lead to more material and eventually are able to stop doing that because most of the websites are found...

Anything else you want to know?
__________________

tempest is offline   Reply With Quote
Old Dec 15th, 2004, 1:02 PM   #3
Overmind
Professional Programmer
 
Overmind's Avatar
 
Join Date: Jun 2004
Location: South Africa, Johannesburg
Posts: 301
Rep Power: 5 Overmind is on a distinguished road
Quote:
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
I don't think they could keep a copy of 8,058,044,651 web pages
__________________
[SIGPIC][/SIGPIC]
Overmind is offline   Reply With Quote
Old Dec 15th, 2004, 4:52 PM   #4
Ooble
I eat cake for breakfast.
 
Ooble's Avatar
 
Join Date: Jul 2004
Location: In my box.
Posts: 4,434
Rep Power: 9 Ooble is on a distinguished road
Then what does the Google cache do?
__________________
Me :: You :: Them
Ooble is offline   Reply With Quote
Old Dec 15th, 2004, 5:16 PM   #5
Overmind
Professional Programmer
 
Overmind's Avatar
 
Join Date: Jun 2004
Location: South Africa, Johannesburg
Posts: 301
Rep Power: 5 Overmind is on a distinguished road
oic now...didn't know about that....
__________________
[SIGPIC][/SIGPIC]
Overmind is offline   Reply With Quote
Old Dec 15th, 2004, 5:26 PM   #6
Ooble
I eat cake for breakfast.
 
Ooble's Avatar
 
Join Date: Jul 2004
Location: In my box.
Posts: 4,434
Rep Power: 9 Ooble is on a distinguished road
Sure, it's definitely highly compressed, but that's petabytes of info. Where do they store it?
__________________
Me :: You :: Them
Ooble is offline   Reply With Quote
Old Dec 15th, 2004, 5:35 PM   #7
Overmind
Professional Programmer
 
Overmind's Avatar
 
Join Date: Jun 2004
Location: South Africa, Johannesburg
Posts: 301
Rep Power: 5 Overmind is on a distinguished road
Good question....must cost alot to keep that all running 24/7 o_O
__________________
[SIGPIC][/SIGPIC]
Overmind is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 3:40 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC