PageRank explained for mere mortals
The main reason why Google become the most popular search engine is the fact that it was able to provide the most relevant results to the user searches.
Google search results are sorted using an algorithm named page rank. It takes in account the number of ingoing and outgoing links to each page.
Most of my time of the last year I’ve spend reading Google Papers such as The Anatomy of a Large-Scale Hypertextual Web Search Engine, The PageRank Citation Ranking: Bringing order to the web, Efficient crawling through URL ordering, and several others papers. In this article I fill focus into explain as easy as possible the The PageRank Citation Ranking: Bringing order to the web paper, because this paper took me many time to understand because the last year I was in the first year of the University and I had not enough mathematic theory to understand it (also until now I don’t have yet). The simple definition of PageRank is this:
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important”.
The first think you should know that the measure of web pages not of web sites, let me explain this by examples, www.foo.bar.com and www.foo.bar.com/page2.html will have different PageRank. But usually they have the same or similar PageRank because these page usually has links between each other..
Now I will speak about the PageRank from a deepest way.
Basically what PageRank try to do, is to simulate a Web User random surfing by the web. So as easy as the Random surfer fall into the web-page it will important. How can be that? It is simple, real users get to other pages most of time by links from others. The Random surfer simulate surf following links that appoints to a page, these is what most of Rankings do, but PageRank also implement something else, for example, Page A appoint to Page B and B doesn’t has another appointer link so the weight of B is the weight of A divided the number of out links that A could has.
This is the basic mathematic expression of the PageRank
This is the PageRank math expression for mere mortals
A Page is important if one or more Important Pages appoints to its.
How can we know which page is important and which is not before compute PageRank?
The answer is a quite simple, we don’t know. What we do is a initial assignment to every page. And the PageRank calculation must be done several times, also Google did 53 times for their first experiment. At the first time there is not difference between pages PageRank, but after itineration pages more important has a highest PageRank.
PageRank algorithm including damping factor
The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85.[5]
The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of (the damping factor and the sum of the incoming PageRank scores).
My test implementation
I’ve delevoped a basic implementation of PageRank that could be download freely here. . It is developed with PHP and MySQL and it is hosted in PHPClasses site.
Basically this class calculate the PageRank of a set of link, it recieves as parameter the ID of the page and the ID of appointed link. The calculation of a quite simple. Every link is a row in mysql that it is calculated step by step.
When all the rows are processed the rows are weight is saved.
The process is doing several times as need, by default 53 times.
Also in the class there is a test that I’ve done download
On my test, on a Sempromp 1.8 GHZ 512 RAM, it took about one hour to calculate the
PageRank of 1.000.000 Web Pages ( Just 60000 are Crawled and the others are just links undowloaded jet)
From Wikipedia in English. Starting with an only page that is
http://en.wikipedia.org/wiki/Linus_Torvalds.Here are the top of PageRank: *
+———-+———————————————————–+———+
| posicion | url | PageRank |
+———-+———————————————————–+———+
| 1 | http://en.wikipedia.org/wiki/Special:Upload | 66.1543 |
| 2 | http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer | 65.72 |
| 3 | http://en.wikipedia.org/wiki/Wikipedia:Featured_articles | 64.5116 |
| 4 | http://en.wikipedia.org/wiki/Wikipedia:Contact_us | 64.2818 |
| 5 | http://en.wikipedia.org/wiki/Special:Recentchanges | 64.025 |
| 6 | http://en.wikipedia.org/wiki/Help:Contents | 63.5814 |
| 7 | http://en.wikipedia.org/wiki/Wikipedia:About | 63.4931 |
| 8 | http://en.wikipedia.org/wiki/Wikipedia:Community_Portal | 63.4072 |
| 9 | http://en.wikipedia.org/wiki/Special:Specialpages | 63.3888 |
| 10 | http://en.wikipedia.org/wiki/Portal:Current_events | 62.964 |
+———-+———————————————————–+———+A comparation Between OS: *
+———-+———————————————+———-+
| posicion | url | PageRank |
+———-+———————————————+———-+
| 233 | http://en.wikipedia.org/wiki/Linux | 1.26486 |
| 410 | http://en.wikipedia.org/wiki/Microsoft | 0.950259 |
| 428 | http://en.wikipedia.org/wiki/Unix | 0.919783 |
| 1853 | http://en.wikipedia.org/wiki/FreeBSD | 0.405697 |
| 2235 | http://en.wikipedia.org/wiki/Category:Unix | 0.35416 |
| 2665 | http://en.wikipedia.org/wiki/Mac_OS | 0.312736 |
| 5680 | http://en.wikipedia.org/wiki/Category:Linux | 0.243818 |
+———-+———————————————+———-+A Comparation Between Computer Peoples: *
+———-+———————————————–+———-+
| posicion | url | PageRank |
+———-+———————————————–+———-+
| 717 | http://en.wikipedia.org/wiki/Linus_Torvalds | 0.626179 |
| 1595 | http://en.wikipedia.org/wiki/Richard_Stallman | 0.451587 |
| 4636 | http://en.wikipedia.org/wiki/Bill_Gates | 0.267769 |
+———-+———————————————–+———-+A Comparation between WebSearch Engines: *
+———-+————————————————————-+———-+
| posicion | url | PageRank |
+———-+————————————————————-+———-+
| 194 | http://en.wikipedia.org/wiki/Google | 1.30925 |
| 10956 | http://en.wikipedia.org/wiki/MSN | 0.193957 |
| 64674 | http://en.wikipedia.org/wiki/List_of_acquisitions_by_Google | 0.15752 |
+———-+————————————————————-+———-+
This Results are automatic calculate by this Class, the result is not changed by our preferenceThe PageRank will be more usefull is there is download the whole wikipedia.
For better performance in the test file there is set to 68M, if you have a good machine you give more of RAM.
Share and Enjoy:
These icons link to social bookmarking sites where readers can share and discover new web pages.


July 2nd, 2007 at 3:43 pm
PageRank: explicado para simples mortales…
Este articulo describe el poderoso algoritmo de rankeo PageRank que llevo a Google a la cima del mundo. Esta explicado con palabras para que sean entendido para personas sin mucha matematica….
July 2nd, 2007 at 5:17 pm
[EN] El PageRank destripado y con script de ejemplo…
The main reason why Google become the most popular search engine is the fact that it was able to provide the most relevant results to the user searches. Una explicacion detallada y con ejemplo de como funciona el pagerank. Interesarnte el script de ej…
July 6th, 2007 at 10:31 am
[…] Explicación técnica del pagerank (Inglés) […]