Saturday, 19 November 2016

PageRank - Google's Algorithm ( The heart of google )

What is it?

PageRank is an algorithm used primarily for rating the popularity of web pages. Although one may initially think 'Page' refers to the web page, it actually refers to the inventor, Larry Page (a founder of Google). This is why you'll find the capital 'P' wherever you find a reference to the algorithm.
The basic assumption is that the more inbound links a web page has across the web, the more valid the content the page contains. It's a crowd-sourcing algorithm of sorts; it relies on everybody online to create a network of links to different web pages and classifies the validity of each page based on its overall popularity.

Applications

The most obvious application of PageRank is the Google search engine. In fact, much of the company's initial success may be attributed to the effectiveness of PageRank in organizing search results on Google.
It is important to note, however, that it is not only web pages which may use PageRank. Any data which can be modeled as a directional graph can be analyzed with PageRank.

The Math

Consider a directional graph, where web pages are modeled as nodes, and the links are modeled as vertices. Each vertex represents a link from one page to another.
While it may be tempting to simply count the total number of links pointing to each web page to discover which page is the most popular, PageRank is a bit more sophisticated.
We begin by assigning each node a score between zero and one. This score represents a probability distribution. If you are unfamiliar with probability distributions, don't worry. This example will make sense regardless.
The initial score is:
PR(N) = 1/n
This reads as "The PageRank of each node is one divided by the total number of nodes in the network." Each node starts with the same initial score.
Let's use an example network with five nodes: A, B, C, D, and E. In this case, each node has an initial score of 0.2. Now imagine a link:
A -> C
This would transfer all of A's scores to C, leaving A with 0 and C with 0.4.
Imagine A with two outbound links:
A -> C
A -> D
In this case, A would still have 0, but C and D would split the 0.2 from A. This leaves D with 0.3 and C with 0.3.
Be aware that this is an example of a simple Page Rank algorithm. More advanced implementations will use tools such as a damping factor.

Input Description

This is a simple implementation of the PageRank algorithm. It accepts an input in the form of an int[][] or a String which can define the d value and the number of loops. The default d value is 0.85, and the default number of loops is 20

page 0 links to pages 1,2,3
Page 1 links to page 0
Page 2 links to page 0
Page 3 links to pages 0,4,5,6,7
page 4 has no links
page 5 has no links
page 6 has no links
page 7 has no links

Sample Input

[[1,2,3],[0],[0],[0,4,5,6,7],[],[],[],[]]

Sample Output


[0.9142543787391981,0.4089334112608257,0.4089334112608257,0.4089334112608257,0.21947767079447258,0.21947767079447258,0.21947767079447258,0.21947767079447258]

Share:

0 comments:

Post a Comment