On Part I, assignment II of COMP2521

hardrain

About Post

Create on: 2019-10-27 19:51:01 UTC

Update on: 2019-10-28 05:51:36 UTC

Words: 882

Part I (pagerank.c)

The first part of this assignment required us to implement a program to calculate the weighted pagerank of a set of dummy web pages (actually txt files) contains 'hyperlinks' (filename) to others.

Firstly, we need to implement a graph ADT which can represents the links between the files in the collection.txt, you may noticed that edges in the ADT should have direction, which represents there is a link between (a -> b), not (b -> a).

Basically, your graph ADT should have the following APIs (in LinkGhaph.h):

// Create new graph
Graph newGraph ();

//Destroy a graph
void freeGraph (Graph g);

// Create edge from src to dst
void newEdge (Graph g, char* src, char* dst);

// Check whether there is an edge between src and dst
int isEdge (Graph g, char* src, char* dst);

// Get the number of pages in the Graph
size_t nV (Graph g);

// How many links pointed to dst?
int countIn (Graph g, char* dst);

// How many links initalized from src?
int countOut (Graph g, char* src);

After that, we need functions to calculate W(eight)-in and W(eight)-out for each edge, which appeared in the so-called 'Weighted Page Rank'.

The logic for calculating the W-in is:

weightIn (start, end) {
    sum_start_in = 0
    for (each page in graph) 
        if (there is anedge from start to this page && it's not a self-link*)
            sum_start_in + countIn(this page)

    weight_out = countIn(end) / sum_start_in
}

While logic for calculating W-out is:

weightOut (start, end) {
    sum_start_out = 0
    for (each page in graph)
        if (there is an edge from start to this page && it's not a self-link) 
            sum_start_out + countOut(this page)

    weight_out = countOut(end) / sum_start_out
}

But, please be notice that countOut(page) can return 0 if page is a dead end (has no link to other pages), in such case, you should use 0.5 instead of the returned 0 from countOut(page). This affects the implementation of W-out algorithm.

* self-link: a link point to itself, e.g. url21 in file url21.txt


If everything in the previous chap goes well, you may start implement the pagerank algorithm.

This algorithm includes iteration. So, we have to have 2 arrays of double to carry the PR value before and after each iteration.

double pr_b[number_of_pages];
double pr_a[number_of_pages];

You will find that you can't initialize an array with a variable size (number_of_pages). The above are just pseudo code, you may use double* with malloc() to create a dynamical 'array'.

The initial PR of each page are same, which equals to (number of pages)^-1, initialize the pr_b array with it.

create a counter for iterations, and initialize the differential value* with the input diffPR:

int n_iters = 0
double diff = diffPR

Then we start the loop:

while (n_iters < maxIterations && diff >= diffPR) {
    // Calculate PR for each page
    for (each page as end) {
        for (each page as start) 
            if (there is an edge from start to end && it's not a self-link)
                calculate sum = weightOut(start, end) * weightIn(start, end) * pr_b[start]
        pr_a[end] = ((1 - d) / (number of pages)) + d * sum
    }

    // Calculate diff
    diff = 0
    for (each page) 
        diff += |pr_a[page] - pr_b[page]|

    // Assign the pr_a to pr_b (for next iteration)
    for (each page)
        pr_b[page] = pr_a[page]

    // Add 1 to iteration counter
    n_iters += 1
}

After the loop ends, return the pr_a array as result.

* d: The damp factor from the input of the function.

* diff: PR value of a page may change between iterations (it will approach a value after 'infinite' iterations), diff is the sum of this difference of all the pages. The input diffPR sets a threshold of this value, if it's to low, then the iteration stops because we can assume that the PR no longer changes significantly with more iterations.


Then, you should print the result obtained from previous chap in following order.

  1. pagerank (in descending order)
  2. number of outlinks (in descending order, if 1 are same for multiple results)
  3. url string (in ascending order, if both 1 and 2 are same for multiple results)
  4. there is no 4 because url never duplicate (unless something goes wrong with the input)

A linked-list which sort the nodes at insertion may be very helpful. We have done such an ADT in assignment I, which takes 2 parameters (are they (double)tf-idf and (char*)url?), now we need something similar but taking 3 parameters, including a double, an int and a char*.


Hints

  1. Always cast int, uint to double when doing division, or you may get incorrect result.
  2. Use fabs() for absolute value of 2 floats. Learn more by man 3 fabs.
  3. Your graph ADT may also required to convert the name of url to the internal variable of its index (e.g. a integer as the url ID).
  4. You can get some idea from your assignment I for implementing a linked-list ADT which sorting the nodes at inserting. This can be helpful for output a sorted page rank list.

TBD for part II


Comments:

There is no comment yet...

Comment as visitor:

We'll never share your email with anyone else.