Introduction

Often when working with collections of data, like documents or entries in a database, operations are applied to individual items. Traditionally this is done in Xill using a foreach loop, but the built-in pipeline functions can make these tasks less verbose and easier. Using these functions, one can build a "pipeline" where each data item is passed to different functions in succession and each function transforms or otherwise parses the data item. 

This tutorial will explain the basics of the pipeline functions and go through a sample project to discuss how some of the pipeline functions can be used. 

Using the pipeline functions

 Most pipeline functions work by taking a function argument and other arguments such as an iterator. 

foreach<function>(iterator, ...);

 The function argument (which should be the name of a function defined somewhere in the Xill code) is passed in the diamond operator (<>) and any other arguments are passed like they would be to any other function.

Pipeline functions example

In this example we use data from the Wikispeedia game. In this game each user gets two Wikipedia pages and has to go from one page to the other using only links in the pages. Each time a user successfully completes a path results in one line in the data file:

015245d773376aab	1366730828	175	14th_century;Italy;Roman_Catholic_Church;HIV;Ronald_Reagan;President_of_the_United_States;John_F._Kennedy	3

  These tab separated values reprents hashed IP address, timestamp, time to go from the first to the last page, the path separated by semicolons (where < represents a click back) and rating given by the user (where NULL is no score given, 1 is very easy and 5 is very hard).

 In this example we'd like to know the total amount of back clicks for each rating given by the user. The full code and the data are included at the bottom of this tutorial.

 We expect that the number of back clicks is the highest for a rating of 5, since a user probably experiences the path he takes as more difficult when he constantly has to backtrack. In order to find out if this is true we will first have to read the data, which is in tab separated value form, and then parse it to something more useful. 

/*
 * Used for filtering out comments and whitespaces from the data
 */
function isData(line) {
    var isComment = String.startsWith(line, "#");
    var isEmpty = line=="";
    return !(isComment || isEmpty);
}

var data = File.openRead("paths_finished.tsv");
var dataIterator = Stream.iterate(data);

var dataLines = filter<isData>(dataIterator);

  Here we open the data file and transform it to an iterator, which is needed for the pipeline functions. Each item in this iterator is one line from the data file. We then use the filter function to filter out all non-data lines (those not being a comment and not being an empty line).

We then parse each line into an OBJECT, which is easier to do further operations on. 

/*
 * Parses one line from the data file into an OBJECT
 */
function parseData(line) {
     var split = String.split(line, "\t");
     var path = String.split(split[3], ";");
     return {
        "ip":split[0], 
        "timestamp":split[1], 
        "duration":split[2], 
        "path":path, 
        "rating":split[4]
     };
}

var parsedData = map<parseData>(dataLines);

 Note that here we use map, which transforms each item in the original iterator one-to-one to a new item. 

 Next we again use filter to remove all data items that do not have a rating (i.e., the rating is the string NULL). 

/*
 * Used for filtering out lines without ratings
 */
function hasRating(data) {
    return data.rating != "NULL";
}

var dataWithRatings = filter<hasRating>(parsedData);

 In order to count the number of click backs we can count the number of occurrences of < in the path. We do this for each individual item, again using map 

/*
 * Count the backclicks for one finished path
 */
function countBackClicks(data) {
    var backClicks = 0;
    foreach (click in data.path) {
        if (click == "<") {
            backClicks++;
        }
    }
    return {"rating":data.rating, "backClicks":backClicks};
}

var countedBackClicks = map<countBackClicks>(dataWithRatings);

 Now to finish up we will combine the counts into a single OBJECT using the reduce function, which will combine data one item at a time. 

/**
 * Add the click backs fo a data item to the appropriate rating
 */
function sumBackClicks(counts, data) {
    counts.backClicks[data.rating] += data.backClicks;
    counts.numRatings[data.rating] += 1;
    return counts;
}

// Number of back clicks per rating
var initialBackClicks = {1:0,2:0,3:0,4:0,5:0};
// Number of completed paths per rating
var initialNumRatings = {1:0,2:0,3:0,4:0,5:0};
var counts = reduce<sumBackClicks>({"backClicks":initialBackClicks, "numRatings":initialNumRatings}, countedBackClicks);

  In this case we start with a count of 0 for all ratings, and the sumBackClicks function adds the click back count of each completed path to the rating the user gave while simultaneously counting the number of completed paths.

We now calculate the resulting averages and print them.

foreach (rating,count in counts.backClicks) {
    System.print("Average back clicks for rating " :: rating :: ": " :: count/counts.numRatings[rating]);
}

 This results in the following:

Average back clicks for rating 1: 0.12140014048232264
Average back clicks for rating 2: 0.2457916287534122
Average back clicks for rating 3: 0.49056603773584906
Average back clicks for rating 4: 1.007278020378457
Average back clicks for rating 5: 1.8350604490500864 

  And we see that indeed the number of back clicks is highest for a rating of 5.

References

  • Jure Leskovec and Andrej Krevl: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, 2014.
  • Robert West and Jure Leskovec: Human Wayfinding in Information Networks. 21st International World Wide Web Conference (WWW), 2012.
  • Robert West, Joelle Pineau, and Doina Precup: Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts. 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009.