New R Package: NetworkInference


27 Mar 2017

Introduction

This R package provides an implementation of the netinf algorithm (see here for more information and the original C++ implementation). Given a set of events that spread between a set of nodes the algorithm infers the most likely stable diffusion network that is underlying the diffusion process. The example used in this tutorial, is policies being adopted by US states. The algorithm is used to infer the most likely diffusion pathways for policies in the US. Which states are most likely inspired by which other states to implement certain policies?

Installation

You can install the package conveniently from CRAN:

install.packages("NetworkInference")

The latest development version can be installed from (bug reports go here, too) github:

#install.packages(devtools)
devtools::install_github('desmarais-lab/NetworkInference')

Tutorial

This is a quick tutorial to get started with the package. For more detailed information on the algorithm and functionality of the package see the package documentation.

netinf infers the optimal diffusion network from a set of nodes and a number of so called cascades. A cascade is a series of events occurring at a specified time. For this demo we will use the data presented in Desmarais et al. (2015). In this paper Desmarais et al. infer a latent network for policy diffusion based on adoption of 187 policies in the US states. In this case a node in the network is a state, a cascade refers to a specific policy and an event is the adoption of said policy in a state.

Preparing the Data

The data for Desmarais et al. (2015) is available in the package:

library(NetworkInference)
# Load the `policies` dataset (?policies for details).
data(policies)
state_names <- rownames(policies)

In this case the data is stored in a matrix. Each row corresponds to a state and each column corresponds to a policy. The cell entries indicate the year a state adopted a policy (or NA in case the policy was never adopted). Let’s look at the upper left corner of this matrix:

policies[1:7, 1:7]
  equalpay conacchwy soil fish consgsoil airpol forest
CT 1949 1939 1945 1867 NA 1959 1901
ME 1949 1939 1941 1878 NA 1954 1891
MA 1945 1943 1945 1865 NA 1954 1904
NH 1947 1943 1945 1864 NA 1955 1893
RI 1946 1937 1943 1879 NA 1956 1909
VT NA 1955 1939 1867 NA 1959 1904

Most functionality of the NetworkInference package is based on the cascades data format. So before starting with the analysis we have to transform our data to such an object (other formats for the input data, such as three column data frames with a cascade id column (‘long format’), are supported, too. See ?as_cascade_long for more info if you data looks differently.).

policy_cascades <- as_cascade_wide(policies, node_names = state_names)

The cascades data type is basically a list containing all the data stored in three separate objects.

Visually Inspecting Cascades

It’s always good practice to visually inspect the data before working with it. The NetworkInference package provides functionality to visualize the cascade data.

The function summary.cascades() provides quick summary statistics on the cascade data:

summary(policy_cascades)

## # cascades: 187
## # nodes: 50
## # nodes in cascades: 50
## # possible edges: 2450
## 
## Summary statistics for node length and ties:
##         length  ties
## Min.     10.00  1.00
## 1st Qu.  23.00 11.00
## Median   33.00 19.00
## Mean     33.13 20.18
## 3rd Qu.  46.00 29.00
## Max.     50.00 45.00

The first four lines provide the number of cascades, the number of nodes in the system, the number of nodes involved in cascades (there might be states that we don’t have diffusion data on, but we still want them represented in the dataset) and the possible number of edges in a potential diffusion network (a diffusion edge between nodes u and v only makes sense if there is at least one cascade in which u experiences an event before v). In this example there are 187 policies and 50 states. Each state is involved in at least one policy cascade and a fully connected diffusion network would have 2450 edges.

It also provides summary statistics on the distribution of the cascade lengths (number of nodes involved in each cascade) and the number of ties in the cascades (two nodes experiencing the same event at the same time). For our example, we can see that the ‘smallest’ policy was adopted by 10 states and the ‘largest’ by all 50 states. From the tie summaries we see that there is at least one policy that was adopted by 45 states in the same year.

The plot() method allows to plot cascades with varying degrees of detail. The argument label_nodes (TRUE/FALSE) provides node labels which require more space but provide more detail. The argument selection allows to pick a subset of cascades to visualize in case there are too many to plot. If label_nodes is set to FALSE each event is depicted by a dot, which allows to visualize more cascades simultaneously.

Let’s first look at the visualization with labels. Here we plot two cascades, selected by their name:

cascade_ids <- colnames(policies)
selection <- cascade_ids[c(16, 186)]
plot(policy_cascades, label_nodes = TRUE, selection = selection)

We can also plot more cascades with less detail:

selection <- cascade_ids[5:15]
plot(policy_cascades, label_nodes = FALSE, selection = selection)

This produces a ‘violin plot’ for each cascade with the single diffusion events overplotted as dots. As we already saw in the previous visualization, the policy data has a lot of ties (i.e. many states adopted a policy in the same year) which is indicated by the areas of higher density in the violin plot.

Inferring the Latent Diffusion Network

The netinf algorithm is implemented in the netinf() function. Besides the data, the function takes three parameters.

trans_mod specifies the transition model, or the parametric model according to which the times between diffusion events are distributed. Currently two distributions are available, the exponential and the Rayleigh distribution. For this example we choose the exponential density.

lambda is the scale parameter for the respective distribution.

n_edges specifies how many edges should be inferred. Best practice is to choose a high number of edges first and then look for a drop-off in gained model fit for each added edge. Then we can rerun the algorithm with a lower number of edges. See @gomez2010inferring and @desmarais2015persistent for guidance on choosing this parameter. The number of inferred edges has to be lower than the maximum number of possible edges. An edge u->v is only possible if in at least one cascade u experiences an event before v. This means, that the maximum number of edges depends on the data. The function count_possible_edges() allows us to compute the maximum number of edges in advance:

npe <- count_possible_edges(cascades)
npe

## [1] 648

Let’s run the algorithm with a larger number of edges to see where the improvement drops off significantly:

results <- netinf(policy_cascades, trans_mod = "exponential", n_edges = 100, 
                  lambda = 1)

Let’s take a look at the output of the algorithm. The output is a dataframe containing the inferred latent network in the form of an edgelist:

head(results)
origin_node destination_node improvement
VA TX 1067
CA NC 1043
CA IL 1040
OR KY 1031
WA CO 1022
CA MD 1011

Each row corresponds to a directed edge. The first column indicates the origin node, the second the destination node. The third column displays the gain in model fit from each added edge. Note that the best fitting network would be a fully connected graph, i.e. a diffusion edge between all nodes. However, since we want to infer a sparse network, a model that captures the important diffusion pathways we need to regularize by constraining the number of edges in the network. In order to find a good cutoff, it is good to visualize the gain to check if we can find a sudden drop-off. There is a generic plot method to inspect the results. If more tweaking is required, the results are a dataframe so it should be easy for the more experienced users to make your own plot. With type = "improvement" the improvement from each edge can be plotted.

plot(results, type = "improvement")

After inspecting the improvements, the model can be re-run with the desired number of edges. We choose (arbitrarily) 25 here:

diffusion_network <- netinf(policy_cascades, trans_mod = "exponential", 
                            n_edges = 25, lambda = 1)

In order to produce a quick visualization of the resulting diffusion network we can use the plot method again, this time with type = "network". Note that in order to use this functionality the igraph package has to be installed.

#install.packages('igraph')
plot(diffusion_network, type = "network")

If additional tweaking of the plot is desired, the network can be visualized using igraph explicitly. We refer you you to the igraph documentation for details on how to customize the plot.

library(igraph)
g <- graph_from_data_frame(d = results[, 1:2])
    plot(g, edge.arrow.size=.3, vertex.color = "grey70")