Networks are a 'hot topic' in various disciplines, including mathematics, economics, sociology, political science, statistics and computeter science.
Given the diversity of the disciplines involved it comes as no surprise that the terminology used by various scientists does not always align well.
Yet all investigations of networks somehow are built upon graph theory
, a branch of discrete mathematics.
Therefore, to understand the following elaborations on how to implement and study networks in Python, a basic understanding of graph theory is very helpful. I have started to write a script introducing the basics here.
Yet I can also recommend the following sources/textbooks:
Some comments on the terminology before we move on: the terms 'network theory' and 'graph theory' are not well distinguished. Here we will use the term graph theory when we refer to the mathematical theory of graphs, where the latter are understood as mathematical objects.
The term 'network theory' will be used when we refer to the theory of networks, where the latter are understood as the relationships between real world entities.
Although this is the only distinction I can somehow make sense of, be aware that the literature uses both terms in various ways and it is not always straightforward to infer how.
As for most particular application areas, networks are best studied using the classes and functions provided by a specialized module. For the area of networks, a number of different libraries are available:
All of them come with particular advantages and disadvantages.
Here we will use the library networkx
, which is written in pure Python and which is particularly well-suited for beginners.
Yet, all of what you will learn in the context of networkx will also help you in working with the other libraries.
As for most other common libraries there exists a particular import convention for networkx:
import networkx as nx
# Not necessary, but will be used later:
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
First we will learn how we create graph objects, which we then can analyze further. As we know from theory, a simple graph $\mathcal{G}$ consists at least of a number of nodes and a number of edges. If not otherwise stated we will only think about simple graphs, and extend the thus acquired techniques to more complex graphs later.
We will consider the following three possibilities of creating simple graps:
This is usually the least attractive option, but it is useful to get some intuition about the graph objects as provided by networkx
.
We could procede by first creating an empty graph and then add vertices and edges:
g = nx.Graph() # create an empty graph
g.add_node("FirstNode")
g.add_edge(1, 2)
g.edges()
g.nodes()
As we can see, we first created a single node called 'Firstnode'.
By adding an edge between the hitherto unexisting nodes '1' and '2', networkx
automatically added the missing nodes to the graph.
Although networkx
is not well suited for visualization, we might have a look at our new graph:
nx.draw(g, with_labels=True)
We can also create an empty graph, and then fill it with vertices and edges according to an *edge list':
list_of_edges = [(1,2), (4,6), (2,5)]
g = nx.Graph()
g.add_edges_from(list_of_edges)
print(g.edges())
print(g.nodes())
nx.draw(g, with_labels=True)
It is useful to keep in mind that vertices can be anything: numbers, strings, instances of a class we have created before, or a collection of data. This makes is easy, for example, to program agent-based models with an explicit network structure.
There are a vast number of algorithms that produce particular graphs.
Many of these algorithms are implemented in networkx
and can be used to create new graphs.
There are basically two types of generator functions:
There are a number of functions that help you to create new graphs from existing ones.
For example, the complement of a graph $\mathcal{G}$ is usually denoted by $\mathcal{G}^C$ and has the same vertices, but all edges that were missing in $\mathcal{G}$ are present in $\mathcal{G}^C$.
We can get $\mathcal{G}^C$ the following way:
g_c = nx.complement(g)
plt.subplot(121)
nx.draw(g,node_size=150,node_color='#A0CBE2', with_labels=True)
plt.title("The original graph")
plt.subplot(122)
nx.draw(g_c,node_size=150,node_color='#A0CBE2', with_labels=True)
plt.title("The complement graph")
We can also combine graphs, e.g. via the function compose
:
g_full = nx.compose(g, g_c)
nx.draw(g_full, node_size=150,node_color='#A0CBE2', with_labels=True) # obviously yields a complete graph
There are a number of famous graphs that can be created using particular generator functions.
For example, the graph below is a representation of the marrying relationships between famous Florentine families in the 15th century:
g_famous = nx.florentine_families_graph()
nx.draw(g_famous, node_size=150,node_color='#A0CBE2', with_labels=True) # obviously yields a complete graph
There are some special graph generating algorithms that always produce the same graph for the same parametrization.
For example, the function cycle_graph
takes one argument, the number of vertices, and then returns a cycle:
g_cylce = nx.cycle_graph(10)
nx.draw(g_cylce, node_size=150,node_color='#A0CBE2', with_labels=False)
It is also possible to generate graphs using a stochastic algorithm. Therefore, every instance produced by an algorithm, even for the same parametrization, will yield a different graph.
One of the most famous examples is the Erdos-Renyi graph. Here, for a fixed number of vertices, every edge exists with a given probability $p$.
Here you see two graphs that both result from the same parametrization (i.e. 10 vertices, and $p=25\%$):
f
Stochastic graph models are widely used to test hypothesis about the mechanisms underlying graph generation in the real world: by comparing random graphs with those observed in reality we can make statements about how likely a certain structure is the outcome of a particular random process.
For example, Erdos-Renyi graphs have certain particular properties, such as a Poisson-like degree distribution. Social networks in the real world do not have such degree distribution. We can, thus, conclude that social networks are not the outcome of rather random edge creations.
Finally, we want to create graphs from data. There are many different formats, which you might use to store your graphs. For example, you can store the vertex and edge list in a text document, read this into python and then create the graph as illustrated above.
Yet, this is not practical, particularly for large graphs. Here, special formats for graphs are a better option. Here we will read a graph, which is stored in the widely used and very practical GLM format.
You can download the data here. I assume it is stored in a folder in you working directory, which is called 'output':
file_dir = "output/expl_graph.gml"
ex_graph = nx.read_gml(file_dir)
Conversely, if you want to save a graph in the GML formal you can do so using the following command:
# nx.write_gml(ex_graph, "output/test.gml")
Suppose you have created a graph by either of the means we have encountered so far. Now you want to compute some of the standard descriptive statistics you have encountered in your theoretical lessons.
For the further work we will use the following graph (which can be downloaded here):
file_dir = "output/expl_graph.gml"
sim_graph = nx.read_gml(file_dir) # a simple graph
In practice, the best way to compute a measure is to search for the corresponding function in the Networkx documentation.
Here we will go through a number of common descriptive statistics used for network analysis. We will distinguish between information on the graph level (which yields a single number for the whole graph) and information on the vertex level (which yields a single number for every vertex, and corresponds to a distribution on the graph level).
For the formal definitions it is assumed that $\boldsymbol{A}$ is the adjecency matrix of the graph and $a_{ij}\in\boldsymbol{A}=1$ if $\langle v_i, v_j \rangle$ and zero otherwise.
Number of edges and vertices
The number of vertices
$$n=\#V(G)$$
is given by the method number_of_nodes
or the function nx.number_of_nodes
:
sim_graph.number_of_nodes()
nx.number_of_nodes(sim_graph)
Alternatively, get the list of nodes using the method nodes
and compute the length of the list.
This is usually less efficient, in particular for larger graphs:
len(sim_graph.nodes())
The number of edges $$m=\frac{1}{2}\sum_{ij}\boldsymbol{A}_{ij}$$ is computed accordingly:
sim_graph.number_of_edges()
nx.number_of_edges(sim_graph)
Connectedness
A graph is connected if every node can be reached from any other node, i.e.
$$d(v_i, v_j)\neq \infty \quad\forall\quad v_i, v_j\in V(G)$$.
This can be checked using the function nx.is_connected
:
nx.is_connected(sim_graph)
Diameter
The diameter of a graph is the longest distance between two vertices, i.e.:
$$diam(G)=\max\{d(v_i, v_j)\}_{i,j\in V(G)}$$.
It is provided by the function nx.diameter
(which takes quite some time for larger graphs).
nx.diameter(sim_graph)
Average path length
The average shortest path length is just the average of all the distances in the graph, i.e.:
$$\bar{d}(G)=\frac{1}{2}\sum_{v_i\in V(G)}\bar{d}(v_i)$$It can be computed using the function nx.average_shortest_path_length
.
nx.average_shortest_path_length(sim_graph)
Density
The density tells is how many of the possible edges in the graph are actually present:
$$\rho=\frac{2m}{n(n-1)}$$It is computed using the function nx.density
:
nx.density(sim_graph)
Transitivity
The transitivity measures the clustering of the graph by relating the number of triangles with the number of triples in the grap:
$$\tau=\frac{\sum_{ijk}a_{ij}a_{jk}a_{kj}}{a_{ij}a_{kj}}$$It is simply give by the function nx.transitivity
:
nx.transitivity(sim_graph)
Clustering coefficient
Another measure for the clustering in a graph the the average clustering coefficient:
$$CC(G)=\frac{1}{n}\sum_icc(v_i)$$where $cc$ is computed for every single vertex $v_i$ via:
$$cc(v_i)=\frac{2 \sum_{jk}a_{ij}a_{jk}a_{kj}}{\delta(v_i(\delta(v_i)-1)}$$where $\delta(v_i)$ is the degree of $v_i$ (i.e. the number of adjacent vertices, see below).
The clustering coefficient is provided for by the function nx.average_clustering
:
nx.average_clustering(sim_graph)
Average degree
The average degree is just the average of all vertex degrees in the graph:
$$\bar{\delta}(G)=\frac{2m}{n}$$While it cannot be calculated directly, its computation is actually straightforward:
np.mean([i[1] for i in sim_graph.degree()])
Degree distribution
More interesting is often the degree distribution, which is given by nx.degrees
, and which can then best be illustrated using a histogram:
degs = [i[1] for i in sim_graph.degree()]
fig, ax = plt.subplots(figsize=(12, 9))
ax.spines["top"].set_visible(False) # Remove plot frame line on the top
ax.spines["right"].set_visible(False) # Remove plot frame line on the right
ax.get_xaxis().tick_bottom() # Remove ticks on the bottom
ax.get_yaxis().tick_left() # Remove the ticks on the left
ax.hist(degs, color="#3F5D7D", bins='auto');
As you can see, there are only very few vertices with high degrees, and many with relatively small degrees -- a typical results for social networks. In such cases, we say that the degree distribution is heavy tailed.
Neighborhood
The neighborhood of vertex $v_i$ is the set of all vertices that are adjacent to $v_i$.
It is given by the function nx.all_neighbors
, which takes two argument: first the graph, then the vertex:
nx.all_neighbors(sim_graph, "1") # use list comprenehsion to show all neighbors
Connectedness
Two vertices are connected of there is a path between then, i.e.:
$$d(v_i, v_j)\neq \infty$$To investigate this, we use the function nx.has_path
, where the first argument is the graph and the second and third argument the vertices in question:
nx.has_path(sim_graph, "0", "1")
Distance between two vertices
In an unweighted graph, the distance between two vertices is the length of the shortest path between them. In a weighted graph, it is usually the sum of the edge weights between the two vertices.
It is computed via the function nx.shortest_path
, which takes three arguments: first, the graph, second, the origin vertex, and third, the destination vertex:
nx.shortest_path(sim_graph, "0", "1")
Clustering coefficient
The clustering coefficient of a single vertex informs us about how well the neighbors of the vertex are themselves connected. The maximum amount of clustering is achieved if all neighbors of the vertex are neighbores as well:
$$cc(v_i)=\frac{2 \sum_{jk}a_{ij}a_{jk}a_{kj}}{\delta(v_i(\delta(v_i)-1)}$$It is calculated using the function nx.clustering
, which takes as a first argument the graph in question, and the vertex as a second argument:
nx.clustering(sim_graph, "0")
Degree The degree of $v_i$ is just the number of adjacent vertices.
In a weighted network we usually speak of strength, which is the sum of the weights of all edges.
In directed networks one might distinguish between in- and out-degree, where the former only counts edges that are pointing towards $v_i$.
sim_graph.degree("0")
Eigenvector Centrality
The centrality of a vertex measures its structural importance -- although keep in mind that the adequace of this link in the end depends on how exactly you define 'importance'.
Of course, the simplest measure of influence of a vertex could be its degree as defined above, but it often turns out that in practice degree is of less importance. More informative is a second order measure, according to which the degrees of a node are weighted by the degree of the connected node. In other words: the connection to a vertex that has many degrees counts more than a connection to a vertex with few connections. This logic leads to the concept of eigenvector centrality because the recursive calculation of importance can be expressed as the problem of finding an eigenvector. There are different ways of calculating this, one of the most well-known is the PageRank algorithm, which underlies the Google Search Engine and determines which websites are ranked highest in search results.
In networkx, we can get the eigenvector centrality of a vertex using the function eigenvector_centrality
, which takes one mandatory argument, the graph, and returns a dictionary containing the eigenvector centrality for each vertex in the graph.
This might then be visualized using a histogram:
eigen_centralities = nx.eigenvector_centrality(sim_graph).values()
fig, ax = plt.subplots(figsize=(12, 9))
ax.spines["top"].set_visible(False) # Remove plot frame line on the top
ax.spines["right"].set_visible(False) # Remove plot frame line on the right
ax.get_xaxis().tick_bottom() # Remove ticks on the bottom
ax.get_yaxis().tick_left() # Remove the ticks on the left
ax.set_title("Eigenvector centrality")
ax.hist(eigen_centralities, color="#3F5D7D", bins='auto');
Btw, centrality and its impact on influence (a measures of dynamic importance) is a very interesting area of social reseach. For example, in communication networks such as Twitter, it is not so important what degree a single node has. More important is the number of degrees the neigbors or a vertex have. In the Twitter example: a strong 'opinion maker' or 'influencer' is not necessarilty characterized by many followers. She is characterized by the fact that her followers have many followers.
You might check out this funny video illustrating the idea (I stole this one from Aaron Clauset, and its actually from this TED talk).
Betweeness Centrality
Another interpretation of structural importance is the following: a vertex is important when it connects two large communities, which would remain unconnected if the vertex was not there.
For example, a scientist might be structurally important if she links two otherwise unconnected research communities.
This can be measures using the concept of betweeness centrality.
This can be done using the function betweenness_centrality
, which works similar to the one for eigenvector centrality:
between_centralities = nx.betweenness_centrality(sim_graph).values()
fig, ax = plt.subplots(figsize=(12, 9))
ax.spines["top"].set_visible(False) # Remove plot frame line on the top
ax.spines["right"].set_visible(False) # Remove plot frame line on the right
ax.get_xaxis().tick_bottom() # Remove ticks on the bottom
ax.get_yaxis().tick_left() # Remove the ticks on the left
ax.set_title("Betweeness centrality")
ax.hist(between_centralities, color="#3F5D7D", bins='auto');
Although graph visualization is one of the weaknesses of networkx, some simple visualizations are possible.
If you really want to make nice vidualizations you should look for specific software packages, dedicated to this task.
For now we just take a simple look on how we can ploat networkx graphs using matplotlib
.
For more inspiration, you might want to have a look at the examples in the networkx visualization gallery.
The relevant function here is draw_networkx
or one of its many aliases.
The most commonly used arguments to the function are:
with_labels
: can be True
or False
, depending on whether you want the labels to be drawnpos
: the layout according to which the vertices should be arranged in the figure (see here for a small overview).ax
: can be an axes
object as produced by matplotlib; useful if you want to draw several plots within one figurenode_size
: either a number specifying vertex size for all vertices; or a list of the length of all vertices; useful if you want to draw vertices according to their degree, for examplenode_color
: the color for the verticeswidth
: the line width for the edgesstyle
: the stlye for the edges (e.g. dashed, solid, of dotted)font_size
and font_color
: specifies color and size for the fontsalpha
: sets the transparencyAs an example, we visualize the pre-defined graph of the Davis Southern women network,
a graph representing the friendship network of members in Zachary's Karate Club,
and we plot the labels, choose a nice color for the nodes (using the very useful color picker), and set all the other keywords, except pos
:
soc_graph = nx.nx.karate_club_graph()
fig, ax = plt.subplots(figsize=(12, 9))
nx.draw(soc_graph,
with_labels=True,
ax=ax,
node_size=75,
node_color='#b3d1ff',
edge_color="#b3e6cc",
width =2.0,
stype= "dashed",
font_size=10.0,
font_color="#002966",
alpha=0.75)
ax.set_title("Zachary's Karate Club friendship network")
Here are two representations, one where node sizes depend on degree, the other where they depend on eigenvector centrality:
fig, axes = plt.subplots(2,1, figsize=(12, 9))
nx.draw(soc_graph,
with_labels=False,
pos=nx.spring_layout(soc_graph),
ax=axes[0],
node_size=[i[1]*10 for i in sim_graph.degree()],
node_color='#b3d1ff',
edge_color="grey",
alpha=0.95)
axes[0].set_title("Node size dependent on degree")
nx.draw(soc_graph,
with_labels=False,
pos=nx.spring_layout(soc_graph),
ax=axes[1],
node_size=[i*1500 for i in nx.eigenvector_centrality(sim_graph).values()],
node_color='#b3d1ff',
edge_color="grey",
alpha=0.95)
axes[1].set_title("Node size dependent on eigenvector centrality");
Visualizing graphs is very difficult. The following figure illustrates how different the same graph can look like if different layouts are used.
fig, axes = plt.subplots(2,1, figsize=(12, 9))
nx.draw(soc_graph,
with_labels=True,
pos=nx.spring_layout(soc_graph),
ax=axes[0],
node_size=75,
node_color='#b3d1ff',
edge_color="grey",
width =2.0,
stype="dashed",
font_size=10.0,
font_color="#002966",
alpha=0.95)
axes[0].set_title("The networks using Spring-Layout")
nx.draw(soc_graph,
with_labels=True,
pos=nx.kamada_kawai_layout(soc_graph),
ax=axes[1],
node_size=75,
node_color='#b3d1ff',
edge_color="grey",
width =2.0,
stype= "dashed",
font_size=10.0,
font_color="#002966",
alpha=0.95)
axes[1].set_title("The network using the Kamada-Kawai Layout");