.. _chap-NetworkX-intro:

=================================
 Introduction to NetworkX Graphs
=================================

.. sectionauthor:: Malcolm Smith <msmith.malcolmsmith@gmail.com>

[status: first draft]


Motivation, Prerequisites, Plan
===============================

.. rubric:: Motivation

As you go about your life, you will often take for granted the vast
networks you are a part of. From your local art club (or your
workplace, or school, etc.) to the entire world, we all live in
relation to one another. To make sense in rigorous terms of this,
mathematicians use what is known as network theory. Understanding and
using it is a very valuable skill to have, as it allows easy visual
understanding of any sort of data or network you're working with.

Situations in which it might arise include modeling economic markets,
where interactions are recorded between people, or in A.I. research,
in which a neural network (a system designed to emulate the way our
brains work) is a very specialized case of a mathematical network. In
general, network theory has a wide range on applications, and having a
basic understanding of it will be helpful to anyone who wants to gain
a deeper insight into their data.

.. rubric:: Prerequisites

* The 10-hour "serious programming" course.
* The "Data files and first plots" mini-course in
  :numref:`chap-data-files-and-first-plots`.
* Having NetworkX and Matplotlib, a very powerful and versatile
  programming library, installed. You can install them with:
  
  .. code-block:: console

	$ sudo apt install pip
	$ pip install --user --upgrade networkx[default]
	$ pip install --user --upgrade matplotlib
	
* For the later parts of the course, you need wget and gzip. You may
  need to download or upgrade wget and gzip, and you can make sure you
  have the latest versions with:

.. code-block:: console

   $ sudo apt install wget gzip
	
.. rubric:: Plan

We're going to start off with learning what a network is with the
Bridges of Konigsberg, as well as the basics of NetworkX, such as
adding nodes and edges as well as graphing the results. We will then
move on to directed and weighted networks, which will come into play
for the final step. After learning these functions, the next thing we
will do is download and extract the data from an online database. we
will analyze the data next, trying to see patterns in the distribution
of the data.


First Steps
===========

Our first step is to make sure all the required libraries are
installed properly.

Open a python 3 prompt and check to see if you properly downloaded
NetworkX and Matplotlib.

.. code-block:: pycon
   
   $ python3
   >>> import networkx as nx
   >>> G = nx.Graph()
   >>> import matplotlib.pyplot as plt
   >>> fig, ax = plt.subplots()

Don't worry about what that means right now; just know it's working if
there isn't an error message.

After getting the libraries squared away, let us explore what a
mathematical network actually is, and some familiar examples. A
network is really a simple object, with lines, or edges, going between
points, or nodes. The famous mathematical problem known as the Bridges
of Konigsberg (which prompted Euler's early work in graph theory) is
easily represented as a network. For those who don't know of the
Bridges of Konigsberg, here is a picture of it:

.. _fig_konisberg_bridges:

.. figure:: bridges.*
   :width: 60%

   The Bridges of Konigsberg.

This can be represented as a network with 4 nodes and 7 edges. The
edges, as you may have guessed, are the bridges, and the nodes are the
landmasses. One thing to note about this is that there are two sets of
repeat bridges. This does not disqualify it from being a network, but
it isn't a type of network we will see in this chapter. There is a
section where we have edges going between the same nodes twice, but
that's in a network in which direction matters, and thus the edges are
not identical. In this case, there are two sets of identical edges, a
case we will not discuss in this chapter.

Next we're going to make a very simple network using NetworkX. To do
this, let us first walk through some basic NetworkX commands.

* ``some_network = nx.Graph()`` will create a network called
  some_network.
* ``some_network.add_node(some_node)`` adds the node some_node to the
  network some_network.
* ``some_network.add_nodes_from([node1, node2, node3])`` will add all
  nodes inside the array in the input to some_network. It can also be
  used with another network.
* ``some_network.add_edge(node1, node2)`` will add an edge from node1
  to node2 in some_network. Note that if node1 or node2 is not in the
  network the command will add them.
* ``some_network.add_edges_from([[node1, node2], [node3, node2]])``
  will add the dges contained in the nested array. It can also be used
  with another network. Like ``.add_edge()``, it will add any nodes it
  contains that are not already present in the network.
* ``some_network.nodes`` will return a tuple with all the nodes in
  some_network in it.
* ``some_network.edges`` will return an array with all the edges in
  some_network as tuples inside it.
* ``nx.draw(some_network)`` will allow matplotlib to draw some_network
  in its current state.
 
With that out of the way, let us create a simple network. To start, we
will use a network with 5 nodes and 7 edges.

.. code-block:: pycon

   >>> import networkx as nx
   >>> import matplotlib.pyplot as plt
   >>> G = nx.Graph()
   >>> G.add_nodes_from(range(5))
   >>> G.add_edges_from([[0,1],[1,2],[2,3],[3,4],[4,0],[0,3],[1,4]])
   >>> nx.draw(G, with_labels=True)
   >>> plt.show()

In this case, the with_labels keyword in the second to last line tells
the program to draw the network with the numbers on the nodes. If
you've done every thing correctly, an image like
:numref:`fig-sample-network` should pop up:

.. _fig-sample-network:

.. figure:: simple_network.*

   First visualization of network.

This doesn't look very nice, though. To fix this, we need a bit more
code:

.. code-block:: pycon

   >>> fig, ax = plt.subplots() # necessary for using a title on the graph
   >>> nx.draw_circular(G, ax=ax, with_labels=True)
   >>> ax.set_title('Simple network')
   >>> plt.show()

This should produce the following:

.. _fig-named-network:

.. figure:: circular_network.*

   Network with title.


Directed and Weighted Networks
==============================

Now that we understand the basics of using NetworkX, let us use it for
something slightly more complex. Using a directed network is also the
next step towards making the complex network at the end.

You can use the same code as before, but replace ``G = nx.Graph()``
with ``G = nx.DiGraph()``. That should produce something like
:numref:`fig-directed-network`:

.. _fig-directed-network:

.. figure:: Directed_network.*

   Arrows show the direction of the edge.

To add weights, we can make a new graph in a very similar manner:

.. code-block:: pycon

   >>> G.clear()
   >>> G = nx.DiGraph()
   >>> G.add_nodes_from(range(5))
   >>> G.add_weighted_edges_from([[0,1,2],[1,2,1],[2,3,1],[3,4,3],[4,0,10],[0,3,0.5],[1,4,0.1]])
   >>> fig, ax = plt.subplots()
   >>> nx.draw(G, ax=ax, with_labels=True)
   >>> ax.set_title('Weighted and Directed Network')
   >>> plt.show()

The ``.add_weighted_edges_from()`` function is very similar to
``.add_edges_from()``, but it take an array of arrays with three
elements, rather than two. The third is the weight, which must be a
positive number.

.. _fig-directed-and-weighted-network:

.. figure:: W_and_D_network.*

   The higher the weight, the shorter the line between nodes.
	    
As you can see, the high weight edge from 4 to 0 is by far the
shortest in the image, followed by 0 to 1 and 3 to 4. 0 to 3 and 1 to
4 should be far and away the longest, but because the network has
certain constraints placed on its shape, they end up being more or
less the same length as 1 to 2 and 2 to 3.

Now, the next question is obvious. What happens when two edges going
opposite directions between the same node are weighted? For instance,
what would happen if we added an edge going from 0 to 4 with a weight
of 0.1? The code to add this is ``G.add_edge(0, 4, weight=0.1)``. Note
that because we are just adding one edge, we don't have access to the
``.add_weighted_edges_from()`` function and have to manually specificy
that we want to add a weight. The resulting graph is pictured in
:numref:`fig-bidirectional-weighted-network`:

.. _fig-bidirectional-weighted-network:

.. figure:: BWN.*

   A network with a double edge between nodes 0 and 4.

While the [0,4] edge is slightly longer than before, the real change
is in all the other lengths of edges. This is because the
``nx.draw()`` function models the edges as springs, and now there are
two springs in that space, which affects the other springs around
it. To explain what this means, imagine a simple scenario where you
have two points (with no mass) connected by a spring. If you let it
come to a resting state, those two points will be some distance
apart. If we add a thrid point, though, and connect it to the other
two points with springs as well, something new will happen. Unless we
choose the precise parameters for the springs (in the case of NetworkX
these parameters come from the weight), no matter what at least one of
the springs will be either stretched or compressed. This is, in
effect, how NetworkX models networks. This is computationally
intensive, as you might imagine, but as we will see later it can give
valuable insights into network structure.

Using Larger Data Sets
======================

Now that most of the basics are out of the way, let us write a script
that can get useful information from large data sets with thousands of
nodes. If you have a powerful computer, you should be able to graph
the whole network with little trouble. However, if your computer is
less powerful, you should use the optional code that samples a smaller
version of the data set.

To start, we need a dataset. The one we will be using is a network of
bitcoin users rating how trustworthy they find another person. It's
directed and weighted, with some edges that go both directions. The
page where the data is downloadable is here:
https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html , but don't
feel the need to download the data right now as we will be doing that
in a later step.

First, you're going to want to make a directory to house the data. To
do this open a new terminal and run:

.. code-block:: console

   $ mkdir NetworkX-intro
   $ cd NetworkX-intro

Of course, feel free to call your directory whatever you
want. However, in the future, we will reference as NetworkX-intro/ ,
so make sure you change that if you have a different directory
name. From now on, make sure you save your python files to this
directory as well. Then, download and extract the data with:

.. code-block:: console

   $ wget https://snap.stanford.edu/data/soc-sign-bitcoinalpha.csv.gz
   $ gunzip soc-sign-bitcoinalpha.csv.gz

If, for whatever reason, Stanford ever takes down their database of
datasets, the same dataset can be downloaded (to be put in the wget
command) from the wayback machine archive at `this link
<http://web.archive.org/web/20231227041646/https://snap.stanford.edu/data/soc-sign-bitcoinalpha.csv.gz>`_
.

If you run ``ls`` now your directiory should contain the file
``soc-sign-bitcoinalpha.csv`` . Now open your prefered programming editor
and make a new python file. we will call it graph_bitcoin_data.py. At
the top, import the necessary libraries with:

.. code-block:: python

   import networkx as nx
   import matplotlib.pyplot as plt

Now we can attempt to extract the data from the .csv file. To do this,
we can use the following code:

.. code-block:: python

   import networkx as nx
   import matplotlib.pyplot as plt

   with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
       for row in bitcoin_data_unformatted:
           print(row)

This should output a huge list of arrays with strings inside them,
something like ``['1234,4321,1,1234567890']``. Now, let us put this in
a network called ``bitcoin_network`` with that third column as
weights:

.. code-block:: python

   with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
       for line in bitcoin_data_unformatted:
           words = line.split(',')
           row = []
	   for i in range(len(words)):
               row.append(int(words[i].strip()))
           bitcoin_network.add_edge(row[0], row[1], weight=row[2])

If you run the program now, it should produce a huge list of arrays
containing numbers, something like ``[1234, 4321, 1,
1234567890]``. Now that we have it properply formatted, let us talk
about what this data actually is. The first and second numbers
represent users of the bitcoin network, and that's going to be an edge
in our network later on. The third number is an integer rating
from -10 to 10 of how much the first person trusts the second. This
will be used as a weight in our network. The fourth is just a
timestamp that we will ignore. Now, let us put all this data into a
network.

Before moving on to the final program, we should address the size of
the file. 21,000 edges is a LOT. This means it's hard for your
computer to render (unless you happen to have a high end gaming
computer or something similar). This means some of the edges have to
be removed. We could iterate through them and remove them at random,
but a more efficient method would be to remove some nodes. This is
because removing a node will also remove all the edges attached to it,
and there are a lot less node to run through. Some example code below
illustrates how we might remove 1500 nodes from a network:

.. code-block:: python

   import random

   nodes = list(bitcoin_network.nodes)
   random.shuffle(nodes)
   for i in range(1500):
       network.remove_node(nodes[i])

Combining all of this, plus a ``main()`` function to tie it all
together, we get listing :numref:`lst-visualize-large`:

.. _lst-visualize-large:

.. literalinclude:: visualize_large_network.py
   :language: python
   :caption: visualize_large_network.py - show data with some number of nodes removed

This should produce, after a couple seconds of waiting, something that
looks like :numref:`fig-simple-bitcoin-network`:

.. _fig-simple-bitcoin-network:

.. figure:: simple_bitcoin_network.*

   A stripped-down visualization of the data.

As you may notice, not all of the nodes are connected with edges to
anything. This is because as we remove the nodes they were attached
to, they edges disappear but the nodes stay. This problem get less and
less noticable as you remove less nodes, as can be seen below with no
nodes removed in :numref:`fig-full-bitcoin-network`:

.. _fig-full-bitcoin-network:

.. figure:: full_bitcoin_network.*

   All the data, graphed.

As you can see, the form is fundamentally the same, so don't worry if
you have disconnected nodes. Now, let us analyze what you can learn
from this figure. It resembles a hub-and-spoke diagram, with a very
concentrated center and edges radiating outward (in practice, many of
those "outward" lines go in both directions, but that isn't super
important for this). From what we learned about the way NetworkX
graphs weighted networks, we know that the low-weight (and in this
case, low-trust) edges are longer than the high-weight ones. This
means that in this diagram, those that were found to be untrustworthy
tended to only trade once or twice. There are exceptions- for example
that node in the lower right with the thick, dark line coming out of
it- but for the most part it's a core of trustworthy people trading
within themselves, and occasionally an untrustworthy newcomer will
trade once or twice before not coming back. And you can see that when
the members of the outer shell traded with each other, those lines are
by far the longest- indicating that they mutually distrusted each
other and the line got longer because of it.

Analyzing with Python
=====================

Now, let us do some analysis of this data. We started analyzing how
the form of the graph showed how people trusted one another in the
previous section. Now, let us use python to do that for us.

To start, we will graph the average trust of every node. Note that you
do not need to remove any nodes from this dataset because the computer
has to do a lot less work per node. We're going to do this by
calculating the average weight of a node's predecessors, then sorting
that list and graphing it. This can be accomplished with:

.. literalinclude:: show_avg_trust.py
   :language: python
   :caption: show_avg_trust.py - create a line graph of each user's average trust.

When run, this function should produce something similar to
:numref:`fig-avg-trust`:

.. _fig-avg-trust:

.. figure:: average_trust.*

   A line graph of each user's average trust.

Next, let us compare number of interactions with trust. This should be
fairly simple, we can create a list of numbers of interaction and then
sort it using the average trust array as a key. This can be
accomplished by changing the ``show_avg_trust()`` function to:

.. code-block:: python

	    def show_avg_trust(filepath):
		trust_dict = {}
		interactions_dict = {}
		avg_trusts = []
		num_interactions_list = []

		Data = convert_data_to_network(filepath)
		for node in Data.nodes:
		    avg_trust = get_avg_trust(Data, node)
		    if avg_trust != None:
		    num_interactions = len(list(Data.adj[node]))
		    trust_dict[node] = avg_trust
		    interactions_dict[node] = num_interactions

		node_order = sorted(trust_dict, key=trust_dict.get, reverse=True)
		for node in node_order:
		    avg_trusts.append(trust_dict[node])
		    num_interactions_list.append(interactions_dict[node])

		fig, axs = plt.subplots(2)
		axs[0].plot(range(len(avg_trusts)),avg_trusts)
		axs[1].plot(range(len(num_interactions_list)), num_interactions_list)
		plt.show()

When run, this will produce :numref:`fig-trust-and-interactions`:

.. _fig-trust-and-interactions:

.. figure:: trust_with_interactions.*

   The average trust (top) and the number of interactions that node
   has (bottom).

There's a lot to be learned from this figure, but something we should
look out for first: Given that the flat line at the trust of 1 is
perfectly flat, it's safe to assume something isn't normal with this
group. It's possible that if a user gave no input for trust, it simply
gave a trust of one. This also explains the oddly low number of
interactions in this range, as it's unlikely that someone who traded a
lot would never get rated.

With that out of the way, let us move on to the results. As one would
expect, the extreme ends had very low numbers of interactions. This
can be explained by the fact that it's difficult to get a perfect 10
or -10, and even a single user giving you a different rating would
change your average. If we ignore the dead zones caused by the 1 flat
and other similar flats (from nodes 750 to 1000, for example), the
data looks almost normally distributed. Again, upon reflection, this
makes sense; the more you trade, the more likely that your going to be
near the middle of the trust spectrum. Of course this is not a perfect
rule, and the user with the most trade actually has a trust rating
around 1.5 instead of 0. However, the general pattern is fairly easy
to understand.

Moving on to our final analysis, let us make a bar graph with trust on
the x axis. To do this, we will make a function that will divide the
data up into some number of baskets, like so:

.. literalinclude:: show_avg_trust_bar.py
   :language: python
   :caption: show_avg_trust_bar.py - creates a bar graph of data with specified baskets.

When run, this produces :numref:`fig-trust-bar`:

.. _fig-trust-bar:

.. figure:: trust_bar.*

   A bar graph of bitcoin trust levels with 20 baskets.

At first glance, it may appear that this data is also skewed by the
plateau at 1. However, if you look closely, there are around 800 nodes
in the 1 - 2 trust rating basket, and almost 2000 in the plateau. The
reason for the nodes not being included in the data lies in line 15 of
``show_avg_trust_bar()``; neither operator has a '=' in it, meaning if
a node has exactly the value of one of the cutoffs, the set won't
include it. This is negligible for good data, because the chances of a
trust score landing on exactly an integer are so low it doesn't affect
the data. However, the bad data all equals exactly one. This means the
bar graph just skips over it and ignores it.

Moving on to an actual analysis of this graph, it looks like a normal
distribution with a very low standard deviation, with a mean somewhere
between 1 and 2. It's not perfect, for example it trails off faster to
the right than left, possibly indicating a general atmosphere of
distrust on this website. This shape can be explained by considering
most people on the website only traded a couple times, and it's hard
to get a reputation as trustworthy or untrustworthy that fast. Of
course, the extreme edges are also mainly from people who traded once
or twice, for the reason that someone might decide that they found
someone really trustworth or untrustworthy. That's just human nature
messing with the data, though, and the outliers don't even appear on
the bar graph.

Conclusion
==========

In this mini course, we learned to use NetworkX by making a drawing
simple graphs. We then moved on to downloading and exploring a larger
data set, one from an early bitcoin network. It contained directed and
weighted edges, over 21,000 of them.

We then tried to develop an intuition for interpreting what the graph
could indicate about the data, and then did more rigorous analysis
with various matplotlib plotting routines. By the end, you hopefully
have a better understanding of both NetworkX and data analysis with
NetworkX.

It is important to note that in the final section we moved away from
directly using NetworkX, and used it as an intermediary in the
analysis steps. However, to even try to do the analysis, NetworkX is
an invaluable tool. It gives a rough idea of what you can analyze, and
also give insights into the dataset's structure.