Lecture 8-9: Graphs

A graph is a set of vertices connected with edges. Every edges connects two vertices; in undirected graphs the connection is two-way, while in directed graphs the connection is from the source vertex to the target vertex. We also consider graphs where edges have additional associated values (usually called weights or costs); in the algorithms presented here we always assume that the costs are greater than 0. We usually denote the number of vertices with n and the number of edges with m; we use these conventions when giving the computational complexity of our algorithms.

Why are graphs useful?

For a data scientist, probably the most natural examples of graphs to consider are social networks. The vertices are people, we have an edge between $A$ and $B$ if they are friends on Facebook (undirected graph), or we have an edge from $A$ to $B$ if $A$ follows $B$ on Twitter (directed graph).

Road networks are another important example. We would probably model these as weighted graphs -- the weight could be e.g., the time necessary to travel on the given road.

Graphs are also an extremely useful abstract notion. Many computational problems can be stated in terms of graphs, and solved using graph algorithms. For example, this video shows an algorithm playing Manic Miner, and solving its levels optimally, i.e., in the shortest amount of time. This is done by modelling every possible state of the game (how much time passed, where is Miner Willy at the moment, which keys has he already collected) as a vertex of a graph. For each state, we consider all the possible keys that the player could press, and create edges to each state obtained from such a keypress. Solving the level optimally then boils down to finding the shortest path in this graph.

How to represent graphs?

The visualization shows an example graph, and two ways of representing them inside our program: Adjacency lists make a more compressed representation, and thus they lead to more efficient algorithms in most cases.

Shortest/cheapest path algorithms

A path in a graph from vertex $s$ to vertex $t$ is a sequence of vertices $v_0, \ldots, v_l$ such that $v_0=s,$ $v_l=t$, and there is an edge from $v_i$ to $v_{i+1}$. In unweighted graphs, we are usually interested in the shortest path (i.e., one where $l$ is the smallest); in weighted graphs, we want to know the cheapest (lightest) path, i.e., one such that the total weight (cost) of all the edges in this path is the smallest. In social network analysis, this tells us how distant two people are; in a road network, this tells us how to reach $t$ from $s$ in the cheapest way. See the visualization to see how the BFS, DFS, and Dijkstra algorithm work step-by-step.

Unweighted graphs: Breadth First Search (BFS)

We will compute the shortest paths from a given source vertex s to any vertex w. For unweighted graphs, we can do this with a very simple algorithm, called Breadth First Search (BFS). The idea of BFS is that first we find all the vertices in distance 1, then we try their outgoing edges to find all the vertices in distance 2, then we try their outgoing edges to find all the vertices in distance 3, and so on. We use a queue: we can insert objects to a queue, and later remove them; the first item added is the first item removed, the second item added is the second item removed, and so on. The time complexity is $O(n+m)$, and memory complexity is $O(n)$. A similar algorithm is Depth First Search (DFS), where we use a stack instead of a queue. A stack is a "last-in-first-out" structure: when we remove an element, we remove one which was inserted the most recently. Thus, we find a vertex in distance 1, then we go by one of its edges to find another vertex, we go by one of its edges to find another yet unvisited vertex, and so on. Time and memory complexity of BFS is the same as for DFS. While DFS will not compute the shortest paths correctly, both BFS and DFS can be used to check which pairs of vertices are connected (by any path). However, DFS has its specific uses in more complex graph algorithms, and it might be easier to implement in some situations (a stack is a simpler structure than a queue, and we could also use recursion instead of a specific stack structure).

Weighted graphs: Dijkstra's Algorithm

Now, consider a weighted graph, where we want to compute the cheapest path from a given s to any vertex w. BFS will not compute them correctly -- it finds the shortest path, which is not necessarily the cheapest one. We could fix BFS by inserting a vertex $v$ to the queue again when we have found a path to $v$ which is cheaper than what we already had; however, this method yields a potentially much less efficient algorithm (it is possible that we will have to insert every vertex many times -- even exponentially many times).

Instead, we have to adjust the idea of BFS so that it works with weights correctly. BFS worked because we considered all the vertices from $s$ itself to the vertices farthest away from $s$. We need to do the same, but considering the lightest paths instead of simply numbers of edges. The question remains how to find $v$ such that $dist[v]$ is the smallest. The heap data structure, which we have used in HeapSort, is perfect for this. We insert pairs $(-dist[v], v)$ to the heap; since the element in the root is always the one with the greatest key, i.e., the greatest $-dist[v]$. We will have to add an element to the heap at most $m$ times (since we know that the number of elements in the heap will never exceed $m$, this takes time $O(\log m)$ each) and to remove the maximum also at most $m$ times (again, $O(\log m)$ each). Thus, the total running time is $O(n + m \log m)$.

A more careful implementation will, in case if the vertex already is in the heap, reduce its key (moving it up heap) instead of adding the same vertex again -- this reduces the running time to $O(n + m \log n)$; however, this is usually not worth it -- first, usually edges do not repeat, and thus $m < n^2$ and there is no difference between $\log n$ and $\log m$; second, the implementation is much more complex (in particular, common implementations such as priority_queue in C++ and heapq in Python do not allow to change keys easily).

Additional note: time complexity of the Dijkstra algorithm can be reduced to $O(m+n \log n)$ by using Fibonacci heaps instead of common binary heaps.

From everywhere to everywhere: Floyd-Warshall algorithm

In this algorithm, we compute the shortest paths from any s to any t.
# the number of vertices
N = 5

# it is sufficient to have the edges as one big list (source, target, cost)
edges = [(0, 1, 10), (0, 1, 8), (1, 2, 15), (0, 2, 22), (2, 3, 40), (3, 4, 8), (4,0, 100)]

# infinity -- denotes that we have not found any path yet. Actually, we use a very large number (one that is surely greater than the cost of any actual shortest path)
INF = 1000000000

# At all times, delta[i][j] will be the upper bound on the cost of the 
# cheapest path from i to j.

# Initially, we do not know about any paths...
delta = [ [INF] * N for v in range(N)]

# We know that we can use our edges. We use min to deal with multiple edges between a pair of vertices
for (s,t,c) in edges:
  delta[s][t] = min(delta[s][t], c)

# We also know the distance from a vertex to itself is always 0
for v in range(N):
  delta[v][v] = 0

for k in range(N):
  for i in range(N):
    for j in range(N):
      # if delta[i][k] + delta[k][j] (the cost of the best known path from i to j going to k)
      # is smaller than delta[i][j] (the cost of the best known path from i to j),
      # replace delta[i][j]
      delta[i][j] = min(delta[i][j], delta[i][k] + delta[k][j])

# for every (s,t), delta[s][t] will be the cost of the cheapest path from s to t now, or INF if no path exists at all.
The order of the $k$, $i$, $j$ loops is important. The outermost loop has the following invariant: for each $i$, $j$, $delta[i][j]$ is the cost of the cheapest path from $i$ to $j$, among the paths which only go through vertices previously considered as $k$ on the way. (This algorithm is often considered a dynamic programming algorithm -- after each $k$ we have a partial solution.) This invariant holds when no $k$ was considered yet (i.e., we know only about the direct paths). It is sufficient to prove that, if the invariant holds before the iteration, it also holds after the iteration (we then know that, after the loop ends, $delta[i][j]$ is the cheapest path from $i$ to $j$ without any restrictions). The cheapest path from $i$ to $j$ which could only go through vertices 0, 1, ..., $k$ either does not actually use $k$ (then $delta[i][j]$ already knew this path before the iteration $k$), or it does, which means that it is a combination of some path from $i$ to $k$ going through only $0,...,k-1$, and some path from $k$ to $j$ going through only $0,...,k-1$ -- the length of the shortest such path is $delta[i][k]+delta[k][j]$, so our algorithm will find it.

The Floyd-Warshall algorithm runs in time $O(n^3+m)$ and memory $O(n^2)$. We could also use the Dijkstra algorithm $n$ times -- the running time of a simple implementation is $O(n^2 + nm \log n)$. Dijkstra algorithm wins for sparse graphs (i.e., ones such that $m$ is relatively small), but Floyd-Warshall wins for dense graphs (ones that $m$ is close to $n^2$). A big advantage of the Floyd-Warshall algorithm is its simplicity -- we have to remember the correct order of the $k$, $i$, $j$ loops, though. Unfortunately, because of the rather high complexity, it cannot be used for very large graphs.

Minimal Spanning Tree problem

Suppose we have a weighted undirected graph that is connected, i.e., a path from every vertex to every other vertex exists. We want to choose a subset of edges such that: The resulting graph will have no cycles (i.e., it will be a tree) -- if it had a cycle, it would be possible to remove one of the edges on the cycle, and the graph would still be connected.

Kruskal Algorithm

For simplicity assume that every edge has a different cost.

Consider the cheapest edge $e$ in the whole graph. It is easy to see that we can safely choose this edge. Indeed, a spanning tree $T$ which does not include this edge cannot be minimal -- adding $e$ yields a cycle; if we remove another edge from this cycle, we get a graph which is still connected, and its total cost smaller than $T$.

We repeat this for every edge in the graph, ordered from the cheapest to the most expensive ones. More precisely, for every edge, we check whether its two endpoints are connected -- if not, we add it (by a reasoning similar to the one above, we can do this safely).

(General note: algorithms which always take the most promising choice are called greedy algorithms. In general, greedy algorithms can be correct or not; also, proving the correctness of a correct greedy algorithm is often difficult.)

We only need to find a way to quickly tell whether two vertices $v$ and $w$ are connected. The disjoint set data structure, also called find-union data structure, is a data structure which solves this problem efficiently. It has three operations: We can use a Find-Union structure in the Kruskal algorithm in the following way:
fu = init(N)

mst = []

for (c,s,t) in sorted((c,s,t) for (s,t,c) in edges):
  if find(fu,s) != find(fu,t):
    mst.append((s,t,c))
    union(fu,s,t)

Applications of MST

If the original graph contains the costs of connecting vertices with potential new roads, the MST can be used to find a cheapest road network which allows its users to reach every considered location. Likewise, it could be used to find an optimal communication network.

It can also be applied for taxonomy -- if our vertices are species (for example, species of plants), and the edge cost between $v$ and $w$ correspond to the differences between these two species (the more similar they are, the smaller the edge cost), the constructed MST (especially when constructing with the Kruskal algorithm) lets us to categorize our species into a taxonomical classification, and we could conjecture that the evolutionary tree for our group of species is similar to the constructed tree. This method also has applications in data science -- vertices are our observations, and edge costs correspond to differences between them; since the Kruskal algorithm will group similar observations together, it can be used as a simple clustering algorithm.

Find-Union structure: implementation

The Find-Union structure consists of an array $fu$. Every set has a representant -- one of its elements; the Find($i$) operation will always return the representant of the set containing $i$. The idea is that $fu[i]$ tells us where to look when looking for the representant of $i$; we could visualize this information as an arrow which points from $i$ to $fu[i]$. If $fu[i]$ equals $i$ then $i$ itself is the representant; otherwise, $fu[i]$ could be the representant, or not -- we might have to check $fu[fu[i]]$. Union works by taking the representants of $i$ and $j$, and adding a new arrow between them.
def init(n):
  return [i for i in range(n)]

def find(fu, i):
  if fu[i] == i:
    return i
  else:
    return find(fu, fu[i])

def union(fu, i, j):
  fu[find(fu, i)] = find(fu, j)
This implementation is not very efficient -- it is quite easy to find a sequence of Find operations which yields the following structure: $0 \leftarrow 1 \leftarrow 2 \leftarrow 3 \leftarrow 4 \leftarrow ... \lefta n-1$ (i.e., $fu[0]$ equals 0, and $fu[i]=i-1$ for other values of $i$). In this case, Find($k$) runs in time $O(k)$ -- since Kruskal algorithm does this $m$ times, it may run in time $O(nm)$. However, there are two optimizations to the idea above: When we do only Union by rank, time complexity of each Find/Union operation drops to $O(\log n)$. When we do only Path compression, amortized time complexity drops to $O(\log n)$. When we do both, time complexity drops to $O(\alpha(n))$, where $\alpha(n)$ is the inverse Ackermann function. This is a very slow growing function, which can be considered constant for every practical purpose. The iterated logarithm of x, called $\log^* x$, is the number of (binary) logarithms necessary to bring $x$ to 1 or below. For example, for 65536 we have $65536=2^{16} \rightarrow 16=2^4 \rightarrow 4=2^2 \rightarrow 2=2^1 \rightarrow 1$, so $\log^* 65536 = 4$. Since $\log^* 2^{65536} = 5$, we can assume that $log^* n \leq 5$ for any practical value of $n$. We could also consider doubly iterated logarithm (by iterating $\log^*$ in the same way), triply iterated logarithm, etc., for even slower growing functions. The inverse Ackermann function grows slower than any iterated logarithm -- $\alpha(n)$ is roughly the smallest $k$ such that $k$-iterated logarithm of $n$ is not greater than $k$.

Therefore, the Kruskal algorithm works in time $O(m \log m + m \log^* n)$ and memory $O(n)$. The $O(m \log m)$ is because of sorting; in case if the array of edges is already sorted, or can be sorted with an efficient algorithm such as CountSort, the time complexity drops down to $O(m \log^* n)$.