Lecture 7-8: Data Structures
Data structures are ways of arranging data. In most data structures there is some important property that defines how data is arranged.
This property has to be satisfied after every operation
(this property called the invariant of the structure -- similarly to loop invariants, which must be satisfied after every iteration of the loop).
So far we have seen the following data structures:
- arrays, which can be unsorted (no invariant) or sorted (the invariant here is that the array should be sorted after every iteration),
or fixed size (a continuous block of memory that cannot grow) or dynamic (we can append elements to it, like the Python list)
- complete binary heap (as used in HeapSort)
In an unsorted array, we can add a new element very quickly, but it takes very long to find anything.
In a sorted array, we can find every given element very quickly (the invariant helps us); however, as we have seen in InsertionSort, adding a new element
takes time (we need to spend extra time to keep the invariant satisfied). This is a common tradeoff in algorithmics (as well as in real life): it takes more time to
make the data more ordered, but then it takes less time to search them!
In this lecture we will explain amortized analysis (a useful technique of computing the total time complexity of a number of operations on a data structure), and
show some data structures that can be used as dictionares.
Amortized analysis
A typical situation in algorithms using various data structures is that we perform many operations on these
data structures, and, while some of these operations take a comparatively long time to perform, the data
structure is designed in such a way that such slow calls are very rare, and the total running time is
good. Amortized analysis is a tool to deal with such a situation. The idea is similar to that of amortization in bookkeeping, hence the name.
Idea: We assign amortized time complexity to each operation on our data structure in such a way that,
when we consider a sequence of operations, the total "real" time complexity of this sequence will never exceed the
sum of the given amortized time complexities, even though the real time of each individual operation might
exceed its amortized time.
Example: according to TimeComplexity on the Python wiki,
"amortized worst case" complexity of appending an item to a Python list is O(1). This means that, if you perform $k$ append operations, they will take
$O(k)$ time in total. However, it is possible that one individual append can take more than O(1) time by itself.
A useful metod of computing the amortized time complexity is the potential method. After each operation our data structure has
its potential $\Phi(S)$, which is a non-negative number depending on its state $S$. If an operation $O$ changes the state from
$S$ to $S'$, we calculate the amortized time complexity of this operation as $a(O) = t(O) + \Phi(S') - \Phi(S)$, where $t(O)$ is the real time.
Intuitively -- the current potential is our savings account, and $a(O)$ is the number of credits we are given to perform the operation $O$.
These credits can be either used right away to pay for the computations (the $t(O)$ part), or we could use our savings account (the $\Phi(S') - \Phi(S)$ part),
either by storing unused credits into the account, or by using our savings to cover the extra costs when $t(O) > a(O)$. The potential starts at 0, so this agrees
with the idea given above -- if we perform a sequence of operations $O_1, \ldots, O_k$, we obtain $a(O_1) + \ldots + a(O_k)$ credits in total, and every
unit of time cost used by each $O_i$ had to be paid for, either from $a(O_i)$ or from the savings from an earlier operation.
Example: binary counter
A binary counter has $n$ binary digits. It starts with the value of 0, and each increment operation adds
1 to its value. Thus, our binary counter will have the following values after each consecutive operation:
0000
0001
0010
0011
0100
0101
0110
0111
1000
(The whole idea works with a decimal counter too, but binary counter is easier to present.)
The increment operation looks at all the digits starting from the last one, until it finds one with the value of 0. That 0 is changed to 1, and all the 1s on the way
are changed to 0.
The worst case time complexity of the increment operation is $O(n)$ -- potentially we could have changed all the digits (going from 0111 to 1000). However, the
amortized time complexity is just $O(1)$. This can be shown using the potential method, where the potential of our binary counter is the number of 1s. Let our elementary
operation be the changing of value of one bit. Then, the amortized complexity of each operation is just 2:
# bits potential
0 0000 0
1 0001 1
2 0010 1
3 0011 2
4 0100 1
5 0101 2
6 0110 2
7 0111 3
8 1000 1
- For our first increment (from 0 to 1), we get two credits. We use one of them to change 0 to 1, and we save the other one for later, by putting in our potential.
- For our second increment (from 1 to 2), we get two credits. We use them to change two bits.
- For our third increment (from 2 to 3), we get two credits. Again, we use one of them to change 0 to 1, and we put the other one in our potential.
- For our fourth increment (from 3 to 4), the two credits we get are not sufficient, but we have two credits in our account. We use the two saved credits to change 1s to 0s.
One of the two new credits is used to change 0 to 1, and the other one is saved.
- For our eight increment (from 7 to 8), the two credits we get are not sufficient, but we have three credits in our account. We use the three saved credits to change 1s to 0s.
One of the two new credits is used to change 0 to 1, and the other one is saved.
In general -- whenever we change a bit from 1 to 0, this uses up one credit, but also reduces the potential by 1, so we can use our savings to pay for this operation.
We use the two new credits to change the next bit from 0 to 1 -- one for the actual operation, and one to increase the potential.
Example: dynamic array
We have already mentioned that Python lists and C++ vectors store their values in a block of consecutive cells in the memory. Thus, for example,
a vector with $n$ elements will have them stored in memory cells numbered $k, \ldots, k+(n-1)$, for some $k$. But what to do when we want to append a new element to
our list? The memory cell $k+n$ could be already used for something else!
To solve this problem we have to allocate a new block of memory where all the values will fit. If we allocate a new block of size $n+1$, we will need to move all the n old values
to it, which will require time $O(n)$. Not a good choice -- potentially we will have to perform such a move whenever we append a new element. Appending in time $O(n)$ is not
very efficient.
A better solution is to reserve space for the new elements which will potentially come in the future. We allocate a new block of size $2n$ instead of $n+1$. This way,
the next $n-1$ append operations will be very fast, $O(1)$ each. Then we will have to move all the elements again -- an expensive operation, but the next $2n-1$ operations will be
fast. An array growing in this way is usually called a dynamic array; this technique is used both in Python lists and C++ vectors.
As we can see, most of the operations take time $O(1)$, but occassionally very expensive operations will occur. We can use the potential method to show that the amortized time
complexity of the append operation is $O(1)$. Let the potential $\Phi(S)$ be $2n-m$, where $n$ is the number of elements in the dynamic array, while $m$ is the reserve.
- When we have more space in our reserve, we use $O(1)$ time to put the appended value in the next memory cell, and the potential increases by 2 -- the total amortized cost is $O(1)$.
- When we have no more space in our reserve ($m=n$), we allocate a new array of size $2m$, and move all the $n$ elements there -- real cost $O(n)$. However, the potential drops
from $n$ to 0, which is enough to cover the cost of this.
Conclusion
For most applications, we can simply ignore the difference between amortized and real time -- thus, for example, if we design an algorithm that performs multiple append operations on
Python lists, and we assume that this operation simply takes time $O(1)$, it will be alright -- even if there are some rare operations take longer, it is not important at all since
they will be "amortized". One situation where the difference matters is when we want to, say, respond to user input in real time -- if some of their queries take longer than expected
(e.g., because we needed to resize our lists at that particular time), they will notice it, even though the average time is alright!
Dictionaries
A dictionary (aka associative array) is a data structure $D$ which is basically a set of key-value pairs (where keys do not repeat), with the following basic operations:
- Init($D$) -- create a new dictionary,
- Add($D$, $k$, $v$) (aka Insert) -- insert a new key-value pair (possibly removing an older key-value pair with key $k$),
- Check($D$, $k$) -- check if $D$ contains a key $k$,
- Get($D$, $k$) (aka Lookup) -- get the value associated to the key $k$ in $D$.
These are the basic operations -- in the last lecture we have shown how to use a dictionary for an algorithm based on memoization, and we used all of these
operations. In some cases, more operations might be useful -- e.g. Remove($D$, $k$), which remove the key-value pair with key $k$ from the dictionary. We will concentrate on the ones
above, though. Although Python's built-in dictionary type works great for this purpose, this is actually based on rather complex algorithms. This part
of the lecture will give possible implementations of a dictionary.
Note: See here for a visualization of Solutions (2) to (6).
Solution (1): Array indexed by keys
Suppose that our keys are $0, \ldots, M-1$. We simply use an array a (e.g., a C++ vector or a Python list) of length $M$. If the key $k$ is currently not in the dictionary,
then a[k] is None; otherwise, a[k] is the value associated with $k$. What are the time and memory complexities of each operation?
- Memory complexity is $O(M)$.
- Init takes $O(M)$ time.
- Add, Check, and Get take $O(1)$ time.
This solution is practical only when $M$ is small. Otherwise, although Add, Check and Get are all very fast, our structure takes a lot of memory, and also a lot of time to initialize.
Solution (2): Unsorted array of key-value pairs
In this solution, we simply store a list of key-value pairs in our array. For example, a dictionary with value 10 at key 10, and value 30 at key 20, could be stored as
[(10,10), (20,30)], or also as [(20,30), (10,10)]. What are the complexities?
- Memory complexity is $O(n)$ where $n$ is the number of key-value pairs.
- Init takes $O(1)$.
- Add takes $O(1)$ (more precisely, amortized $O(1)$) -- we simply append the element to the end.
- To perform Check and Get, we need to go through the whole array, which takes time $O(n)$.
Thus, although Add is very fast, the Check and Get operations are very slow. Typically, we perform a large number of operations of every type (Add and Check/Get), so we would like
both operation types to be fast -- if one is extremely fast and the other one is slow, it is not sufficient for us.
(To be more precise: Add is so fast only if we know that the key is not yet in the dictionary; if it is, and we want to change it, we have to search for it.)
Solution (3): Sorted array of key-value pairs
Could we reduce the time of Check/Get to $O(\log n)$ by using binary search? Not in the last solution -- since Add always appends new elements to the end of our
list, it does not have to be sorted. What if we use a sorted array instead?
- Memory complexity is $O(n)$ where $n$ is the number of key-value pairs.
- Init takes $O(1)$.
- Check and Get work in $O(\log n)$, using binary search.
- Unfortunately, Add is slow. Although we can replace the value (once we know where it is) in $O(1)$, or append a new value to the end in (amortized) $O(1)$, it is difficult to
insert a new element somewhere in the middle -- we need to push all the later elements to the right. Therefore, Add needs time $O(n)$.
We could use a combination of (2) in (3) in the cases where we first add all the elements to the dictionary, and then perform only lookups -- we append to the end ($O(1)$), then sort the whole
array (in $O(n \log n)$), and then use quick lookups ($(O(\log n)$). However, if we interleave lookups and insertions, we will need to use a better solution.
This is a common tradeoff in algorithmics (and also in real life) -- if our data is arranged in an orderly way, it is easy to search for something, however, we have to spend some time to
make it ordered. Thus, we need to find the correct balance between chaos and order. Neither solutions (2) nor (3) did achieve the correct balance.
Solution (4): A linked list
The problem in solution (2) was that it is difficult to insert an element in the middle of a list. A linked list (commonly called just list -- do not confuse with
the Python list, which is a dynamic array rather than a linked list) is a data structure which allows quick insertions.
We could insert the elements in our arrays as triples: (key, value, next), where next is the index of the next element (it "links" to the next element), or None for the last element.
This allows us to insert new elements quickly. For example, suppose we have: [(10, 1, 2), (30, 9, 3), (20, 4, 1), (50, 25, None)]. a[0] is the first element -- thus, we have 1 at key 10;
then, we have a[2], or (20, 4); then, we have a[1], or (30, 9); then a[4], or (50,25); and this is the last element.
Suppose we want to insert a new element (40,16) between a[1]=(30,9) and a[a[1][2]] = a[3] = (50,25) -- to do this, we append (40,16,3) as a[4], and change a[1][2] to 4.
We can use this method to add element in O(1), but only if we know where it is. However, our sorted list is now arranged in the memory rather chaotically -- and searching will again
take O(n). Hence, linked lists are not useful as dictionaries (however, they are useful for other purposes).
(Note: the (key, value, next) representation where next is a index in the table -- in most programming language, you would rather put a 'pointer' or 'reference' to the next element instead.
In python, this would look like: [10,1,[20,4,[30,9,[50,25,None]]]] (Python treats lists as references). However, the low level representation is similar to the (key, value, next) representation
-- except that next is an actual address of a memory cell, rather than array index. (For many applications, it is useful if we also have a link to the previous element.)
Solution (5): A binary search tree (BST)
How to combine some of the ideas in Solution (4) with binary search? Instead of arranging our items in a linked list, we arrange them into a tree. Just as every item
in a linked list had a link to the 'next' element, every element in a tree will have a link to its 'left' and 'right' child. (For most applications it is also useful to have a link
to the parent.) For example, here is a tree containing ($k$, $k$) for k = 1..7:
| (4,4)
| / \
| (2,2) (6,6)
| / \ / \
| (1,1) (3,3) (5,5) (7,7)
Every element to the left from the root will have smaller key than the root, and every element to the right will have greater key. The same rule holds for the subtrees - for example,
every element in the right subtree of (2,2) will have key greater than 2 (but smaller than 4). We can perform lookups by going down our tree -- we always know where to go. We can also
do insertions easily -- we simply insert the new element in the location where we were looking for and found nothing.
The BST above could be created by inserting elements e.g. in the following order: 4, 6, 2, 3, 7, 5, 1. Using the (key, value, left, right) representation, we have:
- After inserting (4,4): [(4,4,None,None)]
- After inserting (6,6): [(4,4,None,1), (6,6,None,None)]
- After inserting (2,2): [(4,4,2,1), (6,6,None,None), (2,2,None,None)]
- After inserting (3,3): [(4,4,2,1), (6,6,None,None), (2,2,None,3), (3,3,None,None)]
- And so on...
What is the complexity?
- Memory complexity is again $O(n)$.
- Init takes $O(1)$.
- Insert and lookup take time proportional to the length of the search path.
A BST is quite similar to the simplest implementation of QuickSort -- it works very nicely if we insert elements randomly, as the expected path length is $O(\log n)$ in this case;
however, if we insert them from smallest to greatest key (or, from greatest key to smallest key), the tree will basically degenerate into a list, and both insertions and lookup take
time $O(n)$. Therefore BST is a good solution only if we know that elements are inserted in a somewhat random order.
Solution (6): A self-balancing binary search tree
As mentioned, binary search trees had $O(n)$ worst case time complexity. What if we could detect this worst case, and rebuild the tree if this happens, so that we always obtain path
lengths $O(\log n)$? Self-balanced trees are based on this idea.
There are many efficient self-balancing algorithms -- we will sketch the AVL trees, which were historically the first, and probably still the easiest. Let $h(v)$ be
the height of the subtree with root $v$. In an AVL tree, for every vertex $v$ we store $h(v)$ (in addition to key, value, left and right link). We also keep the following invariant:
for each vertex $v$, the difference between $h(l)$ and $h(r)$, where $l$ and $r$ are the left and right children of $v$, is at most 1. (We assume that a vertex with no children has
height 1, and if one of the children does not exist, it is assumed to have height 0 -- thus, the other child, if it exists, must have height 1.)
It can be shown that, if we have $n$ elements in an AVL tree, its height will be $O(\log n)$. Thus, lookups can be done in $O(\log n)$.
When we add a new element $w$ to an AVL tree, we need to update $h(v)$ for each vertex from the root to $w$. If we do this like in a BST, it is possible that we break our invariant --
for example, if we add the elements 1, 2, 3 in that order, then the element 1 (root) will have $h(1) = 3$, a right son with $h(r)=2$ and no left son ($h(l)=0). This can be solved
with a rotation -- we rebuild the subtree in the following way:
| | |
| 1 2
| / \ / \
| X 2 => 1 3
| / \ / \ / \
| Y 3 X Y Z T
| / \
| Z T
We need to go from $w$ back to the root, fixing the values of $h(v)$ on the way, and possibly performing a rotation to left or right. Since this path was of length $O(\log n)$,
insertion also works in $O(\log n)$. To sum up:
- Memory complexity is again $O(n)$.
- Init takes $O(1)$.
- Insert and lookup take time $O(\log n)$.
(Note: the AVL trees are more commonly described not by giving $h(v)$ for each vertex $v$, but rather by giving the balance factor of each vertex -- which corresponds
to $h(r)-h(l)$. The balance factor of every vertex in a correctly balanced AVL is -1, 0, or +1; it may temporarily become -2 or +2 before a rotation.)
Other self-balancing binary search trees include red-black trees, B-trees, splay trees. Probably the easiest self-balancing binary search trees are treaps
(treap=tree+heap) -- every element gets a random priority, and elements with a high priority are rotated to the top. This works similar to a BST with randomly inserted elements
(but of course we might get long paths if we are unlucky).
Solution (7): A hash table
Solutions (2) to (6) could be understood as one line -- we refine our data structure further and further. Hashtables go into another direction -- they are rather based on Solution (1).
The problem with Solution (1) was that we needed as much memory as we have keys. For example, if our keys are numbers from 0 to 1000000000000, we need to allocate an array of size
1000000000000 -- we rarely can afford that much memory. But what if we allocate an array with just $N$ cells, for example, $N=1000$, and use the cell $h(k) = k \bmod N$ to store the information
about key $k$? The value of $h(k)$ is called a hash, and an array indexed by hash values is called a hash table.
This solution has one big problem -- the possibility of a collision: two keys $k_1$ and $k_2$ have $h(k_1) = h(k_2)$, and thus go to the same cell in our array. We can solve
this problem in the following way: we don't put single key-value pair in $a[i]$, but rather all the inserted key-value pairs whose hash value is $i$. We need a sub-dictionary
to put these key-value pairs here; we could use any of the data structures from points (2-6) for this purpose -- let's use the simplest of them, i.e., an unsorted dynamic array.
How efficient this is? If we are lucky, we will insert roughly $n/N$ elements into each position. Therefore, the "lucky case" complexity is:
- Memory complexity is $O(N+n)$.
- Init takes $O(N)$.
- Insert and lookup take time $O(n/N)$.
But will we be lucky? If we are using the hash function $h(k) = k \bmod N$, then we are likely to get unlucky -- for example, if every value inserted is a multiple of 1000, then
all of them will go to the same cell. However, we should be fine if our hash function is sufficiently "random" -- then we are lucky with a very high probability.
There is a big theory of creating good hash functions -- they need to be as random as possible, and quickly computable at the same time.
What $N$ should we pick? For best results, it should be greater than $n$ ($N>2n$ sounds good), but not too large.
We could use roughly the same trick as in the dynamic array: we try to keep the invariant that $N>2n$; when we insert enough elements that this relation no longer holds,
we allocate a new hashtable, doubling the value of $N$. This way, we obtain the following complexity (if we are lucky and don't get many collisions):
- Memory complexity is $O(n)$,
- Init, insert, and lookup all take time $O(1)$.
Everything is as good as possible -- under the assumption that we are lucky. But we usually are....
Conclusion
Luckily, in most modern programming languages we don't need to implement these complicated data structures by ourselves. In C++, we have the type map for
dictionaries based on red-black trees (a type of self-balancing BSTs), and unordered_map for dictionaries based on hash tables. While self-balancing BSTs
have worse time complexities, they have two advantages: (1) they work well even in the worst case, (2) they allow for some additional operations. For example,
after finding a given key in the dictionary, we can easily (in $O(\log n)$) find the next/previous key in order -- due to the random nature of hash values, this
is impossible in a hash table (hence the name, unordered_map).
Python dictionaries are based on hashtables. This is because the two disadvantages mentioned above are almost never relevant in practice, and the designers of Python
have decided to keep it simple. In the rare case where we would prefer to use a self-balanced BST, this can be achieved by using a third party library.
In general, there is rarely need to implement a hash table or a self-balancing BST on our own. One situation where implementing our own AVL (or another type of
self-balancing BST) is useful is augmentation -- we extend AVL with additional abilities. The idea of these augmentations is that for every vertex $v$ we also
store some kind of general information about all the elements in the subtree rooted at $v$. For example, if for every vertex $v$ we remember the number of the
elements in the subtree rooted at $v$, we can get to a vertex $w$ which is the $k$-th successor of $v$ in time $O(\log k)$; the normal AVL only allows doing this in $O(k)$.
Try calling the function hash in Python to compute the hash value of numbers and strings. This returns a large number -- this number is then taken modulo the current
size of the hashtable. Python uses a different method of resolving collision than the one explained above: we put only one (key, value) pair at each cell of the hashtable;
in case of a conflict, we use the full hash value again to compute another position, and try to insert/search there. We repeat until we have either found the key, or reach
an empty location. (Of course, this works only if the size of the hashtable is greater than the number of elements in it, and is efficient only if it is significantly greater --
this is the reason why we have chosen $N>2n$.)
Some extra experiments that could be done with Python hash function:
- Set $l=2**62-2$. If we compute hash(k*l) for various values of $k$, we always obtain 0. Try to insert keys k*l, for k=1..1000000, into a Python dictionary --
it will be exceptionally slow, due to the number of collisions.
- In newer versions of Python, you can also notice that hash values of strings will be different when you run Python again -- this is to protect from "denial of service" attacks, where the attacker
tries to crash a server by sending artifically generated queries with equal hash values, in order to overload the server and crash it.