Lecture 10: Hard Problems

When solving an algorithmic question, we often run into the following natural question: is our algorithm optimal? We present known results of this kind, and some examples of problems which are known to be hard.

We have already seen some results of this kind:

In Lecture 4 we have learned that sorting cannot be done faster than $O(n \log n)$ based on comparisons. However, maybe we could sort faster if our algorithm is not based on comparisons? CountSort shows that, for some data, yes!
One of the earlier exercises was about showing that a problem cannot be solved faster than $O(n)$. Consider searching for a value in an unsorted array, for example. If we did it faster in $O(n)$, this means that we have not looked at the whole array, so our algorithm cannot be correct. Thus, this problem cannot be solved faster than $O(n)$. A similar reasoning works for many other problems.

While these results were relatively simple, in general, it turns out to be very difficult to show that something is hard. Still, there is a large class of problems (called NP-complete problems) for which, even if we have no proof, we have good evidence that they cannot be solved efficiently.

A bit of formalism

We will need a bit more formalism to clearly present the ideas in this lecture. We have defined computational problems in the first lectures somewhat intuitively; we will define decision problems a bit more formally.

Let $\Sigma$ be a finite set called an alphabet, and $\Sigma^*$ be the set of finite sequences of symbols from $\Sigma$. Think of files in a computer -- all files in a computer are sequences of bytes, so $\Sigma$ is the set of possible values of a byte (0..255) and $\Sigma^*$ is a set of possible files. Alternatively, our files could be already loaded into memory, and then $\Sigma^*$ would represent the sequences of bytes in our RAM. In theoretical compute science we abstract from such low-level technical details, so $\Sigma$ could be any alphabet, not necessarily the set of bytes.

We will consider decision problems: our algorithm takes an element of $\Sigma^*$ (e.g., it reads a file), and decides whether an answer to some problem is YES or NO. Thus, a decision problem can be modeled as a subset of $\Sigma^*$ (the set of input files for which our algorithm should answer YES).

For example, consider the problem of reachability in a graph (in the given graph $G$, can we reach vertex 2 from vertex 1?). We choose some format of representing a graph in a file (e.g. a list of all edges: "1,5;1,4;4,3;3,2"). Then, REACHABILITY is the set of encodings of graphs where 2 is reachable from 1.

What is the purpose of this formalization?

Decision problems (with binary answers) may look simpler than the more general problems, but actually, most of the algorithmic hardness is already present in decision problems. Restricting ourselves to decision problems will make things simpler without losing much.
Representing all kinds of inputs as sqeuences of characters ("files") allows us to precisely say what is the size of the given input, and study complexity as a function of this size.

Time Hierarchy Theorem

The Time Hierarchy Theorem says: if $f$, $g$ are functions that are time constructible and $g$ is significantly greater than $f$, then there exists a decision problem $L \subseteq Sigma^*$ which can be solved in $O(g)$ but not in $O(f)$. A function $f$ is time constructible iff there exists an algorithm which runs in time $f(n)$ on input of size $n$. We do not provide the precise meaning of "significantly greater", as the formulas are a bit sophisticated and may depend on the used model of computation (e.g. in Wikipedia this theorem is stated for Turing machines, not for Random Access Machines). We will just give some corollaries:

for every $k$ there exists a problem which can be solved in polynomial time, but not in $O(n^k)$. (*)
there exists a problem which can be solved in exponential time, but not in polynomial time.

However, that's basically all we know!

We do not know any specific, satisfying example of a computational problem which can be solved in polynomial time, but cannot be solved in linear time (and we have a proof of that)! While we have (*), the decision problem from the Time Hierarchy Theorem is not really satisfying -- it is rather one specifically constructed as a problem which can be solved in $O(g)$ but not in $O(f)$, not something we would care about otherwise.

One might wonder about the theorem that sorting requires time $\Omega(n \log n)$ -- yes, but only when we restrict ourselves to algorithms based on permutations. If we can do more, we only know that sorting requires time $\Omega(n)$! (Also, sorting is not a decision problem, but this is not really relevant here.)

Complexity class P

P (or PTIME) is the set ("complexity class") of all decision problems which can be solved in time polynomial in the length of its input.

For example, the Reachability problem mentioned above can be solved in polynomial time (BFS runs in linear time). The class P is usually considered by complexity theorists to contain problems which humans can actually solve in practice. From practical reasons, this assumption may be considered a bit doubtful. It is clear that, for input of size $n=1000$, running time $O(n^3)$ is feasible, while $O(2^n)$ is not. What about $O(n^{20})$ (allowed by P) versus $O(1.001^n)$ (not allowed by P)? What about quantum computers (which potentially could solve in polynomial time problems which our RAMs cannot)? What about randomized algorithms (not allowed by P, but there exist algorithms which yield the correct answer with probability $1-(1/2)^100$, which is good enough for all practical uses)? However, despite these shortcomings, the simplicity of the definition of PTIME yields a very elegant theory.

Problems believed to not be in P

Here we present some problems which are believed to not be in P:

Knapsack Problem: given a knapsack with capacity $M$, and value $V$, and $k$ items, $i$ of which has size $s_i$ and value $p_i$, is it possible to fill our knapsack with some of these items (not exceeding its capacity) in such a way that the total value of these items exceeds $V$?

Knapsack Problem was already covered in the lecture about Dynamic Programming. We have shown an algorithm running in time $O(Mk)$. However, our "file representation" allows all the numbers to be quite high: in a file of length 1000, each of our numbers on the input, including $M$, could be easily 100 digits long. Therefore, our dynamic programming algorithm, in general, runs in time exponential in the size of the input. We have formulated our problem in the form "can we get value of $V$ of more" instead of "what is the highest value we can achieve" to obtain a decision problem.
Factorization: given two numbers n and a, does n have a factor greater than a and less than n? The idea of this problem is that we want to find the prime decomposition of $n$, for example, 1001 = 7*11*13; we ask about factors greater than a to put our problem in the decision format. This example may not be natural for data scientists, but it is important for two reasons: (a) the hardness of this problem has practical application -- cryptography algorithms (e.g., used in Internet banking to prevent other people from reading the customer's communication with the bank) are often based on the hardness of problems similar to this; (b) we cannot simply try all the possible factors -- our number n can be hundreds or thousands of digits long (and in the cryptography applications, it is!)
Hamiltonian Circuit Problem: given a graph $G$, does there exist a cyclic path which goes through every vertex exactly once? (Such path is called a "Hamiltonian circuit")
Travelling Salesman Problem (TSP): given a weighted graph $G$ and number $K$, where every edge has a travelling cost, does there exists a cyclic path which visits all the vertices, and whose total cost does not exceed $K$? One can find lots of applications for this problem, and yet, it is hard to solve.
Boolean Satisfiability Problem (SAT): given a Boolean formula (e.g. (a or b) and (not a or not b) and (a or not b)), does there exist an assignment of true/false values to all the variables such that the formula is true? (The formula above is true when we assign true to a and false to b.) The importance of this problem will be explained later.
Minesweeper: a toy problem based on the well-known Minesweeper puzzle. We are given a rectangular board with some numbers, each number gives the number of mines in adjacent cells; the positions of mines themselves are not given. Is the given board consistent, i.e., does there exist an arrangement of mines which is consistent with the given numbers?

Reductions

For any of the problems above, we do not know whether it really cannot be solved in P. At the first glance, it may appear that these open problems are independent -- maybe in fact we could solve, say, SAT quickly, but not TSP. However, this is not really the case.

Suppose we could solve the Travelling Salesman Problem in polynomial time. Then we could also solve the Hamiltonian Circuit Problem in polynomial time. Why? Well, the algorithm works as follows. We are given a graph $G$ with $n$ vertices. Suppose that each edge has travelling cost of $1\$$. Can we visit all the vertices in cost $n\$$? If yes, then the way we do this must be a Hamiltonian circuit; if not, then no Hamiltonian circuit exists. So, by solving the Travelling Salesman Problem in polynomial time, we have also solved the Hamiltonian Circuit problem in polynomial time!

This process is a common approach in mathematics/computer science/data science -- when we approach a new problem, we try to reduce it to a problem we already know how to solve (there are even some jokes about that). Here, we are using a specific kind of reduction.

A polynomial time many-one reduction from problem $L_1$ to problem $L_2$ is a function $f: \Sigma^* \rightarrow \Sigma^*$, computable in polynomial time, such that $w \in L_1$ iff $f(w) \in L_2$ (i.e., for every input $w$, the expected answer for input $w$ in problem $L_1$ is YES iff the expected answer for input $f(w)$ in problem $L_2$ is YES). We say that $L_1$ is reducible to $L_2$ (notation: $L_1 \leq L_2$) if such a reduction exists. Above, we have shown that the Hamiltonian Circuit problem can be reduced to TSP: the reduction $f$ transforms the graph given on the input simply by adding weight 1 to every edge. In general, reductions can be more sophisticated, but they still have to run in polynomial time. Note that if $L_1 \leq L_2$ and $L_2$ can be solved in polynomial time, then so can $L_1$.

We can also reduce:

Any of the problems above to the Boolean Satisfiability Problem (SAT). The practical usefulness of the SAT problem comes from the fact than many problems can be easily reduced for it.
SAT to the Minesweeper problem. So, given a Boolean formula $\phi$, we can (in polynomial time) construct a Minesweeper board which is consistent iff $\phi$ is satisfiable.
SAT to the Hamiltonian Circuit problem.
SAT to the Knapsack problem.

Note that if $L_1 \leq L_2$ and $L_2 \leq L_3$, then $L_1 \leq L_3$. Therefore, any of the problems above can be reduced not only to SAT, but also to Minesweeper or to Knapsack problem. If we manage to solve any of these problems (SAT, TSP, Knapsack, Minesweeper, Hamiltonian Circuit) in polynomial time, then we know that all of them can be solved in polynomial time!

Complexity Class NP

All the decision problems above can be formulated in the following form: for the given input $x$, does there exist a "solution" $y$? While we do not know any fast algorithm to tell us whether a solution exists, in each case we can easily tell whether a solution is correct. The complexity class NP is the set of decision problems which can be formulated in the following form: for the given input $x$, does there exist a witness $y$, such that $P(x,y)$, where $P$ is a property that can be checked in polynomial time? Note: the name NP may suggest to some people that it means "Non-Polynomial". In fact, it does not mean "Non-Polynomial", but "Non-Deterministic Polynomial": if we "non-deterministically" guess the answer, its correctness can be checked in polynomial time. (The relation of this to philosophical "non-determinism" is historical.)

NP-completeness

A decision problem $L$ is NP-hard iff any NP problem can be reduced to $L$. This means that such problems are "harder than", or at least "as hard as", the whole class NP.

A decision problem is NP-complete iff it is NP-hard and it is in NP itself. Therefore, the NP-complete problems are the problems which are the hardest in the class NP.

We said above that every of the problems mentioned above can be reduced to SAT. In fact, it can be shown that every NP problem can be reduced to the SAT; thus, SAT is NP-complete. Since SAT can be reduced, e.g., to the Minesweeper problem, and Minesweeper is also in NP, the Minesweeper problem is NP-complete too.

Thus, if one manages to solve any NP-complete problem (e.g., Minesweeper) in polynomial time, this means that any problem in NP can be solved in polynomial time, and thus, P=NP. If P does not equal NP, then NP-complete problems cannot be solved in polynomial time. The factorization problem is an example of a problem which is in NP, but is not known to be NP-complete; thus, while we do not know a polynomial algorithm for this problem yet, it is possible that we can factorize numbers in polynomial time, but Minesweeper cannot be solved in polynomial time. The P ?= NP problem is the most important open problem in computer science (with a million dollar prize for solving it, although due to its practical importance in algorithmics and cryptography, it may be actually worth way more than 1000000$).

Picture

This picture shows the relations between complexity classes and problems. The intuition here is that, the higher the problem is in the picture, the harder it is.

The saturated blue part is the complexity class P. It includes problems that can be solved in practice, such as sorting, our homeworks, reachability (can be solved using BFS), shortest paths (does a path in a graph of weight $\leq d$ exist -- this can be solved using Dijkstra algorithm).
The next semi-ellipse is the complexity class NP. For this problems, we can verify the 'yes' answer in polynomial time. It includes Factorization, Hamiltonian Circuit, TSP, Minesweeper (Mines), Knapsack (Knap). Note that NP includes P!
The next semi-ellipse is all the decision problems that can be solved (in finite time). We do not discuss them there, but evaluating positions in games like Chess or Go (can we still win?) is a good example; another problem is QBF, which is similar to SAT but asks about formulas with quantifiers. There are also problems which are impossible to solve, like the Halting Problem.
The arrows are reductions. If we can solve the problem A using an algorithm for problem B, there is an arrow from A to B. For example, we can solve Reachability by finding out whether the path of finite length exists in a graph -- so Reachability reduces to Shortest Paths. (In other words, we can use the Dijkstra algorihtm to find this out, although it generally solves a harder problem.) Likewise, there is an arrow from Hamiltonian circuit to the TSP problem, because Hamiltonian circuit can be solved if we know how to solve TSP.
The horizontal line in the middle separates NP-hard problems: problems above that line are NP-hard. Every problem in NP reduces to them.
NP-complete problems are ones which are NP-hard and in NP. Thus, Hamiltonian circuit and TSP are NP-complete, but Factorization is not known to be (it is NP, but it is not known whether it is possible to reduce all NP problems to it, probably not) and QBF is also not known to be NP-complete (it is harder than SAT and thus NP-hard, but it is not known to be in NP, probably it is not).
The picture is drawn according to the current knowledge. It is still possible that P=NP (making the whole picture much simpler), Factorization is NP-complete, or QBF is in NP.