AlgoDS: Lecture 1

How to learn?

Cormen, Leiserson, Rivest, Stein: Introduction to algorithms -- a good reference on algorithms. A rather big book, includes both the easier subjects (such as those we will be talking about in this course) as well as some of the more advanced ones.
Internet resources, for example smurf.mimuw.edu.pl (in Polish)
A good way to get practical experience is to look into competitive programming. This is a kind of a mental sport where one gets an algorithmic problem, has to create an algorithm, and implement it in the chosen programming language. Competitive programming sites allow one to submit your solution and have it tested automatically. One can compete against others (usually under time limit) or use this for educational purposes. Some sites:
- HackerRank -- many educational problems
- CodeForces -- currently the biggest community

Grading rules

What programming languages will be using?

Pseudocode -- when we want to present the idea of an algorithm quickly without giving the implementation details
Python and C++ -- comparison:
- Python is very popular in Data Science, C++ is very popular among algorithmists (statistics)
- It is quite easy to create very efficient programs in C++ -- computations requiring high performance may be possible in Python, but they are usually done with an external library (probably written in C++ or a similar language).
- Python is believed to be very easy, while C++ is commonly believed to be hard and complex -- however, there are people who think otherwise.
- The easiest ways to use C++ on Windows are to install Code::Blocks, or to use an online compiler such as ideone.
- C++ is an evolving language: it started when Bjarne Stroustrup added object oriented programming to C (a very low-level language according to the modern standards); after that, templates, and the standard library have been added. There were major new additions in 2011, 2014, 2017, and planned in 2020; these additions allow doing things much easier. Unfortunately, it is hard to find a good tutorial of modern C++ -- many of them concentrate too much on features which are not that useful for us, such as pointers (a low level and rather difficult concept which was essential in C, but can be mostly avoided in modern C++) or object oriented programming (useful for creating some kinds of software, but not really relevant for us). Learn C++ is a popular tutorial, but it also has these problems.
- See below for a more detailed comparison.

Example problem: "guess the number" puzzle

I have a secret integer number x from 0 to 1000. You can ask questions of form, "does x≤...". How can you find the value of x?

The idea of the solution: first, ask whether x≤500. After we receive the answer, we either know that 0≤x≤500, or that 501≤x≤1000. We ask about the middle number of the new interval, again splitting the interval in half. After about 10 questions we know the value of x exactly.

This algorithm can be described by the following fragment of a C++ program:

int l = 0;
int r = 1000;
// l and r denote our current bounds: we know that l <= x <= r
while(l<r) {
  // we will ask about the number in the middle
  int m = (l+r)/2;
  // note: this is an integer division, thus e.g., for l=0, r=1 we get m=1/2=0
  if(x <= m)
    r = m;
  else
    l = m+1;
  }

After this fragment executes, both the variables l and r will be equal to x.
Although the idea of the algorithm is very easy, implementing it precisely enough so that it can be executed by the computer is not that straightforward. For example, if one writes l = x instead of l = x+1, the algorithm will not be working correctly: suppose that, at some point, l=0, r=1, and x=1. We compute m=0, and since it is not true x <= m, we set l to m, which does not change anything -- thus, we will keep asking about x <= 0 forever!

How to check that the program is correct then?

Our loop has the so-called invariant: at each moment we have l<=x<=r.
- The invariant is true when we first enter the loop -- as we know that 0<=x<=1000 and l and r have exactly these values.
- We have to check that, if the invariant is true, it is still true after one iteration of the loop. For this, we have to check both cases (x<=m or otherwise).
- Therefore, the invariant will have to still be true when we exit the loop. At this point, we know that l<=x<=r (the invariant) and that l>=r (since the loop exits), therefore both l and r must be equal to x.
This shows that, if the program ends, it gives the correct answer. We also have to show that the program will always end. For this, consider the value of r-l. This is an integer which gets smaller and smaller with each iteration (we have to check the two cases again), and never drops below 0, so the program will eventually have to finish.

Application

The last problem has been shown as a puzzle, to show that algorithmics is not necessarily about computers -- it shows that algorithms are useful in general, whether using a computer or not. However, the algorithm shown, known as binary search, has lots of applications in programming.

For example, suppose we have an array of integers, i.e., a[0], a[1], ..., a[n-1], and we know that it is non-decreasing. Given v, we want to find the smallest x such that a[x] ≥ v; in case if all the integers in our array are smaller than v, we should return n. How to do this?

We use the same general idea as in the puzzle, but now instead of checking whether x <= m, we check whether a[m] >= v. We also start with r=n instead of 1000. We prove the correctness of our program in exactly the same way, except that the invariant says now that a[i]<v for each valid i smaller than l, and a[i]>=v for each valid i greater or equal to r.

The complete program

Below is an example of a complete C++ program which uses this:

#include <iostream>
#include <vector>

// We separate our binary search as a function. For the array a, we use the type std::vector<int> from
// the standard library. 

int bin_search(const std::vector<int>& a, int v) {

  // The & sign denotes that 'a' is a reference, i.e., bin_search has an access to a std::vector<int> which
  // is declared everywhere. If & is omitted, the function makes its own copy of the vector given to it,
  // which takes lots of time -- thus it is important not to forget the & sign.

  // 'const' signifies that the function will not change the value of a. 
  
  int l = 0;
  int r = a.size();
  while(l<r) {
    int m = (l+r)/2;
    if(a[m] >= v)
      r = m;
    else
      l = m+1;
    }
  return r;
  }


int main() {
  // we construct an example array a:
  std::vector<int> a;
  for(int i=0; i<1000; i++) 
    a.push_back(i*i);
  
  while(true) {
    int v;
    std::cin >> v;
    std::cout << "The answer is: " << bin_search(a,v) << "\n";
    }

  return 0;
  }

And a Python version:

def bin_search(a, v):
  l = 0
  r = len(a)
  while l<r:
    m = (l+r)//2
    if a[m] >= v:
      r = m
    else:
      l = m+1
  return r

a = [i*i for i in range(100)]

while 1:
  v = int(input())
  print("The answer is: "+str(bin_search(a,v)))

The important differences between the C++ and Python versions:

C++ is statically typed, while Python is dynamically typed. This means that the C programmer has to specify the type for each variable, which is not necessary in Python. This makes the Python program shorter; however, in Python, if you make a type error (for example, you forget the int or str which convert strings to integers and back), you will not know about this until you actually run the program and it breaks when it reaches the offending part of the program. In C++, since the compiler knows all the types, it reports an error right away during the compilation. Another advantage is that we do not have to store the type of the variable in the memory, nor to check it during the run time -- i.e., a C++ program uses both the memory and processor much more efficiently. (In many cases it is possible to declare a variable as auto to make the compiler determine the type by itself, and use templates to create functions which work on many different types -- without losing the advantages of static type checking mentioned above.)
The C++ programmer decides whether to pass a as a reference or not, while the Python programmer just says a; Python does the correct thing by default in this particular case, but sometimes you might want a reference when Python does not allow it, or be surprised when Python does a reference when you want copy (try to run x = [0]; y = [x for a in range(5)]; y[0][0] = 1; print(y)). Declaring it as a const helps the programmer (by telling that this variable is an input that is not changed) and the compiler (to generate more efficient code).
std:: is a namespace containing all the types and functions from the standard library; this is for backwards compatibility -- if a C programmer called their own type vector while std::vector did not yet exist, their program will still work. (Most C programs will still work in modern C++, while Python 2 programs are likely not to work in Python 3.) It is possible to write using namespace std; to avoid the necessity of writing std:: each time, although we could lose backwards compatibility that way, when new things are added to std.
In C++, the structure of the program (what is inside the while loop, and what is not) comes from the braces. It is a good style to use indentation, but it is not strictly necessary. In Python, there are no braces, and the structure comes from indentation.
In C++ you need a bit of more code (e.g. the main function); on the other hand, this part will be the same in mostly every program.

The syntax of the for loop and input/output might be a bit weird -- C++11 allows nicer constructs, but they are not yet in the standard library (probably C++ programmers are too used to this to care). It is possible to make it nicer if you write or use a library such as my easy.cpp:

#include "easy.cpp"
using namespace easy;

int bin_search(const vector<int>& a, int v) {
  int l = 0;
  int r = a.size();
  while(l<r) {
    int m = (l+r)/2;
    if(a[m] >= v)
      r = m;
    else
      l = m+1;
    }
  return r;
  }

int main() {
  vector<int> a;
  for(int i: ints(1000)) 
    a.push_back(i*i);
  
  while(true) {
    int v;
    read(v);
    writeln("The answer is: ", bin_search(a,v));
    }

  return 0;
  }

The importance of using efficient algorithms

A simpler algorithm for the last problem is the linear search, in which you simply look for the value of x from beginning to end:

int x = 0;
while(x < n && a[x] < v) x++;

How important is using the faster algorithm, compared to using a faster computer, or a faster programming language? Consider the case of n=1000000000. In the '80s, in BASIC, the interpreter could do about 1000 simple operations per second. Therefore, the binary search (30 iterations) would take about 50 ms, while the linear search would take about 12 days. On the other hand, C++ on a modern computer can do about 1000000000 operations per second -- thus, the linear search will take about 1 second, while the binary search takes about 50 nanoseconds.

As this example shows -- using a faster algorithm is more important than using a faster computer or programming language. An efficient algorithm will be faster than a bad one, even when given a significant technological disadvantage! And even if answering one query per second does not sound that bad, we would usually use the binary search as a part of a bigger system, which needs to find a value in a array many times. Coders with no knowledge of algorithmics would just implement the linear search (thinking it should be fast enough), find out that their program works slowly, and then get amazed when told about binary search (even though it is a very simple algorithm).

The number of iterations is the binary logarithm of n for binary search, while n itself for linear search. To get the actual running time, we have to multiply this by a constant, which depends on the technology used, and how many elementary operations we do in each loop -- for example, if we considered the case of x=m separately, the number of operations per iteration would be different. While we could save time by changing this, or by improving our technology, the difference between log(n) and n is the most significant -- therefore, in our course, we will be the most concerned about making the order as small as possible. This will be explained in more detail in the further lectures.