Selected topics in combinatorics revolving around Vapnik-Chervonenkis dimension
Spring 2023
Lecturer: Szymon Toruńczyk
What is this lecture about?
A big portion of the lectures will be concerned with the Vapnik-Chervonenkis dimension. This fundamental notion has numerous applications in computational learning theory, combinatorics, logic, computational geometry, and theoretical computer science. We will develop the foundations of the theory and study its applications.
Definition: A set system consists of a domainX (a set) and a family F⊆P(X) of subsets of X. For Y⊂X, write F∩Y for the set system with domain Y and family {F∩Y:F∈F}.
Remark A set system is the same as a hypergraph.
Examples:
open intervals: domain R, family {(a,b):a,b∈R,a<b}
rectangles: domain R2, family {(a,b)×(c,d):a,b,c,d∈R,a<b,c<d}
discs: domain R2 family {Dr(p):p∈R2,r>0}
convex sets in R2
higher-dimensional analogues in Rd,
neighborhoods in graphs
Usually X=⋃F, then we just mention F.
Concepts in machine learning. Goal: given a labelled sample, find a reasonable hypothesis (concept). A labelled sample is the same as a set Y and a set F∩Y, for some concept F∈F. The goal is to "guess" the concept F, or something close to F.
In the area of geometric range searching, we fix a set system of geometric shapes in Rd, like the ones in the examples above. The sets in F are called ranges. Given a finite set X⊂Rd, the task is to compute a data structure that allows to quickly answer queries of the form: for a given range F∈F, what is #(F∩X)?
VC-dimension and shatter function
The goal is to define a measure of complexity of a set system. Machine learning intuition: understand which concept classes are learnable. When does Ockham's razor work well?
Definition (VC-dimension) [Vapnik, Chervonenkis, 71]
Let F be a set system with domain X. A set Y⊆X is shattered by F if F∩Y=2Y.
The VC-dimension of F is the maximum size d of a set Y⊆X shattered by F, or ∞
if arbitrarily large sets are shattered.
Definition (Shatter function).
Let F be a set system on X.
Let πF(m)=Y⊆X,∣Y∣≤mmax∣F∩Y∣.
Examples
half-spaces in Rd have VC-dim at most d+1 (see exercises)
revisit examples above.
Observation If F has
infinite VC-dimension
then πF(m)=2m for all m.
Definition (Dual set system).
Let F be a set system with domain X.
The dual systemF∗ is the system with domain F
and family {x∗:x∈X}, where x∗={F∈F:x∈F}⊂F.
Another view on set systems: bipartite graphs (L,R,E), E⊆L×R. (Multiplicities are ignored).
Dual: (R,L,E−1). In particular, (F∗)∗≅F,
up to removing twins (elements in the domain that belong to
exactly the same sets).
πF∗ is denoted π∗,
and is called the dual shatter function of F.
VC(F∗) is denoted VC∗(F),
and is called the dual VC-dimension of F.
VC(F∗) is the maximal number of sets
in F such that every cell in the Venn diagram
is occupied by some element of the domain.
πF∗(m) is the maximal number of
occupied cells in the Venn diagram of ≤m sets from F.
Example If F is the family of half-planes in R2, then πF∗(m) is the maximal number of regions
into which m half-planes can partition the plane.
For example, πF∗(2)=4 and πF∗(3)=7.
Exercise Prove that πF∗(m)=(2m+1)+1.
Exercise If VC(F)≤d then VC∗(F)<2d+1.
In all examples, we have either a polynomial,
or an exponential shatter function – nothing in between.
Sauer-Shelah-Perles Lemma. [Sauer (72), Shelah and Perles (72), Vapnik-Chervonenkis (71)]
Let F be a set system on n elements of VC-dimension at most d.
Then ∣F∣≤(0n)+…+(dn)=:Φd(n).
It follows that for all m,
πF(m)≤Φd(m)≤(dem)k≤O(md).
Note. This bound is tight. Consider the set system F of all ≤d-element subsets of an infinite set X. Clearly πF(m)=Φd(m).
Proof.
Let X be the domain of the set system,
and pick any x∈X.
Denote F′:=F∩(X−{x}).
How much does F′ differ from F?
There is a natural mapping from F to F′, namely F↦F−{x}.
The preimage of a set G∈F′ under this mapping consists of either exactly two sets,
if G∈F and G∪{x}∈F,
or otherwise, exactly of one set.
Denote Mx:={G∈F∣x∈/G,G∪{x}∈F} It follows that
∣F∣=∣F′∣+∣Mx∣.
Note that both F′ and Mx are set systems with domain X−{x} of size n−1.
The key observation is that Mx has VC-dimension at most d−1.
Indeed, if A⊆X−{x} is shattered in Mx, then A∪{x} is shattered in F.
By induction on n and d we have that:
∣F∣=∣F′∣+∣Mx∣≤(≤dn−1)+(≤(d−1)n−1).
Note that (≤dn−1)+(≤(d−1)n−1)≤(≤dn)
since we can surjectively map a ≤d-element subset A of [n]
to A∈(≤d[n−1]) if n∈/A,
and to A−{x}∈(≤(d−1)[n−1]) if n∈A. ■
Exercise
For every element v in the domain, define a (partial) matching MvF on F, by Mv={{A,B}:A,B∈F,A△B={v}}. Show that if for every subset X of the domain there is v∈X,
with ∣MvF∣X∣≤k, then ∣F∣≤O(k⋅n).
Set systems in logical structures
Let M be a logical structure and ϕ(xˉ;yˉ) be a formula.
For a tuple bˉ∈Myˉ
denote
ϕ(M;bˉ):={aˉ∈Mxˉ:ϕ(aˉ,bˉ)}.
Consider the set system Fϕ
with domain Mxˉ and sets ϕ(M;bˉ), for bˉ∈Myˉ.
Examples
(R,+,⋅,≤,0,1), formula
ϕ(x;y1,y2)≡y1≤x≤y2.
Then Fϕ is the set system of closed intervals on R.
A graph G=(V,E) is seen as a logical structure
with a binary relation E, denoting adjacency.
A first-order formula is a formula using ∃,∀,∨,∧,¬, where the quantifiers range over elements of the domain.
Exampleϕ(x,y)≡∃z.y=x+z2
is equivalent in (R,+,⋅,≤,0,1) to x≤y.
So ≤ is definable in (R,+,⋅).
The relation y=ex is not definable in (R,+,⋅).
Example
The formula δ1(x,y)≡(x=y)∨E(x,y)
expresses that dist(x,y)≤1,
and the formula δ2(x,y)≡x=y∨E(x,y)∨∃z.E(x,z)∧E(z,y) expresses that dist(x,y)≤2,
and similarly we can define δr(x,y) expressing dist(x,y)≤r, for any fixed number r.
There is no first-order formula that expresses the property `x and y are at a finite distance' in all graphs.
This can be expressed by a formula of monadic second-order logic (MSO).
Theorem (Shelah, 71) Let M be a logical structure.
The following conditions are equivalent:
every first-order formula ϕ(xˉ;yˉ) has bounded VC-dimension in M,
every first-order formula ϕ(x;yˉ) has bounded VC-dimension in M (here x is a singleton).
If the above conditions hold, then M is called NIP.
Example(R,+,⋅,≤) is NIP.
We use the result of Tarski:
Theorem (Tarski, 40) The structure (R,+,⋅,≤,0,1) has quantifier elimination: every first-order formula ϕ(x1,…,xk) is equivalent to a formula without quantifiers.
Such formulas are just boolean combinations of sets of the form:
{(a1,…,ak)∈Rk:p(x1,…,xk)≥0}
or
{(a1,…,ak)∈Rk:p(x1,…,xk)=0}
where p(x1,…,xk) is a polynomial with integer coefficients, such as x12+x22≤5.
Corollary (Tarski, 30)
For every first-order formula ϕ(x;y1,…,yk)
and parameters b1,…,bk∈R,
the set ϕ(R;b1,…,bk)∈R is a union of at most kϕ intervals, where kϕ depends only on ϕ (and not on b1,…,bk).
Proof. We may assume that ϕ is a boolean combination of ℓ sets as above. Let d be the maximum degree of the polynomials. Each polynomial of degree d with one free variable defines a set that is a union of at most d intervals.
A boolean combination of k such sets is a union of at most k:=ℓd intervals.
Corollary The structure (R,⋅,+,≤,0,1) is NIP.
Proof.
The family of all sets that are unions of k intervals cannot shatter a set of 2k+1 points. The conclusion follows from the result of Shelah.
Definition A structure M with a total order < and other relations/functions is o-minimal if
for every first-order formula ϕ(x;y1,…,yk)
and parameters b1,…,bk, the set ϕ(M;b1,…,bk) is a union of finitely many intervals.
It is known that this implies that then it is a union of at most nϕ intervals, for some nϕ that depends only on ϕ.
Corollary Every o-minimal structure is NIP.
Examples of o-minimal structures
(R,+,⋅,≤,0,1)
(R,+,⋅,≤,ex,0,1) (Wilkie's theorem)
Other examples of NIP structures
abelian groups
Presburger arithmetic (Z,+,<).
Classes of structures
Let C be a class of logical structures.
For example, the class of all planar graphs.
A formula ϕ(xˉ;yˉ) defines
a set system FϕM in each structure M∈C.
We say that ϕ has VC-dimension ≤d in C
if each of the set systems FϕM has VC-dimension at most ≤d.
The shatter functionπϕ,F(m)
is defined as the maximum of πFϕM, over all M∈C.
A class C is NIP if every formula ϕ(xˉ;yˉ)
has bounded VC-dimension on C.
Example
Let C a class of graphs and ϕ(x,y)≡E(x,y).
Then πE,C(m) is the following quantity:
the maximum, over all planar graphs G and sets A⊆V(G) with ∣A∣≤m, of the number of distinct vertices v∈V(G)
with different neighborhoods in A.
If C is the class of bipartite graphs, then πE,C(m)=2m.
If C is the class of planar graphs, then πE,C(m)=O(m).
Fact The class C of planar graphs is NIP.
[Podewski, Ziegler, 1978] Moreover, for every fixed first-order formula ϕ(x;y),
we have πϕ,C(m)=O(m)
This holds more generally for all classes C with bounded expansion [Pilipczuk, Siebertz, Toruńczyk, 2018]
and for classes of bounded twin-width [Przybyszewski, 2022].
Classes with bounded expansion include classes of bounded tree-width,
classes of graphs that can be embedded in a fixed surface,
classes of graphs with bounded maximum degree,
and classes that exclude a fixed minor.
They are contained in nowhere dense classes.
πϕ,C(m)=O(m1+ε) holds for every nowhere dense class C and fixed ε>0 [PST, 2018].
VC-density
Definition (VC-density).
The VC-density of a set system F is the infimum
of all r∈R such that πF(m)=O(mr).
Similarly, we define the VC-density of a formula ϕ(xˉ;yˉ) in a class C of structures.
Example Every formula ϕ(xˉ;y) (with y singleton) has VC-density 1 on the class of planar graphs, and more generally, on all nowhere dense classes.
Theorem (Assouad, 1983)
The VC-density of any set system F is either 0, or a real number ≥1.
For every real α≥1 there is a set system F with VC-density α.
We will construct a family with VC-density 3/2. See Chernikov's notes, Example 2.7. For this we use the following.
Theorem (Kovari, Sos, Turan, 1954)
Every K2,2-free bipartite graph with n+n vertices
has O(n3/2) edges.
Example
Fix a prime number p. For every power q of p there is a field Fq with q elements.
Consider the set system Fq where:
the domain consists of points in Fq2 and lines in Fq2 (solutions of equations y=ax+b).
the family consists of two-element sets {p,ℓ} where ℓ is a line and p is a point in ℓ.
πFq(m)=O(m3/2). Pick an m-element subset A.
The restriction Fq∩A
consists of:
the empty set (1 element),
singletons (m elements),
and of pairs {p,ℓ} such that p,ℓ∈A and p∈ℓ.
The number of pairs is the number of edges in the bipartite graph GA
whose parts are the points p∈A, and the lines ℓ∈A, and edges denote incidence. The graph GA is K2,2-free,
so it has at most O(m3/2) edges.
Altogether, ∣Fq∩A∣≤1+m+O(m3/2)=O(m3/2).
On the other hand, ∣Fq∣≥q3 and
there are at most 2q2 points and lines.
Therefore, ∣Fq∣≥Ω(∣points∣+∣lines∣)3/2.
VC-theorem and PAC learning
Let D be a fixed domain.
For a multiset S⊂D and a set F⊂D,
write AvS(F) to denote the average number of points in S that belong to D,
AvS(F)=∣S∣#{s∈S:S∈F}
Definition (ε-approximation).
Fix a probability measure μ on D.
A multiset S is an ε-approximation for μ on a (measurable) set F⊂D if
∣μ(F)−AvS(F)∣≤ε.
A multiset S is an ε-approximation for μ on a set family F with domain F, if S is an ε-approximation for μ on F, for each F∈F.
Recall the weak law of large numbers:
Theorem (Weak law of large numbers)
Fix a probability measure μ on a domain D and a measurable set F⊂D.
For every ε>0 and number n, a randomly and independently selected sequence S of n-elements of D
is an ε-approximation for μ on F with probability at least
1−4nε21.
The VC-theorem is a uniform version, for an entire family F of small VC-dimension. We assume here that the domain is finite. This assumption can be removed at the cost of some additional assumptions.
Theorem (VC-theorem)
Fix a probability measure P on a finite domain D and a family F of subsets of D.
For every ε>0 and number n, a randomly and independently selected sequence of n-elements of D
is an ε-approximation for P on F with probability at least
1−8πF(n)⋅exp(−32nε2).
For the proof, see e.g. Chernikov's lecture notes.
Corollary
For every fixed d∈N and ε>0 there is a number
N(d,ε) such that
if F has VC-dimension d and P is a probability distribution on D,
then there exists an ε-approximation for P on F
of size at most N(d,ε).
Moreover, we can take
N(d,ε)=O(ε2dlog(εd)).
Proof of Corollary.
By Sauer-Shelah-Perles we have that πF(n)≤O(nd).
The key point is that in the value in the VC-theorem,
O(nd)/exp(32nε2),
converges rapidly to 0 as n→∞.
In particular, we can find N(d,ε) large enough so that this number
is smaller than any fixed constant c>0. A more careful analysis shows that
O(ε2dlog(εd))
suffices. See Chernikov's lecture notes. □
Let μ be a probability distribution on (a σ-algebra on) D and let F be a system of (measurable) subsets of D.
For ε>0, a set S⊂D is called a ε-net
for μ on F,
if every set F∈F with μ(F)≥ε
has a nonempty intersection with S.
Lemma Every ε-approximation is an ε-net.
Proof. For every F∈F, if S has an empty intersection with F
then AvF(S)=0. If S is an ε-approximation, this implies that μ(F)≤ε. □
Corollary
For every fixed d∈N and ε>0 there is a number
N(d,ε) such that
if F has VC-dimension d and P is a probability distribution on D,
then there exists an ε-net for P on F
of size at most N(d,ε). In fact, a random (with probability P) sample of N(d,ε) elements of D is an ε-net, with probability at least 1−ε.
Moreover, we can take
N(d,ε)=O(ε2d)log(εd).
In fact, for ε-nets (as opposed to ε-approximations) one can improve the bound above and get
N(d,ε)=O(εd)log(εd).
Although we will use
this bound in the sequel, the improvement
over the previous bound will not be relevant to us.
PAC learning
A concept is a function f:D→[0,1].
A concept class is a family of concepts on D.
We restrict to {0,1}-valued functions (aka sets),
which can be seen as a classification problem.
So now concept classes are just set families.
A labeled sample is a multiset S⊂D with a partition S=S+∪S−.
Denote by SN(D) the set of all labeled samples of size N.
A learning map is a function f that maps every N-sample S∈SN(D) to a generalizationf(S)⊂D.
Ideally, f(S) should be consistent with S, that is, S+=S∩f(S) and S−=S∖f(S), but this does not need to happen (it is even desirable to allow errors in the labelling of S).
Intuition See the papaya intuition from Understanding Machine Learning.
Overfitting We pick f(S) to be S+.
To avoid this, we restrict to some family F
of concepts that we are willing to consider,
and require that the range of f should be contained in F.
Empirical Risk Minimization is the learning approach that
we pick f(S) to be any concept C∈F
that minimizes the number of errors on the training set.
That is, the number of points in S such that belong to
S+△C is the smallest possible.
Here F is a predefined set family (representing our knowledge of the world).
PAC learnability
Say that F is PAC-learnable if
for every ε,δ>0 there exists N=N(ε,δ)
and a learning map f:SN(D)→2D such that
for every concept C∈F and probability distribution μ on D,
if we select N elements of D independently with distribution μ each
and label it with respect to C, obtaining a labeled sample S=S+⊎S−,
then we will have that μ(f(S)△C)≤ε with probability
at least 1−δ.
If moreover f always returns concepts in F, then we say that F is
properly PAC learnable.
We say that F is consistently PAC learnable if every function f:SN(D)→F
such that f(S)∈F is any concept that agrees with S, works.
Theorem (Fundamental theorem of PAC learning).
Let F be a concept class.
The following conditions are equivalent:
F is PAC learnable,
VC(F) is finite.
Remark There are some untold hidden assumptions in this statement.
Either we should assume some measurability conditions, similar to the
ones in the VC theorem for infinite domains, or we should assume that the domain is finite.
Then the equivalence of the two implications in the theorem should be read as follows. Bottom-up implication: the sample complexity
N(ε,δ) can be bounded in terms of ε,δ and on the VC-dimension of the family F.
Top-down implication: the VC-dimension of F can be bounded in terms of the function N(⋅,⋅). More precisely, we
can bound it in terms of N(31,31).
Proof
For the bottom-up implication, we may assume for simplicity
(by taking the min) that δ=ε.
Let N=N(d,ε) be the number obtained from the ε-net Corollary.
The learning map f:SN(D)→F be defined so that f(S+,S−) is any concept in C∈F that contains S+ and is disjoint from S−.
To prove that this learning map f has the required properties,
we consider the set system
F△C:={F△C:F∈F},
where △ denotes the symmetric difference of sets.
It was shown in the exercises that F△C has the same VC-dimension as F.
By the ε-net corollary, a random sample of N elements of D is an ε-net for the family F△C,
with probability at least 1−ε.
This easily implies that the learning map f has the required property.
The top-down implication was shown in the exercices.
See also Chernikov's notes.
Transversals and Helly-type properties
Transversals
Definition Fix a set system F on a domain D.
A transversalT is a set T⊆D that intersects every F∈F
The transversal number of F is the smallest size of a transversal, and is denoted τ(F)
A packingG is a set G⊆F of pairwise disjoint sets.
The packing number is the largest size of a packing, and is denoted ν(F).
Clearly τ(F)≥ν(F).
It is NP-hard to compute τ(F) and ν(F).
For this reason, we sometimes use the following, approximate notions.
Definition Fix a set system F on a finite domain D.
A fractional transversalT is a function f:D→[0,1]
such that for every F∈F we have
x∈F∑f(x)≥1.
Its total weight is ∑x∈Df(x).
The fractional transversal number of F is the infimum
of the total weights of all fractional transversals, and is denoted τ∗(F).
A fractional packingG is a function g:F→[0,1]
such that for every x∈D we have
F∈F:x∈F∑g(F)≤1.
Its total weight is ∑F∈Fg(F)
The fractional packing number is the supremum of the total weights of all fractional packings, and is denoted ν∗(F).
The following lemma is straightforward.
Lemma For every set system F on a finite domain D we have:
τ∗(F)≥ν∗(F)
The following fact is a consequence (or restatement) of the duality theorem for linear programs. It is an easy consequence of Farkas' lemma.
Fact For every set system F on a finite domain D we have
τ∗(F)=ν∗(F).
Moreover, this number is rational, computable in polynomial time,
and is attained both by a
fractional transversal f:D→[0,1] with rational values, and by a
fractional packing g:F→[0,1] with rational values.
Corollary
For every set system F on a finite domain D with VCdim(F)=d, we have:
τ(F)≤O(d)τ∗(F)logτ∗(F).
Proof Follows from the existence of ε-nets
of size O(d)ε1log(ε1).
Namely, pick a fractional transversal f:V→[0,1]
of total weight r,
and normalize it to get a measure μ
with μ(v)=f(v)/r.
For a set F∈F
we have that ∑x∈Ff(v)≥1 since f is a fractional transversal, so μ(F)≥1/r.
Let ε=1/r.
By the ε-net theorem
there is an ε-net T for μ on F
of size O(d)⋅rlogr. Since μ(F)≥1/r=ε for all F∈F and T is an ε-net, it follows that
T intersects every F∈F.
Helly-type theorems
Fact (Helly's Theorem)Fix d≥1.
Let F be a finite family of convex sets in Rd.
If any d+1 sets in F have a point in common,
then F has a point in common.
Definition A family G has the k-Helly property
if for every finite subfamily F⊂G, if every k subsets of F have a point in common, then F has a point in common.
For instance, the family of convex subsets of Rd has the (d+1)-Helly property.
Families of bounded VC-dimension don't necessarily have the k-Helly property, for any finite k.
For example, for an infinite set V, the family of all sets of the form V−{a}, for a∈V, has VC-dimension 2, but does not have the k-Helly property, for all k.
However, families of bounded VC-dimension satisfy a related property.
Say that a family F has the (p,q)-property
if among every p subsets of F,
some q among them have a point in common.
For example, the k-Helly property is the same as the (k,k)-property.
Note that the conclusion in Helly's theorem is that F has a transversal of size 1, that is τ(F)≤1. The following is a generalization.
Fact (Alon,Kleitman). Let p,q,d be integers with p≥q≥d+1.
Then there is a number N=N(p,q,d) such that for every finite family F of convex subsets of Rd,
if F has the (p,q)-property, then τ(F)≤N.
Now this statement generalizes to families of bounded VC-dimension,
and can be derived from the proof of Alon and Kleitman, as observed by Matousek.
Fact (Alon,Kleitman and Matousek). Let p,q,d be integers with p≥q≥d+1.
Then there is a number N=N(p,q,d) such that for every finite family F with VC-dim∗(F)<d+1,
if F has the (p,q)-property, then τ(F)≤N.
Recall that VC-dim∗(F) is the dual VC-dimension of F,
defined as the VC-dimension of F∗,
Equivalently, it is the maximal number k of subsets in F such that
every cell in the Venn diagram of those sets is nonempty (occupied by some element in the domain).
We reprove a duality result which is a corollary
of the (p,q)-theorem.
We present a self-contained proof.
Let E⊂A×B be a binary relation.
For a∈A denote E(a):={b∈B∣(a,b)∈E}
and for b∈B denote E∗(b):={a∈A∣(a,b)∈E}.
Define the VC-dimension of a binary relation E as the maximum of the VC-dimension of two set systems:
(B,{E(a)∣a∈A})and(A,{E∗(b)∣b∈B}).
Here's the duality result.
Theorem (Corollary of the (p,q)-theorem)
For every d there is a number k≤O(d) with the following property.
Let E⊂A×B be a binary relation with VCdim(E)≤d,
with A,B finite.
Then at least one of two cases holds:
there is a set B′⊂B with ∣B′∣≤k
such that for all a∈A we have E(a)∩B′=∅
(that is, B′ dominates A wrt E), or
there is a set A′⊂A with ∣A′∣≤k
such that for all b∈B we have E∗(b)∩A′=A′
(that is, A′ dominates B wrt ¬E).
We first state a corollary of von Neumann's minimax theorem (also of Farkas' lemma, the Hahn-Banach theorem, or of the strong duality theorem for linear programming).
Theorem (Minimax Theorem)
Let E⊂A×B be a binary relation with A,B finite, and let α∈R.
Then exactly one of the following two cases holds:
there is some probability distribution ν on B such that
ν(E(a))≥α holds for all a∈A
there is some probability distribution μ on A such that
μ(E∗(b))<α
holds for all b∈B.
Proof of Corollary of the (p,q)-theorem
Set α:=1/2 and ε:=1/3;
the point is that 0<α±ε<1.
Fix d∈N, and let k be the number from
the VC-theorem, with k≤O(ε2dlog(ε1))=O(d).
By the Minimax Theorem applied to α=1/2,
one of the two cases below holds.
Case 1: There
is some probability distribution ν on B
such that ν(E(a))≥1/2 holds for all a∈A.
By the VC-theorem applied to ε=1/3, there is a multiset
B′⊂B with ∣B′∣≤k
such that
∣AvB′(E(a))−ν(E(a))∣≤1/3for all a∈A.
Pick a∈A.
Since ν(E(a))≥1/2,
we have that AvB′(E(a))>0,
so B′∩E(a) is nonempty.
Therefore, the first condition in the statement is satisfied.
Case 2: There is some probability distribution
μ on A such that μ(E∗(b))<1/2 holds for all b∈B.
By the VC-theorem again, there is a multiset A′⊂A with ∣A′∣≤k such that
∣AvA′(E∗(b))−μ(E∗(b))∣≤1/3for all b∈B.
Pick b∈B.
Since μ(E∗(b))<1/2,
we have that AvA′(E∗(b))<1,
so A′ is not contained in E∗(b). Therefore, the second condition in the statement is satisfied.
■
Fix a set family F on a finite domain D.
Let Sd(F) denote the set of labeled samples of size
d, that is, pairs S:=(S+,S−) of multisets of elements of D with ∣S+∣+∣S−∣=d,
such that there exists some F∈F containing S+ and disjoint from S−.
For a labeled sample S and set Z⊂D, let S[D] denote
the sample (S+∩D,S−∩D).
If S and T are two samples, we say that T is a sub-sample of S
if T+⊂S+ and T−⊂S−.
Let Sω(F) denote the union of all Sd(F), for d≥0.
Fix a family F of VC-dimension d, for some fixed d.
A sample compression scheme for F, of size k, consists of two
mappings: a compression mapping, and a reconstruction mapping,
defined as follows.
Compression
The compression map maps each
Y:=(Y+,Y−)∈Sω(F)
to a pair (Z,w), where
Z⊂Y is a sub-sample of Y of size k and w∈{0,1}k is a bit-sequence describing some auxiliary information.
Reconstruction
A reconstruction function maps a given pair (Z,w), where Z∈Sk(F) and w∈{0,1}k, to a concept C⊂2D.
We require that the composition of the two mappings – start with Y∈Sω(F), compress it to some (Z,w), reconstruct some concept C – returns a concept C∈2D that contains Y+ and is disjoint with Y−.
Recall that VCdim∗(F)
is the dual VC-dimension of F.
Theorem Let F be a set system on a finite domain, and d=VCdim(F) and d∗=VCdim∗(F). Then F has a sample compression scheme of size O(d⋅d∗).
Proof
By the VC-theorem, there is s=O(d) and a learning map
f:Ss(F)→F such that for all F∈F
and μ on D there is Z⊂supp(μ):={v∈D∣μ(v)>0} with ∣Z∣≤s
such that
μ(F△f(F∩Z,F−Z))≤1/3.
Let (Y+,Y−)∈Sω(F), and let Y=Y+∪Y−.
The key lemma is the following.
Lemma
There are T≤O(d∗) sets Z1,…,ZT⊆Y+∪Y−, each of size at most s,
such that for every x∈Y+∪Y− we have:
x∈Y+⟺#{t∈{1,…,T}∣x∈f(Y[Zi])}>T/2
With this lemma,
obtaining the compression scheme is straightforward, as we now describe.
Compression
We assume an arbitrary total order on D is fixed.
To compress (Y+,Y−),
we map it to (Z,w) where
Z is the union of the sets as in the statement of the lemma,
and w is a bit-sequence of length T,
whose i-th bit indicates whether the i-th element of Z (in the fixed order on D) belongs to Zi.
This needs to be fixed slightly
Decompression
Given (Z,w), we first compute the sets Z1,…,ZT,
using Z and the bit-sequence w.
We then define a concept C as the set of all x∈D
such that
#{t∈{1,…,T}∣x∈f(Y[Zi])}>T/2.
It follows immediately from the lemma that this is a sample compression scheme of size O(dd∗).
We now prove the lemma.
Proof of lemma
Denote
G:=GY={f(Y+∩Z,Y−∩Z)∣Z⊂Y,∣Z∣≤s}⊆F.
For a concept C∈G and element x∈Y+∪Y−,
say that Cis correct on x
if x∈C⟺x∈Y+ holds.
By definition of f,
for every probability distribution μ on Y there is
G∈G such that
μ({x∈Y∣G is correct on x})≥2/3.
By the Minimax theorem,
there is a probability distribution ν on G
such that for every x∈Y we have
ν({G∈G∣G is correct on x)≥2/3.
By the VC-theorem,
there is an 1/8-approximation
for ν on G∗:
a multiset H⊂G of size T≤O(d∗) such that
for all x∈Y we have:
#H#{H∈H∣H is correct on x}≥ν({G∈G∣G is correct on x})−81≥32−81>21.
Taking Z1,…,ZT to be the elements of H,
this proves the lemma, and concludes the theorem.■
Haussler's packing theorem
Fix a domain D and a probabilistic distribution on D.
For two sets X,Y⊆D, write distP(X,Y)
to denote the measure P[X△Y] of the symmetric difference of X and Y.
This defines a metric on 2D.
Fix ε>0.
We are intersted in the problem of packing a maximal number of pairwise disjoint (closed) balls of radius ε in this metric.
Equivalently, of finding the maximal size of a subset of
a 2ε-separated subset S⊆F,
that is, a set S of elements with mutual distance larger than 2ε.
Example. Suppose we are packing discs of radius 0<ε<1 with center in the n×n square in R2.
For a disc D with center in [0,n]×[0,n],
the area of D∩([0,n]×[0,n]) is at least πε2/4=Ω(ε2).
Thus, we can pack at most O((n/ε)2) disjoint discs.
Haussler's packing theorem says that for set systems F of VC-dimension d, the behaviour is similar as in the d-dimensional Euclidean space.
Theorem
Let d>1 and C be constants, and let F be a set system on a finite domain D whose primal shatter function satisfies πF(m)≤Cmd for all m∈N. Let 0<ε<1 and fix a probability distribution on D. If F is ε-separated then ∣F∣≤O(1/εd).
Corollary
Let F be a set system on a
finite domain D and d=VCdim(F).
Let k be a positive integer.
If ∣A△B∣≥k for all A,B∈F
then ∣F∣≤O((n/k)d).
This follows from the theorem
applied to ε=k/n and the uniform distribution on D,
by the Sauer-Shelah-Perles lemma. Conversely, for k=1, the Corollary recovers the SSP lemma, although with worse hidden constants.
We will prove the following lemma, which quickly yields the theorem.
Lemma (∗). Fix 0<δ<1 and ε>0.
Let F be an ε-separated set system on D, and let μ
be a probability distribution on D. Fix m=εδ2VCdim(F)
and let S be a sample of m independently chosen elements of D according to the distribution μ.
Then
#F≤1−δES[#F∣S].
In particular, if πF(m)≤Cmd for all m,
we have that
holds for all 0<δ<1.
The minimum, over 0<δ<1 of
(1−δ)δd1 is attained for δ=d/(d+1),
and is at most e(d+1).
Hence we get
#F≤Ce(d+1)(ε2VCdim(F))d≤O(1/εd),
which proves the theorem. It therefore remains to prove the lemma.
For a set S⊆D and F∈F denote by
F∣S the labeled sample F∣S:=(S+,S−)=(F∩S,S−F).
Recall that, given ε>0, the VC-theorem implies the existence of a learning function
f:Sm(F)→2D,
where m=O(d)ε1logε1,
such that for every F∈F and random sample S of m elements of D
chosen independently with a given probability distruction μ,
we have that
μ[f(F∣S)△F]<ε.
Using this, it is easy (and instructive) to prove Lemma (∗) with a weaker bound,
where instead of m=εδ2VCdim(F)
we use m=O(d)ε1logε1.
This would would yield the bound
#F≤O(ε1logε1)d,
which was obtained originally by Dudley.
[To obtain Dudley's bound, proceed as in the proof Lemma (∗) presented below, but instead of the learning function
obtained in the proposition below, use the learning function
obtained from the VC-theorem.]
In order to obtain the improved bound as stated in Lemma (∗), we need a learning function
with the number of sample elements of the form m=Oδ(ϵ1),
rather than m=Oδ(ϵ1logϵ1). This might come at a worse dependency on δ, which we don't care about, because we consider δ to be fixed, say δ=21. This is achieved by the following proposition.
Proposition Let F be a set system on a domain D.
Fix m=εδ2VCdim(F).
There is a "learning function"
f:Sm(F)→2D such that the following holds.
For every F∈F and random
sample S of m elements of D chosen independently with a given probability distribution μ,
we have that
μ[f(F∣S)△F]<2ε
holds with probability at least 1−δ over the sample.
The following lemma is key.
Given a set system F,
consider its Hamming graphHF,
with vertices F and edges connecting two sets F,G∈F
if and only if ∣F△G∣=1.
The edge density of a graph (V,E) is ∣E∣/∣V∣.
Lemma The Hamming graph HF of F has edge density at most VCdim(F).
The proof of the lemma is by repeating the proof of the Sauer-Shelah-Perles lemma and observing that it also yields this conclusion. See also Geometric Discrepancy by Matousek, Lemma 5.15.
This implies that the same holds for every induced subgraph of HF.
Fact
Let G be a graph such that every induced subgraph H of G
has edge density at most d. Then there is an orientation of the edges of G
such that every node has outdegree at most d.
Remark This is very easy to prove with 2d instead of d:
repeatedly remove a vertex v of degree at most 2d,
and orient its edges towards the remaining vertices.
This can be improved to d using Hall's marriage theorem
(see Alon, Tarsi '92).
Corollary
There is an orientation of the edges of HF
such that every node has outdegree at most VCdim(F).
Proof of Proposition
Let d0=VCdim(F).
Fix a probability distribution μ on D.
We will show that there is a function f:Sm(F)→2D
such that
We proceed to defining the learning function f.
Fix a set S={s1,…,sm}.
Suppose that we are given the labelling F∣S, for some F∈F, and we
want to guess the label of s0, for some s0∈D.
Let S′:=S∪{s0}.
Fix an orientation of HF∣S′ with out-degree d0=VCdim(F).
If there is only one G∈F∣S′ that agrees with F∣S,
then G=F and we label s0 according to G.
If there are two concepts G,H∈F∣S′ that agree with F∣S then G and H are adjacent in HF∣S′.
We label s0 according to the one which is the out-neighbor of the other.
This way, by ranging s0 over all elements of D,
we define a concept f(F∣S), basing on the labeled sample F∣S.
We argue that for all F∈F, we have:
ES[μ(f(FS)△F)]≤m+1d0<md0.
Fix F∈F.
We have that
ES[μ(f(FS)△F)]=m+11ES′[#((Hi△F)∩S′)],
where
S′={s0,…,sm} is a random sample of size m+1,
and we define
m+1 concepts H0,…,Hm, by leaving the ith element si out and applying
f as defined above to the labeled sample of size m.
We have:
#((Hi△F)∩S′)≤out-deg(F∣S′)≤d0
This gives the required bound m+1d0<md0.
■
We now prove Lemma (∗), recalled below.
Lemma (∗). Fix 0<δ<1 and ε>0.
Let F be an ε-separated set system on D, and let μ
be a probability distribution on D. Let m=εδ2VCdim(F)
and let S be a sample of m independently chosen elements of D according to the distribution μ.
Then
#F≤1−δES[#F∣S].
Proof
For F∈F and a set S of m elements,
say that F is well approximated on S
if μ(f(F∣S)△F)<ε/2, where f is the learning function from the proposition.
We first show that for any fixed set S of size m,
#{F∈F:F is well approximated on S}≤#F∣S.
To prove this, suppose F∣S=G∣S.
Let F^=f(F∣S) and G^=f(G∣S).
Since F is ε-separated we have ε≤μ(F△G).
As F^=G^, by the triangle inequality we have:
ε≤μ(F△G)≤μ(F△F^)+μ(G△G^)
so one of those must be at least ε/2.
So at most one of F,G is well approximated on S,
which proves the inequality above.
The above inequality can be written as follows, where F∈F is chosen uniformly at random:
PF[F is well approximated on S]≤#F#F∣S.
In particular, if S is a random sample of m elements chosen independently with any given distribution μ, we have:
ESPF[F is well approximated on S]≤#FES[#F∣S],
On the other hand, by the proposition,
for every fixed F∈F
we have that F is well approximated on a random sample S of m elements with probability at least 1−δ.
In particular,
EFPS[F is well approximated on S]≥1−δ.
As the left-hand sides in the two inequalities are equal,
we get that 1−δ≤#FES[#F∣S],
which yields the lemma.■
Zarankiewicz problem and incidence geometry
The Zarankiewicz problem asks about the largest number of edges in a bipartite graph
with parts of sizes and m,n that
avoids a fixed biclique Ks,t as a subgraph.
Let z(m,n;s,t) denote this number.
The Kővári–Sós–Turán theorem gives an upper bound.
Example
We show that there is a K2,2-free graph matching the bound
O(n3/2). Thus, for s=t=2, the bound
is optimal.
Let p be a prime and Fp be the field with p elements.
Let P be the set of p2 points in Fp2,
and L be the set of p2 lines in Fp2
of the form {(x,y)∈Fp2:y=ax+b}, for some a,b∈\Fp.
Since every line contains p points, there are p3 point-line incidences. This graph excludes K2,2, since two distinct lines can have only one point in common.
Thus we have a bipartite K2,2-free graph with n=2p2 vertices and Θ(n3/2) edges, matching
the Kővári–Sós–Turán bound. This holds for all n since for large enough n there is a prime between n−n1/3 and n
(this is a very deep result).□
It turns out that for bipartite graphs arising as incidence graphs
of points and lines in the plane, we get better bounds.
Given a collection P of points and L of lines in R2,
let I(P;L) denote the number of pairs (p,ℓ)∈P×L
such that p∈ℓ.
Let I(m;n) denote the maximum of I(P;L), over all m-element P and n-element L.
Theorem (Szemerédi-Trotter, 1983).
I(m;n)≤O(m2/3n2/3+m+n),
and this bound is asymptotically tight.
The bound for m≤n2 or n≤m2 follows from the Kővári–Sós–Turán. We follow a proof for the balanced case,
of m=n, based on Matoušek, Lectures on Discrete Geometry, Chapter 4.5. This uses the Cutting Lemma for lines in the plane.
We now prove a generalization of the Kővári–Sós–Turán
bound, for graphs of bounded VC-dimension. We follow this paper.
Theorem 1 (Fox, Pach, Sheffer, Suk, Zahl 2017).
Let G=(A,B,E) be a bipartite graph with ∣A∣=m,∣B∣=n and such that
the set system G={N(b)∣b∈B} satisfies πG(z)≤czd for some constants c,d and all z.
Then, if G is Kt,t-free, we have
∣E∣≤Oc,d,t(mn1−1/d+n).
The Kővári–Sós–Turán bound is the same, with 1−1/t instead of 1−1/d.
Corollary. Fix d1,d2,c∈N.
Let G=(A,B,E), where
A⊂Rd1 and B⊂Rd2
are sets of size m,n respectively,
and E⊆A×B is defined by a boolean combination of c polynomial equalities and inequalities with d1+d2 variables of degree at most c. If G is Kt,t-free, then
∣E∣≤Od1,d2,c,t(mn1−1/d2+n).
From the Corollary (for d1=d2=2) and a cutting lemma for algebraic curves in the plane, one can derive the following generalization of the Szemeredi-Trotter result.
Theorem 2.
Let G=(A,B,E), where
A⊂R2 and B⊂R2
are sets of size m,n respectively,
and E⊆A×B is defined by a boolean combination of c polynomial equalities and inequalities with 4 variables of degree at most c. If G is Kt,t-free, then
∣E∣≤Oc,t(m2/3n2/3+m+n).
We only prove Theorem 1.
This follows easily from the following generalization of Haussler's theorem.
Say that a family F of sets on an n-element domain is
(k,ε)-separated, where ε>0,
if for every k sets F1,…,Fk∈F
we have
∣(F1∪⋯∪Fk)−(F1∩⋯∩Fk)∣≥εn.
In other words, for every tuple F1,…,Fk∈F, at least an ε-fraction of elements x, the membership of x is not the same in all Fi's.
So a (2,ε)-separated family is the same as an ε-separated family from the previous lecture
(for the counting measure).
Recall that Haussler's theorem says that if F is
an ε-separated family with πF=O(md),
then ∣F∣≤Od((ε1)d).
We now prove the same result for (k,ε)-separated families.
Theorem 3 Fix k∈N and ε,d>0.
Let F be a (k,ε)-separated family
with πF=O(md).
Then ∣F∣≤Ok,d((ε1)d).
The proof of Haussler's theorem relied on the following.
Proposition. Let ε,δ>0 and m=εδVCdim(F).
There is a "learning function"
f:Sm(F)→2D such that
for every F∈F and random
sample S of m elements of D chosen independently with a given probability distribution μ,
we have that
μ[f(F∣S)△F]<ε
holds with probability at least 1−δ over the sample.
We now prove Theorem 3.
Proof.
Let F be a (k,ε)-separated family
with πF=O(md).
Fix a parameter 0<δ<1.
Apply the proposition to ε/k, obtaining
m=εδkVCdim(F) and a learning function f:Sm(F)→2D. Let μ denote the counting measure on the domain, with μ(F)=∣F∣/n.
Say that a set F∈F is well-approximated on a set S
of size m
if μ[f(F∣S)△F]≤ε/k.
The Proposition says that every F is well-approximated on a random S with probability at least 1−δ.
Therefore, for a random m-tuple S, the expected number of sets F that are well-approximated on S is at least #F(1−δ):
1−δ≤#F1F∈F∑ES[F is well approximated on S]=#F1ESF∈F∑[F is well approximated on S].
On the other hand, for a given S,
there are at most (k−1)⋅#(F∣S)
sets that are well approximated on S.
Indeed, for any k-tuple of sets F1,…,Fk
with the same restriction to S, not all of them are well approximated on S. Let K=f(F1∣S).
Then (F1∪…∪Fk)−(F1∩⋯∩Fk)⊆(F1∪⋯∪Fk)△K=(F1△K)∪⋯∪(Fk△K),
so if all were well approximated, then we would have
that the RHS has measure at most ε,
and so does the LHS, contradicting the assumption that F is (k,ε)-separated.
Hence, for a fixed set P∈F∣S, there at most k−1
sets in F that restrict to P,
and are well approximated on S.
Therefore, the number of sets F that are well-approximated on any S is at most (k−1)⋅#(F∣S).
We get that
the expected number X of sets F that are well-approximated on S satisfies
#F⋅(1−δ)≤X≤(k−1)⋅ES#(F∣S)≤Ok(πF(m)).
Since δ is a fixed constant,
this proves #F≤Ok(πF(m)),
as required. ■
We now prove Theorem 1.
Theorem 1
Let G=(A,B,E) be a bipartite graph with ∣A∣=m,∣B∣=n and such that
the set system G={N(b)∣b∈B} satisfies πG(z)≤czd for some constants c,d and all z.
Then, if G is Kt,t-free, we have
∣E∣≤Oc,d,t(mn1−1/d+n).
Proof
Claim. There is d=O(m/n1/d)
and a set B′⊆B with ∣B′∣=t
such that at most d elements a∈A
are non-homogeneous towards B′.
Proof.
Let c be the multiplicative factor from Theorem 3,
and fix b to be defined below.
Suppose that for every t-tuple B′,
at least d:=bm/n1/d elements a∈A are non-homogeneous towards B′.
Then G is (k,b/n1/d)-separated,
so by Theorem 3, we have ∣G∣≤c⋅n/bd=n⋅c/bd,
whereas ∣G∣≥n. Picking b so that c<bd we get a contradiction. Thus, we may take d=bm/n1/d.□
Since G is Kt,t-free, it follows that every vertex b∈B′
has degree at most d+t in G.
We may remove any such vertex b and repeat.
Therefore, ∣E∣≤n⋅(d+t)=O(mn1−1/d+n), as required.
■
Matchings with small crossing number
We follow Geometric Discrepancy, Jiri Matousek, Chapter 5.4.
We prove Theorem 5.17 from there:
Theorem
Let F be a set system on a 2n-element domain D,
with πF(m)≤c⋅md for some constants c,d>1.
Then there exists a perfect matching on D
whose crossing number is at most Oc,d(n1−1/d+logn).
Here, the crossing number of a graph (V,E) with respect to F
is the maximum, over all F∈F, of the number of pairs {u,v}∈E with u∈F and v∈/F.
Corollary
Let F be a set system on an n-element domain D,
with πF(m)≤c⋅md for some constants c,d≥1.
Then there exists a path on D
whose crossing number is at most Oc,d(n1−1/d+log(n)).
The proof of the corollary (using the theorem) is left as an exercise.
Geometric range searching
Geometric range searching is an area of computational geometry,
which stimulated a large research effort in the mid 80's-mid 90's. Many of the results that we have seen in the lecture so far were stimulated by problems in this area (e.g. ε-nets, cuttings, packing lemma, matchings with low crossing numbers). This area has deep connections with important problems in mathematics and in computer science.
See Geometric range searching by Jiri Matousek for a survey.
In this area, a set system (D,F) is called a range space, and the sets in F are called ranges.
Typical range spaces considered in this area are:
(halfspace range searching) half-spaces in Rd
(orthogonal range searching) axis-parallel boxes in Rd
(simples range searching) simplices in Rd
discs in Rd.
For a fixed range space (D,F), consider the following problem. We are given a finite set P⊂D of points,
and our task is to compute a data structure that allows to answer the following queries: given a range F∈F:
(range emptiness) is F∩P nonempty?
(range counting) compute #(F∩P),
(range reporting) enumerate F∩P.
Those problems are motivated by practical applications, e.g. in databases.
A common generalization of the above problems asks about sums in a commutative semigroup (S,+). Here we assume that P is given together with a weight function w:P→S,
and the data structure allows to answer queries: given a range F∈F, compute the weighted sum
v∈F∩P∑w(v).
This generalizes the above, by taking:
(range emptiness) S=({0,1},∨), w(v)=1 for v∈P,
(range counting) S=(N,+), w(v)=1 for v∈P,
(range reporting/enumeration) S=(2D,∪), w(v)={v} for v∈P.
For semigroups like in the first two cases, we assume the unit cost model, where semigroup elements can be represented in constant space and operations take constant time. For the last semigroup, this is an unreasonable assumption, and we need to pick different representations of sets (such as lists or arrays).
Intervals on the line
Let us first consider a very simple example of range spaces:
the set of intervals on the line R.
Let P⊂R be a set of n points, and assume that they are sorted.
For the range emptiness and counting problems, we can use a very simple solution:
we precompute in time O(n),
for every a∈A the number na of elements in P that are ≤a.
Then [a,b]∩P is then
the difference of the sums nb−na′, where a′ is the predecessor of a in P.
The query time is O(logn) if we allow arbitrary intervals [a,b] as queries, and O(1) if we only consider intervals with endpoints in P.
The above solution also works for computing weights in some semigroups -- specifically those, which embed into a group.
But it does not
work for all semigroups, for instance, for (N,max).
For range reporting, we just store the elements of P in a doubly-linked list.
The query time is still O(logn) or O(1)
if we agree to represent outputs as enumerators,
and O(logn+∣output∣) or O(∣output∣) if we wish to output the elements.
Note that we can always have a trivial solution that
precomputes all possible answers in time O(n2), and answers queries in time O(logn) or O(1). This brute-force approach also works for many other range search problems (such as for axis-parallel boxes), and the game is minimize the memory usage and/or pre-processing time, while having small query time.
The following solution works for all semigroups (for the interval range space).
In the preprocessing phase, construct a balanced (rooted, ordered) binary tree T, whose leaves are the points of P, and the usual left-to-right order on the leaves agrees with the usual order on P. Hence, T has height Θ(logn).
We will use T as a binary search tree. With each node v of T we associate a canonical intervalIv=[a,b], where a and b are the smallest and largest leafs below v.
Now, given any interval I=[a,b] (where a,b∈R),
we can find using binary search two points a′,b′∈P
such that I∩P=[a′,b′]∩P. Moreover, I∩P is a disjoint union of O(logn) canonical intervals, each intersected with P. Those canonical intervals can be computed while performing the binary search, in time O(logn).
We can precompute the total weight w(Iv∩P) of every canonical interval Iv in time O(n), while constructing the tree T. Since the total weight of I∩P is the sum of the total weights of the canonical intervals (as the union is disjoint), we can compute it in time O(logn), given I.
Note that this reasoning works in every semigroup in the unit cost model. The preprocessing time is O(n) (or O(nlogn) if the points are not sorted) and query time is O(logn).
Example
As mentioned, for (N,+), or semigroups
that embed in groups, we may improve the query time to O(1), assuming the queried intervals are of the form [a,b] for a,b∈P (otherwise we must spend Ω(logn) time even for emptiness).
Another semigroup where we can achieve some speedup
is (N∪{+∞},min).
This can be formulated in terms of range minimum queries.
We assume the RAM model in the following,
and that each input number fits into a single word.
Theorem
Given a list a1,…,an of numbers,
we can compute in time O(n) a data structure that
allows to answer the range minimum queries in time O(1):
given 1≤i≤j≤n, output argmin(ai,…,aj).
This result depends quite delicately on the computation model.
If we assume the pointer machine model
(where we can follow pointers in constant time, but cannot manipulate those pointers), then
there is a lower bound which says that the above is not possible -- there is a Ω(loglogn) lower bound for query answering (Harel, Tarjan, 1984).
Simplex range searching
In all the considered geometric range search problems,
the range space on Rd is fixed, and d is considered to be a small, fixed constant.
Simplex range searching appears to be an important
case to which many other geometric range search problems can often be reduced to. Below, O~(⋅) hides a factor (logn)O(1).
Informal statement. Let P⊂Rr be an n-point set and m be a parameter with n≤m≤nd.
The simplex range search problems and halfspace range problems can be solved with preprocessing time and space O~(m) and query time O~(n/m1/d),
Moreover, the query time is optimal, given the preprocessing time and space m.
For example, on one extreme, for m=n
we get preprocessing time O~(n) and query time O~(n1−1/d), which is necessary. On the other extreme, for m=nd we get
preprocessing time O~(nd) and query time O~(1).
Logarithmic query time
Consider the halfspace range problem.
Recall that for an n-point set in Rd, we have O(nd) different sets P∩F, for different half-spaces F.
The solution is to precompute the answers for all possible P∩F, and then can answer queries in time O(1).
However, to implement this, we also need to be able to find P∩F, given F.
This can be reduced (by duality) to the point location problem:
Given a subdivision of Rd into convex cells, construct a data structure that allows to quickly find the cell
containing any given point.
For partitions of Rd given by hyperplane arrangements, there is a solution with O(nd) preprocessing and O(logn) query time (Chazelle 1993).
This relies on cuttings.
Recall:
Theorem (Chazelle, Friedman)
Let H be a set of n hyperplanes in Rd and let r be a parameter, with 1<r≤n. There is a (1/r)-cutting of Rd into O(rd) simplices with disjoint interiors, covering Rd, such that no simplex is intersected by more than 1/r hyperplanes.
Linear space
Consider the halfspace range problem in two dimensions.
That is, the ranges are half-planes in R2.
Early solutions
Early solutions are based on the ham-sandwich theorem,
and on partition trees.
Suppose for simplicity that the set P has size n which is a power of 4. Find a line ℓ that dives P into two sets of size n/2. By the ham-sandwich theorem, there is a line ℓ′ such that each of the four regions A1,…,A4 formed by ℓ and ℓ′ contains n/4 points.
In each of those regions Ai, proceed similarly, finding four regions Aij, each with n/42 elements of P, and so on.
This way, we get a rooted tree T of branching 4, which can be used to solve the halfplane range problem.
The observation is that any halfplane H, and for any node of our tree T, one of the four associated regions is not cut by H, and so is either contained in H, or disjoint with H.
Therefore, we report ∅ or the precomputed set A∩P for this region A, and proceed recursively with the remaining three regions.
From this we get a query time
T(n)≤C+3T(n/4),
which resolves to T(n)=nlog43≈n0.79.
Spanning trees with low crossing numbers
The work of Haussler, Welzl, Chazelle has lead to a huge improvement over the earlier ideas based on ideas similar to the one above.
Recall the result
from the previous lecture/tutorial:
Fact
Let F be a set system on an n-element domain D,
with πF(m)≤c⋅md for some constants c,d>1.
Then there exists a path on D
whose crossing number is at most O(n1−1/d).
In our case (for halfplanes), we have πF(m)≤O(m2), so we get a path p1−p2−…−pn on P with crossing number O(n).
Moreover, such a path can be effectively computed (our construction was randomized, this can be derandomized).
Therefore, a given halfspace H partitions our point set P
into O(n) sets P1,…,Pk,
where each part is either contained in H or is disjoint with H, and forms an interval in our path.
Therefore, we have reduced the problem
to O(n) interval problems,
which can be solved jointly in time
O~(n).
Note there is a caveat in the above solution:
given a halfplane H we need to be able to efficiently compute the edges of the path that are cut by H.
This can be solved efficiently for d=2 and d=3 (Chazelle, Welzl 1992),
but for higher d, this seems as difficult as the original problem.
Simplicial partitions
For higher d, the solution was given by Matousek (1992),
basing on cuttings.
See Matousek, Geometric range searching, p. 20-21.
Using simplicial partitions, we can construct a partition tree similarly as earlier.
Now the recursion is:
T(n)≤f(r)+κ⋅T(2n/r),
where κ=O(r1−1/d) is the crossing number of the simplicial partitions and f(r) is the computation cost spent in each node of the tree, with f(r)≤O(r).
We must find an optimal value of r.
If we take a large constant for r, we get:
T(n)≤O(n1−1/d+or(1)).
By choosing r as a function of n appropriately, we get
T(n)≤O~(n1−1/d)
and pre-processing time O(nlogn).
Szemeredi Regularity Lemma
Let G be a graph and A,B⊂V(G).
Let E(A,B) be the set of edges with one endpoint in A and one endpoint in B,
and define the density between A and B as
d(A,B)=∣A∣∣B∣∣E(A,B)∣.
Say that the pair (A,B) (where A and B are not necessarily disjoint)
is ε-regular if for every A′⊂A and B′⊂B,
if ∣A′∣≥ε∣A∣ and ∣B′∣≥ε∣B∣, then
∣d(A′,B′)−d(A,B)∣≤ε.
Say that a partition P of V(G)
is ε-regular if
(A,B)∑n2∣A∣∣B∣≤ε,
where the sum ranges over all pairs (A,B)∈P2 that are notε-regular (possibly A=B).
Theorem (Szemeredi regularity lemma)
For every ε>0 there is a number N such that for every graph G there is an ε-regular partition of V(G) with at most N parts.
For notational convenience, we state (and prove) the
result in a more general setting.
We will consider graphs that are weighted in two ways:
instead of an edge set E⊆(2V), we may have a symmetric function E:V×V→[0,1],
instead of counting with respect to the counting measure, we can have a probabilistic measure (distribution) μ on V,
μ:V→[0,1] with μ(V)=1,
where we write μ(A) for ∑a∈Aμ(a).
Given (V,E,μ) as above,
we may define:
the edge density between two sets A,B⊂V as
the expected value, over all a∈A and b∈B sampled with distribution μ, of
With this notion, we can define the notion of an ε-regular pair A,B as previously, and
say that a partition P of V is ε-regular if
(A,B)∑μ(A)μ(B)≤ε,
where the sum ranges over pairs (A,B)∈P2 that are not ε-regular.
Thus, the probability that a pair (a,b)∈V2 belongs to an ε-regular pair (A,B)∈P2 is at least 1−ε.
In the following, fix V,E and μ as above.
Our goal is to prove that V has an
ε-regular
partition with Oε(1) parts.
We will use the following inequality.
Fact Let s1,…,sk≥0 and μ1,…,μk>0.
Then
∑iμi(∑isk)2≤i∑μisk2.
Proof Follows from the Cauchy-Schwartz inequality
(∑iaibi)2≤∑iai2⋅∑ibi2
by taking ai=si/μi and bi=μi.□
Lemma Let A,B⊆V,
A be a partition of A and B be a partition of B.
Then
For a partition P of V(G) define the potential of P
as
potential(P):=A,B∈P∑d(A,B)2μ(A)μ(B).
Lemma
If Q is a refinement of P then
potential(P)≤potential(Q).
Proof
Follows from the previous lemma, by considering all pairs
A,B∈P and adding up the inequalities
(where for A and B we take the restrictions of Q to A and B respectively).□
Lemma We have:
0≤potential(P)≤1.
Proof
Because ∑A,B∈Pμ(A)μ(B)=1
and 0≤d(A,B)≤1 for all A,B.□
Lemma If a pair (A,B) is not ε-regular then there is a partition
A=A1⊎A2 and B=B1⊎B2 such that:
Lemma
Suppose P is not ε-regular. Then there is a refinement Q of P such that ∣Q∣≤∣P∣⋅4∣P∣ and
potential(Q)≥potential(P)+ε5.
Proof
Let I={{A,B}:A,B∈P,(A,B) not ε-regular}
be the set of irregular pairs in P.
For each {A,B}∈I consider the partition A=A1AB⊎A2AB and B=B1AB⊎B2AB from the previous lemma, so that
Consider the refinement Q of P,
which is the partition into the cells of the Venn diagram of the family
of sets P∪⋃{A,B}∈I{A1AB,A2AB}.
Then every A∈P is refined into at most 22∣P∣ parts,
so ∣Q∣≤∣P∣⋅22∣P∣.
For all {A,B}∈I we have:
The sum in the last expression is ≥ε since each pair {A,B}∈I is not ε-regular. Therefore we get
potential(Q)≥potential(P)+ε5.□
Proof of Szemeredi regularity lemma.
Fix ε>0 and let G be a graph.
Let P0 be the partition of V(G) with one part,
and for i≥1, if Pi−1 is not already ε-regular, let
Pi be the refinement of Pi−1 obtained
from the previous lemma.
Then ∣Pi∣≤f(i) for some fixed function f
(which is bounded by a tower of 4's of height polynomial in i),
and potential(Pi)≥i⋅ε5.
Since potential(Pi)≤1 for all i,
we have that the sequence P0,P1,… may have length at most 1/ε5. Therefore, the last element of this sequence is an ε-regular partition
of size at most f(1/ε5). □.
A stronger statement
We now formulate and prove a slightly stronger statement, which is convenient in applications. The statement says that
all the parts have equal size, apart from one part,
there are at least m parts, where m is a given parameter.
Theorem (variant of Szemeredi regularity lemma)
For every ε>0 and m∈N there is a number N=N(ε,m) such that for every graph G there is an ε-regular partition of V(G) with m≤∣P∣≤N and such that all parts of P, apart form one, have equal size.
The proof is very similar to the previous proof,
but in each step of the refinement process we additionally
reorganize the obtained partition to obtain a partition into parts of equal size
(apart from one).
Regularity for graphs of bounded VC-dimension
There are a couple of improvements of the regularity lemma that one could wish to have:
improve the bound on the number N of parts, which in the proof above is a tower of exponentials of height polynomial in ε1
remove irregular pairs
have all pairs (apart from a small set of irregular pairs) with density ≤ε or ≥1−ε
have all pairs (apart from a small set of irregular pairs) with density 0 or ≥1.
It was proven by Gowers that the function N cannot be improved in general
(it needs to be a tower of exponentials of height polynomial in ε1). Also, we will see that one cannot get rid of the irregular pairs.
We will see that for graphs of bounded VC-dimension, we can get both the first and third improvement.
We will also see that for stable graphs, we can furthermore get the second improvement.
Say that a pair A,B of sets of vertices is ε-homogeneous
if d(A,B)≤ε or d(A,B)≥1−ε.
Lemma
If A,B is ε3-homogeneous then A,B is ε-regular.
Proof Exercise.
We will prove the following.
Theorem Fix reals c,d≥1 and 0<ε<1/4. Let G be a graph with
πN(G)(m)≤cmd for all m∈N.
There is a partition P of V(G) of size
O(1/ε2d+1), with all parts of equal size (±1),
such that at most ε∣P∣2 pairs (A,B)∈P2
are not ε-homogeneous.
For a pair A,B of subsets of V(G),
let TAB be the set of triples (a,b,b′)∈A×B×B
such that ab∈E(G) and ab′∈/E(G).
Lemma
For a pair A,B of subsets of V(G) with ∣A∣=∣B∣=m
we have that
∣TAB∣+∣TBA∣≥ε(1−ε)⋅m3.
Proof
Let εA=∣TAB∣/m3 and εB=∣TBA∣/m3.
Our goal is to prove that εA+εB≥ε(1−ε).
Pick a,a′∈A and b,b′∈B independently and randomly (with uniform measure).
Then
2ε(1−ε)≤P[ab∈E xor a1b1∈E].
Note that the event ab∈E xor a1b1∈E implies
that at least one of the two following events holds:
ab∈E xor a1b∈E, or
a1b∈E xor a1b1∈E.
These events have probabilities at most 2εA and 2εB, respectively. Hence, by the union bound, we have
that 2ε(1−ε)≤P[ab∈E xor a1b1∈E]≤2(εA+εB), which gives the required conclusion.□
Proof of the theorem
Let δ:=16ε2n.
Claim.
There is a partition Q of V(G) such that ∣Q∣≤O(δn)d
and ∣N(u)△N(v)∣≤2δ for all u,v∈V(G) that belong to the same part of Q.
Namely, greedily find v1,v2,…∈V(G) such that ∣N(vi)△N(vj)∣≥δ, until no longer possible. Haussler's packing theorem says that this will stop after at most O(nδ)d steps.
For every v∈V(G)
there is some i such that ∣N(v)△N(vi)∣≤δ.
Therefore, there is a partition A1,A2,… of V(G) of size at most
O(δn)d such that any two vertices u,v in the same part satisfy ∣N(u)△N(v)∣≤2δ, by the triangle inequality. This proves the claim.
Let K:=ε16∣Q∣.
Partition P into K parts of size n/K
so that each part of P, apart from ≤∣Q∣ parts, is contained in some part of Q. This can be achieved by greedily chopping each part of Q into parts of size n/K, and rearranging the leftovers into parts of size n/K.
Let R denote those parts of P that are not contained in any part of Q, so that ∣R∣≤∣Q∣.
Let I={(A,B):A,B∈P−R,(A,B) is not ε−homogeneous}
We will prove that ∣I∣≤32ε⋅K2.
This will prove the theorem,
since:
the number of pairs (A,B)∈P such that (A,B) is not ε-homogeneous is at most ∣I∣+2∣R∣∣P∣≤32ε⋅K2+161εK2≤εK2.
the number of parts in Q is at most K=16∣Q∣/ε≤O(n/δ)d/ε≤O(1/ε2d+1).
Proof of Claim.
Let T=⋃{TAB:(A,B)∈I}.
Recall that if v,v′ are in the same part of P then
∣N(v)△N(v′)∣≤2δ.
Morevoer, there are K parts, each of size N/K.
It follows that ∣T∣≤2δ(n/K)2⋅K=81Kε2n3.
On the other hand, by the lemma we proved earlier,
∣TAB∣≥ε(1−ε)(Kn)3
for all A,B∈I.
Hence,
81Kε2n3≥∣T∣≥(A,B)∈I∑∣TAB∣≥∣I∣⋅ε(1−ε)(Kn)3.
Rearranging, and by ε<1/4,
we get ∣I∣≤32εK2.
□
Stable regularity
A ladder in a graph G of length k
consists of two sequences a1,…,ak and b1,…,bk
of vertices of G, such that
aibj∈E(G)⟺i≤j.
Note that we pose no requirements on the adjacencies between ai and aj, nor between bi and bj.
The ladder index of a graph G is the maximal length of a ladder in G.
Our goal is to prove that graphs G with ladder index bounded by a fixed constant ℓ
enjoy a particularly well-behaved regularity lemma.
Theorem (Malliaris, Shelah, 2011)
For every ε>0 and ℓ∈N there is a number N=N(ε,ℓ)
such that every sufficiently large graph G of ladder index at most ℓ
has a partition into parts of equal size (±1) into at most N
parts, such that:
there are no irregular pairs,
every pair is ε-regular with
density ≤ε or ≥(1−ε)
the number of parts of the partition is at most polynomial in 1/ε (the degree of the polynomial depends on ℓ).
We first develop a key tool for working with graphs of bounded ladder index.
Branching index
A tree configuration of depth d in G
is a complete binary tree T of depth d
with inner nodes labeled by distinct vertices of G
and leaves labeled by distinct verttices of G,
with the following property.
For any inner node u and its descendant leaf v,
uv∈E(G)⟺v is a right descendant of u.
Here, when writing uv∈E(G) we implicitly refer to the corresponding vertices of G.
Therefore, u is adjacent (in G) to all its right descendant leaves, and non-adjacent to all its left descendant leaves.
The largest depth d of a tree configuration in G is the branching index of G.
The branching index and the ladder index can be bounded in terms of each other.
Lemma If G contains a ladder of length ℓ then
G has branching index at least log2(ℓ).
Proof
Suppose we have a ladder a1,…,aℓ,b1,…,bℓ
Construct a balanced binary tree whose leaves, from left to right,
are labeled with b1,…,bℓ,
and where every inner node is labelled by ai,
whenever bi is the left-most leaf that is a descendant of the right
child of ai. This is a tree configuration of depth at least log2(ℓ).
□
Proposition If G has a tree configuration of depth 2k+1−2
then G contains a ladder of length k.
A set A of inner nodes of Tcontains a tree configuration of depth a
if there is a subset A′⊆A of A
such that A′ with the ancestor relation
is isomorphic to that of a complete binary tree of depth a.
Note that if this is the case, then T has a tree configuration of depth a whose inner nodes are the elements of A′.
Lemma
If T is a tree configuration of depth a+b
and the inner nodes of T are partitioned into two sets A⊎B,
then either A contains a tree configuration of depth a,
or B contains a tree configuration of depth b.
such that a1,…,aℓ, b1,…,bℓ
form a ladder of length ℓ,
T is a tree configuration T of depth 2k+1−ℓ−2,
and 0≤q≤ℓ is a position,
such that:
every inner node b of T is adjacent in G to all vertices a1,…,aq−1 and non-adjacent to all vertices aq,…,aℓ
every leaf a of T is non-adjacent in G to all vertices b1,…,bq−1 and adjacent to all vertices bq,…,bℓ.
We prove that there is a configuration Cℓ,
by induction on ℓ=0,1,…,k.
We have a configuration C0 by assumption.
The existence of a configuration Ck yields the conclusion of the proposition.
Suppose we have a configuration Cℓ as above;
we prove that then we have a configuration Cℓ+1.
Let b be the root of T, and b′ be its right child.
Fix a leaf a of T which is a right descendant of b′.
Let T0 denote the subtree of T rooted at the left child of b′.
Let N denote
the set of inner nodes of T0 that are neighbors of a in G,
and let N′ denote the set of inner nodes of T0 that are non-neighbors of a in G.
By the previous lemma we have that either N or N′ contains a tree coinfiguration of depth 2k−ℓ−2.
This is because
2(2k−ℓ−2)=2k−ℓ+1−4=height(T)−2=height(T0),
Suppose that we have that N contains a tree configuration of depth 2k−ℓ−2.
We can then find a tree configuration T′ of depth 2k−ℓ−2,
such that
Otherwise, by the previous lemma, we have that the set N′
of inner nodes of T0 that are non-neighbors of a,
contains a tree configuration T′ of depth 2k−ℓ−1.
Then we have the following configuration Cℓ+1:
Fix 21>ε>0.
A set A of vertices of a graph G is ε-good
if for every vertex v of G,
either ∣N(v)∩A∣≤ε∣A∣,
or ∣N(v)∩A∣≥(1−ε)∣A∣.
Thus, every vertex v has an `opinion' about A,
which is defined as
For example, every singleton set is clearly ε-good,
but it is not obvious that large ε-good sets exist.
We say that a set A is ε-excellent
if for every ε-good set B,
we have that all but at most ε∣A∣ vertices of A
have the same opinion about B.
That is, there is a set XA,B⊂A with ∣XA,B∣≤ε∣A∣
such that opinion(v,B) is equal to some constant c∈{0,1}, for all v∈A−XA,B.
We then define opinion(A,B)∈{0,1} as c.
Note that every ε-excellent set is ε-good
(take B to be singletons).
A stronger version of the Malliaris-Shelah theorem is the following.
Theorem (Malliaris, Shelah, 2011)
For every ε>0 and ℓ∈N there is a number N=N(ε,ℓ)
such that every sufficiently large graph G of ladder index at most ℓ
has a partition into parts of equal size (±1) into at most N
parts, such that:
every part is ε-excellent,
the number of parts of the partition is at most polynomial in 1/ε (the degree of the polynomial depends on ℓ).
This statement implies the previous one,
as we have the following lemma.
Lemma
For every 0<ε<21 there exists δ=2ϵ such that
every pair (A,B) of δ-excellent sets
is ε-regular, with edge density ≤ε
or ≥(1−ε).
Now, the crux of the proof of the theorem is the following lemma, stating that large ε-excellent sets can be found in any set: every set A contains a subset A′ whose size is a constant fraction of A, which is ε-excellent.
Lemma
Suppose d is the maximal depth of a tree configuration in G,
and ε<2d1. For every A⊂V(G)
there exists an ε-excellent subset A′⊂A with ∣A′∣≥εd−1∣A∣.
Proof
Level by level, for each node v of the complete binary tree of depth d+1, we label v by two sets, Av and Bv,
where:
Bv is an ε-good set witnessing that Av is not ε-excellent,
at the root r, Ar=A,
at any node v with children v0, v1, we have that
Av0⊎Av1 is the partition of Av
according to their majority opinion about Bv.
Note that ∣Avi∣≥ε∣Av∣ by construction.
Therefore, ∣Av∣≥∣A∣⋅εs for nodes at distance s from the root. since
If we can find such a labelling up to the last level,
we get a contradiction as follows.
First, label each leaf v by any element of Av.
For each inner node u that is an ancestor of v,
u has a majority opinion about Bu, because Bu is ε-good. Let Xuv⊂Bu denote the exceptions to that majority opinion. Let Xu=⋃vXuv be the union over all leaf descendants of u. Then ∣Xu∣<∣Bu∣ because ε<2d1. Label u by any element of Bu−Xu.
We obtain a tree configuration of depth d+1, a contradiction.□
The rest of the proof of the regularity theorem of Malliaris and Shelah
roughly proceeds by starting with R=V, and repeatedly (for i=1,2,...) finding a maximal size subset Ai of R that is ε-excellent, and then repeating the same with R replaced by R−Ai, until R becomes sufficiently small.
We then add R to A1, which does not spoil its ε-excellence,
since A1 is very large relative to R.
There are some additional difficulties that need to be overcome in order to get a partition into parts of equal size. See this paper for more details.
Permutation classes
Sources: articles by Wojciech Przybyszewski,
part 1
and part 2.