Invariants

This lecture covers an algorithmic concept you are already familiar with, invariants. What I will teach you is how to maintain a number of invariants that provide you with useful data structures. But I cannot teach you all the useful invariants... so throughout the class keep an eye on the larger picture, how to define your own invariants, and when they might be useful..
So what is an invariant: a property or feature that is: "CONSTANT, UNCHANGING; specifically : unchanged by specified mathematical or physical operations or transformations" (Mirriam Webster). What we are going to talk about are additional invariants you can add to a data structure that makes it somehow 'better.' Well, what is better:

SORTED LISTS:

Let us start with the simplest one: What is the cost of looking up the smallest element in a list?

Right, order n. What's the easiest improvement: keep the list sorted. Wow, look we have just improved efficency! We went from O(n) to constant time lookups!  Hey everyone, get rid of your normal lists, we're going to use sorted ones from now one! What's the problem?

Because it makes insertions O(n)! Invariants take work to maintain. In this case, the trade off is quite simple. Do we want fast insertions, or fast minimum element lookup? If we look at the smallest element a lot, and insert rarely, it is worthwhile to keep our lists sorted.

Alright. That's a pretty boring example, because lists have a very restrictive structure. Lets look at a generalization of lists: trees. What is the common invariant for trees:


TREE ORDER:


Tree order: Everything in the left child is less than the node, everything in the right child is greater than the node. What does this gain us?

We expect that finding an arbitrary element is now O(log n), we can throw out half the nodes at every comparison. What is the cost? Well, you can have a tree that doesn't maintain this invariant. In this tree insertion is constant time: make it the new head. So now we have O(log n) insertions too. But since lookups get a lot quicker, and log(n) is much smaller than O(n), this is almost always a good thing. What about deletions?

I have this tree:

            7
          /   \
       3       8
      /         / \
     1      5     9
    /\       /\      \
   0 2    4 6     10

So let us go, case by case, and try to delete elements. Deleting 9 is not hard at all. What about deleting 8? Well, replace it with 7. That isn't hard either.
What if I want to delete 3? That seems a little harder, but since we have the amazing intuition of counting, we can probably think of a replacement. We look for the in-order successor or in-order predecssor: 2 and 4. Notice 2 and 4 have at most one child, and in this case none.
Behold, our deletion algorithm.
Find the node we want to delete
If it is a leaf, delete it.
If it has one child, delete it and then replace it with its child.
If it has two children, find it's in-order successor by going right once, and then left until we cannot go left any more.
This successor can have only a right child: copy the successor over the node and delete the original successor.


You've been learning about worst-case behavior... what is the worst case for a tree-ordered tree? Well, lets run the insertion algorithm on this set. We happen to have sorted the list first, but we want quick random access, so we put it in the tree. What happens...

1 2 3 4 5 6

Well, we make the first node

1

Then we insert 2, and a 3, and so on.
1 - 2 - 3 - 4 - 5 - 6

Oh noes! We have O(n) lookups! What happened to our beautiful tree! It looks like a slow, cumbersome, list with a bunch of extra pointers hanging around and destroying our cache coherency.

We need an /additional/ invariant: balance. This means our tree has two invariants: it is tree-ordered, and it is balanced. What does balance mean: well, the joy of invariants is you can define them however you want. This is constructive mathematics!

The simplest invariant is height-balanced: The height of a tree is the longest path to a node. In a height-balanced tree, the two children of every node vary in height by at most one. We could also consider weight-balanced trees.

Turns out that, for our purposes, height balancing is easier. So, how do we /maintain/ this invariant.... well, we can always build a balanced tree in O(n) time, so why don't we just keep track of how balanced it is, and when it gets unbalanced, rebuild it?

How quick is our rebuilding then? Well, every once in a while you'll get O(n) rebuilds... how often can this happen? This introduces the concept of ammortization, which we'll go over next time Seth teaches. For now, all I want to think about is what happens in our worst case: we balance the tree, insert very /few/ extra nodes, and have to recreate it all over again. This is too slow. But, maybe, we don't have to recreate /all/ of it over again. Maybe if we only rebuild the affected areas... well, the size of thea ffected area can't be larger than twice the number of things we have put in, so that should work, but it is getting kind of complicated.

Rotations

Alright, so surely by now you've realized that one set of numbers can have multiple properly tree-ordered trees. Is there a relation among them? an operation we can use to balance them? Lets play...

Lets say you have this unbalanced tree:

     8
     / \
   5    9
   / \
 2   7
 /\
0 3

And I want to make it a little more balanced... I want, namely, to decrease the height of the tree starting at 5, and increase the height of the tree starting at 9. Well, lets do a little trick and pull 5 up...
    5  
    / \
 2     8
 /\      /\
0 3   7  9

Doesn't that look better? And, hey what if the situation were reversed:


  2
 /  \
0    5
    /  \
   3   8
        /\
      7  9

Again we can pull the five up. These are called rotations: Now notice that a left rotate and a right rotation are inverses of eachother. In general, they will have the form

   A
   / \
 B   C
  / \
D E

 ^
 |
 |
 v

  B
  /  \
D   A
     / \
   E  C

From top to bottom is a rotate left, from bottom to top rotate right.


Congradulations, you have a data structure. AVL trees.

Insert and delete as normal. As you go up, if there is a height imbalance of size two, rotate. If the left tree is too high, then rotate right. If the right tree is too high, then rotate left. Hey, there, does that really work? Can anyone spot the hard case, why this might not work?

Well, sure, if D is the reason the tree is B is too high, then the above works.
But what if E is the reason B is so high... then it seems that we cannot fix
things! Because BAE is just as high! OH NOES! What can we do!

Well, lets look at an example

  5
3
  4

This is not height-balanced: the left tree of 5 has depth 2, the right depth 0. So we want to rotate right, and we get

3
  5
4

Hmn. That doesn't seem very good.


Hey, don't we know a way to shift balance around? Lets rotate B LEFT, to make D higher than E, and then rotate A right. This does work, and is called a 'left right' or 'right left' rotation. Now this is the real joy of theory: I get to tell you how something works, and all the complications of implementing it are left up to your imagination.

 5
3
 4

|
V

   5
 4
3

 |
V

  4
3   5

Notice we need to rotate at most twice every step, so we do no more work balancing than we did inserting! So our cost of this extra invariant, balancing, was no more than the cost of our first invariant, being tree-ordered. So in some sense we get this for free.

The AVL tree is named after its two inventors, G.M. Adelson-Velsky and E.M.  Landis, who published it in their 1962 paper "An algorithm for the organization of information."

HEAP ORDER:
Now I mentioned that tree order is an invariant... this should immediately make you wonder if there are other interesting invariants. Remember sorted order for lists: the smallest element is the root. Let's consider a similar order for trees. Heap order: Every child is larger than its parents.

Treaps


With heap order, we have O(1) smallest element lookups, O(log n) insertions and deletions, but O(n) search time! The invariants we want to maintain depend on how we want to use the data structure. Those of you who want to use a hashtable for everything, take note... there is no magic bullet.

TREAPS:
So, I'm sure you are all overjoyed to learn about heap order. I mean, who doesn't love math for the sake of math? But I suppose you'll actually be looking for a application. Well, AVL trees turn out to be kind of slow and annoying, and as I mentioned a pain to implement. You have to perform quite a few rotations, in a complicated schema. What if we could ensure we were usually balanced? Unfortunately, worst-cases do happen. But maybe I can mix up the tree in some kind of random order....

Alright, each time I insert something, I'm going to give it a random priority between zero and one.  I will keep those numbers in heap-order, and the original elements in tree order.

So suppose I take that sorted list, 1 through six, and add them in. I might get the following treap... get it, it's a tree, and a heap, so it's a treap...

  1     2      3    4   5     6   7     8     9   10
.27  .78  .57  .95 .53 .92 .63 .98  .85 .67

      1
     .27
           5
          .53
      3        6
    .57       .92
   2        4
 .78      .95

Now let us try to insert 7: pretty simple to maintain tree order, lets just put that sucker in there:
   
      1
     .27
          5
         .53
      3       6
    .57      .92
   2        4      7
 .78      .95    .63

Well, hey, that's not in heap order. So how do we move it up... well, obviously we have to make seven a parent of 6. And we have an operation that swaps parents:  Rotations! Yes, treaps will also use rotations:
   
      1
     .27
           5
         .53
      3         7
    .57        .63
   2         6    
 .78       .92  
          4
        .95

But, I hear you cry, Seth! That makes the tree LESS BALANCED!  Well, it does, but look at what we insert next... eight, nine, and ten. And they all fit quite nicely with no rotations.

These are called treaps, and once again by combining two invariants we have obtained a much more efficient data structure.
"Invented by Cecilia R. Aragon and Raimund G. Seidel in 1989, though the authors credit Jean Vuillemin with studying essentially the same data structure in 1980."

This doesn't have the same worst-case behavior: we COULD get very unlucky with our random number generator. But that's unlikely, and because we have a random number generator, we know /how/ unlikely it is: this is called randomized analysis, and we won't be covering it in this course. It turns out, however, that with treaps your expected balance is just as good as AVL trees, and you expect, on average, TWO rotations per insertion. Regardless of how many items are in your heap! That's kind of nifty..


B-TREES

So. I am a functional programmer. I love lists. I love trees. But I will tell you here and now that lists and trees are very, very, slow. How is this possible, they have log(n) lookup and delete and insertions, how can that be slow? In fact, for sizeable chunks of data searching in linear time through an array can be faster than going through a tree! What is going on here?

Disk access is slow! Very slow, thousands and thousands and thousands of times slower than memory access, which is hundreds of times slower than register access. We have lost locality with our trees: we create things, rotate them, move them to their in-order successor's place, shuffle them all around. This is horrible, and, I posit, unneccessary. We have been growing our trees by making them deeper... but how else might we insert elements into a tree?

We have been using binary trees. But why binary: why bring in /one/ element from memory at a time. What a horrible idea.  Imagine that our block size, b, is a thousand and twenty-four elements.  Lets bring in five hundred and eleven at a time. How many children might there be?  Well, any value might be between any two other values, or smaller, or greater, so there are five hundred and twelve. Hey, this is a nice size. Five hundred and eleven elements, five hundred and twelve pointers...  just about our block size with one left over for the parent pointer.

I'm going to imagine that our block size is ten. I know, it's a boring block size, but I am only human. So a block size of ten means that we have a sweet spot when have four elements and five pointers per node.

So lets put up a simple tree..

|2 4 6 8|

That's a nice tree. Our pointers are null at the moment, but we have room to grow. Lets insert 5.

|2 4 5 6 8|


Wait a minute. Now we are larger than a block, that's no good. So anyone have an idea of what to do?

Well, hey, we can split this, right?

   |5|
|2 4|  |6 8|

The key insight here is that we are now allowed to have high-arity nodes. Why make them all the same? Lets merge them when they get small, and split them when they get large.

What I have just shown is a 3-5 B-Tree, and it has two invariants:
1) Every node, except for the root, has between 3 and 5 pointers, and thus between 2 and 4 values to separate them.
2) The tree is perfectly balanced.

Now, that's nice and fun, but life isn't generally so easy. Lets try to do a generalized insertion algorithm.

So lets take our example tree, and insert 3. It has to go between 2 and 4. Note that we still have only 3 children, and so we don't have to work to maintain our invariant. Lets insert 1, doing much the same thing, and then zero.

          |5|
|0 1 2 3 4|   |6 8|

Aha! Now we have to split again. Find the median element and promote it! But make sure you put it IN the root, right next to five! If you make it a child of 5, then you'll break the first invariant

       |2    5|
|0 1|   |3 4|  |6 8|

Oooh, look at that. Nicely maintained. Notice what happened to the parent of the leaf we expanded: it increased by a size of one. This might make the parent too big. But we can always split the parent, and thus make our grandparent too big, and perhaps eventually we have to split the root. Notice when we split the root, we end up with a new root of size 1... the only time we might have a small node.

For deletion, take a look at the handout with solutions..

The four cases considered are: Deletion from leaf node with no rebalancing:
Deletion from leaf node when the left and right sibling both are of size 2 (one must merge, loosing a seperator)
Deletion from internal node: find the inorder successor or predecessor!
Deletion from internal node when both sides of separator have minimal size: recursive case!

My, this is rather complicated. Now, assuming that our tree is some fixed size, 2-4, 255-512, whatever...the cost of doing a lookup, insertion, or deletion is still log(n).

EXTERNAL MEMORY MODEL


Now you might be saying this is a lot of work when our expected cost is STILL log(n). But recall that we, in practice, expected better locality.  This does not represent a gap between theory and practice. This is, in fact, another realm of theory. This is the external memory model, where we track the number of memory accesses, not the number of comparisons or operations. We will also not be covering that here, but for a b-2b tree, the number of memory accesses is reduced to log-base-b of n.

These B-trees, by the way, were invented by Rudolf Bayer and Ed McCreight in the early 1970's.

Now, I want to give you a final thought... I've been drawing these blocks as arrays. But insertions into arrays are slow. We have better structures, don't we, like.. say... trees? I will give you a teaser:
when your blocks are of size b, use sqrt-b arity trees to store your blocks.  This is the key insight in cache-oblivious balanced trees, which are optimal no matter HOW big your block is. But we really don't have time for that.

HUFFMAN ENCODINGS

Alright. So I have presented you with trees, and all along we've been assuming that we'll use them to hold sets. But there are other uses, other invariants.  Right now I'm going to cover an important one, known as huffman encodings.  Project 2 is out, so pay attention.

First I'll present a problem. Then I'll give you a vision, a dream of a solution. Then you are going to tell me how to do this.

Alright. So we are encoding characters in ASCII. Everything has the same length. But we  use 'e' a lot more than 'black box #42', why is the cost for encoding an 'e' the same as 'black box #42?' I offer the conceit that it is not! That we can do /better/!

So we now have the concept that we'll use more bits for some letters than for others. Can anyone see the problem with this?

Yeah... what happens when we say 'e' is '1', but 'z' is '101'. And, worse, f is '01'. Now is '101' e or zf? We don't know! We need an /invariant/ over our code! We need the code to be prefix free! For any encoding 'x', there is no prefix of 'x' that is a valid encoding of another letter.

Lets do a simple example:
Letter a b c d e Sum
Probability 0.10 0.15 0.30 0.16 0.29 = 1
Code 000 001 10 01 11
Cost/Letter 3 3 2 2 2
Total Cost 0.30 0.45 0.60 0.32 0.58 =2.25
Frequency 1/8 1/8 1/4 1/4 1/4 = 1.00


Notice that as we traverse a letter, we split up the possibilites with each bit. Is the first letter a 0 or 1? If it is a 0, then we know it is an a, b, or d. If it is a 1, we know it is a c or e. Does this seem familiar to anyone? This is a tree!

        (a|b|c|d|e)
   (a|b|d)       (c|e)
(a|b)    d      c     e
a     b

Look! EVERYTHING IS TREES! XML: trees. Codes: trees. File systems: trees. Sets, mappings, heaps....trees, trees, trees!

Alright. Excuse me. I get kind of excited about trees.

Alright. So now I have, perhaps, convinced you that a prefix-free code is a tree! Well, we want the /best/ prefix-free code... sadly, that is not unique.  However, we can still find a tree for which there is no better!  

Alright. So, let us take a, b, c, d, and e. These are our /leaf/ nodes.

Take the two lowest ones, put them together, and sum up their probabilities.  Okay, now lets take the next two, add them together. And then we get the smallest two, ad them. And the smallest two, and, wow, look, we have that tree!

So I know you have seen these already, but I want to prove to you that these Huffman trees are optimal. And the reason I want to prove this to you is that you are going to /beat/ Huffman encodings on this project, and someone had better tell me why that is possible.

Huffman Algorithm

Given an alphabet E and a probability function p:
  1. Start with the set {(c, p(c)) | c in E}
  2. While the set contains two elements:
    1. Remove from the set the two elements (a,c1) (b,c2) with minimal c1 and c2.
    2. Add to the set the element ((ab), c1+c2)
  3. Take the final element of the set (x,c). Construct the tree t(x) as follows:
    1. If x is a character c in E, then it is the singleton node c.
    2. If x is a string (y,z) then recursively interpret t(y) and t(z) to be the trees for y and z, and x is then the node (y|z) with t(y) as the left child and t(z) as the right child.

Proof of optimality of Huffman encoding.

Definitions

For a character x, p(x) is the probability of that character.
For a tree T and a character x, d(x,T) is the depth of x in that tree, and the length of the code of that character.
For a tree T, d(T) is the maximum depth of any character in the tree.
For a tree T and a character x, c(x,T) is p(x)*d(x,T).
For a tree T, the cost c(T) is the sum of the the costs of all characters.

Proof

For any alphabet E and probability function p: the tree T produced by the Huffman algorithm is such that given any other tree T' c(T) ≤ c(T').

Note: Huffman trees are COMPLETE, in that every tree has either 0 or 2 children.

Lemma: For a given alphabet E, where x and y are the two characters with the lowest probabilities p(x) and p(y), there is an optimal tree T in which d(x,T) = d(y,T) = d(T) and x and y are siblings.

Sublemma: For an optimal tree T, d(T)=d(x,T)=d(y,T)
Proof: If this is not the case, then either x or y (without loss of generality, assume x) has a depth that is not the maximum depth in the three. Since Huffman trees are complete there must be two characters with the maximum depth in the tree. This means there is a third character, z, with larger depth than x. But then the amount that x and z contribute to the cost of the tree, d(x,T) * p(x) + d(z,T) *p(z), is greater than the amount contributed if you swap them. Thus the tree cannot be optimal.

Sublemma: For every optimal tree T, there is a corresponding optimal tree T' such that x and y are siblings (share an immediate parent).
Proof: If x and y are not siblings, that means there are more than two characters with depth d(T). Moving these characters around does not change their cost, and thus does not change the cost of the tree. Thus we can swap the sibling of x and y, and make x and y siblings.

Lemma: Given a Huffman tree T and another tree T' (for alphabet E), c(T) ≤ (T').

Proof by induction on the size of the alphabet:
1 or 2 characters: all complete trees, and thus Huffman trees, are optimal.
n characters: Consider the two lowest characters, x, and z. Using the huffman algorithm, They will be joined, in the first step, into the (x|z) tree.
Imagine the alphabet with x and z removed, and xz inserted with their combined probability. By the induction hypothesis, the Huffman algorithm will produce an optimal tree T for this alphabet.
Look at the cost of any tree T over this condensed alphabet. We will create from this tree the derivitive tree T' where we replace the character xz with the caracters x and z, and replace the leaf xz with the appropriate tree (x|z).
The cost of T', relative to T, is:
C(T') = C(T) - L(xz)P(xz) + L(x)P(x) + L(z)P(z)
C(T') = C(T) - L(xz)P(xz) + (L(xz)+1)P(x) + (L(xz)+1)P(z)
C(T') = C(T) - L(xz)(P(x)+P(z)) + (L(xz)+1)P(x) + (L(xz)+1)P(z)
C(T') = C(T) - L(xz)(P(x)+P(z)) + (L(xz)+1)P(x) + (L(xz)+1)P(z)
C(T') = C(T) + P(x) + P(z)
The change in cost does not vary based on the structure of the tree. Thus, if T is optimal, T' is optimal.