Disjoint Sets

(lecture notes by Zoran Budimlić, 2002-02-25; updated for comp314 2008 by dwallach/dsandler)

Introduction

Sometimes we want to partition a group of N distinct objects into a collection of disjoint sets. Important operations on disjoint sets are finding which set a given element belongs to, and creating a union of two sets (why is intersection not very important for disjoint sets?).

A set will be identified by an object representative. That’s one of the objects that belongs to the set. Sometimes can be arbitrary, sometimes it has to be a certain (i.e. smallest or biggest) object from the set. The only condition is that it has to be the same object every time we ask about it on the given set.

Important operations:

MAKE-SET(x)
make a set out of object x. x shouldn’t be in any other set. X will become representative of the newly created set.
UNION(x,y)
make a set that is a union of sets to which x and y belong. After union, sets of x and y are destroyed.
FIND-SET(x)
find the representative of the set to which x belongs.

An example of disjoint set application

A very simple algorithm to find connected components on an undirected graph. We will discuss finding strongly connected components on a directed graph, a somewhat harder problem, next week.

After we are done constructing the connected components, testing whether two elements are in the same component involves just finding their coresponding representatives.

	CONNECTED-COMPONENTS(G)
	1.    for each vertex v in V
	2.        MAKE-SET(v)
	3.    for each edge (u,v) in E
	4.        if FIND-SET(u) != FIND-SET(v)
	5.            UNION(u,v)

	SAME-COMPONENT(u,v)
	1.    return (FIND-SET(u) == FIND-SET(v))

Linked-list implementation

A very simple disjoint set implementation can be done using linked lists.

1st approach: Just make a list of all the elements. The last element is the representative. MAKE-SET takes O(1) time, so does the UNION. FIND-SET takes O(n) amortized time.

2nd approach: Make a list of all the elements. The first element is the representative, every element has a pointer back to the representative. MAKE-SET takes O(1), so does the FIND-SET. UNION now takes O(n) amortized time because we have to update the representative pointers in the list we are appending, and because we can come up with a scenario where we always append a longer list onto a short one.

3rd approach: Same as the 2nd approach, but we keep a length of the list with the list, and always append a shorter list to a longer one. That means that we need to update, at most, half of the pointers on each UNION. MAKE-SET and FIND-SET are still O(1), but UNION is more complicated. Rather than examining any one UNION operation, we need to look at the total cost of n of them, which turns out to be O(n log n). Thus, the amortized cost of a single UNION operation is O(log n).

Proof Sketch: Look at how many times an object's representative pointer can be updated. If an object is in a list of length k and ends up being updated, then the resulting list must be at least 2 k in size. How many times can you double the length of the list before you have all n elements in it? Log (base 2).

Disjoint-set forests

If we represent sets as trees where the root is the representative and every node points to its parent while the root points to itself, and apply a couple of heuristics, we can get asymptotically faster disjoint-set implementation.

MAKE-SET is simple. It just creates a tree with one node. FIND-SET starts at the element and traverses the tree upwards until it gets to the root, which it returns as the representative. UNION just links the root of one tree to the other one.

Union by rank.

When performing UNION, if we make the root of the larger tree be the root of the resulting tree (similar to the 3rd approach for linked-list implementation), we can save some time. We won’t actually compute the size of the tree (as in the number of elements), but the rank, which is an upper bound on the height of the tree. In union-by-rank, the tree with a smaller rank is made to point to the root of the tree with a higher rank.

Path compression

When doing FIND-SET, we can speed-up any subsequent queries on the find path by compressing the path: we make each node on the path to the root point directly to the root. Any subsequent queries on any of the nodes on the path will be done in just one step.

	MAKE-SET(x)
	1.    p[x] = x
	2.    rank[x] = 0

	UNION(x,y)
	1.    LINK(FIND-SET(x), FIND-SET(y))

	LINK(x,y)
	1.    if (rank[x] > rank[y])
	2.       p[y] = x
	3.    else 
	4.       p[x] = y
	5.       if (rank[x]==rank[y])
	6.          rank[y] = rank[y]+1

	FIND-SET(x)
	1.    if (x != p[x])
	2.       p[x] = FIND-SET(p[x])
	3.    return p[x]

When we use either union by rank or path compression, the running time is improved. When we use both heuristics, the running time is further improved to O(mα(n)) where m is the total number of operations, and n is the number of MAKE-SET operations. α(n) is a very slowly growing function, which has a value of at most 4 for any conceivable value of n.