Some applications involve grouping *n* distinct elements into a collection of disjoint sets. Two important operations are then finding which set a given element belongs to and uniting two sets. This chapter explores methods for maintaining a data structure that supports these operations.

A **disjoint-set data structure*** *maintains a collection* S *= {*S*_{1}*, S*_{2}*, . . . ,S _{k}*} of disjoint dynamic sets. Each set is identified by a

As in the other dynamic-set implementations we have studied, each element of a set is represented by an object. Letting *x* denote an object, we wish to support the following operations.

MAKE-SET(*x*) creates a new set whose only member (and thus representative) is pointed to by *x*. Since the sets are disjoint, we require that *x *not already be in a set.

FIND-SET(*x*) returns a pointer to the representative of the (unique) set containing *x*.

One of the many applications of disjoint-set data structures arises in determining the connected components of an undirected graph (see Section 5.4). Figure 22.1(a), for example, shows a graph with four connected components.

The procedure CONNECTED-COMPONENTS that follows uses the disjoint-set operations to compute the connected components of a graph. Once CONNECTED-COMPONENTS has been run as a preprocessing step, the procedure SAME-COMPONENT answers queries about whether two vertices are in the same connected component.^{1 }(The set of vertices of a graph *G *is denoted by *V*[*G*], and the set of edges is denoted by *E*[*G*].)

CONNECTED-COMPONENTS(G)

1foreach vertexvV[G]

2doMAKE-SET(v)

3foreach edge (u,v)E[G]

4doifFIND-SET(u)FIND-SET(v)

5thenUNION(u,v)

Edge processed Collection of disjoint sets

------------------------------------------------------------------------------

initial sets{a} {b} {c} {d} {e} {f} {g} {h} {i} {j}

(b,d) {a} {b,d} {c} {e} {f} {g} {h} {i} {j}

(e,g) {a} {b,d} {c} {e,g} {f} {h} {i} {j}

(a,c) {a,c} {b,d} {e,g} {f} {h} {i} {j}

(h,i) {a,c} {b,d} {e,g} {f} {h,i} {j}

(a,b) {a,b,c,d} {e,g} {f} {h,i} {j}

(e,f) {a,b,c,d} {e,f,g} {h,i} {j}

(b,c) {a,b,c,d} {e,f,g} {h,i} {j}

(b)

SAME-COMPONENT(u,v)

1ifFIND-SET(u) = FIND SET(v)

2then returnTRUE

3else returnFALSE

A simple way to implement a disjoint-set data structure is to represent each set by a linked list. The first object in each linked list serves as its set's representative. Each object in the linked list contains a set member, a pointer to the object containing the next set member, and a pointer back to the representative. Figure 22.2(a) shows two sets. Within each linked list, the objects may appear in any order (subject to our assumption that the first object in each list is the representative).

With this linked-list representation, both MAKE-SET and FIND-SET are easy, requiring *O*(1) time. To carry out MAKE-SET(*x*), we create a new linked list whose only object is *x*. For FIND-SET(*x*), we just return the pointer from *x* back to the representative.

The total time spent is therefore (*n+q*^{2}), which is(*m*^{2})* *since *n* = (*m*)* *and *q* = (*m*). Thus, on the average, each operation requires (*m*) time. That is, the amortized time of an operation is (*m*).

The above implementation of the UNION procedure requires an average of (*m*) time per call because we may be appending a longer list onto a shorter list; we must update the pointer to the representative for each member of the longer list. Suppose instead that each representative also includes the length of the list (which is easily maintained) and that we always append the smaller list onto the longer, with ties broken arbitrarily. With this simple * weighted-union heuristic*, a single UNION operation can still take (

Write pseudocode for MAKE-SET, FIND-SET, and UNION using the linked-list representation and the weighted-union heuristic. Assume that each object *x* has attributes *rep*[*x*] pointing to the representative of the set containing *x*, *last*[*x*] pointing to the last object in the linked list containing *x*, and *size*[*x*] giving the size of the set containing *x*. Your pseudocode can assume that *last*[*x*] and *size*[*x*] are correct only if *x* is a representative.

1fori1to16

2doMAKE-SET(x)_{i}

3fori1to15by2

4doUNION(x,_{i}x+1)_{i}

5fori 1to13by4

6doUNION(x,_{i}x+2)_{i}

7 UNION(x_{1},x_{5})

8 UNION(x_{11},x_{13})

9 UNION (x_{1},x_{10})

10 FIND-SET(x_{2})

11 FIND-SET(x_{9})

Argue on the basis of Theorem 22.1 that we can obtain amortized time bounds of *O*(1) for MAKE-SET and FIND-SET and *O*(1g *n*) for UNION using the linked-list representation and the weighted-union heuristic.

In a faster implementation of disjoint sets, we represent sets by rooted trees, with each node containing one member and each tree representing one set. In a * disjoint-set forest*, illustrated in Figure 22.4(a), each member points only to its parent. The root of each tree contains the representative and is its own parent. As we shall see, although the straightforward algorithms that use this representation are no faster than ones that use the linked-list representation, by introducing two heuristics--"union by rank" and "path compression"--we can achieve the asymptotically fastest disjoint-set data structure known.

We perform the three disjoint-set operations as follows. A MAKE-SET operation simply creates a tree with just one node. We perform a FIND-SET operation by chasing parent pointers until we find the root of the tree. The nodes visited on this path toward the root constitute the * find path*. A UNION operation, shown in Figure 22.4(b), causes the root of one tree to point to the root of the other.

The first heuristic, * union by rank*, is similar to the weighted-union heuristic we used with the linked-list representation. The idea is to make the root of the tree with fewer nodes point to the root of the tree with more nodes. Rather than explicitly keeping track of the size of the subtree rooted at each node, we shall use an approach that eases the analysis. For each node, we maintain a

The second heuristic, * path compression*, is also quite simple and very effective. As shown in Figure 22.5, we use it during FIND-SET operations to make each node on the find path point directly to the root. Path compression does not change any ranks.

MAKE-SET(x)

1p[x]x

2rank[x] 0

UNION(x,y)

1 LINK(FIND-SET(x), FIND-SET(y))

LINK(x,y)

1ifrank[x] >rank[y]

2thenp[y]x

3elsep[x]y

4ifrank[x] =rank[y]

5thenrank[y]rank[y] + 1

The FIND-SET procedure with path compression is quite simple.

FIND-SET(x)

1ifxp[x]

2thenp[x] FIND-SET(p[x])

3returnp[x]

The FIND-SET procedure is a * two-pass method*: it makes one pass up the find path to find the root, and it makes a second pass back down the find path to update each node so that it points directly to the root. Each call of FIND-SET(

Do Exercise 22.2-2 using a disjoint-set forest with union by rank and path compression.

Write a nonrecursive version of FIND-SET with path compression.

To understand Ackermann's function and its inverse , it helps to have a notation for repeated exponentiation. For an integer *i* 0, the expression

stands for the function *g*(*i*), defined recursively by

Recall the definition of the function 1*g** in terms of the functions lg^{(i)} defined for integer *i* 0:

The 1g* function is essentially the inverse of repeated exponentiation:

We are now ready to show Ackermann's function, which is defined for integers *i*, *j* 1 by

A(1,j) = 2for^{j}j1 ,

A(i,1) =A(i- 1,2) fori2 ,

A(i,j) =A(i- 1,A(i,j- 1)) fori,j2 .

Figure 22.6 shows the value of the function for small values of *i* and *j*.

We define the inverse of Ackermann's function by^{2}

(m,n) =min{i1:A(i,m/n) > lgn} .

In the remainder of this section, we prove an *O*(*m* 1g^{*} *n*) bound on the running time of the disjoint-set operations with union by rank and path compression. In order to prove this bound, we first prove some simple properties of ranks.

We define size(*x*) to be the number of nodes in the tree rooted at node *x*, including node *x* itself.

For all tree roots *x*, size(*x*) 2* ^{rank}*[

size'(y) = size(x) + size(y)

2[^{rank}x] + 2[^{rank}y]

2[^{rank}y]

= 2'[^{rank}y]^{.}

No ranks or sizes change for any nodes other than *y*.

If *rank*[*x*] = *rank*[*y*], node *y* is again the root of the new tree, and

size'(y) = size(x) + size(y)

2[^{rank}x] + 2[^{rank}y]

= 2[^{rank}y]+ 1

= 2'[^{rank}y]^{ . }

For any integer *r* 0, there are at most *n*/2* ^{r}* nodes of rank

Every node has rank at most 1g *n*.

We shall use the aggregate method of amortized analysis (see Section 18.1) to prove the *O*(*m* 1g^{*} *n*) time bound. In performing the amortized analysis, it is convenient to assume that we invoke the LINK operation rather than the UNION operation. That is, since the parameters of the LINK procedure are pointers to two roots, we assume that the appropriate FIND-SET operations are performed if necessary. The following lemma shows that even if we count the extra FIND-SET operations, the asymptotic running time remains unchanged.

Then, for *j* = 0, 1, . . . , lg^{*} *n* - 1, the *j*th block consists of the set of ranks

{B(j- 1) + 1,B(j- 1) + 2, . . . ,B(j)} .

For *j* = 0, this sum evaluates to

N(0) =n/2^{0}+n/2^{1}

= 3n/2

= 3n/2B(0).

Thus, *N*(*j*) 3*n*/2*B*(*j*) for all integers *j* 0.

* Proof *Immediate from Theorem 22.7 and Lemma 22.6.

For each node *x*, how many bits are necessary to store size(*x*)? How about *rank*[*x*]?

The * off-line minimum problem* asks us to maintain a dynamic set

4, 8, E, 3, E, 9, 2, 6, E, E, E, 1, 7, E, 5.

Fill in the correct values in the *extracted* array.

I_{1},E,I_{2},E,I_{3},...,I_{m},E,I_{m+1 },

OFF-LINE-MINIMUM(m,n)

1fori1ton

2dodeterminejsuch thatiKj

3ifjm+ 1

4thenextracted[j]i

5 letlbe the smallest value greater thanj

for which setKexists_{l}

6K_{l}KjK, destroying_{l}K_{j}

7returnextracted

* b.* Argue that the array

In the * depth-determination problem,* we maintain a forest

MAKE-TREE(*v*) creates a tree whose only node is *v*.

FIND-DEPTH(*v*) returns the depth of node *v* within its tree.

GRAFT(*r*, *v*) makes node *r*, which is assumed to be the root of a tree, become the child of node *v*, which is assumed to be in a different tree than *r* but may or may not itself be a root.

**b****.** Give an implementation of MAKE-TREE.

22-3 Tarjan's off-line least-common-ancestors algorithm

The * least common ancestor* of two nodes

LCA(u)

1 MAKE-SET(u)

2ancestor[FIND-SET(u)]u

3foreach childvofuinT

4doLCA(v)

5 UNION(u,v)

6ancestor[FIND-SET(u)]u

7color[u] BLACK

8foreach nodevsuch that {u,v}P

9do ifcolor[v] = BLACK

10thenprint "The least common ancestor of"

u"and"v"is"ancestor[FIND-SET(v)]

**a****.** Argue that line 10 is executed exactly once for each pair {*u*, *v*} *P.*

**c****. **Prove that LCA correctly prints the least common ancestor of *u* and *v* for each pair {*u ,v*} *P*.

Many of the important results for disjoint-set data struckltures are due at least in part to R. E. Tarjan. The upper bound of *O*(*m* (*m,n*)) was first given by Tarjan[186, 188]. The *O*(*m *1g* *n*) upper bound was proven earlier by Hopcroft and Ullman[4, 103]. Tarjan and van Leeuwen [190] discuss variants on the path-compression heuristic, including "one-pass methods," which sometimes offer better constant factors in their performance than do two-pass methods. Gabow and Tarjan [76] show that in certain applications, the disjoint-set operations can be made to run in *O*(*m*) time.