Credits: Princeton’s Algorithms, Part I
Union Find is a classic problem in computer science, which entails forming groups of nodes and determining if two nodes are in the same group.
Formally stated, you have n nodes (or sites).
union(n1, n2)
creates a connection between nodes
n1
and n2
. connected(n1, n2)
determines whether the two nodes have any path between them. For
example, running union(n1, n2)
and
union(n2, n3
causes connected(n1, n3)
to
return true
.
This means that the relationship of being “connected” is actually an equivalence relation and has the following properties:
We also observe that this insight allows us to simplify our problem.
Instead of thinking of specific connections between nodes, we instead
just need to pay attention to connected components of
nodes. Any union(n1, n2)
operation simply joins
the that connected components n1
and n2
are
members of into a larger connected component.
connected(n1, n2)
just checks if n1
and
n2
are in the same connected component. Our problem
description never specified that we need to find the actual path between
two nodes! Just that we need to know if such a path exists.
As a heuristic, simplifying the problem and discarding unnecessary information leads to better algorithmic results.
A more general term for this problem is the dynamic connectivity problem.
For convenience, we will label our n nodes 0 to n-1. This is so their key can be used as an index to a basic array. For other key data types, hash maps or whatever else can be used instead.
This approach is considered to be an eager approach for reasons we will see later.
We first initialize an array id[]
of size n,
such that each value is its own index.
int[] id = new int[n];
for (int i = 0; i < n; i++) {
[i] = i;
id}
When we call union(n1, n2)
find all the indexes with the
same value as n2
and set them to the value at
n1
.
public void union(int p, int q) {
int pid = id[p];
int qid = id[q];
for (int i = 0; i < id.length; i++) {
if (id[i] = pid) id[i] = qid;
}
}
public boolean connected(int p, int q) {
return id[p] == id[q];
}
This approach is considered eager because it does the work
upfront when union
is executed.
connected
, which is an operation that typically happens
after union
performs far less “work”
comparatively.
However, we don’t actually know when/if a connected
operation will be performed ahead of time, so the extra work that
union
is doing could be going to waste.
For raw algorithmic performance, lazy approaches tend to outperform eager approaches.
Also, this implementation has a worst case runtime of O(n)
for the union
operation, which is not efficient
enough to scale.
Big-Oh isn’t everything, emperical testing, yada yada. Point is, we can make this faster.
With this insight, let’s try to shift some of the work over to the
connected
operation to create a lazy approach.
We initialize our id[]
array in the same way we did for
Quick-find.
int[] id = new int[n]
for (int i = 0; i < n; i++) {
[i] = i;
id}
However, rather than the id[n]
denoting the connected
component that n
belongs to, id[n]
now points
to the parent of n
. If
id[n] == n
, n
is considered to be the root of
that connected component. All members of a connected component
share a root. Now, to join two connected components, we only
need to modify the value of one of the root.
private int root(int i) {
while (i != id[i]) i = id[i];
return i;
}
public void union(int p, int q) {
int i = root(p);
int j = root(q);
// set the root of i to j. implicitly sets the connected component to be "under" j.
[i] = j
id}
public boolean connected(int p, int q) {
return root(p) == root(q);
}
This is better, but still not good enough. In most cases, the
root
operation runs in O(log n) time. But in the
worst case, the tree can become “tall and skinny”, so the real worst
case runtime is O(n).
This runtime is actually worse overall than the Quick-find
implementation as both the union
and connected
methods make use of the root
method, so now both of those
operations have a worst case runtime of O(n).
BUT, this approach leaves more room for improvement.
Before, we arbitrary set the the new of the connected component. We
can instead keep track of the size of each subtree and make sure that
the larger subtree becomes the new root. This actually
forces the worst case run time of root
down to O(log n). The proof and implementation is left as an
exercise for the reader.
When we traverse up the tree via the root
method, we
might as well compress the path of each node for future use. A simple
way to do this is set the parent of each node we touch to it’s
grandparent. This halves the length of the path from each node to it’s
root.
private int root(int i) {
while (i != id[i]) {
[i] = id[id[i]];
id= id[i];
i }
return i;
}
An even more powerful path compression technique is to directly set the parent each node we traverse to the root of its connected component. However, in practice, this optimization doesn’t actually do much better compared to the basic path compression.
Side note: the more powerful path compression is actually a form of dynamic programming!
private int root(int p) {
if (id[p] == p) return p;
int r = root(id[p]);
[p] = r;
idreturn r;
}