Breadth-first search on a graph

One question we might ask about a graph is how few edges we need to traverse to find a path from one vertex to another. In other words, assuming that some path exists from vertex x to vertex y, find a path from x to y that has the fewest edges.

One application is the famous Kevin Bacon game, which you will implement in Lab Assignment 3. Consider a movie actor, say Renée Adorée, a silent film actress from the 1920s. She appeared in a film with Bessie Love, who made a movie with Eli Wallach, who acted in a film with Kevin Bacon, and so we say that Renée Adorée's "Kevin Bacon number" is 3. If we were to make a graph where the vertices represent actors and we put an edge between vertices if their actors ever made a movie together, then we can use breadth-first search to find anyone's Kevin Bacon number: the minimum number of actors between a given actor and Kevin Bacon.

Mathematicians have a similar concept, called an Erdős number. Paul Erdős was a great Hungarian mathematician. His Erdős number is 0. Mathematicians who coauthored a paper with Erdős have an Erdős number of 1. Mathematicians other than Erdős who coauthored a paper with those whose Erdős number is 1 have an Erdős number of 2. In general, a mathematician who coauthored a paper with someone whose Erdős number is k and did not coauthor a paper with anyone whose Erdős number is less than k has an Erdős number of k + 1. Believe it or not, there is even such a thing as an Erdős-Bacon number, which is the sum of the Erdős and Bacon numbers. A handful of people, including Erdős himself, have finite Erdős-Bacon numbers.

I'll describe breadth-first search, which we abbreviate as BFS, for you, but I'll give you no code. That's because you'll be implementing BFS in Lab Assignment 3.

We perform BFS from a given start vertex s. We sometimes call s the root vertex. We can either designate a specific goal vertex, or we can just search to all vertices that are reachable from the start vertex (i.e., there exist paths from the start vertex to the vertices). We want to determine, for each vertex x that the search encounters, an unweighted shortest path from s to x, that is, a path from s to x with the minimum length (fewest edges). Later in the course, when we consider weighted graphs, we'll extend the notion of shortest paths to account for edge weights.

We usually record two pieces of information about each vertex x:

  1. The length of an unweighted shortest path from s to x, as the number of edges.

  2. A back-pointer from x, which is the vertex that immediately precedes x on a shortest path from s to x. For example, let's look at that graph again:

    In this graph, an unweighted shortest path from vertex D to vertex A is D, I, C, A, and so A would have a back-pointer to C, C would have a back-pointer to I, and I would have a back-pointer to D. Because vertex D is the start vertex, no vertex precedes it. We can say that its back-pointer is null (depending on the type of the back-pointer), or we can say that D is its own back-pointer, so that a vertex that is its own back-pointer must be the start vertex.

    This unweighted shortest path is not unique in this graph. Another unweighted shortest path from vertex D to vertex A is D, B, E, A. In this path, A would have a back-pointer to E, E would have a back-pointer to B, B would have a back-pointer to D, and D's back-pointer would be None.

I like to think of BFS in terms of sending waves of runners over the graph in discrete timesteps. We'll focus on undirected graphs. I'll call the start vertex s, and at first I won't designate a goal vertex.

  • Record the "distance" from the start vertex s to itself as 0, and record the back-pointer for s as null.
  • At time 0, send out runners from s by sending out one runner over each edge incident on s. (So that the number of runners sent out at time 0 equals the degree of s.)
  • Each runner arrives at a vertex adjacent to s at time 1. For each of these vertices, record its distance from s as 1 and its back-pointer as s.
  • At time 1, send out runners from each vertex whose distance from s is 1. As we did with s at time 0, send out one runner along each edge from each of these vertices.
  • Each runner will arrive at another vertex at time 2. Let's say that the runner left vertex x at time 1 and arrives at vertex y at time 2. Vertex y could have already been visited by a runner (perhaps it is s or perhaps it is another vertex that was visited at time 1). Or vertex y might not have been visited before. If vertex y had already been visited, forget about it; we already know its distance from s and its back-pointer. But if vertex y had not already been visited, record its distance from s as 2, and record its back-pointer as x (the vertex that the runner came from).
  • At time 2, send out runners from each vertex whose distance from s is 2. As before, send out one runner along each edge from each of these vertices.
  • Each runner will arrive at another vertex at time 3. If the vertex has been visited before, forget about it. Otherwise, record its distance from s as 3, and record its back-pointer as the vertex from which the runner came.
  • Continue in this way until all vertices that are reachable from the start vertex have been visited.

If there is a specific goal vertex, you can stop this process once it has been visited.

In the above graph, let's designate vertex D as the start vertex, so that D's distance is 0 and its back-pointer is null. At time 1, runners visit vertices G, B, and I, and so each of these vertices gets a distance of 1 and D as their back-pointers. At time 2, runners from G, B, and I visit vertices D, G, B, I, F, E, C, and H. D, G, B, and I have already been visited, but F, E, C, and H all get a distance of 2. F and E get B as their back-pointer, and C and H get I as their back-pointer. At time 3, runners from F, E, C, and H visit vertices B, A, and I. B and I have already been visited, but A gets a distance of 3, and its back-pointer is set arbitrarily to either of E and C (it doesn't matter which, since either E or C precedes A on an unweighted shortest path from D). Now all vertices have been visited, and the breadth-first search terminates.

To determine the vertices on a shortest path, we use the back-pointers to get the vertices on a shortest path in reverse order. (But you know how easy it is to reverse a list, and sometimes you're fine with knowing the path in reverse order, such as when you want to display it.) Let's look at the vertices on an unweighted shortest path from D to A. We know that A is reachable from D because it has a back-pointer. So A is on the path, and it is preceded by its back-pointer. Let's say that A's back-pointer is E (which I chose arbitrarily over C so that I wouldn't get a reversed path that spells "ACID"). So E is the next vertex on the reversed path. E's back-pointer is B, and so B is the next vertex on the reversed path. B's back-pointer is D, and so D is the next vertex on the reversed path. D's back-pointer is null, and so we are done constructing the reversed path: it contains, in order, vertices A, E, B, and D.

Implementing BFS

You can't implement BFS by following the above description exactly. That's because you can't send out lots of runners at exactly the same time. You can do only one thing at a time, and so you couldn't send runners out along all edges incident on vertices G, B, and I all at the same time. You need to simulate doing that by sending out one runner at a time.

Let's define the frontier of the BFS as the set of vertices that have been visited by runners but have yet to have runners leave them. If you send out runners one at a time, at any moment, the vertices in the frontier are all at either some distance i or distance i + 1 from the start vertex. For example, suppose that we have sent out runners from vertices D, G, and B so that they are no longer in the frontier. The frontier consists of vertex I, at distance 1, and vertices F and E, at distance 2.

One key to implementing BFS is to treat the frontier properly. In particular, you want to maintain it as a FIFO queue.

Here is "pseudocode" (i.e., not real Java code) for BFS:

queue = new Queue();
queue.enqueue(s);  // initialize the queue to contain only the start vertex s
distance of s = 0;
while (!queue.isEmpty()) {
  x = queue.dequeue(); {  // remove vertex x from the queue
  for (each vertex y that is adjacent to x) {
    if (y has not been visited yet) {
      y's distance = x's distance + 1;
      y's back-pointer = x;
      queue.enqueue(y);   // insert y into the queue
    }
  }
}

One property of BFS is that at any time, the distances of vertices on the queue are all either d or d + 1, for some nonnegative integer d, with all the vertices at distance d preceding those at distance d + 1. We can show this property by induction on the number of enqueue and dequeue operations. At first, we have just one vertex on the queue, and its distance is 0. Now let's consider the two operations, with dequeue first. If the property holds before dequeueing, then surely removing a vertex from the queue cannot cause a vioation. Now let's consider enqueueing. Suppose that the most recently dequeued vertex had distance d. Because we assume that the property holds, after dequeueing, the first vertex in the queue must have distance d or d + 1. Any vertex enqueued has distance d + 1 (1 greater than the distance of the dequeued vertex), and because enqueued vertices follow all others in the queue, the property holds after enqueueing.

Now we have a couple more questions to answer:

  • How do we represent the distance and back-pointer for each vertex?
  • How do we know whether a vertex has been visited?

Representing vertex information

Again, you have a couple of choices. If you create a class, say Vertex, to represent what you need to know about a vertex, it can have whatever instance variables you like. So a Vertex object could have instance variables distance, backPointer, and even visited. Of course, you could make visited unnecessary by initializing distance to Integer.MAX_VALUE and saying that the vertex has been visited only if its distance variable is not Integer.MAX_VALUE. (Of course, this idea works only if it's impossible for an actual distance to be Integer.MAX_VALUE.) In Lab Assignment 3, you do not necessarily need to keep track of distances.

This approach is nice because it encapsulates what you'd like to know about a vertex within its object. But this approach has a downside: you need to initialize the instance variables for every vertex before you commence send out runners from the start vertex. You wouldn't want to use, say, the visited value from a previous BFS, right? Ditto for the distance value.

There's another way: use a Map. If you want to store back-pointers, store them in a Map, where the key is a reference to a Vertex object and the value is a reference to the Vertex object for its immediate predecessor on an unweighted shortest path from the start vertex. You can use this same Map to record whether a vertex has been visited, because you can say that a vertex has been visited if and only if its Vertex object appears in the Map. You can use the containsKey method from the Map interface to determine whether the Map contains a reference to particular Vertex object. The beauty of using a Map is that initializing is E-Z: just make a new Map. There is one other detail you need to take care of: the start vertex. It has no predecessor, but you want to record that it has been visited initially. So you'll need to store a reference to its Vertex object in the Map, and you can just use null as the corresponding value.

There is yet another way: store the breadth-first tree as a separate directed graph. Each directed edge (u, v) is equivalent to u's back-pointer being v. Looked at another way, each directed edge is oriented on a path toward the start vertex. Every vertex has at most one leaving edge in a breadth-first tree, since every vertex has at most one back-pointer.

How long does BFS take? We visit each vertex and each edge at most once. For a graph with n vertices and m edges, the running time is O(n + m).

You can think of BFS like a "layered" search, where we visit the start vertex at distance 0, then the next "layer" of vertices at distance 1, then the next layer at distance 2, and so on. We really need the queue to keep track of which vertex we search from next.