Random Shuffles

Shuffling a collection of items is a surprisingly frequent task in programming: essentially, it comes up whenever a known input must be processed in random order. What is more, there is a delightful, three-line algorithm to accomplish this task correctly, in-place, and in optimal time. Unfortunately, this simple three-line solution seems to be insufficiently known, leading to various, less-than-optimal ad-hoc alternatives being used in practice — but that is entirely unnecessary!

The Algorithm

The basic algorithm consists of a single loop over the elements in the input array. At each step, we pick an element at random from the as-yet unvisited “tail” of the array, and exchange it with the current element. Crucially, the current element is always considered part of the “tail”. In Python, it looks like this:

for i in range( data ):
    j = random.randrange( i, len(data) )
	data[i], data[j] = data[j], data[i]

These three lines (or the idea behind them) should be committed to memory!

Notice the use of the random.randrange( start, stop ) function, which returns a random index precisely from the required range: i <= j < len(data).

To see that this algorithm produces each possible permutation with equal probability, it is enough to realize that:

  • the first element is selected randomly from all possible choices
  • the algorithm is recursive (the implementation isn’t, but the algorithm is).

Because each element is touched once, the algorithm has the “best possible” time complexity. The input sequence is shuffled in place, without the need for extra storage.

The algorithm goes back to the early 20th century statistics pioneers Ronald Fisher and Frank Yates, with practical computer implementations being provided later by Richard Durstenfeld and Donald Knuth. See its Wikipedia entry.

Reading from a Stream

The above Wikipedia article also mentions a fascinating variant, wherein a randomly shuffled array is built up from an input stream, even when the total number of elements in the stream is not known. For each value read from the stream, the array is extended by one element, and a random location within the newly extended array is found for the latest element.

data = []
while True:
    elem = input()              # raises EOFError when done
	data.append( elem )         # add new elem at end, maybe shuffle later!
	i = random.randrange( 0, len(data) )
	if i != len(data)-1:        # if i is not last element in data: exchange
		data[i], data[-1] = data[-1], data[i]