How to Select a Random Subset from a List

Reservoir sampling provides an easy and efficient way to do this. Here's an example implementation in PHP.

Suppose you have a website that displays product information provided by a supplier in an XML data feed, and you’d like to select ten products at random from the feed to feature on the site. A PHP programmer might start with an approach like this:

// Load and parse the list of products
$product_list = simplexml_load_file('https://supplier.example.com/products.xml');

// Fetch data for all the products as an array
$products = $product_list->xpath('product');

// Reorder the products randomly
shuffle($products);

// Select the first ten products from this randomly ordered list
$products = array_slice($products, 0, 10);

This works, but it’s pretty inefficient. Even though we’re interested in data for only a handful of products, the code parses and holds in memory a representation of the entire product list, which it then shuffles—again in its entirety. After all this processing, most of the list is discarded in the final step.

A simple approach like this might be acceptable when the number of products is very small, but as the product list grows, so will the processing time and memory consumed (and mostly wasted) by this implementation. What’s a better way to do this?

Algorithm R

Reservoir-sampling algorithms are a family of algorithms designed to perform efficiently this exact task. The most common and straightforward example is known as Algorithm R, credited to Alan G. Waterman in Donald Knuth’s The Art of Computer Programming, Volume 2 and in a 1985 paper on the subject by Jeffrey S. Vitter.

Algorithm R generates a random subset from a list in only a single pass through it. As it processes each element in constant time, its overall runtime is O(n) and increases linearly with the length of its input. However, because the algorithm loads only a single element in each iteration, its memory usage remains constant no matter how many elements the list contains—a significant improvement over the approach above.

Here’s how the algorithm works:

Suppose our input consists of a set S of N elements, where N is very large, unknown or both. We want to select at random a subset of n elements from S, where typically n is much smaller than N. For generality, we’ll assume S is too large to fit in memory and is read sequentially from an external source.

Algorithm R starts by defining a reservoir of elements and populating it with the first n elements read from S. At this point, the reservoir holds a true random subset of n elements selected with equal probability from those read so far. (Clearly, if Nn, the subset must include every element of S.)

If N > n, then as the algorithm reads each additional element from S it decides at random whether or not this new element should replace one already in the reservoir. Let t represent the number of elements read so far, including this current, candidate element. If the reservoir already contains a true random sample, each of its elements must have been selected with equal probability n ÷ (t - 1). To maintain this property, once element t is read, it should be selected for placement in the reservoir with probability n ÷ t.

This implies every element already in the reservoir has a probability of being replaced of (1 ÷ n) × (n ÷ t) or 1 ÷ t, and a probability of surviving the iteration (not being replaced) of 1 - (1 ÷ t) or (t - 1) ÷ t.

At the end of the t‘th iteration, the probability any particular element encountered so far will be in the reservoir is equal to the product of the probability it was in the reservoir in the previous iteration and the probability it survives this current iteration. From the foregoing, this equals (n ÷ (t - 1)) × ((t - 1) ÷ t) or n ÷ t, which implies that if the reservoir held a true random sample at the end of iteration (t - 1), it will continue to hold one after subsequent iteration t.

It follows by induction that at the end of every iteration, the reservoir contains a true random sample of the elements read so far from S. Therefore, when the end of the input is reached the reservoir holds a true random sample of all the elements of S, at which point Algorithm R returns the reservoir’s contents to its caller.

Implementation in PHP

Here’s an implementation of Algorithm R in PHP, written using for instead of while loops to isolate the steps taken at each iteration:

 1 /**
 2  * Selects a random subset of elements from an input set.
 3  *
 4  * @param Iterator $input The input from which to select elements.
 5  * @param int $n The number of elements to select.
 6  * @return array An array containing the randomly selected elements.
 7  */
 8 function reservoir_sample(Iterator $input, int $n): array {
 9     $reservoir = [];
10 
11     // Fill the reservoir from the input
12     for ($t = 1; $input->valid() and $t <= $n; $input->next(), $t++)
13         $reservoir[$t - 1] = $input->current();
14 
15     // With any remaining input, replace elements in the reservoir with gradually
16     // decreasing probability
17     for (; $input->valid(); $input->next(), $t++) {
18         $j = rand(1, $t);
19         if ($j <= $n)
20             $reservoir[$j - 1] = $input->current();
21     }
22 
23     return $reservoir;
24 }

Accepting an Iterator for its input allows this function to read data sequentially from any of a variety of sources without requiring the data to already exist in memory.

Lines 18–20 contain the logic the algorithm uses to decide whether to replace an element in the reservoir. It selects a random whole number j between 1 and t, the number of elements read so far, inclusive. If j is less than or equal to n, the size of the reservoir, it replaces element j of the reservoir with the element just read from the input. Thus the new element is placed in the reservoir with probability n ÷ t, as required.

This has the convenient side effect that for any element past the first n, not just its presence but its position in the reservoir is decided at random. This means there is usually little need to shuffle the resulting array; its elements will already be (mostly) in random order relative to the input.

Example: A Random View of Toronto

Returning to our original question, A Random View of Toronto is a simple Web page that demonstrates using reservoir sampling to select efficiently a random subset of elements from an XML document. The source code is included below. It uses the Traffic Camera dataset published by Toronto’s Transportation Services, which provides information about the traffic cameras located at various points across the city.

On each load, the page randomly selects four traffic cameras from the dataset and displays their image, giving visitors a “random view” of the streets of Toronto.

To use this code, copy it to a file named index.php on your computer (or download the gist). From a terminal window, change the current directory to the folder containing the file and run

php -S 127.0.0.1:8416

Then open http://localhost:8416 in your Web browser. (Users of CentOS and Fedora will first need to install the “php-xml” package for the page to work.)

The code defines a generator, traffic_cameras, that uses an instance of XMLReader to read successive Camera elements from the dataset and yield them, one at a time, to its caller. An invocation of this generator is passed as the input to reservoir_sample. Even though in this case the dataset is fairly small, this minimizes the use of memory as no more than a single node from the XML document is ever loaded at a time.

<?php

/**
 * Displays images from a random selection of traffic cameras located in Toronto, Ontario,
 * Canada.
 *
 * Demonstrates using a reservoir-sampling algorithm to efficiently select a random subset
 * of nodes from an XML document, specifically the Traffic Camera dataset provided by
 * Toronto's Transportation Services.
 *
 * To use this file, download it to a location on your computer. From a terminal window,
 * change the current directory to the download folder and run
 *
 *   php -S 127.0.0.1:8416
 *
 * then open "http://localhost:8416" in your Web browser.
 *
 * Users of Red Hat Linux distributions, including CentOS and Fedora, will first need to
 * install the "php-xml" package if it is not already present on their system.
 *
 * For a description of the reservoir-sampling algorithm, Algorithm R, see:
 *
 *   https://simonsouth.ca/posts/select-random-subset-from-list/
 *
 * For information about the Traffic Camera dataset, see:
 *
 *   http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=9525a89f92491410VgnVCM10000071d60f89RCRD
 *
 * @author Simon South <simon@simonsouth.ca>
 * @license https://unlicense.org/ Unlicense
 */

/** The online location of Transporation Services' Traffic Camera dataset */
define('DATASET_URL',
       'http://opendata.toronto.ca/transportation/tmc/rescucameraimages/Data/' .
       'tmcearthcameras.xml');

/** The path to the local, cached copy, of the dataset */
define('DATASET_LOCAL_PATH', './tmcearthcameras.xml');

/** The online location of the folder containing traffic-camera images */
define('CAMERA_IMAGE_PATH',
       'http://opendata.toronto.ca/transportation/tmc/rescucameraimages/CameraImages/');

/** The number of cameras to display on the page */
define('CAMERA_COUNT', 4);


/**
 * Selects a random subset of elements from an input set.
 *
 * @param Iterator $input The input from which to select elements.
 * @param int $n The number of elements to select.
 * @return array An array containing the randomly selected elements.
 */
function reservoir_sample(Iterator $input, int $n): array {
    $reservoir = [];

    // Fill the reservoir from the input
    for ($t = 1; $input->valid() and $t <= $n; $input->next(), $t++)
        $reservoir[$t - 1] = $input->current();

    // With any remaining input, replace elements in the reservoir with gradually
    // decreasing probability
    for (; $input->valid(); $input->next(), $t++) {
        $j = rand(1, $t);
        if ($j <= $n)
            $reservoir[$j - 1] = $input->current();
    }

    return $reservoir;
}

/**
 * Emits traffic-camera data as objects, one camera at a time, from Transportation
 * Services' Traffic Camera XML dataset.
 *
 * @param string $dataset_path The path to the dataset.
 * @throws RuntimeException if the dataset can't be opened for reading.
 */
function traffic_cameras(string $dataset_path) {
    if (($input = XMLReader::open($dataset_path)) === FALSE)
        throw new RuntimeException("Can't open dataset at \"$dataset_path\" for reading");

    // Seek to the first "Camera" element in the document
    while ($input->read() === TRUE and
           !($input->nodeType == XML_ELEMENT_NODE and $input->localName == 'Camera'));

    // Read and emit every "Camera" element as an object
    while (($camera_node = @$input->expand()) !== FALSE) {
        $camera = new stdClass();
        foreach ($camera_node->childNodes as $child_node) {
            if ($child_node->nodeType == XML_ELEMENT_NODE)
                $camera->{strtolower($child_node->localName)} = $child_node->nodeValue;
        }

        // Ignore any camera that does not have a number assigned, which we need in order
        // to generate the URL of its image
        if (!isset($camera->number))
            continue;

        // Reformat the camera's name for a nicer display
        $camera->name = ucwords(strtolower($camera->name));
        $camera->name = str_replace("Fgg", "Gardiner Expy", $camera->name);
        $camera->name = preg_replace_callback("/Dvp|Cp|Cn|O'c/",
                                              function ($matches) {
                                                  return strtoupper($matches[0]);
                                              },
                                              $camera->name);

        // Add a "url" property that points to the camera's current image
        $camera->url = CAMERA_IMAGE_PATH . 'loc' . $camera->number . '.jpg';

        // Emit this camera
        yield $camera;

        // Continue with the next camera
        $input->next('Camera');
    }

    $input->close();
}


// Check to see if we have a cached copy of the Traffic Camera dataset available locally
// and, if not, try to create one
$cached_dataset_available = file_exists(DATASET_LOCAL_PATH);
if (!$cached_dataset_available) {
    $io_error =
              ($input = fopen(DATASET_URL, 'r')) === FALSE or
              ($output = fopen(DATASET_LOCAL_PATH, 'w')) === FALSE;

    while (!$io_error && !feof($input)) {
        $io_error =
                  ($data = fread($input, 1024 * 16)) === FALSE or
                  fwrite($output, $data) === FALSE;
    }

    $cached_dataset_available = !$io_error;
}

// Use the cached copy of the Traffic Camera dataset, if it's available; otherwise read
// directly from the online feed
$dataset_path = $cached_dataset_available? DATASET_LOCAL_PATH: DATASET_URL;

// Select for display a random subset of cameras from the feed
$selected_cameras = reservoir_sample(traffic_cameras($dataset_path), CAMERA_COUNT);

?>
<!doctype html>
<html lang="en-CA">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=450">
    <title>A Random View of Toronto - Toronto Traffic Cameras</title>
    <style type="text/css">
      body {
          background-color: white;
          color: black;
          font-family: sans-serif;
          margin: 0 auto;
          max-width: 825px;
          padding: 1em;
      }

      header {
          border-bottom: 1px solid black;
          margin-bottom: 5ex;
      }

      header h1 {
          margin-bottom: 0;
      }

      main {
          text-align: center;
      }

      .camera {
          display: inline-block;
          margin: 0 5px 3ex 5px;
          max-width: 400px;
          vertical-align: top;
      }

      .camera figcaption {
          margin-top: 1ex;
      }

      footer {
          border-top: 1px solid black;
          font-size: smaller;
          margin-top: 2ex;
          padding-top: 1ex;
      }

      footer p {
          margin: 1ex auto;
      }
    </style>
  </head>

  <body>
    <header>
      <h1>A Random View of Toronto</h1>
    </header>

    <main>
      <?php foreach ($selected_cameras as $camera): ?>
      <figure class="camera">
        <img src="<?php echo $camera->url; ?>"
             alt="View from traffic camera at <?php echo $camera->name; ?>"
             width="400"
             height="225">
        <figcaption>
          <a href="https://www.google.ca/maps/?ll=<?php echo $camera->latitude; ?>,<?php echo $camera->longitude; ?>"
             target="_blank"><?php echo $camera->name ?></a>
        </figcaption>
      </figure>
      <?php endforeach; ?>
    </main>

    <footer>
      <p>This page demonstrates the reservoir-sampling algorithm described in
        <a href="https://simonsouth.ca/posts/select-random-subset-from-list/"
           target="_blank"><i>How to Select a Random Subset from a List</i></a>.</p>

      <p>Data provided by
      <a href="http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=e10086195a7c1410VgnVCM10000071d60f89RCRD"
         target="_blank">Transportation Services</a> and licensed under the
      <a href="http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=4a37e03bb8d1e310VgnVCM10000071d60f89RCRD"
         target="_blank">Open Government License – Toronto</a>.</p>
    </footer>
    </footer>
  </body>
</html>
Simon South is a senior software engineer in Toronto, Ontario who builds high-performance Web backends using PHP and the Phalcon framework.