Labelled Faces in the Wild (LFW) Datasets

lfw_people_dataset(
  root = tempdir(),
  transform = NULL,
  split = "original",
  target_transform = NULL,
  download = FALSE
)

lfw_pairs_dataset(
  root = tempdir(),
  train = TRUE,
  transform = NULL,
  split = "original",
  target_transform = NULL,
  download = FALSE
)

Arguments

root

Root directory for dataset storage. The dataset will be stored under root/lfw_people or root/lfw_pairs.

transform

Optional. A function that takes an image and returns a transformed version (e.g., normalization, cropping).

split

Which version of the dataset to use. One of "original" or "funneled". Defaults to "original".

target_transform

Optional. A function that transforms the label.

download

Logical. If TRUE, downloads the dataset to root/. If the dataset is already present, download is skipped.

train

For lfw_pairs_dataset, whether to load the training (pairsDevTrain.txt) or test (pairsDevTest.txt) split.

Value

A torch dataset object lfw_people_dataset or lfw_pairs_dataset. Each element is a named list with:

  • x:

    • For lfw_people_dataset: a H x W x 3 numeric array representing a single RGB image.

    • For lfw_pairs_dataset: a list of two H x W x 3 numeric arrays representing a pair of RGB images.

  • y:

    • For lfw_people_dataset: an integer index from 1 to the number of identities in the dataset.

    • For lfw_pairs_dataset: 1 if the pair shows the same person, 2 if different people.

Details

The LFW dataset collection provides facial images for evaluating face recognition systems. It includes two variants:

  • lfw_people_dataset: A multi-class classification dataset where each image is labelled by person identity.

  • lfw_pairs_dataset: A face verification dataset containing image pairs with binary labels (same or different person).

This R implementation of the LFW dataset is based on the fetch_lfw_people() and fetch_lfw_pairs() functions from the scikit-learn library, but deviates in a few key aspects due to dataset availability and R API conventions:

  • The color and resize arguments from Python are not directly exposed. Instead, all images are RGB with a fixed size of 250x250.

  • The split argument in Python (e.g., train, test, 10fold) is simplified to a train boolean flag in R. The 10fold split is not supported, as the original protocol files are unavailable or incompatible with clean separation of image-label pairs.

  • The split parameter in R controls which version of the dataset to use: "original" (unaligned) or "funneled" (aligned using funneling). The funneled version contains geometrically normalized face images, offering better alignment and typically improved performance for face recognition models.

  • The dataset is downloaded from Figshare, which hosts the same files referenced in scikit-learn's dataset utilities.

  • lfw_people_dataset: 13,233 images across multiple identities (using either "original" or "funneled" splits)

  • lfw_pairs_dataset:

    • Training split (train = TRUE): 2,200 image pairs

    • Test split (train = FALSE): 1,000 image pairs

Examples

if (FALSE) { # \dontrun{
# Load data for LFW People Dataset
lfw <- lfw_people_dataset(download = TRUE)
first_item <- lfw[1]
first_item$x  # RGB image
first_item$y  # Label index
lfw$classes[first_item$y]  # person's name (e.g., "Aaron_Eckhart")

# Load training data for LFW Pairs Dataset
lfw <- lfw_pairs_dataset(download = TRUE, train = TRUE)
first_item <- lfw[1]
first_item$x  # List of 2 RGB Images
first_item$x[[1]]  # RGB Image
first_item$x[[2]]  # RGB Image
first_item$y  # Label index
lfw$classes[first_item$y]  # Class Name (e.g., "Same" or "Different")

# Load test data for LFW Pairs Dataset
lfw <- lfw_pairs_dataset(download = TRUE, train = FALSE)
first_item <- lfw[1]
first_item$x  # List of 2 RGB Images
first_item$x[[1]]  # RGB Image
first_item$x[[2]]  # RGB Image
first_item$y  # Label index
lfw$classes[first_item$y]  # Class Name (e.g., "Same" or "Different")
} # }