MTCNN Face Detection Networks — model

These models implement the three-stage Multi-task Cascaded Convolutional Networks (MTCNN) architecture from the paper Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks.

model_facenet_pnet(pretrained = TRUE, progress = FALSE, ...)

model_facenet_rnet(pretrained = TRUE, progress = FALSE, ...)

model_facenet_onet(pretrained = TRUE, progress = FALSE, ...)

model_mtcnn(pretrained = TRUE, progress = TRUE, ...)

model_inception_resnet_v1(
  pretrained = NULL,
  classify = FALSE,
  num_classes = 10,
  dropout_prob = 0.6,
  ...
)

Arguments

pretrained: (bool): If TRUE, returns a model pre-trained on ImageNet.
progress: (bool): If TRUE, displays a progress bar of the download to stderr.
...: Other parameters passed to the model implementation.
classify: Logical, whether to include the classification head. Default is FALSE.
num_classes: Integer, number of output classes for classification. Default is 10.
dropout_prob: Numeric, dropout probability applied before classification. Default is 0.6.

Value

model_mtcnn() returns a named list with three elements:

boxes: A tensor of shape (N, 4) with bounding box coordinates [x1, y1, x2, y2].
landmarks: A tensor of shape (N, 10) with (x, y) coordinates of 5 facial landmarks: left eye, right eye, nose, left mouth corner, right mouth corner.
cls: A tensor of shape (N, 2) with face classification probabilities (face / non-face). The cls head has two classes:
- 1: Non-face probability (background)
- 2: Face probability — use this value for thresholding detections

(Here, N is the number of detected faces in the input image.)

model_inception_resnet_v1() returns a tensor output depending on the classify argument:

When classify = FALSE (default): A tensor of shape (N, 512), where each row is a normalized embedding vector (L2 norm = 1). These 512-dimensional FaceNet embeddings can be compared using cosine similarity or Euclidean distance for face verification and clustering.
When classify = TRUE: A tensor of shape (N, num_classes) containing class logits.

Details

MTCNN detects faces and facial landmarks in an image through a coarse-to-fine pipeline:

PNet (Proposal Network): Generates candidate face bounding boxes at multiple scales.
RNet (Refine Network): Refines candidate boxes, rejecting false positives.
ONet (Output Network): Produces final bounding boxes and 5-point facial landmarks.

Model Variants

| Model | Input Size     | Parameters | File Size | Outputs                       | Notes                             |
|-------|----------------|------------|-----------|-------------------------------|-----------------------------------|
| PNet  | ~12×12+        | ~3k        | 30 kB     | 2-class face prob + bbox reg  | Fully conv, sliding window stage  |
| RNet  | 24×24          | ~30k       | 400 kB    | 2-class face prob + bbox reg  | Dense layers, higher recall       |
| ONet  | 48×48          | ~100k      | 2 MB      | 2-class prob + bbox + 5-point | Landmark detection stage          |

Inception-ResNet-v1 is a convolutional neural network architecture combining Inception modules with residual connections, designed for face recognition tasks. The model achieves high accuracy on standard face verification benchmarks such as LFW (Labeled Faces in the Wild).

Model Variants and Performance (LFW accuracy)

|    Weights     | LFW Accuracy | File Size |
|----------------|--------------|-----------|
| CASIA-Webface  | 99.05%       | 111 MB    |
| VGGFace2       | 99.65%       | 107 MB    |

The CASIA-Webface pretrained weights provide strong baseline accuracy.
The VGGFace2 pretrained weights achieve higher accuracy, benefiting from a larger, more diverse dataset.

Functions

model_facenet_pnet(): PNet (Proposal Network) — small fully-convolutional network for candidate face box generation.
model_facenet_rnet(): RNet (Refine Network) — medium CNN with dense layers for refining and rejecting false positives.
model_facenet_onet(): ONet (Output Network) — deeper CNN that outputs final bounding boxes and 5 facial landmark points.
model_mtcnn(): MTCNN (Multi-task Cascaded Convolutional Networks) — face detection and alignment using a cascade of three neural networks
model_inception_resnet_v1(): Inception-ResNet-v1 — high-accuracy face recognition model combining Inception modules with residual connections, pretrained on VGGFace2 and CASIA-Webface datasets

Examples

if (FALSE) { # \dontrun{
# Example usage of PNet
model_pnet <- model_facenet_pnet(pretrained = TRUE)
model_pnet$eval()
input_pnet <- torch_randn(1, 3, 224, 224)
output_pnet <- model_pnet(input_pnet)
output_pnet

# Example usage of RNet
model_rnet <- model_facenet_rnet(pretrained = TRUE)
model_rnet$eval()
input_rnet <- torch_randn(1, 3, 24, 24)
output_rnet <- model_rnet(input_rnet)
output_rnet

# Example usage of ONet
model_onet <- model_facenet_onet(pretrained = TRUE)
model_onet$eval()
input_onet <- torch_randn(1, 3, 48, 48)
output_onet <- model_onet(input_onet)
output_onet

# Example usage of MTCNN
mtcnn <- model_mtcnn(pretrained = TRUE)
mtcnn$eval()
image_tensor <- torch_randn(c(1, 3, 224, 224))
out <- mtcnn(image_tensor)
out

# Load an image from the web
wmc <- "https://upload.wikimedia.org/wikipedia/commons/"
url <- "b/b4/Catherine_Bell_200101233d_hr_%28cropped%29.jpg"
img <- base_loader(paste0(wmc,url))

# Convert to torch tensor [C, H, W] normalized
input <- transform_to_tensor(img)  # [C, H, W]
batch <- input$unsqueeze(1)   # [1, C, H, W]

# Load pretrained model
model <- model_inception_resnet_v1(pretrained = "vggface2")
model$eval()
output <- model(batch)
output

# Example usage of Inception-ResNet-v1 with CASIA-Webface Weights
model <- model_inception_resnet_v1(pretrained = "casia-webface")
model$eval()
output <- model(batch)
output
} # }