{torch} {tabnet} et l’apprentissage profond par l’usage

Atelier des Rencontres R 2025

Christophe Regouby

Agenda

Bien commencer avec torch

{torch}

Le mlverse

{tabnet}

  • {tabnet} pour la régression avec valeurs manquantes

  • {tabnet} pour la classification hiérarchique

  • {tabnet} pour la régression logistique déséquilibrée

GPT2 avec R

Fine-Tuning de GPT2 en français avec un LORA

Un classifieur d’images avec ResNext50 fine-tuning

Bien commencer avec torch

Licence



Ce contenu est sous licence Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA4.0).

Checklist


R & RStudio/un IDE confortable installés?

     J’ai R 4.5.0 et RStudio 2025.05.0 build 496

{torch} est installé ?

     torch::torch_is_installed()
     Your system is ready!

Un accélérateur {torch} ?

     torch::backends_xxxx_is_available()
     Your system has power!

Autres ressources

{torch}

{torch}: pourquoi réinventer l’eau chaude?

  • facilité et frugalité d’installation sur CPU, GPU, MPS, …

  • confort de RStudio pour développer, déverminer, visualiser

  • confort de R pour l’indexation à 1

  • la qualité des articles de blog de Posit AI blog

  • l’écosystème des packages

  • plein de possibilités de contributions

Installation

Nominale

Avancée

Expert : Machine sans connexion

Expert : déverminage

Sys.setenv(TORCH_INSTALL_DEBUG = 1)
install_torch()

?install_torch()

La pile logicielle

La manipulation de tenseurs

library(torch)
tt <- torch_rand(2, 3, 4)
tt
torch_tensor
(1,.,.) = 
  0.8514  0.5955  0.8392  0.5652
  0.2030  0.9233  0.7504  0.8816
  0.1172  0.0588  0.5395  0.9065

(2,.,.) = 
  0.3927  0.8916  0.1552  0.7698
  0.3152  0.2461  0.6334  0.5722
  0.2397  0.7746  0.9792  0.2495
[ CPUFloatType{2,3,4} ]

tt[, 2:N, ]
torch_tensor
(1,.,.) = 
  0.2030  0.9233  0.7504  0.8816
  0.1172  0.0588  0.5395  0.9065

(2,.,.) = 
  0.3152  0.2461  0.6334  0.5722
  0.2397  0.7746  0.9792  0.2495
[ CPUFloatType{2,2,4} ]
tt[1, 2:N, ]
torch_tensor
 0.2030  0.9233  0.7504  0.8816
 0.1172  0.0588  0.5395  0.9065
[ CPUFloatType{2,4} ]
tt[1:1, 2:N, ]
torch_tensor
(1,.,.) = 
  0.2030  0.9233  0.7504  0.8816
  0.1172  0.0588  0.5395  0.9065
[ CPUFloatType{1,2,4} ]
torch_squeeze(tt[1:1, 2:N, ])
torch_tensor
 0.2030  0.9233  0.7504  0.8816
 0.1172  0.0588  0.5395  0.9065
[ CPUFloatType{2,4} ]

tt$to(device = "cpu")
torch_tensor
(1,.,.) = 
  0.8514  0.5955  0.8392  0.5652
  0.2030  0.9233  0.7504  0.8816
  0.1172  0.0588  0.5395  0.9065

(2,.,.) = 
  0.3927  0.8916  0.1552  0.7698
  0.3152  0.2461  0.6334  0.5722
  0.2397  0.7746  0.9792  0.2495
[ CPUFloatType{2,3,4} ]
as.array(tt$to(device = "cpu"))
, , 1

          [,1]      [,2]      [,3]
[1,] 0.8514396 0.2029930 0.1172432
[2,] 0.3926740 0.3151631 0.2396535

, , 2

          [,1]      [,2]       [,3]
[1,] 0.5955127 0.9232679 0.05882704
[2,] 0.8915685 0.2460586 0.77457345

, , 3

          [,1]      [,2]      [,3]
[1,] 0.8392107 0.7504073 0.5394922
[2,] 0.1552180 0.6334370 0.9791724

, , 4

          [,1]      [,2]      [,3]
[1,] 0.5651808 0.8816227 0.9065086
[2,] 0.7697918 0.5721706 0.2495310

À vous de jouer, exercices

Installations : 00_installation.R

02:00

Exercice : 01_exercice.

05:00

les nn_modules: construire un réseau

net <- nn_module(
  "Net",
  initialize = function(num_class) {
    self$conv1 <- nn_conv2d(1, 32, 3, 1)
    self$conv2 <- nn_conv2d(32, 64, 3, 1)
    self$dropout1 <- nn_dropout(0.25)
    self$dropout2 <- nn_dropout(0.5)
    self$fc1 <- nn_linear(9216, 128)
    self$fc2 <- nn_linear(128, num_class)
  },
  forward = function(x) {
    x %>% 
      self$conv1() %>% 
      nnf_relu() %>% 
      self$conv2() %>% 
      nnf_relu() %>% 
      nnf_max_pool2d(2) %>% 
      self$dropout1() %>% 
      torch_flatten(start_dim = 2) %>% 
      self$fc1() %>% 
      nnf_relu() %>% 
      self$dropout2() %>% 
      self$fc2()
  }
)

entraîner un réseau avec luz

fitted <- net %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_ignite_adam,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  set_hparams(num_class = 10) %>% 
  set_opt_hparams(lr = 0.003) %>%
  fit(train_dl, epochs = 10, valid_data = test_dl)

Exercice : 02_exercice.

05:00

mlverse

Un univers de 📦 dédiés à {torch}

Packages du mlverse / utilisant torch

Packages du mlverse / utilisant torch

Un univers de 📦 en français

Les paquetages disponibles en français
paquetage messages l’aide1 les vignettes
{torch}
{torchvision} /torchvision.fr
{tabnet} /tabnet.fr
{luz} /luz.fr cregouby.github.io/luz.fr/
{hfhub} /hfhub.fr
{tok} /tok.fr
{safetensors} /safetensors.fr
{minhub} 2 /minhub.fr
Sys.setLanguage(lang = "fr")
library(torchvision)
transform_normalize(torch::torch_rand(c(3,5,5)), 3, 0)

Sys.setenv(LANGUAGE="fr")
library(torchvision.fr)
library(torchvision)
?transform_normalize

{tabnet}

v0.7.0 is on CRAN

Fonctionnement

Usage intégré dans tidymodels

Dataset

library(tidymodels, quietly = TRUE)
data("ames", package = "modeldata")
str(ames)
tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)
 $ MS_SubClass       : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...
 $ MS_Zoning         : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ Lot_Frontage      : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
 $ Lot_Area          : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
 $ Street            : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley             : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Lot_Shape         : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...
 $ Land_Contour      : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
 $ Utilities         : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Lot_Config        : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
 $ Land_Slope        : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood      : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...
 $ Condition_1       : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ Condition_2       : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Bldg_Type         : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
 $ House_Style       : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...
 $ Overall_Cond      : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
 $ Year_Built        : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ Year_Remod_Add    : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
 $ Roof_Style        : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
 $ Roof_Matl         : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior_1st      : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
 $ Exterior_2nd      : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
 $ Mas_Vnr_Type      : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...
 $ Mas_Vnr_Area      : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
 $ Exter_Cond        : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation        : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
 $ Bsmt_Cond         : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
 $ Bsmt_Exposure     : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...
 $ BsmtFin_Type_1    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...
 $ BsmtFin_SF_1      : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...
 $ BsmtFin_Type_2    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...
 $ BsmtFin_SF_2      : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
 $ Bsmt_Unf_SF       : num [1:2930] 441 270 406 1045 137 ...
 $ Total_Bsmt_SF     : num [1:2930] 1080 882 1329 2110 928 ...
 $ Heating           : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Heating_QC        : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...
 $ Central_Air       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical        : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ First_Flr_SF      : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
 $ Second_Flr_SF     : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
 $ Gr_Liv_Area       : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
 $ Bsmt_Full_Bath    : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
 $ Bsmt_Half_Bath    : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Full_Bath         : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
 $ Half_Bath         : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
 $ Bedroom_AbvGr     : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
 $ Kitchen_AbvGr     : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
 $ TotRms_AbvGrd     : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
 $ Functional        : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Fireplaces        : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
 $ Garage_Type       : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Garage_Finish     : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...
 $ Garage_Cars       : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
 $ Garage_Area       : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
 $ Garage_Cond       : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Paved_Drive       : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
 $ Wood_Deck_SF      : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
 $ Open_Porch_SF     : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
 $ Enclosed_Porch    : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
 $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Screen_Porch      : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
 $ Pool_Area         : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Pool_QC           : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Fence             : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...
 $ Misc_Feature      : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
 $ Misc_Val          : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
 $ Mo_Sold           : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
 $ Year_Sold         : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ Sale_Type         : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Sale_Condition    : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Sale_Price        : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
 $ Longitude         : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...
 $ Latitude          : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...

Recipe

ames <- ames |> mutate(Sale_Price = log10(Sale_Price))
ames_rec <- recipe(Sale_Price ~ ., data=ames) |> 
  step_normalize(all_numeric(), -all_outcomes()) 

Pre-training

library(tabnet)
ames_pretrain <- tabnet_pretrain(
  ames_rec, data=ames,  epoch=50, cat_emb_dim = 1,
  valid_split = 0.2, verbose=TRUE, 
  early_stopping_patience = 3L, 
  early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_pretrain)

Training

ames_fit <- tabnet_fit(ames_rec, data=ames,  tabnet_model = ames_pretrain, 
                       epoch=50, cat_emb_dim = 1, 
                       valid_split = 0.2, verbose=TRUE, batch=2930)
# model diagnostic
autoplot(ames_fit)

Prediction

predict(ames_fit, ames)
# A tibble: 2,930 × 1
   .pred
   <dbl>
 1  4.68
 2  4.35
 3  4.66
 4  4.19
 5  4.41
 6  4.48
 7  5.01
 8  4.80
 9  4.82
10  4.38
# ℹ 2,920 more rows
metrics <- metric_set(rmse, rsq, ccc)
cbind(ames, predict(ames_fit, ames)) |> 
  metrics(Sale_Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.881 
2 rsq     standard      0.0754
3 ccc     standard      0.0423
# variable importance
vip::vip(ames_fit)

Explainability

ames_explain <- tabnet::tabnet_explain(ames_fit, ames)
# variable importance
autoplot(ames_explain, quantile = 0.99)

À vous de jouer, exercice 03

Complete 03_exercise to practice tabnet model training.

07:00

{tabnet} pour les valeurs manquantes

retour sur le jeu de données Ames

  • les tenseurs ne peuvent pas inclure de valeur manquantes.

  • ames nous fait le plaisir d’être sans valeur manquante.

Quelle est la surface de la piscine quand il n’y a pas de piscine?

data("ames", package = "modeldata")
qplot(ames$Mas_Vnr_Area)

Comment le modèle peut-il capturer cette distribution ?

Et si on l’applique à toute les colonnes ?

Code
col_with_zero_as_na <- ames |>  
  select(where(is.numeric)) |>  
  select(matches("_SF|Area|Misc_Val|[Pp]orch$")) |> 
  summarise_each(min) |> 
  select_if(~.x==0) |> 
  names()
ames_missing <- ames |>mutate_at(col_with_zero_as_na, na_if, 0) |> 
  mutate_at("Alley", na_if, "No_Alley_Access")  |>  
  mutate_at("Fence", na_if, "No_Fence") |> 
  mutate_at(c("Garage_Cond", "Garage_Finish"), na_if, "No_Garage") |> 
  mutate_at(c("Bsmt_Exposure", "BsmtFin_Type_1", "BsmtFin_Type_2"), na_if, "No_Basement")

visdat::vis_miss(ames_missing)

Recipe

ames_missing <- ames_missing |> mutate(Sale_Price = log10(Sale_Price))
ames_missing_rec <- recipe(Sale_Price ~ ., data=ames_missing) |> 
  step_normalize(all_numeric(), -all_outcomes()) 

Pre-training

library(tabnet)
ames_missing_pretrain <- tabnet_pretrain(
  ames_missing_rec, data=ames_missing,  epoch=50, cat_emb_dim = 1,  valid_split = 0.2,
  verbose=TRUE,   early_stopping_patience = 3L,   early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_missing_pretrain)

Training

Code
ames_missing_fit <- tabnet_fit(
  ames_missing_rec,   data = ames_missing,
  tabnet_model = ames_missing_pretrain,
  epoch = 50,  cat_emb_dim = 1,  valid_split = 0.2,
  verbose = TRUE,  batch = 2930,
  early_stopping_patience = 5L,
  early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_missing_fit)

Prediction

predict(ames_missing_fit, ames_missing)
# A tibble: 2,930 × 1
   .pred
   <dbl>
 1  5.36
 2  3.49
 3  4.07
 4  5.48
 5  3.99
 6  4.26
 7  4.81
 8  5.33
 9  5.82
10  3.72
# ℹ 2,920 more rows
metrics <- metric_set(rmse, rsq, ccc)
cbind(ames_missing, predict(ames_missing_fit, ames_missing)) |> 
  metrics(Sale_Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      1.04  
2 rsq     standard      0.0118
3 ccc     standard      0.0222

Variable importance

# original ames
vip_color(ames_pretrain, col_with_missings)

vip_color(ames_fit, col_with_missings)

# ames with missing values
vip_color(ames_missing_pretrain, col_with_missings)

vip_color(ames_missing_fit, col_with_missings)

Explainability

ames_missing_explain <- tabnet::tabnet_explain(ames_missing_fit, ames_missing)
# variable importance
autoplot(ames_missing_explain, quantile = 0.99, type="step")

{tabnet} avec un outcome() hiérarchique

  • {tabnet} admet des variable à prédire catégorielle, multi-label multi-class.

  • et si on pouvait mettre une contrainte entre les classes des différents labels ?

  • le dataset doit être de type data.tree::as.Node()

    • conversion de trainset et testset avec as.Node() avant les fonctions tabnet_
    • conversion de inverse avec node_to_df()
  • nouveauté de la 0.5.0

Exemple avec starwars

library(data.tree)
data(starwars, package = "dplyr")
head(starwars, 4)
# A tibble: 4 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

On construit la variable de sortie comme un chaîne avec des séparateurs / dans une variable "pathString" (erronné)

starwars_tree <- starwars |> 
  mutate(pathString = paste("StarWars_characters", species, sex, `name`, sep = "/"))  |> 
  as.Node()
print(starwars_tree, "name","height", "mass", "eye_color", limit = 8)
                           levelName                    name height mass
1  StarWars_characters                   StarWars_characters      4   NA
2   ¦--Human                                           Human      3   NA
3   ¦   ¦--male                                         male      2   NA
4   ¦   ¦   ¦--Luke Skywalker                 Luke Skywalker      1   77
5   ¦   ¦   ¦--Darth Vader                       Darth Vader      1  136
6   ¦   ¦   ¦--Owen Lars                           Owen Lars      1  120
7   ¦   ¦   ¦--Biggs Darklighter           Biggs Darklighter      1   84
8   ¦   ¦   °--... 22 nodes w/ 0 sub   ... 22 nodes w/ 0 sub      1   NA
9   ¦   °--... 1 nodes w/ 31 sub       ... 1 nodes w/ 31 sub      1   NA
10  °--... 37 nodes w/ 123 sub       ... 37 nodes w/ 123 sub      1   NA
   eye_color
1           
2           
3           
4       blue
5     yellow
6       blue
7      brown
8           
9           
10          

Mais avec des règles sur les noms et les types

  • pas d’usage des noms internes de {data.tree} :
    • name, height sont interdits
    • comme tous les noms de NODE_RESERVED_NAMES_CONST. (Ils seraient supprimés au moment de la conversion.)
  • pas de factor()
  • pas de colonne nommée level_*
  • le dernier niveau hiérarchique doit être l’individu (donc un Id unique)
  • il doit y avoir une racine à la hiérarchie

Construction correcte de la variable de sortie "pathString"

starwars_tree <- starwars |>
  rename(`_name` = "name", `_height` = "height") |> 
  mutate(pathString = paste("StarWars_characters", species, sex, `_name`, sep = "/"))  |> 
  as.Node()
print(starwars_tree, "name", "_name","_height", "mass", "eye_color", limit = 8)
                           levelName                    name             _name
1  StarWars_characters                   StarWars_characters                  
2   ¦--Human                                           Human                  
3   ¦   ¦--male                                         male                  
4   ¦   ¦   ¦--Luke Skywalker                 Luke Skywalker    Luke Skywalker
5   ¦   ¦   ¦--Darth Vader                       Darth Vader       Darth Vader
6   ¦   ¦   ¦--Owen Lars                           Owen Lars         Owen Lars
7   ¦   ¦   ¦--Biggs Darklighter           Biggs Darklighter Biggs Darklighter
8   ¦   ¦   °--... 22 nodes w/ 0 sub   ... 22 nodes w/ 0 sub                  
9   ¦   °--... 1 nodes w/ 31 sub       ... 1 nodes w/ 31 sub                  
10  °--... 37 nodes w/ 123 sub       ... 37 nodes w/ 123 sub                  
   _height mass eye_color
1       NA   NA          
2       NA   NA          
3       NA   NA          
4      172   77      blue
5      202  136    yellow
6      178  120      blue
7      183   84     brown
8       NA   NA          
9       NA   NA          
10      NA   NA          

Initial split et construction

starwars a des colonnes de list() qu’il faut dérouler

starw_split <- starwars |> 
  tidyr::unnest_longer(films) |> 
  tidyr::unnest_longer(vehicles, keep_empty = TRUE) |> 
  tidyr::unnest_longer(starships, keep_empty = TRUE) |> 
  initial_split( prop = .8, strata = "species")
starwars_train_tree <- starw_split |> 
  training() |>
  rename(`_name` = "name", `_height` = "height") |>
  rowid_to_column() |>
  mutate(pathString = paste("StarWars_characters", species, sex, rowid, sep = "/")) |>
  # remove outcomes labels from predictors
  select(-species, -sex, -`_name`, -rowid) |>
  # turn it as hierarchical Node
  as.Node()

starwars_test_tree <- starw_split |>
  testing() |>
  rename(`_name` = "name", `_height` = "height") |>
  rowid_to_column() |>
  mutate(pathString = paste("StarWars_characters", species, sex, rowid, sep = "/")) |>
  select(-species, -sex, -`_name`, -rowid) |>
  as.Node()

Les $attributesAll du Node seront les predicteurs :

starwars_train_tree$attributesAll
 [1] "_height"    "birth_year" "eye_color"  "films"      "gender"    
 [6] "hair_color" "homeworld"  "mass"       "skin_color" "starships" 
[11] "vehicles"  

Entraînement du modèle

```{r}
#| echo: true
#| label: "starwars fit"

config <- tabnet_config(
  decision_width = 8,
  attention_width = 8,
  num_steps = 3,
  penalty = .003,
  cat_emb_dim = 2,
  valid_split = 0.2,
  learn_rate = 1e-3,
  lr_scheduler = "reduce_on_plateau",
  early_stopping_monitor = "valid_loss",
  early_stopping_patience = 4,
  verbose = FALSE
)

starw_model <- tabnet_fit(starwars_train_tree, config = config, epoch = 75, checkpoint_epochs = 15)
```

Diagnostique

```{r}
#| echo: true
#| label: "starwars diag"

autoplot(starw_model)
```
```{r}
#| echo: true
#| label: "starwars vip"

vip::vip(starw_model)
```

Inférence avec le modèle hiérarchique

```{r}
#| echo: true
#| label: "starwars inference"
starwars_hat <- bind_cols(
    predict(starw_model, starwars_test_tree),
    node_to_df(starwars_test_tree)$y
  )
tail(starwars_hat, n = 5)
```

{tabnet} pour la classification binaire déséquilibrée

Prenons le jeu de données lending_club du package {modeldata}

data("lending_club", package = "modeldata")
class_ratio <- lending_club |> 
  summarise(sum( Class == "good") / sum( Class == "bad")) |> 
  pull() 

class_ratio
[1] 18.06576

Voyons ce que donne tabnet face à XGBoost sur cette tâche

tab_rec <- train |>
  recipe() |>
  update_role(
    Class, 
    new_role = "outcome") |>
  update_role(
    -has_role(c("outcome", "id", "case_weights")), 
    new_role = "predictor")

tab_mod <- tabnet(epochs = 100, 
                  loss = tabnet::nn_aum_loss, 
                  learn_rate = 0.02) |> 
  set_engine("torch") |> 
  set_mode("classification")

tab_wf <- workflow() |> 
  add_model(tab_mod) |> 
  add_recipe(tab_rec) |> 
  add_case_weights(case_wts)
xgb_rec <- tab_rec |> 
  step_dummy(term, sub_grade, addr_state, verification_status, emp_length)

xgb_mod <- boost_tree(trees = 100, tree_depth = 10, learn_rate = 0.1, mtry = 1, sample_size = 0.8) |> 
  set_engine("xgboost") |> 
  set_mode("classification")

xgb_wf <- workflow() |> 
  add_model(xgb_mod) |> 
  add_recipe(xgb_rec) |> 
  add_case_weights(case_wts)

:::

tab_fit <- tab_wf |>
  fit(train)

tab_test <- tab_fit |>
  augment(test)

tab_test |> 
  pr_curve(
    Class, .pred_good, 
    case_weights = case_wts) |> 
  autoplot() 

Tabnet, avec pondération, perte ROC_AUM

Tabnet, avec pondération, perte ROC_AUM
tab_fit_bce <- tab_wf_bce |>
  fit(train)

tab_test <- tab_fit_bce |>
  augment(test)

tab_test |> 
  pr_curve(
    Class, .pred_good, 
    case_weights = case_wts) |> 
  autoplot() 

Tabnet, avec pondération, perte BCE

Tabnet, avec pondération, perte BCE
xgb_fit <- xgb_wf |>
  fit(train)

xgb_test <- xgb_fit |>
  augment(test)

xgb_test |>
  pr_curve(
    Class, .pred_good,
    case_weights = case_wts) |>
  autoplot() 

XGBoost, avec pondération

XGBoost, avec pondération

GPT2 avec R

basé sur 4 packages {minhub}, {hfhub}, {tok}, {safetensors}

  • {minhub} : un dépot de réseau de neurones classiques pour {torch}
  • {hfhub} : l’accès aux téléchargement de modèles préentraînés du hub hugging-face
  • {tok} : un wrappeur des tokenizers d’huggingface en R
  • {safetensors} : sauvegarde et lecture des données de tenseurs au format .safetensors

Téléchargement du modèle et de ses poids

library(minhub)
library(zeallot)

identifier <- "gpt2"
revision <- "e7da7f2"
# instantiate model and load Hugging Face weights
model <- gpt2_from_pretrained(identifier, revision)
# load matching tokenizer
tok <- tok::tokenizer$from_pretrained(identifier,)
model$eval()

Tokenisation de la phrase

text = paste("✨ Quel plaisir de participer aux  ateliers de R-toulouse !✨",
             "Vivement la proch" )

idx <- torch_tensor(tok$encode(text)$ids)$view(c(1, -1))
idx
torch_tensor
Columns 1 to 11 26486    101   2264    417    458  15152    343    390   1344   9346  27506

Columns 12 to 22   220    379    417   3183    390    371     12     83   2852   1076   5145

Columns 23 to 29 26486    101  29237    434   8591    386    354
[ CPULongType{1,29} ]

Génération d’une entrée

La génération est un process itératif, chaque prédiction du modèle est ajoutée au prompt qui grossit.

Ajoutons y 30 tokens :

prompt_length <- idx$size(-1)

for (i in 1:30) { # decide on maximal length of output sequence
  # obtain next prediction (raw score)
  with_no_grad({
    logits <- model(idx + 1L)
  })
  last_logits <- logits[, -1, ]
  # pick highest scores (how many is up to you)
  c(prob, ind) %<-% last_logits$topk(50)
  last_logits <- torch_full_like(last_logits, -Inf)$scatter_(-1, ind, prob)
  # convert to probabilities
  probs <- nnf_softmax(last_logits, dim = -1)
  # probabilistic sampling
  id_next <- torch_multinomial(probs, num_samples = 1) - 1L
  # stop if end of sequence predicted
  if (id_next$item() == 0) {
    break
  }
  # append prediction to prompt
  idx <- torch_cat(list(idx, id_next), dim = 2)
}

décodage des tokens du résultat

tok$decode(as.integer(idx))
[1] "✨ Quel plaisir de participer aux  ateliers de R-toulouse !✨ Vivement la proché dans la voiture établée !✨ de décembre qui se était sécurant de pré"

Fine-Tuning avec LoRA

Est-ce que les LLMs dépossèdent le data-scientist ?

  • des réseaux toujours plus gros impliquent des entraînements prohibitifs

  • la promesse de la prochaîne version qui résoudra les faiblesses constatées

  • le jeu de donnée de réference difficile à constituer

LoRA à la rescousse

Low Rank Adaptation

Method

The problem of fine-tuning a neural network can be expressed by finding a \(\Delta \Theta\) that minimizes \(L(X, y; \Theta_0 + \Delta\Theta)\) where \(L\) is a loss function, \(X\) and \(y\) are the data and \(\Theta_0\) the weights from a pre-trained model.

We learn the parameters \(\Delta \Theta\) with dimension \(|\Delta \Theta|\) equals to \(|\Theta_0|\). When \(|\Theta_0|\) is very large, such as in large scale pre-trained models, finding \(\Delta \Theta\) becomes computationally challenging. Also, for each task you need to learn a new \(\Delta \Theta\) parameter set, making it even more challenging to deploy fine-tuned models if you have more than a few specific tasks. LoRA proposes using an approximation \(\Delta \Phi \approx \Delta \Theta\) with \(|\Delta \Phi| << |\Delta \Theta|\). The observation is that neural nets have many dense layers performing matrix multiplication, and while they typically have full-rank during pre-training, when adapting to a specific task the weight updates will have a low “intrinsic dimension”.

A simple matrix decomposition is applied for each weight matrix update \(\Delta \theta \in \Delta \Theta\). Considering \(\Delta \theta_i \in \mathbb{R}^{d \times k}\) the update for the \(i\)th weight in the network, LoRA approximates it with:

\[\Delta \theta_i \approx \Delta \phi_i = BA\] where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times d}\) and the rank \(r << min(d, k)\). Thus instead of learning \(d \times k\) parameters we now need to learn \((d + k) \times r\) which is easily a lot smaller given the multiplicative aspect. In practice, \(\Delta \theta_i\) is scaled by \(\frac{\alpha}{r}\) before being added to \(\theta_i\), which can be interpreted as a ‘learning rate’ for the LoRA update.

LoRA does not increase inference latency, as once fine tuning is done, you can simply update the weights in \(\Theta\) by adding their respective \(\Delta \theta \approx \Delta \phi\). It also makes it simpler to deploy multiple task specific models on top of one large model, as \(|\Delta \Phi|\) is much smaller than \(|\Delta \Theta|\).

Implémentation avec torch

On simule un jeu de données \(y = X \theta\) model. \(\theta \in \mathbb{R}^{1001, 1000}\).

library(torch)

n <- 10000
d_in <- 1001
d_out <- 1000

thetas <- torch_randn(d_in, d_out)

X <- torch_randn(n, d_in)
y <- torch_matmul(X, thetas)

On entraine un modèle pour estimer \(\theta\). C’est notre modèle entraîné.

Code
lin_model <- nn_linear(d_in, d_out, bias = FALSE)
Code
train <- function(model, X, y, batch_size = 128, epochs = 100) {
  opt <- optim_ignite_adam(model$parameters)

  for (epoch in 1:epochs) {
    for(i in seq_len(n/batch_size)) {
      idx <- sample.int(n, size = batch_size)
      loss <- nnf_mse_loss(model(X[idx,]), y[idx])
      
      with_no_grad({
        opt$zero_grad()
        loss$backward()
        opt$step()  
      })
    }
    
    if (epoch %% 10 == 0) {
      with_no_grad({
        loss <- nnf_mse_loss(model(X), y)
      })
      cat("[", epoch, "] Loss:", loss$item(), "\n")
    }
  }
}

On entraîne le modèle

Code
train(lin_model, X, y)
[ 10 ] Loss: 576.6039 
[ 20 ] Loss: 311.8771 
[ 30 ] Loss: 154.8159 
[ 40 ] Loss: 68.2778 
[ 50 ] Loss: 25.5456 
[ 60 ] Loss: 7.554851 
[ 70 ] Loss: 1.582172 
[ 80 ] Loss: 0.2017887 
[ 90 ] Loss: 0.01323973 
[ 100 ] Loss: 0.000453501 

On simule une distribution des données différente en appliquant une transformation à \(\theta\)

Code
thetas2 <- thetas + 1

X2 <- torch_randn(n, d_in)
y2 <- torch_matmul(X2, thetas2)

Sur ces données, le modèle donne de mauvais résultats :

nnf_mse_loss(lin_model(X2), y2)
torch_tensor
988.92
[ CPUFloatType{} ][ grad_fn = <MseLossBackward0> ]

Le nouveau LoRA - s’ajoute au modèle linear dont on gèle les poids - avec des tenseurs A et B de dimension intérieure \(r\)

lora_nn_linear <- nn_module(
  initialize = function(linear, r = 16L, alpha = 1) {
    self$linear <- linear
    
    # les paramêtres du modèle linéirte initial sont  'gelés', donc pas pris en 
    # considération pr autograd. Ce sont juste des constantes.
    purrr::walk(self$linear$parameters, \(x) x$requires_grad_(FALSE))
    
    # LEs paramêtre du Low-rank à entraîner (shortcut here, specific to our lin_model)
    self$A <- nn_parameter(torch_randn(linear$in_features, r))
    self$B <- nn_parameter(torch_zeros(r, linear$out_features))
    
    # la constante de scaling
    self$scaling <- alpha / r
  },
  forward = function(x) {
    # La fonction `forward` modifiée qui ajoute le résultat du LoRA  A.B.x aux résultat du modèle initial
    self$linear(x) + torch_matmul(x, torch_matmul(self$A, self$B) * self$scaling)
  }
)

Essayons un LoRA avec \(r = 1\) i.e. A et B sont des vecteurs

lora <- lora_nn_linear(lin_model, r = 1L)

entraînement du LoRA sur la nouvelle distribution

train(lora, X2, y2)
[ 10 ] Loss: 803.9484 
[ 20 ] Loss: 480.2847 
[ 30 ] Loss: 252.2604 
[ 40 ] Loss: 117.0122 
[ 50 ] Loss: 45.98793 
[ 60 ] Loss: 14.2972 
[ 70 ] Loss: 3.188789 
[ 80 ] Loss: 0.4264118 
[ 90 ] Loss: 0.02679148 
[ 100 ] Loss: 0.001070438 

Le tenseur \(\Delta \theta\) est idéalement constant à 1 :

delta_theta <- torch_matmul(lora$A, lora$B) * lora$scaling
delta_theta[1:5, 1:5]
#> torch_tensor
#> 1.0005  1.0005  1.0005  1.0005  1.0005
#> 1.0000  1.0000  1.0000  1.0000  1.0000
#> 0.9958  0.9958  0.9958  0.9958  0.9958
#> 0.9993  0.9993  0.9993  0.9993  0.9993
#> 0.9997  0.9997  0.9997  0.9997  0.9997
#>[ CPUFloatType{5,5} ][ grad_fn = <SliceBackward0> ]

Pour diminuer le temps d’inférence, une astuce consiste à ajouter le LoRA directement au poids du modèle avec la fonction $add_. Ainsi on passe de deux inférences séquentielle, à une seule.

with_no_grad({
  lin_model$weight$add_(delta_theta$t())  
})
torch_tensor
Columns 1 to 6-1.3702e+00 -1.8230e+00 -6.2842e-02 -6.6632e-01  3.9145e-01 -1.0849e+00
-9.9452e-01  2.3281e+00 -1.2232e+00  1.9496e+00 -2.2651e-01  3.4276e-01
-9.3249e-01  2.4470e-01 -7.7541e-01 -1.7575e+00  1.9786e+00 -1.7312e+00
 1.8013e+00  8.8431e-02  1.0654e+00 -6.3846e-01 -7.8318e-01  1.6281e+00
-3.7756e-01 -1.6199e+00  1.2010e+00 -1.4953e+00 -1.3473e+00  2.9494e-01
 9.4640e-01  7.6874e-01  1.1691e+00 -1.2790e+00 -1.7773e+00  1.8248e+00
-1.7454e+00  4.9157e-01 -1.1067e+00  5.3387e-01  8.1825e-02  3.7023e-01
-4.9610e-02  6.9243e-01 -1.4972e+00  6.9678e-01  1.3679e+00  3.0947e-01
 1.5678e+00  3.0892e-01  1.2925e+00 -1.3465e+00 -2.4546e-01  1.4780e+00
 9.1523e-01  8.0792e-02 -9.2377e-01  1.0088e+00  4.2277e-01 -1.5816e-01
 6.5593e-01  6.5353e-02  5.1238e-01  6.6861e-01 -4.7130e-01 -6.9400e-01
 1.0499e+00 -5.8894e-01  5.3813e-01  3.4786e-01 -8.1234e-01  9.3049e-01
-1.3125e-01 -7.7277e-01 -1.7689e+00  4.4351e-01 -1.6387e+00 -8.4184e-01
 4.9112e-01 -4.4472e-01 -8.3900e-02  2.5632e+00 -2.1624e+00  9.9819e-02
-1.4026e+00  1.2139e+00  8.6326e-01  7.5578e-01  5.9698e-01  1.9664e+00
 1.8184e+00 -1.4854e+00  4.6389e-01 -6.3541e-01 -3.5622e-02 -1.1756e+00
 9.7948e-02 -1.3066e+00 -7.6465e-01 -5.8135e-01  1.7723e+00  3.1507e-01
-6.8449e-01  1.4971e+00  8.5841e-02  1.4874e+00 -2.4287e+00 -7.7290e-01
-1.0146e+00 -9.2046e-01  1.4259e+00 -4.3127e-02  1.1053e+00 -5.5399e-01
-1.7334e+00 -1.5949e+00 -3.6605e-01  9.1769e-01 -4.8194e-01 -1.9422e+00
 1.1083e+00  3.4509e-01  3.5148e-03  6.2509e-01 -1.7053e+00  1.0438e-01
-6.2546e-01  7.2093e-01 -8.1211e-01 -8.2500e-01 -1.2463e+00 -9.3799e-01
-7.3798e-01  4.7272e-01  9.2503e-01  9.1160e-02  6.2472e-01 -1.3352e+00
-2.5224e-01 -1.3423e+00 -1.5197e+00  1.1242e+00  6.8534e-02 -1.7901e+00
-2.6883e-01  6.6717e-01  7.4457e-01 -5.4357e-01  2.2816e-01  7.0499e-02
-1.1835e-01 -1.9279e-01  6.0010e-01 -7.8544e-02 -6.1630e-01  5.0806e-01
 1.4728e+00  1.8682e-01  1.2522e+00 -9.1607e-01  1.0065e+00 -4.9821e-01
-1.1942e+00  9.1282e-01  1.2018e+00 -1.9340e+00  3.8941e-01 -1.0429e+00
 3.5208e-01 -4.7015e-01  2.6418e+00  5.7486e-01 -1.1825e+00  7.4574e-01
 1.5374e-01 -5.1457e-01 -7.9212e-01  1.8305e-01  9.3734e-01  1.1209e+00
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{1000,1001} ]

Quel est la performance sur la nouvelle distribution ?

nnf_mse_loss(lin_model(X2), y2)
#> torch_tensor
#> 0.0012366
#> [ CPUFloatType{} ]

Un classifieur d’images métier par le fine-tuning de ResNext50

un jeu de données sur le plancton

::::::

library(torch)
library(torchvision)
train_ds <- image_folder_dataset(
  "Data_mining_with_R/data/Training set",
  transform = function(x) {
    x |> transform_to_tensor() |> 
      transform_resize(c(224, 224))
  })
test_ds <- image_folder_dataset(
  "Data_mining_with_R/data/Test sets",
  transform = function(x) {
    x |> transform_to_tensor() |> 
      transform_resize(c(224, 224))
  })


train_ds$.length()
[1] 2873
test_ds$.length()
[1] 4866

ResNext préentrainé

resnext <- model_resnext50_32x4d(pretrained = TRUE)
resnext
An `nn_module` containing 25,028,904 parameters.

── Modules ─────────────────────────────────────────────────────────────────────
• conv1: <nn_conv2d> #9,408 parameters
• bn1: <nn_batch_norm2d> #128 parameters
• relu: <nn_relu> #0 parameters
• maxpool: <nn_max_pool2d> #0 parameters
• layer1: <nn_sequential> #205,824 parameters
• layer2: <nn_sequential> #1,197,056 parameters
• layer3: <nn_sequential> #7,022,592 parameters
• layer4: <nn_sequential> #14,544,896 parameters
• avgpool: <nn_adaptive_avg_pool2d> #0 parameters
• fc: <nn_linear> #2,049,000 parameters

À vous de jouer, exercice 04

Complete 04_exercise to practice ResNext model fine-tuning.

07:00

réentrainement de la tête de classification

explorons la transformation des images par le réseau préentrainé

resnext_embedding <- resnext |> 
  nn_prune_head(1)
resnext_embedding
An `nn_module` containing 22,979,904 parameters.

── Modules ─────────────────────────────────────────────────────────────────────
• conv1: <nn_conv2d> #9,408 parameters
• bn1: <nn_batch_norm2d> #128 parameters
• relu: <nn_relu> #0 parameters
• maxpool: <nn_max_pool2d> #0 parameters
• layer1: <nn_sequential> #205,824 parameters
• layer2: <nn_sequential> #1,197,056 parameters
• layer3: <nn_sequential> #7,022,592 parameters
• layer4: <nn_sequential> #14,544,896 parameters
• avgpool: <nn_adaptive_avg_pool2d> #0 parameters
train_dl <- dataloader(train_ds, batch_size = 1)
iter <- dataloader_make_iter(train_dl)
img_class <- dataloader_next(iter)

resnext_embedding(img_class[[1]])$squeeze()
torch_tensor
 0.4956
 0.4499
 0.4132
 0.4217
 0.4663
 0.4511
 0.5266
 0.3839
 0.4600
 0.4451
 0.4153
 0.4536
 0.4079
 0.3616
 0.4893
 0.4373
 0.4254
 0.5138
 0.4923
 0.4400
 0.4237
 0.3764
 0.4559
 0.5589
 0.4151
 0.3832
 0.4681
 0.4256
 0.4372
 0.3886
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{2048} ][ grad_fn = <SqueezeBackward0> ]

un tenseur d’embedding de l’image de taille 2048… (en réalité [1, 2048, 1, 1])

Première étape, une classification à 6 classes {.small}

resnext_classifier <- nn_module(
  "Resnext_classifier",
  initialize = function(num_class) {
    self$fc <- nn_linear(2048, num_class)
  },
  forward = function(x) {
    x %>% 
      resnext_embedding() %>% 
      torch_squeeze(4) %>%
      torch_squeeze(3) %>%
      self$fc()
  }
)
library(luz)
train_dl <- dataloader(train_ds, batch_size = 10)
test_dl <- dataloader(test_ds, batch_size = 10)

fitted <- resnext_classifier %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_ignite_adam,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  set_hparams(num_class = 6) %>% 
  set_opt_hparams(lr = 0.003) %>%
  fit(train_dl, epochs = 3, valid_data = test_dl, verbose = TRUE)

À vous de jouer, exercice 05

Complete 05_exercise to practice ResNext model fine-tuning.

07:00