{torch} {tabnet} and friends

Atelier HappyR, sept 2024

Christophe Regouby

Agenda

Getting started

{torch}

mlverse

{tabnet}

{tabnet} pour la regression avec valeurs manquantes

{tabnet} pour la classification hierarchique

GPT2 avec R

Fine-Tuning de GPT2 en français avec un LORA

Un classifieur d’images avec ResNext50 fine-tuning

Getting started

Licensing



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA4.0).

Checklist


R & RStudio installed?

     I’m on 4.4.1 and 2024.08.0 build 351

{torch} installed?

     torch::torch_is_installed()
     Your system is ready!

Any {torch} device ?

     torch::backends_xxxx_is_available()
     Your system has power!

Autres ressources

{torch}

{torch} : pourquoi réinventer l’eau chaude?

-   facilité d'installation CPU, GPU, MPS, ...

-   frugalité d'installation

-   confort de RStudio pour développer, déverminer, visualiser

-   confort de R pour l'indexation à 1

-   la qualité des articles de blog de Posit AI blog

-   l'écosystème des packages

-   plein de possibilités de contributions

    ![](images/clipboard-872398520.png)

Installation

Nominale

Avancée

Expert : air-gap server

Expert : déverminage

Sys.setenv(TORCH_INSTALL_DEBUG = 1)
install_torch()

?install_torch()

La pile logicielle

La manipulation de tenseurs

library(torch)
tt <- torch_rand(2, 3, 4)
tt
torch_tensor
(1,.,.) = 
  0.5113  0.7240  0.2231  0.7171
  0.2707  0.1827  0.6711  0.1541
  0.6131  0.9882  0.3075  0.5069

(2,.,.) = 
  0.3541  0.0129  0.2919  0.0758
  0.7849  0.1660  0.5629  0.1750
  0.4279  0.3991  0.9372  0.5699
[ CPUFloatType{2,3,4} ]

tt[, 2:N, ]
torch_tensor
(1,.,.) = 
  0.2707  0.1827  0.6711  0.1541
  0.6131  0.9882  0.3075  0.5069

(2,.,.) = 
  0.7849  0.1660  0.5629  0.1750
  0.4279  0.3991  0.9372  0.5699
[ CPUFloatType{2,2,4} ]
tt[1, 2:N, ]
torch_tensor
 0.2707  0.1827  0.6711  0.1541
 0.6131  0.9882  0.3075  0.5069
[ CPUFloatType{2,4} ]
tt[1:1, 2:N, ]
torch_tensor
(1,.,.) = 
  0.2707  0.1827  0.6711  0.1541
  0.6131  0.9882  0.3075  0.5069
[ CPUFloatType{1,2,4} ]
torch_squeeze(tt[1:1, 2:N, ])
torch_tensor
 0.2707  0.1827  0.6711  0.1541
 0.6131  0.9882  0.3075  0.5069
[ CPUFloatType{2,4} ]

À vous de jouer, exercise 01

Installations : `00_installation.R`

02:00

Exercice : `01_exercice.R`

05:00

mlverse

Un univers de 📦 dédiés à {torch}

Un univers de 📦 en français

Les paquets disponibles en français
paquet les messages l’aide les vignettes
{torch} 1
{torchvision} cregouby/torchvision.fr
{tabnet} cregouby/tabnet.fr
{luz} cregouby/luz.fr
{hfhub} cregouby/hfhub.fr
{tok} cregouby/tok.fr
{safetensors} cregouby/safetensors.fr
Sys.setLanguage(lang = "fr")
library(torchvision)
transform_normalize(torch::torch_rand(c(3,5,5)), 3, 0)
Sys.setenv(LANGUAGE="fr")
library(torchvision.fr)
library(torchvision)
?transform_normalize

{tabnet}

v0.6.0 is on CRAN

Fonctionnement

Usage intégré dans tidymodels

Dataset

library(tidymodels, quietly = TRUE)
data("ames", package = "modeldata")
str(ames)
tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)
 $ MS_SubClass       : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...
 $ MS_Zoning         : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ Lot_Frontage      : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
 $ Lot_Area          : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
 $ Street            : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley             : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Lot_Shape         : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...
 $ Land_Contour      : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
 $ Utilities         : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Lot_Config        : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
 $ Land_Slope        : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood      : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...
 $ Condition_1       : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ Condition_2       : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Bldg_Type         : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
 $ House_Style       : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...
 $ Overall_Cond      : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
 $ Year_Built        : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ Year_Remod_Add    : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
 $ Roof_Style        : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
 $ Roof_Matl         : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior_1st      : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
 $ Exterior_2nd      : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
 $ Mas_Vnr_Type      : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...
 $ Mas_Vnr_Area      : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
 $ Exter_Cond        : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation        : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
 $ Bsmt_Cond         : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
 $ Bsmt_Exposure     : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...
 $ BsmtFin_Type_1    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...
 $ BsmtFin_SF_1      : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...
 $ BsmtFin_Type_2    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...
 $ BsmtFin_SF_2      : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
 $ Bsmt_Unf_SF       : num [1:2930] 441 270 406 1045 137 ...
 $ Total_Bsmt_SF     : num [1:2930] 1080 882 1329 2110 928 ...
 $ Heating           : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Heating_QC        : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...
 $ Central_Air       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical        : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ First_Flr_SF      : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
 $ Second_Flr_SF     : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
 $ Gr_Liv_Area       : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
 $ Bsmt_Full_Bath    : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
 $ Bsmt_Half_Bath    : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Full_Bath         : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
 $ Half_Bath         : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
 $ Bedroom_AbvGr     : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
 $ Kitchen_AbvGr     : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
 $ TotRms_AbvGrd     : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
 $ Functional        : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Fireplaces        : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
 $ Garage_Type       : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Garage_Finish     : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...
 $ Garage_Cars       : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
 $ Garage_Area       : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
 $ Garage_Cond       : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Paved_Drive       : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
 $ Wood_Deck_SF      : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
 $ Open_Porch_SF     : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
 $ Enclosed_Porch    : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
 $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Screen_Porch      : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
 $ Pool_Area         : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
 $ Pool_QC           : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Fence             : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...
 $ Misc_Feature      : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
 $ Misc_Val          : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
 $ Mo_Sold           : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
 $ Year_Sold         : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ Sale_Type         : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Sale_Condition    : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Sale_Price        : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
 $ Longitude         : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...
 $ Latitude          : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...

Recipe

ames <- ames |> mutate(Sale_Price = log10(Sale_Price))
ames_rec <- recipe(Sale_Price ~ ., data=ames) |> 
  step_normalize(all_numeric(), -all_outcomes()) 

Pre-training

library(tabnet)
ames_pretrain <- tabnet_pretrain(
  ames_rec, data=ames,  epoch=50, cat_emb_dim = 1,
  valid_split = 0.2, verbose=TRUE, 
  early_stopping_patience = 3L, 
  early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_pretrain)

Training

ames_fit <- tabnet_fit(ames_rec, data=ames,  tabnet_model = ames_pretrain, 
                       epoch=50, cat_emb_dim = 1, 
                       valid_split = 0.2, verbose=TRUE, batch=2930, 
                       early_stopping_patience = 5L, 
                       early_stopping_tolerance = 1e-4)
# model diagnostic
autoplot(ames_fit)

Prediction

predict(ames_fit, ames)
# A tibble: 2,930 × 1
   .pred
   <dbl>
 1  2.14
 2  3.97
 3  1.76
 4  2.92
 5  3.86
 6  3.05
 7  3.36
 8  2.26
 9  2.05
10  2.24
# ℹ 2,920 more rows
metrics <- metric_set(rmse, rsq, ccc)
cbind(ames, predict(ames_fit, ames)) |> 
  metrics(Sale_Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      2.39  
2 rsq     standard      0.0461
3 ccc     standard     -0.0114
# variable importance
vip::vip(ames_fit)

Explainability

ames_explain <- tabnet::tabnet_explain(ames_fit, ames)
# variable importance
autoplot(ames_explain, quantile = 0.99)

À vous de jouer, exercise 02

Complete 02_exercise to practice tabnet model training.

07:00

{tabnet} pour les valeurs manquantes

retour sur le jeu de données Ames

  • les tenseurs ne peuvent pas inclure de valeur manquantes.

  • ames nous fait le plaisir d’être sans valeur manquante.

Quelle est la surface de la piscine quand il n’y a pas de piscine?

data("ames", package = "modeldata")
qplot(ames$Mas_Vnr_Area)

Comment le modèle peut-il capturer cette distribution ?

Et si on l’applique à toute les colonnes ?

Code
col_with_zero_as_na <- ames |>  
  select(where(is.numeric)) |>  
  select(matches("_SF|Area|Misc_Val|[Pp]orch$")) |> 
  summarise_each(min) |> 
  select_if(~.x==0) |> 
  names()
ames_missing <- ames |>mutate_at(col_with_zero_as_na, na_if, 0) |> 
  mutate_at("Alley", na_if, "No_Alley_Access")  |>  
  mutate_at("Fence", na_if, "No_Fence") |> 
  mutate_at(c("Garage_Cond", "Garage_Finish"), na_if, "No_Garage") |> 
  mutate_at(c("Bsmt_Exposure", "BsmtFin_Type_1", "BsmtFin_Type_2"), na_if, "No_Basement")

visdat::vis_miss(ames_missing)

Recipe

ames_missing <- ames_missing |> mutate(Sale_Price = log10(Sale_Price))
ames_missing_rec <- recipe(Sale_Price ~ ., data=ames_missing) |> 
  step_normalize(all_numeric(), -all_outcomes()) 

Pre-training

library(tabnet)
ames_missing_pretrain <- tabnet_pretrain(
  ames_missing_rec, data=ames_missing,  epoch=50, cat_emb_dim = 1,  valid_split = 0.2,
  verbose=TRUE,   early_stopping_patience = 3L,   early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_missing_pretrain)

Training

Code
ames_missing_fit <- tabnet_fit(
  ames_missing_rec,   data = ames_missing,
  tabnet_model = ames_missing_pretrain,
  epoch = 50,  cat_emb_dim = 1,  valid_split = 0.2,
  verbose = TRUE,  batch = 2930,
  early_stopping_patience = 5L,
  early_stopping_tolerance = 1e-4
)
# model diagnostic
autoplot(ames_missing_fit)

Prediction

predict(ames_missing_fit, ames_missing)
# A tibble: 2,930 × 1
   .pred
   <dbl>
 1  3.41
 2  5.07
 3  2.64
 4  3.29
 5  3.32
 6  3.09
 7  4.50
 8  2.02
 9  2.58
10  2.56
# ℹ 2,920 more rows
metrics <- metric_set(rmse, rsq, ccc)
cbind(ames_missing, predict(ames_missing_fit, ames_missing)) |> 
  metrics(Sale_Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      2.00  
2 rsq     standard      0.0286
3 ccc     standard     -0.0150

Variable importance

# original ames
vip_color(ames_pretrain, col_with_missings)

vip_color(ames_fit, col_with_missings)

# ames with missing values
vip_color(ames_missing_pretrain, col_with_missings)

vip_color(ames_missing_fit, col_with_missings)

Explainability

ames_missing_explain <- tabnet::tabnet_explain(ames_missing_fit, ames_missing)
# variable importance
autoplot(ames_missing_explain, quantile = 0.99, type="step")

{tabnet} avec un outcome() hierarchique

  • {tabnet} admet des variable à prédire catégorielle, multi-label multi-class.

  • et si on pouvait mettre une contrainte entre les classes des différents labels ?

  • le dataset doit être de type data.tree::as.Node()

    • conversion de trainset et testset avec as.Node() avant les fonctions tabnet_
    • conversion de inverse avec node_to_df()
  • nouveauté de la 0.5.0

Exemple avec starwars

library(data.tree)
data(starwars, package = "dplyr")
head(starwars, 4)
# A tibble: 4 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

On construit la variable de sortie comme un chaîne avec des séparateurs / dans une variable "pathString" (erronné)

starwars_tree <- starwars |> 
  mutate(pathString = paste("StarWars_characters", species, sex, `name`, sep = "/"))  |> 
  as.Node()
print(starwars_tree, "name","height", "mass", "eye_color", limit = 8)
                           levelName                    name height mass
1  StarWars_characters                   StarWars_characters      4   NA
2   ¦--Human                                           Human      3   NA
3   ¦   ¦--male                                         male      2   NA
4   ¦   ¦   ¦--Luke Skywalker                 Luke Skywalker      1   77
5   ¦   ¦   ¦--Darth Vader                       Darth Vader      1  136
6   ¦   ¦   ¦--Owen Lars                           Owen Lars      1  120
7   ¦   ¦   ¦--Biggs Darklighter           Biggs Darklighter      1   84
8   ¦   ¦   °--... 22 nodes w/ 0 sub   ... 22 nodes w/ 0 sub      1   NA
9   ¦   °--... 1 nodes w/ 31 sub       ... 1 nodes w/ 31 sub      1   NA
10  °--... 37 nodes w/ 123 sub       ... 37 nodes w/ 123 sub      1   NA
   eye_color
1           
2           
3           
4       blue
5     yellow
6       blue
7      brown
8           
9           
10          

Mais avec des rêgles sur les noms et les types

  • pas d’usage des noms internes de {data.tree} :
    • name, height sont interdits
    • comme tous les noms de NODE_RESERVED_NAMES_CONST. (Ils seraient supprimés au moment de la conversion.)
  • pas de factor()
  • pas de colonne nomée level_*
  • le dernier niveau hiérarchique doit être l’individu (donc un Id unique)
  • il doit y avoir une racine à la hiérarchie

Construction correcte de la variable de sortie "pathString"

starwars_tree <- starwars |>
  rename(`_name` = "name", `_height` = "height") |> 
  mutate(pathString = paste("StarWars_characters", species, sex, `_name`, sep = "/"))  |> 
  as.Node()
print(starwars_tree, "name", "_name","_height", "mass", "eye_color", limit = 8)
                           levelName                    name             _name
1  StarWars_characters                   StarWars_characters                  
2   ¦--Human                                           Human                  
3   ¦   ¦--male                                         male                  
4   ¦   ¦   ¦--Luke Skywalker                 Luke Skywalker    Luke Skywalker
5   ¦   ¦   ¦--Darth Vader                       Darth Vader       Darth Vader
6   ¦   ¦   ¦--Owen Lars                           Owen Lars         Owen Lars
7   ¦   ¦   ¦--Biggs Darklighter           Biggs Darklighter Biggs Darklighter
8   ¦   ¦   °--... 22 nodes w/ 0 sub   ... 22 nodes w/ 0 sub                  
9   ¦   °--... 1 nodes w/ 31 sub       ... 1 nodes w/ 31 sub                  
10  °--... 37 nodes w/ 123 sub       ... 37 nodes w/ 123 sub                  
   _height mass eye_color
1       NA   NA          
2       NA   NA          
3       NA   NA          
4      172   77      blue
5      202  136    yellow
6      178  120      blue
7      183   84     brown
8       NA   NA          
9       NA   NA          
10      NA   NA          

Initial split et construction

starwars a des colonnes de list() qu’il faut dérouler

starw_split <- starwars |> 
  tidyr::unnest_longer(films) |> 
  tidyr::unnest_longer(vehicles, keep_empty = TRUE) |> 
  tidyr::unnest_longer(starships, keep_empty = TRUE) |> 
  initial_split( prop = .8, strata = "species")
starwars_train_tree <- starw_split |> 
  training() |>
  rename(`_name` = "name", `_height` = "height") |>
  rowid_to_column() |>
  mutate(pathString = paste("StarWars_characters", species, sex, rowid, sep = "/")) |>
  # remove outcomes labels from predictors
  select(-species, -sex, -`_name`, -rowid) |>
  # turn it as hierarchical Node
  as.Node()

starwars_test_tree <- starw_split |>
  testing() |>
  rename(`_name` = "name", `_height` = "height") |>
  rowid_to_column() |>
  mutate(pathString = paste("StarWars_characters", species, sex, rowid, sep = "/")) |>
  select(-species, -sex, -`_name`, -rowid) |>
  as.Node()

Les $attributesAll du Node seront les predicteurs :

starwars_train_tree$attributesAll
 [1] "_height"    "birth_year" "eye_color"  "films"      "gender"    
 [6] "hair_color" "homeworld"  "mass"       "skin_color" "starships" 
[11] "vehicles"  

Entraînement du modèle

```{r}
#| echo: true
#| label: "starwars fit"

config <- tabnet_config(
  decision_width = 8,
  attention_width = 8,
  num_steps = 3,
  penalty = .003,
  cat_emb_dim = 2,
  valid_split = 0.2,
  learn_rate = 1e-3,
  lr_scheduler = "reduce_on_plateau",
  early_stopping_monitor = "valid_loss",
  early_stopping_patience = 4,
  verbose = FALSE
)

starw_model <- tabnet_fit(starwars_train_tree, config = config, epoch = 75, checkpoint_epochs = 15)
```

Diagnostique

```{r}
#| echo: true
#| label: "starwars diag"

autoplot(starw_model)
```
```{r}
#| echo: true
#| label: "starwars vip"

vip::vip(starw_model)
```

Inférence sur le modèle hierarchique

```{r}
#| echo: true
#| label: "starwars inference"
starwars_hat <- bind_cols(
    predict(starw_model, starwars_test_tree),
    node_to_df(starwars_test_tree)$y
  )
tail(starwars_hat, n = 5)
```

GPT2 avec R

basé sur 4 packages {minhub}, {hfhub}, {tok}, {safetensors}

  • {minhub} : un dépot de réseau de neurones classiques pour {torch}
  • {hfhub} : l’accès aux téléchargement de modèles préentraînés du hub hugging-face
  • {tok} : un wrappeur des tokenizers d’hugging-face en R
  • {safetensors} : sauvegarde et lecture des données de tenseurs au format .safetensors

Téléchargement du modèle et de ses poids

library(minhub)
identifier <- "gpt2"
revision <- "e7da7f2"
# instantiate model and load Hugging Face weights
model <- gpt2_from_pretrained(identifier, revision)
# load matching tokenizer
tok <- tok::tokenizer$from_pretrained(identifier)
model$eval()

Tokenisation de la phrase

text = paste("Quel plaisir de participer aux ateliers HappyR !",
             "Vivement le prochain évènement" )


idx <- torch_tensor(
  tok$encode(
    text
  )$
    ids
)$
  view(c(1, -1))
idx

Génération d’une entrée

La génération est un process itératif, chaque prédiction du modèle est ajoutée au prompt qui grossit.

Ajoutons y 30 tokens :

prompt_length <- idx$size(-1)

for (i in 1:30) { # decide on maximal length of output sequence
  # obtain next prediction (raw score)
  with_no_grad({
    logits <- model(idx + 1L)
  })
  last_logits <- logits[, -1, ]
  # pick highest scores (how many is up to you)
  c(prob, ind) %<-% last_logits$topk(50)
  last_logits <- torch_full_like(last_logits, -Inf)$scatter_(-1, ind, prob)
  # convert to probabilities
  probs <- nnf_softmax(last_logits, dim = -1)
  # probabilistic sampling
  id_next <- torch_multinomial(probs, num_samples = 1) - 1L
  # stop if end of sequence predicted
  if (id_next$item() == 0) {
    break
  }
  # append prediction to prompt
  idx <- torch_cat(list(idx, id_next), dim = 2)
}

décodage des tokens du résultat

tok$decode(as.integer(idx))

Fine-Tuning avec LoRA

Est-ce que les LLMs dépossèdent le data-scientist ?

  • des réseaux toujours plus gros impliquent des entraînements prohibitifs

  • la promesse de la prochaîne version qui résoud les faiblesses

  • le jeu de donnée de réference difficile à constituer

LoRA à la rescousse

Low Rank Adaptation

Method

The problem of fine-tuning a neural network can be expressed by finding a \(\Delta \Theta\) that minimizes \(L(X, y; \Theta_0 + \Delta\Theta)\) where \(L\) is a loss function, \(X\) and \(y\) are the data and \(\Theta_0\) the weights from a pre-trained model.

We learn the parameters \(\Delta \Theta\) with dimension \(|\Delta \Theta|\) equals to \(|\Theta_0|\). When \(|\Theta_0|\) is very large, such as in large scale pre-trained models, finding \(\Delta \Theta\) becomes computationally challenging. Also, for each task you need to learn a new \(\Delta \Theta\) parameter set, making it even more challenging to deploy fine-tuned models if you have more than a few specific tasks. LoRA proposes using an approximation \(\Delta \Phi \approx \Delta \Theta\) with \(|\Delta \Phi| << |\Delta \Theta|\). The observation is that neural nets have many dense layers performing matrix multiplication, and while they typically have full-rank during pre-training, when adapting to a specific task the weight updates will have a low “intrinsic dimension”.

A simple matrix decomposition is applied for each weight matrix update \(\Delta \theta \in \Delta \Theta\). Considering \(\Delta \theta_i \in \mathbb{R}^{d \times k}\) the update for the \(i\)th weight in the network, LoRA approximates it with:

\[\Delta \theta_i \approx \Delta \phi_i = BA\] where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times d}\) and the rank \(r << min(d, k)\). Thus instead of learning \(d \times k\) parameters we now need to learn \((d + k) \times r\) which is easily a lot smaller given the multiplicative aspect. In practice, \(\Delta \theta_i\) is scaled by \(\frac{\alpha}{r}\) before being added to \(\theta_i\), which can be interpreted as a ‘learning rate’ for the LoRA update.

LoRA does not increase inference latency, as once fine tuning is done, you can simply update the weights in \(\Theta\) by adding their respective \(\Delta \theta \approx \Delta \phi\). It also makes it simpler to deploy multiple task specific models on top of one large model, as \(|\Delta \Phi|\) is much smaller than \(|\Delta \Theta|\).

Implémentation avec torch

On simule un jeu de données \(y = X \theta\) model. \(\theta \in \mathbb{R}^{1001, 1000}\).

library(torch)

n <- 10000
d_in <- 1001
d_out <- 1000

thetas <- torch_randn(d_in, d_out)

X <- torch_randn(n, d_in)
y <- torch_matmul(X, thetas)

On entraine un modèle pour estimer \(\theta\). C’est notre modèle entraîné.

Code
model <- nn_linear(d_in, d_out, bias = FALSE)
Code
train <- function(model, X, y, batch_size = 128, epochs = 100) {
  opt <- optim_adam(model$parameters)

  for (epoch in 1:epochs) {
    for(i in seq_len(n/batch_size)) {
      idx <- sample.int(n, size = batch_size)
      loss <- nnf_mse_loss(model(X[idx,]), y[idx])
      
      with_no_grad({
        opt$zero_grad()
        loss$backward()
        opt$step()  
      })
    }
    
    if (epoch %% 10 == 0) {
      with_no_grad({
        loss <- nnf_mse_loss(model(X), y)
      })
      cat("[", epoch, "] Loss:", loss$item(), "\n")
    }
  }
}

On entraine le modèle

Code
train(model, X, y)
[ 10 ] Loss: 576.074 
[ 20 ] Loss: 311.5254 
[ 30 ] Loss: 154.5969 
[ 40 ] Loss: 68.1488 
[ 50 ] Loss: 25.47743 
[ 60 ] Loss: 7.52371 
[ 70 ] Loss: 1.574381 
[ 80 ] Loss: 0.2002759 
[ 90 ] Loss: 0.01300205 
[ 100 ] Loss: 0.0004028342 

On simule une distribution des données différente en appliquant une transformation à \(\theta\)

Code
thetas2 <- thetas + 1

X2 <- torch_randn(n, d_in)
y2 <- torch_matmul(X2, thetas2)

Un classifieur d’images métier avec le fine tuning de ResNext50