# Evaluate by ID/group

When we have groups of observations (e.g. a participant ID), we are sometimes more interested in the overall prediction for the group than those at the observation-level.

Say we have a dataset with 10 observations per participant and a model that predicts whether a participant has an autism diagnosis or not. While the model will predict each of the 10 observations, it’s really the overall prediction for the participant that we are interested in.

evaluate() has two approaches to performing the evaluation on the ID level: averaging and voting.

## Averaging

In averaging, we simply average the predicted probabilities for the participant. This is the default approach as it maintains information about how certain our model is about its class prediction. That is, if all observations have a 60% predicted probability of an autism diagnosis, that should be considered differently than 90%.

## Voting

In voting, we simply count the predictions of each outcome class and assign the class with the most predictions to the participant.

If 7 out of 10 of the observations are predicted as having no autism diagnosis, that becomes the prediction for the participant.

## ID evaluation with evaluate()

We will use the simple participant.scores dataset as it has 3 rows per participant and a diagnosis column that we can evaluate predictions against. Let’s add predicted probabilities and diagnoses and have a look:

library(cvms)
library(knitr)  # kable()
library(dplyr)
set.seed(74)

# Prepare dataset
data <- participant.scores %>% as_tibble()
# Add probabilities and predicted classes
data[["probability"]] <- runif(nrow(data))
data[["predicted diagnosis"]] <- ifelse(data[["probability"]] > 0.5, 1, 0)


participant age diagnosis score session probability predicted diagnosis
1 20 1 10 1 0.7046162 1
1 20 1 24 2 0.4800045 0
1 20 1 45 3 0.1960176 0
2 23 0 24 1 0.9369707 1
2 23 0 40 2 0.8698302 1
2 23 0 67 3 0.2140318 0
3 27 1 15 1 0.0240853 0
3 27 1 30 2 0.8547959 1
3 27 1 40 3 0.7027153 1
4 21 0 35 1 0.9579817 1

We tell evaluate() to aggregate the predictions by the participant column with the mean (averaging) method.

Note: It is assumed that the target class is constant within the IDs. I.e., that the participant has the same diagnosis in all observations.

ev <- evaluate(
data = data,
target_col = "diagnosis",
prediction_cols = "probability",
id_col = "participant",
id_method = "mean",
type = "binomial"
)

ev

## # A tibble: 1 x 18
##   Balanced Accur…    F1 Sensitivity Specificity Pos Pred Value
##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>
## 1            0.292 0.364       0.333        0.25              0.4
## # … with 13 more variables: Neg Pred Value <dbl>, AUC <dbl>, Lower
## #   CI <dbl>, Upper CI <dbl>, Kappa <dbl>, MCC <dbl>, Detection
## #   Rate <dbl>, Detection Prevalence <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, Confusion Matrix <list>, Positive
## #   Class <chr>


The Predictions column contains the averaged predictions:

ev$Predictions[] %>% kable()  Target Prediction SD Predicted Class participant id_method 1 0.4602128 0.2548762 0 1 mean 0 0.6736109 0.3994204 1 2 mean 1 0.5271988 0.4422946 1 3 mean 0 0.7974576 0.1703448 1 4 mean 1 0.5887699 0.4738221 1 5 mean 1 0.3526630 0.2302525 0 6 mean 1 0.2333758 0.1913763 0 7 mean 1 0.3956015 0.3207379 0 8 mean 0 0.3374361 0.0304785 0 9 mean 0 0.5988969 0.0675830 1 10 mean Let’s plot the confusion matrix as well: # Note: If ev had multiple rows, we would have to # pass ev$Confusion Matrix[] to
# plot the first row's confusion matrix
plot_confusion_matrix(ev) We can have a better look at the metrics:

ev_metrics <- select_metrics(ev)
ev_metrics %>% select(1:9) %>% kable(digits = 5)

Balanced Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC Lower CI Upper CI
0.29167 0.36364 0.33333 0.25 0.4 0.2 0.20833 0 0.62475
ev_metrics %>% select(10:14) %>% kable(digits = 5)

Kappa MCC Detection Rate Detection Prevalence Prevalence
-0.4 -0.40825 0.2 0.5 0.6

### Using voting

We can use the majority (voting) method for the ID aggregation instead:

ev_2 <- evaluate(
data = data,
target_col = "diagnosis",
prediction_cols = "probability",
id_col = "participant",
id_method = "majority",
type = "binomial"
)

ev_2

## # A tibble: 1 x 18
##   Balanced Accur…    F1 Sensitivity Specificity Pos Pred Value
##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>
## 1            0.292 0.364       0.333        0.25              0.4
## # … with 13 more variables: Neg Pred Value <dbl>, AUC <dbl>, Lower
## #   CI <dbl>, Upper CI <dbl>, Kappa <dbl>, MCC <dbl>, Detection
## #   Rate <dbl>, Detection Prevalence <dbl>, Prevalence <dbl>,
## #   Predictions <list>, ROC <list>, Confusion Matrix <list>, Positive
## #   Class <chr>


Now the Predictions column looks as follows:

ev_2$Predictions[] %>% kable()  Target Prediction Predicted Class participant id_method 1 0 0 1 majority 0 1 1 2 majority 1 1 1 3 majority 0 1 1 4 majority 1 1 1 5 majority 1 0 0 6 majority 1 0 0 7 majority 1 0 0 8 majority 0 0 0 9 majority 0 1 1 10 majority In this case, the Predicted Class column is identical to that in the averaging approach. We just don’t have the probabilities to tell us, how sure the model is about that prediction. ## Per model If we have predictions from multiple models, we can group the data frame and get the results per model. Let’s duplicate the dataset and change the predictions. We then combine the datasets and add a model column for indicating which of the data frames the observation came from: # Duplicate data frame data_2 <- data # Change the probabilities and predicted classes data_2[["probability"]] <- runif(nrow(data)) data_2[["predicted diagnosis"]] <- ifelse(data_2[["probability"]] > 0.5, 1, 0) # Combine the two data frames data_multi <- dplyr::bind_rows(data, data_2, .id = "model") data_multi  ## # A tibble: 60 x 8 ## model participant age diagnosis score session probability predicted diagn… ## <chr> <fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl> ## 1 1 1 20 1 10 1 0.705 1 ## 2 1 1 20 1 24 2 0.480 0 ## 3 1 1 20 1 45 3 0.196 0 ## 4 1 2 23 0 24 1 0.937 1 ## 5 1 2 23 0 40 2 0.870 1 ## 6 1 2 23 0 67 3 0.214 0 ## 7 1 3 27 1 15 1 0.0241 0 ## 8 1 3 27 1 30 2 0.855 1 ## 9 1 3 27 1 40 3 0.703 1 ## 10 1 4 21 0 35 1 0.958 1 ## # … with 50 more rows  We can now group the data frame by the model column and run the evaluation again: ev_3 <- data_multi %>% dplyr::group_by(model) %>% evaluate( target_col = "diagnosis", prediction_cols = "probability", id_col = "participant", id_method = "mean", type = "binomial" ) ev_3  ## # A tibble: 2 x 19 ## model Balanced Accur… F1 Sensitivity Specificity Pos Pred Value ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 0.292 0.364 0.333 0.25 0.4 ## 2 2 0.375 0.5 0.5 0.25 0.5 ## # … with 13 more variables: Neg Pred Value <dbl>, AUC <dbl>, Lower ## # CI <dbl>, Upper CI <dbl>, Kappa <dbl>, MCC <dbl>, Detection ## # Rate <dbl>, Detection Prevalence <dbl>, Prevalence <dbl>, ## # Predictions <list>, ROC <list>, Confusion Matrix <list>, Positive ## # Class <chr>  The Predictions for the second model looks as follows: ev_3$Predictions[] %>% kable()

model Target Prediction SD Predicted Class participant id_method
2 1 0.3302017 0.3002763 0 1 mean
2 0 0.6040242 0.2854935 1 2 mean
2 1 0.7342651 0.2653166 1 3 mean
2 0 0.6383918 0.3799305 1 4 mean
2 1 0.4551732 0.3417810 0 5 mean
2 1 0.6808281 0.3626166 1 6 mean
2 1 0.4536740 0.3784584 0 7 mean
2 1 0.6281501 0.4506029 1 8 mean
2 0 0.7000411 0.1490745 1 9 mean
2 0 0.4630344 0.4344227 0 10 mean

## In 'gaussian' evaluation

This kind of ID aggregation is also available for the 'gaussian' evaluation (e.g. for linear regression models), although only with the averaging approach. Again, it is assumed that the target value is constant for all observations by a participant (like the age column in our dataset).

We add a predicted age column to our initial dataset:

data[["predicted age"]] <- sample(20:45, size = 30, replace = TRUE)


We evaluate the predicted age, aggregated by participant:

ev_4 <- evaluate(
data = data,
target_col = "age",
prediction_cols = "predicted age",
id_col = "participant",
id_method = "mean",
type = "gaussian"
)

ev_4

## # A tibble: 1 x 7
##    RMSE   MAE NRMSE(IQR)  RRSE   RAE RMSLE Predictions
##   <dbl> <dbl>        <dbl> <dbl> <dbl> <dbl> <list>
## 1  10.3   8.7        0.984  1.48  1.45 0.340 <tibble [10 × 5]>


The Predictions column looks as follows:

ev_4\$Predictions[] %>% kable()

Target Prediction SD participant id_method
20 35.66667 8.326664 1 mean
23 33.33333 10.214369 2 mean
27 35.33333 5.686241 3 mean
21 30.00000 4.582576 4 mean
32 28.66667 5.507570 5 mean
31 43.33333 1.154700 6 mean
43 39.00000 5.196152 7 mean
21 40.33333 2.516611 8 mean
34 35.33333 5.507570 9 mean
32 35.33333 7.571878 10 mean

On average, we predict participant 1 to have the age 35.66.

## Results for each ID

If our targets are not constant within the IDs, we might be interested in the ID-level evaluation. E.g. how well it predicted the score for each of the participants.

We add a predicted score column to our dataset:

data[["predicted score"]] <- round(runif(30, 10, 81))


Now, we group the data frame by the participant column and evaluate the predicted scores:

data %>%
dplyr::group_by(participant) %>%
evaluate(
target_col = "score",
prediction_cols = "predicted score",
type = "gaussian"
)

## # A tibble: 10 x 8
##    participant  RMSE   MAE NRMSE(IQR)  RRSE   RAE RMSLE Predictions
##    <fct>       <dbl> <dbl>        <dbl> <dbl> <dbl> <dbl> <list>
##  1 1           13.8  13.7         0.787 0.957 1.10  0.683 <tibble [3 × 3]>
##  2 2           32.4  26.3         1.50  1.82  1.69  0.946 <tibble [3 × 3]>
##  3 3           12.8  10.7         1.03  1.25  1.2   0.549 <tibble [3 × 3]>
##  4 4            9.15  7.67        0.425 0.513 0.486 0.154 <tibble [3 × 3]>
##  5 5           24.1  17.3         1.27  1.47  1.15  0.566 <tibble [3 × 3]>
##  6 6           34.2  33.3         4.27  5.12  5.56  0.895 <tibble [3 × 3]>
##  7 7           44.7  40           2.98  3.45  3.33  1.21  <tibble [3 × 3]>
##  8 8            9.80  8           0.700 0.854 0.818 0.306 <tibble [3 × 3]>
##  9 9           22.6  21.3         1.37  1.66  1.81  0.447 <tibble [3 × 3]>
## 10 10          29.3  28           1.13  1.38  1.62  0.556 <tibble [3 × 3]>


Participant 4 has the lowest prediction error while participant 7 has the highest.

This approach is similar to what the most_challenging() function does:

# Extract the ~20% observations with highest prediction error
most_challenging(
data = data,
type = "gaussian",
obs_id_col = "participant",
target_col = "score",
prediction_cols = "predicted score",
threshold = 0.20
)

## # A tibble: 2 x 4
##   participant   MAE  RMSE  >=
##   <fct>       <dbl> <dbl> <dbl>
## 1 7            40    44.7  32.7
## 2 6            33.3  34.2  32.7