Evaluate your Training
Evaluate your Training#
After you validated your training and found an epoch you want to use for a more detailed check, the evaluate_model.py script comes in. It evaluates the given model with the samples defined in test_samples
and writes the results in a results/
folder in the model folder. The results can then be visualised/plotted using the plotting_umami.py
script. A detailed explanation how to use this script is given here.
Config#
The important part for the evaluation is the evaluation_settings
section. In there are all options set to evaluate your model. In the following example, the different options are shown/explained. There are some specific options only available for some taggers. These options can be found in their respective subsections.
evaluation_settings:
# Number of jets used for evaluation
n_jets: 3e5
# Define taggers that are used for comparison in evaluate_model
# This can be a list or a string for only one tagger
tagger: ["rnnip", "DL1r"]
# Define fc values for the taggers
frac_values_comp:
{
"rnnip": {"cjets": 0.07, "ujets": 0.93},
"DL1r": {"cjets": 0.018, "ujets": 0.982},
}
# Charm fraction value used for evaluation of the trained model
frac_values: {"cjets": 0.005, "ujets": 0.995}
# Working point used in the evaluation
working_point: 0.77
Options | Data Type | Necessary, Optional | Explanation |
---|---|---|---|
results_filename_extension |
str |
Optional | String which is added to the filenames of the several files created when evaluating. This allows to re-evaluate without overwriting old results. Make sure you specify the evaluation_file when plotting the corresponding results, otherwise the plotting script will look for files without the extension. |
n_jets |
int |
Necessary | Number of jets per sample used for evaluation. |
tagger |
list |
Necessary | List of taggers used for comparison. This needs to be a list of str or a single str . The name of the taggers must be same as in the evaluation file. For example, if the DL1d probabilities in the test samples are called DL1dLoose20210607_pb , the name you need to add to the list is DL1dLoose20210607 . |
frac_values_comp |
dict |
Necessary | dict with the fraction values for the comparison taggers. For all flavour (except the main flavour), you need to add values here which add up to one. |
frac_values |
dict |
Necessary | dict with the fraction values for the freshly trained tagger. For all flavour (except the main flavour), you need to add values here which add up to one. |
working_point |
float |
Necessary | Working point which is used in the evaluation. In the evaluation step, this is the value used for the fraction scan. |
eff_min |
float |
Optional | Minimal main class efficiency considered for ROC. |
eff_max |
float |
Optional | Maximal main class efficiency considered for ROC. |
frac_step |
float |
Optional | Step size of the fraction value scan. Please keep in mind that the fractions given to the background classes need to add up to one! All combinations that do not add up to one are ignored. If you choose a combination frac_min , frac_max or frac_step where the fractions of the background classes do not add up to one, you will get an error while running evaluate_model.py |
frac_min |
float |
Optional | Minimal fraction value which is set for a background class in the fraction scan. |
frac_max |
float |
Optional | Maximal fraction value which is set for a background class in the fraction scan. |
add_eval_variables |
list |
Optional | A list of available variables which are to be added to the evaluation files. With this, variables can be added for the variable vs eff/rejection plots. |
eval_batch_size |
int |
Optional | Number of jets used per batch for the evaluation of the training. If not given, the batch size from nn_structure is used. |
extra_classes_to_evaluate |
list |
Optional | List with jet flavours that are also loaded for evaluation although the tagger is not trained on this class. With this option, you can test the taggers behaviour for classes it wasn't trained on. Note: This must be a list and you also only need to add extra classes that are not in class_labels ! Also you need to add an entry to the frac_values for each class in this list with value 0 so the calculation of the discriminants work. |
DIPS
DIPS#
# Decide, if the Saliency maps are calculated or not.
calculate_saliency: True
Options | Data Type | Necessary, Optional | Explanation |
---|---|---|---|
calculate_saliency |
bool |
Optional | Decide, if the saliency maps are calculated or not. This takes a lot of time and resources! |
Running the Evaluation#
After the config is prepared switch to the umami/umami
folder and run the evaluate_model.py
by executing the following command:
evaluate_model.py -c <path to train config file> -e <epoch to evaluate>
The -e
options allows to define which epoch of the training is to be evaluated.
Note Depending on the number of jets which are used for evaluation, this can take some time to process! Also, the use of a GPU for evaluation is highly recommended to reduce the time needed for execution.
Evaluate only the taggers inside the .h5 files (without a freshly trained model)#
Although the UMAMI framework is made to evaluate and plot the results of the trainings of the taggers that are living inside of it, the framework can also evaluate and plot taggers that are already present in the files coming from the training-dataset-dumper.
The tagger results come from LWTNN models which are used to evaluate the jets in the derivations. The training-dataset-dumper applies these taggers and dumps the output probabilities for the different classes in the output .h5 files. These probabilities can be read by the evaluate_model.py
script and can be evaluated like a freshly trained model.
To evaluate only the output files, there is a specific config file in the examples, which is called evaluate_comp_taggers.yaml.
This file is shown here:
# Set foldername (aka modelname)
model_name: Eval_results
# Set the option to evaluate a freshly trained model to False
evaluate_trained_model: False
# Defining templates for the variable cuts
.variable_cuts_ttbar: &variable_cuts_ttbar
variable_cuts:
- pt_btagJes:
operator: "<="
condition: 2.5e5
.variable_cuts_zpext: &variable_cuts_zpext
variable_cuts:
- pt_btagJes:
operator: ">"
condition: 2.5e5
test_files:
ttbar_r21:
path: <path>/<to>/<preprocessed>/<samples>/ttbar_r21_test_file.h5
<<: *variable_cuts_ttbar
ttbar_r22:
path: <path>/<to>/<preprocessed>/<samples>/ttbar_r22_test_file.h5
<<: *variable_cuts_ttbar
zpext_r21:
path: <path>/<to>/<preprocessed>/<samples>/zpext_r21_test_file.h5
<<: *variable_cuts_zpext
zpext_r22:
path: <path>/<to>/<preprocessed>/<samples>/zpext_r22_test_file.h5
<<: *variable_cuts_zpext
# Values for the neural network
nn_structure:
# Use evaluated tagger scores in h5 file and not trained model
tagger: None
# Define which classes are used for training
# These are defined in the global_config
class_labels: ["ujets", "cjets", "bjets"]
# Main class which is to be tagged
main_class: "bjets"
# Plotting settings for training metrics plots.
# Those are not used here. Only when running plotting_epoch_performance.py
validation_settings:
# Eval parameters for validation evaluation while training
evaluation_settings:
# Number of jets used for validation
n_jets: 3e5
# Number of jets per batch used for evaluation
eval_batch_size: 15_000
# Define taggers that are used for comparison in evaluate_model
# This can be a list or a string for only one tagger
tagger: ["rnnip", "DL1r"]
# Define fc values for the taggers
frac_values_comp:
{
"rnnip": {"cjets": 0.07, "ujets": 0.93},
"DL1r": {"cjets": 0.018, "ujets": 0.982},
}
# Charm fraction value used for evaluation of the trained model
frac_values: {"cjets": 0.018, "ujets": 0.982}
# Working point used in the evaluation
working_point: 0.77
Most of the options are similar to the ones already explained, although a lot of them are missing because they are not needed here. The ones that are new are explained in the following table
Options | Data Type | Necessary, Optional | Explanation |
---|---|---|---|
evaluate_trained_model |
bool |
Necessary | This options enables/disables the evaluation of a freshly trained model. By default, this value is True but if you want to evaluate only taggers already present in the .h5 files, you need to set this option to False ! |
Now you can simply run the evaluate_model.py
script as described in the section above but without the -e
option. The command would look like this:
evaluate_model.py -c <path to train config file>
The evaluate_model.py
will now output a results file like the one from the "regular" usage of the scripts with the difference that only your defined taggers in tagger
are present in the files and no freshly trained tagger. An explanation how to plot the results is given in here.
Explaining the importance of features with SHAPley (only for DL1*)#
SHAPley is a framework that helps you understand how your training of a machine learning model is affected by the input variables, or in other words from which variables your model possibly learns the most. SHAPley is for now only usable when evaluating a DL1* version. You can run that by executing the command
evaluate_model.py -c <path to train config file> -e <epoch to evaluate> -s shapley
which will output a beeswarm plot into modelname/plots/
. Each dot in this plot is for one whole set of features (or one jet). They are stacked vertically once there is no space horizontally anymore to indicate density. The colour map tells you what the actual value was that entered the model. The SHAP value is basically calculated by removing features, letting the model make a prediction and then observe what would happen if you introduce features again to your prediction. If you do this over all possible combinations you get estimates of a features impact to your model. This is what the x-axis (SHAP value) tells you: the on average(!) contribution of a variable to an output node you are interested in (default is the output node for b-jets). In practice, large magnitudes (which is also what these plots are ordered by default in umami) are great, as they give the model a better possibility to discriminate. Features with large negative SHAP values therefore will help the model to better reject, whereas features with large positive SHAP values helps the model to learn that these are most probably jets from the category of interest. If you want to know more about SHAPley values, here is a talk from one of our FTAG algorithm meeting.
You have some options to play with in the evaluation_settings
section in the DL1r-PFlow-Training-config.yaml shown here:
# some properties for the feature importance explanation with SHAPley
shapley:
# Over how many full sets of features it should calculate over.
# Corresponds to the dots in the beeswarm plot.
# 200 takes like 10-15 min for DL1r on a 32 core-cpu
feature_sets: 200
# defines which of the model outputs (flavor) you want to explain
# Must be an entry from class_labels! You can also give list of multiple flavours
flavour: "bjets"
# You can also choose if you want to plot the magnitude of feature
# importance for all output nodes (flavors) in another plot. This
# will give you a bar plot of the mean SHAP value magnitudes.
bool_all_flavor_plot: False
# as this takes much longer you can average the feature_sets to a
# smaller set, 50 is a good choice for DL1r
averaged_sets: 50
# [11,11] works well for dl1r
plot_size: [11, 11]
The options are explained here:
Options | Data Type | Necessary, Optional | Explanation |
---|---|---|---|
shapley |
dict |
Optional | dict with the options for the feature importance explanation with SHAPley |
feature_sets |
int |
Optional | Over how many full sets of features it should calculate over. Corresponds to the dots in the bee swarm plot. 200 takes like 10-15 min for DL1r on a 32 core-cpu. |
model_output |
int |
Optional | Defines which of the model outputs (flavour) you want to explain. This is the index of the flavour in class_labels . |
bool_all_flavor_plot |
bool |
Optional | You can also choose if you want to plot the magnitude of feature importance for all output nodes (flavors) in another plot. This will give you a bar plot of the mean SHAPley value magnitudes. |
averaged_sets |
int |
Optional | As this takes much longer you can average the feature_sets to a smaller set, 50 is a good choice for DL1r. |
plot_size |
list |
Optional | Figure size of the SHAPley plot. This is a list with [width, height] |