Ntuple preparation
Ntuple preparation#
After the ntuple production (training-dataset-dumper), the first step of the preprocessing is the preparation of the different flavour files. In this step, the different flavours that are to be used for the training are extracted from the .h5
files and written into extra files. While extracting the jets, different cuts are applied and the splitting into training/validation/test is done.
Config file#
The preprocessing is configured using .yaml
config files. We start with some general options that are needed by multiple preprocessing steps and should be set at the very beginning of the preprocessing:
outfile_name: *outfile_name
# outfile name for the validation sample
outfile_name_validation: *outfile_name_validation
# Name of the plot
plot_name: PFlow_ext-hybrid
# Define the plot type (like pdf or png) for the preprocessing plots created
plot_type: "pdf"
# Define, if you want to use the ATLAS Logo
use_atlas_tag: True
# Define, what's the text behind the ATLAS Logo
atlas_first_tag: "Simulation Internal"
# Label under the ATLAS logo on the preprocessing plots
atlas_second_tag: "$\\sqrt{s}=13$ TeV, PFlow jets"
# include sample categories in the plots legends before resampling
legend_sample_category: True
# Variable dict which is used for scaling and shifting
var_file: *var_file
# Dictfile for the scaling and shifting (json)
dict_file: *dict_file
Setting | Type | Explanation |
---|---|---|
outfile_name |
str |
Name of the output file of the preprocessing. The different steps will add append some info to this name to produce their output files (Like _resampled ). |
outfile_name_validation |
str |
Name of the validation output file of the preprocessing. The different steps will add append some info to this name to produce their output files (Like _resampled ). |
plot_name |
str |
Defines the names of the control plots which are produced in the preprocessing. |
plot_type |
str |
Defines the filetype in which the preprocessing plots are saved. Default is "pdf" |
use_atlas_tag |
bool |
Define if you want to have the ATLAS Logo at the top left of the plot. |
atlas_first_tag |
str |
Define the text after the ATLAS Logo. By default "Simulation Internal". |
atlas_second_tag |
str |
Defines the label in the control plots which are made in the preprocessing. This is the text which is written under the "ATLAS". |
legend_sample_category |
bool |
Whether to include sample categories in the legends of plots before resampling. Set to False for Boosted tagging. |
var_file |
str |
Path to the variable dict which is used. Default configs can be found here |
dict_file |
str |
Full path (with filename) to the scale dict. This must be a .json file! |
For plot related variables, Umami supports here also all available options of PUMA
.
The var_dict
and dict_file
options are normally set in the Preprocessing-parameters.yaml
file. A snapshot of these two variables is shown here:
# List of variables for training (yaml)
.var_file: &var_file <path_palce_holder>/umami/umami/configs/Dips_Variables.yaml
# Dictfile for the scaling and shifting (json)
.dict_file: &dict_file !join [*base_path, /scale_dict.json]
For the preparation step, we also need the some more parts of the preprocessing config, which are described in the following sections.
Preprocessing Parameters#
parameters: !include Preprocessing-parameters.yaml
This line specifies where the ntuples (which are used) are stored and where to save the output of the preprocessing. You can find an example file here. In the following the options from the Preprocessing-parameters.yaml
, which are needed for the preparation step, will be explained:
# Path where the ntuples are saved
ntuple_path: &ntuple_path <path_palce_holder>/ntuples/
# Base path where to store preprocessing results
.base_path: &base_path <base_path_palce_holder>
# Path where the hybrid samples will be saved
sample_path: &sample_path !join [*base_path, /prepared]
# Path where the merged and ready-to-train samples are saved
file_path: &file_path !join [*base_path, /preprocessed]
Setting | Type | Explanation |
---|---|---|
ntuple_path |
str |
The path where the ntuples are stored. This is the folder where the different process folders with the .h5 files are stored (the folder with the user.* folders from rucio ) |
sample_path |
str |
The path were the prepared samples will be stored (the output files of this preprocessing step). |
Cut Templates#
cut_parameters: !include Preprocessing-cut_parameters.yaml
Preprocessing-cut_parameters.yaml
# Defining anchor with outlier cuts that are used over and over again
.outlier_cuts: &outlier_cuts
- JetFitterSecondaryVertex_mass:
operator: <
condition: 25000
NaNcheck: True
- JetFitterSecondaryVertex_energy:
operator: <
condition: 1e8
NaNcheck: True
- JetFitter_deltaR:
operator: <
condition: 0.6
NaNcheck: True
# Defining yaml anchors to be used later, avoiding duplication
.cuts_template_training_ttbar: &cuts_template_training_ttbar
cuts:
- eventNumber:
operator: mod_10_<=
condition: 7
- pt_btagJes:
operator: "<="
condition: 2.5e5
- *outlier_cuts
.cuts_template_training_zprime: &cuts_template_training_zprime
cuts:
- eventNumber:
operator: mod_10_<=
condition: 7
- pt_btagJes:
operator: ">"
condition: 2.5e5
- *outlier_cuts
.cuts_template_validation: &cuts_template_validation
cuts:
- eventNumber:
operator: mod_10_==
condition: 8
- *outlier_cuts
.cuts_template_validation_ttbar_hybrid: &cuts_template_validation_ttbar_hybrid
cuts:
- eventNumber:
operator: mod_10_==
condition: 8
- pt_btagJes:
operator: "<="
condition: 2.5e5
- *outlier_cuts
.cuts_template_validation_zprime_hybrid: &cuts_template_validation_zprime_hybrid
cuts:
- eventNumber:
operator: mod_10_==
condition: 8
- pt_btagJes:
operator: ">"
condition: 2.5e5
- *outlier_cuts
.cuts_template_testing: &cuts_template_testing
cuts:
- eventNumber:
operator: mod_10_==
condition: 9
- *outlier_cuts
The cuts defined in this section are templates for the cuts of the different flavour for t\bar{t}/Z'. ttbar_train
and zprime_train
are the jets which are used for training while validation/test are the templates for validation and test.
The cuts which are to be applied can be defined in these templates. For example, we can define a cut on the eventNumber
with a modulo operator. This modulo operator defines that all jets are used, where the eventNumber
is equal to something. The something can be defined by the condition
. With this specific cut on the eventNumber
, we are splitting the t\bar{t}/Z' in train/validation/test to ensure no jet is used twice. In the default case, \frac{2}{3} of the jets are used for training, \frac{1}{6} for validation and \frac{1}{6} for evaluation.
Another cut which can be applied is the pt_btagJes
, which is a cut on the jet p_T. Works the same as the modulo operator. In the default case, we want t\bar{t} for the jet p_T region from 20\,\text{GeV} to 250\,\text{GeV} and Z' for the region above 250\,\text{GeV}.
Nested cuts on same variable
It is possible to also apply nested cuts on the same variable e.g. like this
.cuts_template_zprime_train: &cuts_template_zprime_train
cuts:
- eventNumber:
operator: mod_6_<=
condition: 3
- pt_btagJes:
operator: ">="
condition: 2.5e5
- pt_btagJes:
operator: "<="
condition: 3e6
File- and Flavour Preparation#
# Path to the .h5 files from the h5 dumper.
input_h5:
ttbar:
path: *ntuple_path
file_pattern: user.alfroch.410470.btagTraining.e6337_s3126_r10201_p3985.EMPFlow.2021-07-28-T130145-R11969_output.h5/*.h5
zprime:
path: *ntuple_path
file_pattern: user.alfroch.427081.btagTraining.e6928_e5984_s3126_r10201_r10210_p3985.EMPFlow.2021-07-28-T130145-R11969_output.h5/*.h5
Preparation
section, different options need to be set and files/flavours defined. The options that need to be set are given in the following table:
Setting | Type | Explanation |
---|---|---|
batchsize |
int |
Number of jets that are loaded per iteration step from the .h5 files. This is to not load the whole .h5 file at once, which could lead to exhaustion of the available RAM. This number can adjusted to the amount of RAM that is available. |
input_h5 |
dict |
The dict with the file types which are used in the preprocessing. Here ttbar and zprime are the internal names of these files. Both are also dicts. |
path |
str |
Dict entry of ttbar and zprime . This gives the path to the folder where the process folders are stored (this is the same as ntuple_path in Preprocessing-Parameters.yaml ). |
file_pattern |
str |
Dict entry of ttbar and zprime . This is the specific path to the .h5 files of the process. The path and file are in the script merged to form the global path to the .h5 files. Wildcards are supported! |
randomise |
bool |
Optional setting to randomise the samples which are read in. can be useful if you have several data taking campaigns and you want a representative sample, especially important for validation and testing. (a random seed is set to maintain reproducibility.) by default False |
jets_name |
str |
Optional setting the name of the jet collection. The default is "jets". After the preparation step, the jet collection will always be called "jets". |
collection_name |
str |
Optional setting to define a top level in which the jet collection and track collections are stored. By default, this is not set. If you have, for example, "collection_1/jets", this option would be "collection_1". After the preparation step, the jet- and track collections will always be at the top level. |
In the example above, we specify the paths for ttbar
and zprime
ntuples. Since we define them there, we can then use these ntuples in the samples
section. So if you want to use e.g. Z+jets ntuples for bb-jets, define the corresponding zjets
entry in the ntuples section before using it in the samples
section.
training_ttbar_bjets:
type: ttbar
category: bjets
n_jets: 10e6
<<: *cuts_template_training_ttbar
output_name: !join [*sample_path, /bjets_training_ttbar_PFlow.h5]
training_ttbar_cjets:
type: ttbar
category: cjets
# Number of c jets available in MC16d
n_jets: 10e6
<<: *cuts_template_training_ttbar
output_name: !join [*sample_path, /cjets_training_ttbar_PFlow.h5]
training_ttbar_ujets:
type: ttbar
category: ujets
n_jets: 10e6
<<: *cuts_template_training_ttbar
output_name: !join [*sample_path, /ujets_training_ttbar_PFlow.h5]
training_ttbar_taujets:
type: ttbar
category: taujets
n_jets: 10e6
<<: *cuts_template_training_ttbar
output_name: !join [*sample_path, /taujets_training_ttbar_PFlow.h5]
training_zprime_bjets:
type: zprime
category: bjets
n_jets: 10e6
<<: *cuts_template_training_zprime
output_name: !join [*sample_path, /bjets_training_zprime_PFlow.h5]
training_zprime_cjets:
type: zprime
category: cjets
n_jets: 10e6
<<: *cuts_template_training_zprime
output_name: !join [*sample_path, /cjets_training_zprime_PFlow.h5]
training_zprime_ujets:
type: zprime
category: ujets
n_jets: 10e6
<<: *cuts_template_training_zprime
output_name: !join [*sample_path, /ujets_training_zprime_PFlow.h5]
training_zprime_taujets:
type: zprime
category: taujets
n_jets: 10e6
<<: *cuts_template_training_zprime
output_name: !join [*sample_path, /taujets_training_zprime_PFlow.h5]
validation_ttbar:
type: ttbar
category: inclusive
n_jets: 4e6
<<: *cuts_template_validation
output_name: !join [*sample_path, /inclusive_validation_ttbar_PFlow.h5]
validation_ttbar_bjets:
type: ttbar
category: bjets
n_jets: 4e6
<<: *cuts_template_validation_ttbar_hybrid
output_name: !join [*sample_path, /bjets_validation_ttbar_PFlow.h5]
validation_ttbar_cjets:
type: ttbar
category: cjets
n_jets: 4e6
<<: *cuts_template_validation_ttbar_hybrid
output_name: !join [*sample_path, /cjets_validation_ttbar_PFlow.h5]
validation_ttbar_ujets:
type: ttbar
category: ujets
n_jets: 4e6
<<: *cuts_template_validation_ttbar_hybrid
output_name: !join [*sample_path, /ujets_validation_ttbar_PFlow.h5]
validation_ttbar_taujets:
type: ttbar
category: taujets
n_jets: 4e6
<<: *cuts_template_validation_ttbar_hybrid
output_name: !join [*sample_path, /taujets_validation_ttbar_PFlow.h5]
validation_zprime:
type: zprime
category: inclusive
n_jets: 4e6
<<: *cuts_template_validation
output_name: !join [*sample_path, /inclusive_validation_zprime_PFlow.h5]
validation_zprime_bjets:
type: zprime
category: bjets
n_jets: 4e6
<<: *cuts_template_validation_zprime_hybrid
output_name: !join [*sample_path, /bjets_validation_zprime_PFlow.h5]
validation_zprime_cjets:
type: zprime
category: cjets
n_jets: 4e6
<<: *cuts_template_validation_zprime_hybrid
output_name: !join [*sample_path, /cjets_validation_zprime_PFlow.h5]
validation_zprime_ujets:
type: zprime
category: ujets
n_jets: 4e6
<<: *cuts_template_validation_zprime_hybrid
output_name: !join [*sample_path, /ujets_validation_zprime_PFlow.h5]
validation_zprime_taujets:
type: zprime
category: taujets
n_jets: 4e6
<<: *cuts_template_validation_zprime_hybrid
output_name: !join [*sample_path, /taujets_validation_zprime_PFlow.h5]
testing_ttbar:
type: ttbar
category: inclusive
n_jets: 4e6
<<: *cuts_template_testing
output_name: !join [*sample_path, /inclusive_testing_ttbar_PFlow.h5]
testing_zprime:
type: zprime
category: inclusive
n_jets: 4e6
<<: *cuts_template_testing
output_name: !join [*sample_path, /inclusive_testing_zprime_PFlow.h5]
The last part is the exact splitting of the flavours. In samples
, you define for each of t\bar{t}/Z' and training/validation/testing the flavours you want to use. In the example case, these samples are stored in another yaml file called Preprocessing-samples.yaml
to keep the config file a bit smaller. But you can also simply add them directly to the config file.
The sample are defined as dicts with the following options:
Setting | Type | Explanation |
---|---|---|
type |
str |
Type of process that this file will be. |
category |
str |
This defines that flavour that will be extracted in this file. You can either use a flavour like bjets or inclusive , which will use all tracks regardless of their flavour. |
n_jets |
int |
Number of jets you want for this specific flavour. If not specified, it is set to 4M. |
cuts |
list |
A list of cuts that are applied. In the default case, this is added via templates which are added with <<: . |
output_name |
str |
Name of the output file where the prepared file will be stored. |
Note: The n_jets
should be as high as possible for the train files! This is just the number of jets for this flavour which are extracted from the .h5
files coming from the dumper. The resampling algorithm uses these samples to get the jets for building the final training sample, but it only uses as much as needed! Only for the validation and testing files we suggest to use something around 4e6
(otherwise the loading later on takes quite some time).
Create samples list automatically
If you don't want to define all the different samples one by one, you can also use the create_preprocessing_samples.py
script. To use the script, you just need to adapt it to your needs:
categories = ["ujets", "cjets", "bjets"]
sample_types = ["ttbar", "zprime"]
n_jets = {
"training": int(10e6),
"validation": int(4e6),
"testing": int(4e6),
}
categories
(list
): List with the flavours to extract.sample_types
(list
): List with the sample types you want to use.n_jets
(dict
): Dict with the number of jets which are to be extracted from the.h5
files for the different usages of the samples (this must be training/validation/testing! You can't rename them!).
This will create the content of the samples
dict of the preprocessing config file. The different training samples, i.e. training_ttbar_bjets
etc. and also the validation (validation_ttbar
and validation_zprime
) and testing samples (testing_ttbar
and testing_zprime
) will be created. In addition, the by flavour separated validation files (i.e validation_ttbar_bjets
) needed for the hybrid validation creation are also prepared. Which cut template is used is also based on the name of the cut template. This must be .cuts_template_training_ttbar
for the training
case of ttbar
.
To add your file to the preprocessing config, you can simply !include
it like the preprocessing parameters. Just exchange the samples
with the different samples defined in it with:
samples: !include <Path to your samples yaml file>
Run the Preparation#
To run the preparation step, switch to the umami/umami/
folder and run the following command:
preprocessing.py --config <path to config file> --prepare
The preprocessing will start in order of the files defined in samples:
to preprare the different selected samples. This step is one of the longest steps of the preprocessing if not parallelised. You can run the preparation in for the different defines samples one per job, by defined which sample is to be prepared.
For example, to run the sample preparation for the prepared training b-jet sample training_ttbar_bjets
, which has been defined in the config file in the preparation: samples:
block, execute:
preprocessing.py --config <path to config file> --prepare --sample training_ttbar_bjets
The result of the commands are the prepared samples which are ready for resampling. Also, please keep in mind that in this step also the validation and testing files are prepared. You can also run them in separate with the same as the training_ttbar_bjets
with, for example, the command:
preprocessing.py --config <path to config file> --prepare --sample testing_ttbar
Ntuple Preparation for VR track jets#
The preparation of variable-radius track jet input files for training, validation, and testing datasets is very similar to the workflow described above for PFlow jets. The main difference is that a special set of config files is used, which accounts for the different hybrid sample composition when using VR track jets. The main differences to PFlow jets are:
- implementation of the VR track jet overlap removal
- only the four leading jets in pt are used for the ttbar sample
- only the two leading jets in pt are used for the Z' sample
- the transition from ttbar to Z' occurs in a region in pt and not by a fixed pt cut