Preprocessing
In the preparation and preprocessing section of this tutorial, we focus on converting and preparing the data for our machine learning model. This step involves organizing and cleaning the data, as well as applying suitable preprocessing techniques to ensure that the data is in the optimal format for effective model training.
Prerequisites#
Step 1 Download the JetClass dataset. For the purpose of this tutorial, we only use a subset of the dataset.
mkdir -p data/jetclass/train_10M && cd data/jetclass/train_10M
wget https://zenodo.org/records/6619768/files/JetClass_Pythia_train_100M_part0.tar
tar xvf JetClass_Pythia_train_100M_part0.tar
Step 2 Convert the JetClass dataset to the Umami hdf5 format. We will use the R10TruthLabel_R22v1
variable to encode the different target classes. Note that for the following script it is assumed that you have installed the following python packages.
pip install uproot h5py numpy lz4 xxhash
You can create a script convert_to_umami
with the following content and execute it to create the input file for Umami.
convert_to_umami.py
script
import uproot
import h5py
import numpy as np
import os
import fnmatch
# Directory containing the ROOT files
root_dir = "data/jetclass/train_10M"
output_dir = "data/jetclass/ntuples"
output_file = "combined_data.h5"
# File patterns and their corresponding class labels
file_patterns = {
"HToBB_*.root": 11,
"HToCC_*.root": 12,
"TTBar_*.root": 1,
}
# Define the dtype for the structured array
dtype = np.dtype([
("pt", np.float16),
("eta", np.float16),
("mass", np.float16),
("R10TruthLabel_R22v1", np.int32)
])
# List to store data
all_data = []
# Process each file pattern
for pattern, class_label in file_patterns.items():
for filename in os.listdir(root_dir):
if fnmatch.fnmatch(filename, pattern):
file_path = os.path.join(root_dir, filename)
with uproot.open(file_path) as file:
tree = file["tree"]
# Extract relevant branches
pt = tree["jet_pt"].array(library="np").astype(np.float16)
eta = tree["jet_eta"].array(library="np").astype(np.float16)
mass = tree["jet_sdmass"].array(library="np").astype(np.float16)
# Create structured array
data = np.empty(len(pt), dtype=dtype)
data["pt"] = pt
data["eta"] = eta
data["mass"] = mass
data["R10TruthLabel_R22v1"] = class_label
all_data.append(data)
# Combine all data
combined_data = np.concatenate(all_data)
# Shuffle the combined data
np.random.shuffle(combined_data)
# Write data to HDF5 file
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, output_file)
with h5py.File(output_path, "w") as h5file:
h5file.create_dataset("jets", data=combined_data)
print(f"Data written to {output_path}")
As a result, you should have now a directory data/jetclass/
which contains the converted dataset as a h5 file in data/jetclass/ntuples/combined_data.h5
.
Preprocessing#
The preprocessing is steered via configuration files. The relevant files are located in examples/tutorial_jetclass/
. Please have a look at the files
Preprocessing-config.yaml
: main preprocessing filesPreprocessing-parameters.yaml
: custom parameters specific to your computing site (you might need to modify this if you deviated from the previous instructions)Preprocessing-samples.yaml
: samples which correspond to the classes in the trainingPreprocessing-cut_parameters.yaml
: can contain commonly used definitions for selection cuts applied to datasets (not used in this tutorial)
Step 1 Run the preparation step. This step is needed to preprare the different selected samples.
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --prepare
data/jetclass/preprocessed/
and data/jetclass/hybrids/
which contain the output of the preparation step.
Step 2 Run the resampling step. This step uses count
-based undersampling to achieve balanced training classes. You will create both a training sample and a hybrid validation sample.
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --resampling
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --resampling --hybrid_validation
Step 3 Run the scale + shift step. This step processes the training variables to normalise the range of the independent variables.
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --scaling
Step 4 Run the writing step. This step writes the training dataset to disk. You will also write the validation sample to disk.
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --write
preprocessing.py --config examples/tutorial_jetclass/Preprocessing-config.yaml --write --hybrid_validation
You now have finished all steps of the preprocessing stage. The result is provided in data/jetclass/preprocessed/
. Have a look at the plots in data/jetclass/preprocessed/plots
In the next part you will run a training of a fully connected deep neural network to classify the jets.