Multivariate Methods

To reduce the combinatorial background, while maximizing the ratio of signal efficiency to background rejection, a multivariate selection based on a BDT is applied. Optimizing this BDT represents the main body of work for this analysis. The optimization process is described in this section. However, an introduction to MVAs is given first.

Generally, MVA algorithms take as input a signal sample, a background sample, and a set of discerning variables. The MVA classifies the variable space as signal or background, depending on the properties of the respective data. MVAs can exploit correlations between input variables and thus perform better than the traditional method of applying rectangular cuts. The output of the MVA is a single variable, denoted BDT_G in the plots, that represents how signal-like a certain event is. A decision tree has a structure that is very similar to a flowchart. The first decision node receives the full training sample and applies a requirement on one of the considered discriminating variables; the applied cut gives the best signal-background separation. Subsequently, two sub-samples are created, which are used as inputs for the further nodes. The process continues until either the maximum number of requirements are applied, as set by the user, or there are no events left. The resulting nodes are thus classified as signal/background, depending on the main contribution. In case an event is misclassified by a decision tree, it is assigned a higher priority to ensure that the next decision tree is more likely to perform the classification properly. This process was named boosting and thus the term Boosted Decision Trees was coined to denote this type of MVA.

The training and reweighting of the training sample is repeated until a decision forest is obtained, which is subsequently combined into a single classifier by taking the weighted average of the forest. Due to being based on simple 1D optimisation, BDTs are insensitive to the inclusion of poorly discriminating input variables. As a caveat, this may limit their relative performance when compared to more sophisticated MVA methods. A total of 16 BDTs were trained and used for the studied decay modes.

Preliminary histograms sections describes the process through which initial sets of variables are selected to train each BDT. However, listing all these variables here would over-saturate the text. A complete inventory of the initial training variables is available on the left for each decay. The initial variable sets are refined through an iterative process, guided by two factors. First, the final training set should contain only variables that are minimally correlated, as to mitigate over-training. The correlation between different quantities is illustrated through correlation matrices. Secondly, the ratio between signal efficiency and background rejection should be high, since this will lead to a higher precision in the $CP$ asymmetry measurement. This efficiency is determined by the so called ROC curves.

The two contributions counter-balance one another. If a lot of variables are eliminated, as to guarantee minimum overtraining, the signal efficiency over the background rejection decreases; however, a high ratio is obtained through the inclusion of many variables, some of which are bound to have high correlation coefficients and thus produce an overtrained BDT. Therefore, each of the 16 BDTs were trained using 15 distinct variable mixtures. All the resulting correlation matrices, ROC curves, and overtraining figures ($\sim 100$ in total) are on the left of this section. By performing an overall comparison of the plots, obtained for each of the 16 BDTs, it was determined that the best training variables, the 13th set for each decay on the left. Furthermore, the input variables undergo an additional procedure to reduce correlations, in the form of training data transformations.

Finally, the $k$-folding technique is used to mitigate the eventual performance issues and eliminate biases that small data sets might experience in training the BDT by increasing the overall statistics. Withholding a subset of the signal and background samples is quintessential for post-training performance validation. Ideally, the size of the training and testing samples is the same. However, as stated at the start of the paragraph, this might hinder the performance of small data sets. To avoid this possible complication, $k = 5$ BDTs are trained for each variable combination in each decay chain, while withholding a different 20% of the sample for testing every time. Thus, the performance validation is done by evaluating the BDT on its excluded data. This procedure allows the majority of the signal and background samples to be used. Finally, the five different BDT responses are incorporated into the same analysis data set.