GitHub - cms-physics-object-school/LongExerciseTauID

2. Tau ID

Set up code

source /cvmfs/grid.cern.ch/emi3ui-latest/etc/profile.d/setup-ui-example.sh
export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch
source $VO_CMS_SW_DIR/cmsset_default.sh

export SCRAM_ARCH=slc7_amd64_gcc700
cmsrel CMSSW_10_2_15
cd CMSSW_10_2_15/src
cmsenv
git cms-init
git cms-addpkg PhysicsTools/NanoAOD 
scram b -j 8

git clone https://github.com/cms-physics-object-school/LongExerciseTauID

cd LongExerciseTauID

cmsDriver.py myNanoProdMc2018 -s NANO --mc --eventcontent NANOAODSIM --datatier NANOAODSIM  --no_exec  --conditions 102X_upgrade2018_realistic_v19 --era Run2_2018,run2_nanoAOD_102Xv1 --customise_commands="process.add_(cms.Service('InitRootHandlers', EnableIMT = cms.untracked.bool(False)))"

scp -r lxplus.cern.ch:/afs/cern.ch/work/j/jbechtel/public/CMSPOS_2019/TauID .

cp -rs TauID/*root .

Tasks:

Run the nanoAOD script on the QCD and GluGluHToTauTau miniAOD files for 100 events.
Inspect the output files with python ../PhysicsTools/NanoAOD/test/inspectNanoFile.py -d output_file.html output_file.root.
In nanoAOD, not all information about the tau leptons are kept, e.g. the information of the impact parameter of the tau track with respect to the primary vertex, and its significance, is lost. We can customize the nanoAOD production to add this (and potentially more) parameters back. The variables to be written out in nanoAOD for tau leptons are defined in ../PhysicsTools/NanoAOD/python/taus_cff.py. Add the variables ip3d and ip3d_Sig to the output file. (You can add more if you want, the variables available in a PAT::Tau object are defined here: https://github.com/cms-sw/cmssw/blob/CMSSW_10_2_X/DataFormats/PatCandidates/interface/Tau.h
Process 100 events and check that the new variables are properly written out.
Find variables which you suspect have discriminating power in seperating jets misidentified as hadronic tau leptons, and genuine hadronic tau leptons. You can add the variables to plot.py and check in the resulting distributions which distributions differ between the two samples.
Choose a set of variables in which you suspect strong discrimination power. Remember that this application will be to find a general discrimination between real and fake tau leptons, not between $H\rightarrow\tau\tau$ and QCD in general! Which variables (even if discriminating) should you not use because of this?
Add your variables to train.py. Train the neural network. Apply the model to your validation data and plot the ROC curve, and calculate the area-under-curve (AUC).
Draw the output scores of your classifier for the background and signal events of the validation data.
Find an optimal working point (i.e. value of the NN score at which you consider a tau lepton to be genuine) for your classifier. Remember that the cross section of QCD events (jet production) is orders of magnitude higher than that of processes producing genuine tau leptons. What does this impose for your working point.
Give the working point at which the misidentification probabiliy (false positive rate) is at most 1%. What is the efficiency you can achieve at this point?
Run python analyze.py. This will result in a plot ranking.png in which the input variables of the NN are ranken according to their influence on the NN output. Did you expect the variables to be ranked as they are?
The file TTToSemileptonic_MiniAOD.root contains $t\bar{t}\rightarrow \ell \tau_{h}$ events, which means they contain both many jets (from the $t\bar{t}$ pair production) as well as genuine hadronic tau leptons. We want to seperate the misidentified and true tau leptons using our classifier.
Process the file to nanoAOD while adding all variables you need as input for your NN classifier.
Apply the NN classifier for the $t\bar{t}$ events and plot the ROC curve as well as the distribution of NN scores. You can gain the truth information about the reconstructed taus by using Tau_genPartFlav==0 for misidentified taus and Tau_genPartFlav==5 for true hadronic taus. How does the performance compare to your validation sample?
How can you explain the difference in performance? Hint: Plot the $p_{T}$ distribution of genuine and fake taus from the $t\bar{t}$ sample.
How can you recover the performance to make the classifier more applicable to $t\bar{t}$ events?

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
analyze.py		analyze.py
plot.py		plot.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2. Tau ID

About

Releases

Packages

Languages

cms-physics-object-school/LongExerciseTauID

Folders and files

Latest commit

History

Repository files navigation

2. Tau ID

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages