Skip to content

Preprocessing

DannyWeitekamp edited this page Aug 23, 2016 · 7 revisions

#Preprocessing Pandas Tables -> numpy Training Data

Table of contents:

  1. ObjectProfile
  2. preprocessFromPandas_label_dir_pairs
  3. [Examples] (https://github.com/DannyWeitekamp/CMS_SURF_2016/wiki/Preprocessing#examples)

##ObjectProfile An object containing processing instructions for each observable object

Arguements:

  • name- The name of the data type (i.e. Electron, Photon, EFlowTrack, etc.)
  • max_size- The maximum number of objects to use in training
  • sort_columns- What columns to sort on (See pandas.DataFrame.sort)
  • sort_ascending- Whether each column will be sorted ascending or decending (See pandas.DataFrame.sort)
  • query - A selection query string to use before truncating the data (See pands.DataFrame.query)
  • shuffle- Whether or not to shuffle the data

##preprocessFromPandas_label_dir_pairs Gets training data from folders of pandas tables

Arguements:

  • label_dir_pairs- a list of tuples of the form (label, directory) where the directory contains tables containing data of all the same event types.
  • start- Where to start reading (as if all of the files are part of one long list)
  • num_samples- The number of samples to read
  • object_profiles- A list of ObjectProfile(s) corresponding to each type of observable object and its preprocessing steps. The order of the ObjectProfiles in this list dictates the order or the input X list.
  • observ_types- The column headers for the data to be read from the panadas table

Returns: Training data with its correspoinding labels (X_train, Y_train)

##Examples

observ_types = ['E/c', 'Px', 'Py', 'Pz', 'Charge', "PT_ET", "Eta", "Phi", "Dxy_Ehad_Eem"]
sample_start = 0
num_samples = 10000


object_profiles = [ObjectProfile("Electron",5),
                    ObjectProfile("MuonTight", 5),
                    ObjectProfile("Photon", 25),
                    ObjectProfile("MissingET", 1),
                    ObjectProfile("EFlowPhoton",1000, sort_columns=["PT_ET"], sort_ascending=False),  #1300
                    ObjectProfile("EFlowNeutralHadron",1000, sort_columns=["PT_ET"], sort_ascending=False),  #1000
                    ObjectProfile("EFlowTrack",1000, sort_columns=["PT_ET"], sort_ascending=False)]  #1050


label_dir_pairs = \
            [   ("ttbar", "/data/shared/Delphes/ttbar_lepFilter_13TeV/pandas_unjoined/"),
                ("wjet", "/data/shared/Delphes/wjets_lepFilter_13TeV/pandas_unjoined/"),
                ("qcd", "/data/shared/Delphes/qcd_lepFilter_13TeV/pandas_unjoined/")
            ]
X_train, y_train = preprocessFromPandas_label_dir_pairs(label_dir_pairs, sample_start, num_samples, object_profiles,observ_types)
Clone this wiki locally