-
Notifications
You must be signed in to change notification settings - Fork 7
Preprocessing
DannyWeitekamp edited this page Aug 23, 2016
·
7 revisions
#Preprocessing Pandas Tables -> numpy Training Data
- ObjectProfile
- preprocessFromPandas_label_dir_pairs
- [Examples] (https://github.com/DannyWeitekamp/CMS_SURF_2016/wiki/Preprocessing#examples)
##ObjectProfile An object containing processing instructions for each observable object
Arguements:
-
name
- The name of the data type (i.e. Electron, Photon, EFlowTrack, etc.) -
max_size
- The maximum number of objects to use in training -
sort_columns
- What columns to sort on (See pandas.DataFrame.sort) -
sort_ascending
- Whether each column will be sorted ascending or decending (See pandas.DataFrame.sort) -
query
- A selection query string to use before truncating the data (See pands.DataFrame.query) -
shuffle
- Whether or not to shuffle the data
##preprocessFromPandas_label_dir_pairs Gets training data from folders of pandas tables
Arguements:
-
label_dir_pairs
- a list of tuples of the form (label, directory) where the directory contains tables containing data of all the same event types. -
start
- Where to start reading (as if all of the files are part of one long list) -
num_samples
- The number of samples to read -
object_profiles
- A list of ObjectProfile(s) corresponding to each type of observable object and its preprocessing steps. The order of the ObjectProfiles in this list dictates the order or the input X list. -
observ_types
- The column headers for the data to be read from the panadas table
Returns:
Training data with its correspoinding labels
(X_train, Y_train)
##Examples
observ_types = ['E/c', 'Px', 'Py', 'Pz', 'Charge', "PT_ET", "Eta", "Phi", "Dxy_Ehad_Eem"]
sample_start = 0
num_samples = 10000
object_profiles = [ObjectProfile("Electron",5),
ObjectProfile("MuonTight", 5),
ObjectProfile("Photon", 25),
ObjectProfile("MissingET", 1),
ObjectProfile("EFlowPhoton",1000, sort_columns=["PT_ET"], sort_ascending=False), #1300
ObjectProfile("EFlowNeutralHadron",1000, sort_columns=["PT_ET"], sort_ascending=False), #1000
ObjectProfile("EFlowTrack",1000, sort_columns=["PT_ET"], sort_ascending=False)] #1050
label_dir_pairs = \
[ ("ttbar", "/data/shared/Delphes/ttbar_lepFilter_13TeV/pandas_unjoined/"),
("wjet", "/data/shared/Delphes/wjets_lepFilter_13TeV/pandas_unjoined/"),
("qcd", "/data/shared/Delphes/qcd_lepFilter_13TeV/pandas_unjoined/")
]
X_train, y_train = preprocessFromPandas_label_dir_pairs(label_dir_pairs, sample_start, num_samples, object_profiles,observ_types)