Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to combine filters for Samples and Regions #307

Open
nhartman94 opened this issue Dec 3, 2021 · 3 comments · May be fixed by #310
Open

Option to combine filters for Samples and Regions #307

nhartman94 opened this issue Dec 3, 2021 · 3 comments · May be fixed by #310
Labels
config Affects configuration schema enhancement New feature or request

Comments

@nhartman94
Copy link

Hi @alexander-held ,

I watched your pyhep tutorial recently, and was super enthusiastic by how streamlined this package is!

I started playing with it for my analysis, and I wanted to try a study where I wanted to apply some different filters for the individual samples as well as an analysis categorization to define a regions.

I can see where to modify the _filter class to implement an "&" the two filters if both the Region and Sample Filters are provided, but I was wondering, would it maybe be useful to submit a pull request in case others are interested in this functionality in the future? Or am I not correctly understanding how I should be using these Filters?

Thanks so much!

Best,
Nicole

@nhartman94 nhartman94 changed the title Separate filters for Samples and Regions Option to combine filters for Samples and Regions Dec 3, 2021
@alexander-held alexander-held added config Affects configuration schema enhancement New feature or request labels Dec 6, 2021
@alexander-held
Copy link
Member

Hi Nicole,

Thanks for bringing this up! The current limitation is indeed quite limiting, and it would be great to have something a bit more flexible. The general idea behind the implementation of the filter is to have something flexible that does not require a lot of different configuration options to work. At the moment it works like this:

  • a filter can be set per region,
  • if a filter is also specified per sample, the filter at the sample level completely overrides the setting at the region level,
  • if a filter is specified for a systematic variation in an up/down template, that filter also completely overrides anything at the region/sample level.

The one convenient thing about this is that there's only a single option, and (I believe) this is in principle general enough to implement any use case, albeit it can get very clunky. If you have for example an extra filter you want to apply per sample, and also multiple regions with their own filter, you would have to split the sample up into many different samples (one per region) and implement the combined region + sample filter for each of the new per-region versions of the sample. The behavior at the moment is strictly overriding the higher-level filter, with no possibility to append.

I think for practical use cases it would be useful for systematics to be able to both override filters and append to them, and similarly for all things at a lower level (general -> region -> sample -> systematic) to both be able to override and append. Here are some thoughts of what could be useful to have:

  • a generic filter setting at the highest level (under General) to apply to everything (ideally with the possibility to override per region/sample/systematic?),
  • a region-specific filter (ideally with the possibility to override per sample/systematic?),
  • a sample-specific filter (ideally with the possibility to override per systematic?),
  • a systematic-specific filter that allows to append to the other filters (besides allowing for overriding them too).

I am not sure how to best go about providing something flexible that can still be reasonably easy to understand and does not require a ton of different settings to remember. Any ideas for a design here are much appreciated!

For the specific case outlined above, we could think about splitting Filter into something like RegionFilter and SampleFilter and combine them both, and then also support both of these settings in the systematic templates with the possibility to override them both individually (and we could also add a high-level Filter at the General level too maybe). That might be a good short term solution.

To also support appending to a filter (e.g. at the systematic level), I think a slightly different approach might be needed. With YAML configs there's probably some way with anchors to do that, and it could also be done when building the config with Python. Maybe there's a better way though?

@alexander-held
Copy link
Member

alexander-held commented Dec 6, 2021

One way to implement this that may be quite general and possibly also fairly readable is the following. We stick with a single option Filter, which takes a dictionary instead of a string. The dictionary keys are user-defined, and the values are strings (the actual filter instructions). When building a filter for a given region, sample, and systematic variation (or the nominal template), the dictionaries from the general level, the region level, the sample level, and the systematic template (if not nominal) are combined. If a key is used more than once, the value is taken from the lowest-level implementation (general -> region -> sample -> systematic). Otherwise all values are logically added together.

That would allow users to specify themselves filters that can be overridden (and overriding then happens by using the same key again later on), and to append to existing filters (using a new key). Here is a schematic example:

General:
  Filter: {"general": "jet_pt > 50"}

Region:
  - Name: "SR"
    Filter: {"SR_filter": "nLeptons == 4"}

Sample:
  - Name: "ttbar"
    Filter: {"ttbar_filter": "DSID == 410000", "general": "jet_pt > 40"}

  - Name: "Z+jets"

The actual strings used here for the keys in the dictionaries don't matter, the only thing that matters is whether two keys are the same or not. For ttbar, the filter would be (jet_pt > 40)*(nLeptons == 4)*(DSID == 410000) (the General filter with the cut of 50 was overridden), for Z+jets it would be (jet_pt > 50)*(nLeptons == 4).

Edit: the dictionaries can also be written differently, e.g. like this

Filter:
  ttbar_filter: "DSID == 410000"
  general: "jet_pt > 40"

which may further enhance readability.

This approach clashes in one aspect with the current config design: at the moment, keys are known in advance and specified by the JSON schema. This would allow the introduction of arbitrary keys for this specific option. To avoid this, the filter could be a list of dictionaries, with each dictionary in that list having a Name and a Filter option (or similar, probably not ideal to re-use Filter):

Filter:
  - Name: "ttbar_filter"
    Filter: "DSID == 410000"
  - Name: "general"
    Filter: "jet_pt > 40"

That results in a more consistent format at the cost of being more verbose.

@alexander-held
Copy link
Member

@nhartman94 #310 implements a first attempt at a dict-based approach. Filter is now called Filters, and takes either a dictionary or a list of dictionaries. Those dictionaries need a Name property (with an arbitrary string) and the actual Filter (a string). When constructing a filter for a given template histograms, all filters with different names are combined, and for filters with the same name the filter at the lowest level is chosen (systematics override everything, samples override regions). I plan to also add a way to specify a filter at the highest level in the general block, but that's not in #310 yet. If you have the chance to give this a try, any feedback about the approach would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config Affects configuration schema enhancement New feature or request
Projects
None yet
2 participants