Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LambdaTransformer (Scaper compatibility) #127

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

beasteers
Copy link
Contributor

What does this implement/fix? Explain your changes.

This adds a general purpose transformer that can be used to load and transform arbitrary observations with Pump.

I built this for the purpose of extracting Scaper annotations, but it's not Scaper specific.

Here's a super simple example:

trans = pumpp.task.LambdaTransformer(
        name='scaper', namespace='scaper',
        fields=['snr', 'label'], 
        query={'role': 'foreground'})

Assuming a 5 second sample with a single 1 second event 2 seconds in, the data dict would look like this:

{
    'scaper/snr': [np.nan, np.nan, 10, np.nan, np.nan],
    'scaper/label': ['', '', 'something', '', '']
}

Filtering Observations

query can be used flexibly with a wide range of values. The type of query should roughly match the observation values and is run recursively through dicts and lists. So if the observation is a dict and you want to query based on keys, you build it as a dict with keys matching value. If value is a list and you want to condition element-wise, then make query a list. If value is a single string then use a string. You can also use a set to check membership for hashable types.

At any point, you can set it as a callable and it will pass the data up to that point.

It will will fail if any conditions are False.

Here are some valid query examples:

query={'label': 'AAA|BBB'} # either AAA or BBB
query={'label': lambda label: 'A' in label, 'role': 'foreground'} # arbitrary condition
query={'pitch_shift': lambda pitch: pitch and pitch > 3} # whatever you want
query={'pitch_shift': {1, 2}} # pitch_shift is in set
query=lambda d: d['pitch_shift'] == -3 or 'Thunk' in d['label'] # callable gets the full dict

# Or say observation values are just strings
query=lambda label: 'Dog' in label
query={'Dog bark', 'Hum', 'Honk'} # label is in set
query='[^0-9]+'

# Or a list field of shape (None, 4,)
query=[5, 6, lambda x: x // 2 < 25, {8, 9, 10}] # mix and match

Aggregating interval windows

And you can have a bit more control. reduce(x) is iteratively fed a list of all the events within each hop window interval.

trans = pumpp.task.LambdaTransformer(
        name='scaper', namespace='scaper',
        fields=[
            'label', # from schema
            ('mean_time_stretch', (None,1), np.float_), # custom field
            ('all_time_stretch', (None,None), np.float_), # variable number of events
        ], 
        query={'role': 'foreground'},
        multi=True, reduce=lambda events: {
            'label': ','.join(set(e['label'] for e in events)),
            'mean_time_stretch': np.mean([e['time_stretch'] for e in events]),
            'all_time_stretch': [e['time_stretch'] for e in events],
        })

real life

And finally, here's how I'm currently using it:

pumpp.task.LambdaTransformer(
    'scaper', 'scaper',
    ['snr', 'label', 'source_file', ('fault', (None, 1), np.bool_)],
    query={'label': fault_label},
    reduce=lambda e: dict(e or {}, fault=e and ~np.isnan(e['snr']))
)

Any other comments?

As of right now, the all_time_stretch field won't work with a slicer because all None fields are interpreted as a time dimension. I see how this makes sense for the structure transformer. I'm not sure how to reconcile it with returning array values. Maybe it's really not necessary ever, but part of my thinks it would be a nice option to have (returning an array for each interval) if we want to support as many use cases as possible.

This could also probably use some more safeguards preventing ppl from doing bad things, but atm I'm not sure what those would be so for now, I think it's okay to leave things open ended.

@beasteers
Copy link
Contributor Author

Is there anything else in this PR that needs high-level commentary before i dig in for a proper CR?

I don't think so? Let me know if things need clarification

@beasteers
Copy link
Contributor Author

beasteers commented Oct 8, 2019

I divided it into separate commits after getting low-key shamed during the marl meeting. 😝

(Justin has told me that I should squash commits when contributing. my bad)

@beasteers
Copy link
Contributor Author

One thing I want to add is a static parameter which will return a single value for the entire annotation. This would be useful to extract the background source_file for example

@beasteers
Copy link
Contributor Author

It'd also be good to be able to gather arbitrary sandbox data as well. I'm not sure if this fits in the scope of this transformer or if it'd be better to create a simpler, dedicated transformer for that purpose.

@bmcfee bmcfee added this to the 0.6.0 milestone Apr 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented Apr 14, 2022

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.71%. Comparing base (4a67bdf) to head (c588f58).
Report is 5 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #127      +/-   ##
==========================================
+ Coverage   99.69%   99.71%   +0.02%     
==========================================
  Files          22       24       +2     
  Lines        1299     1425     +126     
==========================================
+ Hits         1295     1421     +126     
  Misses          4        4              
Flag Coverage Δ
unittests 99.71% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bmcfee bmcfee removed this from the 0.6.0 milestone Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants