Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternate AnnotationData backend #149

Merged
merged 32 commits into from
May 19, 2017
Merged

alternate AnnotationData backend #149

merged 32 commits into from
May 19, 2017

Conversation

bmcfee
Copy link
Contributor

@bmcfee bmcfee commented May 9, 2017

This is work in progress, but implements an alternate backend for annotation data.

TODO:

  • trim
  • slice
  • deep serialization (eg for vector format)
  • pretty printing
  • purge all pandas references
  • settle on final object hierarchy / naming conventions
  • update converter scripts and documentation / examples

@bmcfee
Copy link
Contributor Author

bmcfee commented May 10, 2017

Okay, this PR is now functional in that all tests pass again. So if you never touched a JamsFrame object directly in application code (and there are few reasons to do so), then everything should just work.
That said, the design is up for discussion though, particularly seeking input from @justinsalamon and @ejhumphrey.

The new implementation replaces the JamsFrame object by an AnnotationData object, which at the schema level covers both DenseObservation and SparseObservationList, though not using JObjects because I don't think we should have sandbox support in the observation list.

Under the hood, AnnotationData encapsulates the observations with an ordered (by time) list of sparse observations. Observations are implemented as named tuples (time, duration, value, confidence), again not using JObject so that we can preserve slots and field ordering.

This namedtuple implementation also makes individual observations immutable, which I see as a Good Thing. If you want to modify observations, it's best to make a new annotation object and re-insert them after modification. Note that this is the reverse of how the DataFrame implementation worked: the individual observations there were mutable (it's easy to modify entries in-place), but the container itself was much more rigid (adding and deleting observations was difficult). I think the mutable-container/immutable-observation model makes a lot more sense for the common cases (e.g., adding observations to an existing annotation).

You can iterate over observations in the usual way:

for obs in ann.data:
    print(obs.time, obs.duration, obs.value, obs.confidence)

@justinsalamon and I kicked around the idea of having an export-to-dataframe method, which should be pretty trivial to cook up. Similarly, we might want to consider adding other mir_eval-style converters, such as to_event_values for instantaneous observations.

Pretty-printing is the last major hurdle, but I think we can get over that fairly easily.

@bmcfee bmcfee changed the title [WIP] alternate AnnotationData backend [RFC] alternate AnnotationData backend May 10, 2017
@bmcfee bmcfee added this to the 0.3.0 milestone May 10, 2017
jams/core.py Outdated

Fields must be `['time', 'duration', 'value', 'confidence']`.
class AnnotationData(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honest question: does inheriting directly from SortedListWithKey buy us anything? This object owning an obs object feels like an unnecessary abstraction, but I'm not sure if this would result in a weird interface.

thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I'm thinking it's a bad idea to extend from other peoples' classes because it makes the API encapsulation boundary porous.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough ... I guess JamsFrame inherited from DataFrame, but maybe that was half the issue?

if we're not going to inherit from it, I'd propose making obs some level of private (dunders?) so that users are discouraged from using it as part of the API. However, I don't like having to wrap a bunch of pre-existing functionality ... SortedListWithKey is quite fully loaded.

Also worth mentioning here: What gets added to an AnnotationData object? Observations or the fields of an observation? or, said differently, who is responsible for creating the Observation object, the user or the AnnotationData object? because, if it's the former, I think that's an argument for inheritance.

related pop quiz: in Jams, who creates an Annotation when adding it to a JAMS object, the user or the owning object? (because we should follow that paradigm for consistency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're not going to inherit from it, I'd propose making obs some level of private (dunders?) so that users are discouraged from using it as part of the API.

Yeah, I'd be down with that. I'd say sunder not dunder, ie data._obs.

Also worth mentioning here: What gets added to an AnnotationData object? Observations or the fields of an observation? or, said differently, who is responsible for creating the Observation object, the user or the AnnotationData object? because, if it's the former, I think that's an argument for inheritance.

It depends: if you're adding a bunch of records at once, it's typically in deserialized form (ie after a json parse), so it'd be list-of-dict/dict-of-list in.

The core add routine takes the fields for a single observation record and packs them in an Observation for adding.

I don't see what this has to do with inheritance though?

related pop quiz: in Jams, who creates an Annotation when adding it to a JAMS object, the user or the owning object? (because we should follow that paradigm for consistency)

Should we? I see the AnnotationData class as relatively private, because it diverges somewhat from the schema. Annotation provides an interface for adding observations, and that's all a user should need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this:

Another option is to drop AnnotationData entirely, keep ann.data as a sorted container object, and bubble up the rest of the functionality for adding observations, iterations, etc to Annotation. This would be a more radical change in the API than I'd originally intended, but it seems like maybe a good idea provided we're willing to break stuff anyway.

jams/core.py Outdated
return '<{}: {:d} observations>'.format(self.__class__.__name__,
len(self))

def __iter__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, we wouldn't need to implement this method, or L601

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or L604, i think?

jams/core.py Outdated
frame = frame[cls.fields()]
def add_observation(self, time=None, duration=None, value=None,
confidence=None):
idx = self.obs.bisect_key(time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nothing else, I think we can use SortedListWithKey directly (docs). I poked around, and calling add sorts items in-line, so self.obs.add(Observation(...)) should work.

additionally, i think if AnnotationData is a sublcass of SortedListWithKey, then we might subclass add directly? and maybe call it ObservationArray? idk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. Maybe I did it wrong, but it doesn't allow you to insert items in a way that violates sort order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, you can't insert or append if it violates order, but add should sort.

jams/core.py Outdated
value=values,
confidence=confidences)
else:
return [dict(time=o.time,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Observation is an object rather than a NamedTuple, might we get some of this for free?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of what? I mean, we can always do dict(obs._asdict()), but that doesn't solve the dict-of-lists/list-of-dicts problem resulting from sparse/dense observation lists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some of what": the serialization waltz, but yea, sparse vs dense makes this tricky.

jams/core.py Outdated
@@ -1016,10 +892,10 @@ def trim(self, start_time, end_time, strict=False):
# We do this rather than copying and directly manipulating the
# annotation' data frame (which might be faster) since this way trim is
# independent of the internal data representation.
for idx, obs in self.data.iterrows():
for obs in self.data.obs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be keen to simplify this and make the object directly iterable, by a subclass or whatever

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, it is. I'd remove .obs to dig in the preferred interaction paradigm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about Annotation or AnnotationData?

The latter is iterable (wrapping the sorted list); the former is not and wasn't previously. We could make it iterable, but that sounds like a change better suited to a separate PR.

@@ -137,12 +140,38 @@ def can_convert(annotation, target_namespace):
return False


def pop_data(annotation):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this should be a method of AnnotationData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just had that thought too. Will migrate in a bit.

@ejhumphrey
Copy link
Collaborator

well, first off, 💯 for a PR / RFC that deletes more code than it adds.

I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers lower than pandas, and that's why it's a win?

question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?

@bmcfee
Copy link
Contributor Author

bmcfee commented May 10, 2017

I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers lower than pandas, and that's why it's a win?

I think we still need pandas as a dependency for loading lab/csv files (utils, import scripts, etc). But it's good to get the hacky dataframe dependencies out from under the jams object hierarchy.

Sortedcontainers is a much simpler project, and seems reasonably mature at this point.

question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?

It did, but that's not a huge deal. It's better IMO to make observations immutable and put a little more friction into annotation manipulation. Muda will need a little retrofitting, but it's probably about an afternoon's worth of work.

@bmcfee
Copy link
Contributor Author

bmcfee commented May 10, 2017

Note: coverage is down here because I added html rendering for AnnotationData. I don't think unit tests are critical on that function.

@justinsalamon
Copy link
Contributor

Note: coverage is down here because I added html rendering for AnnotationData.

Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz? This could be nice in that you could use the variety of viz options offered by the pandas API for dataframes (e.g. peak(), head(), etc.)

@bmcfee
Copy link
Contributor Author

bmcfee commented May 10, 2017

Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz?

I don't think we should rely on it, but it's always an option. The dataframe converter is done anyway, so you can do

df = ann.data.to_dataframe()
df.head(10)

etc.

@bmcfee
Copy link
Contributor Author

bmcfee commented May 13, 2017

Amending my previous statement: I don't think we can consistently make Annotation implement the Collection ABC because it would change the semantics inherited from JObject for __len__ and __contains__, which both apply to the object's __dict__. OTOH there's no problem in just implementing __iter__.

@bmcfee bmcfee mentioned this pull request May 13, 2017
@bmcfee
Copy link
Contributor Author

bmcfee commented May 15, 2017

@ejhumphrey I think this one's stable for eyeballs again.

Coverage drop is primarily due to html rendering not being covered, but I'd prefer to push that to #93 and implement it across the board.

@bmcfee bmcfee changed the title [RFC] alternate AnnotationData backend [CR needed] alternate AnnotationData backend May 17, 2017
@@ -159,7 +165,13 @@ def note_midi_to_hz(annotation):
'''Convert a pitch_midi annotation to pitch_hz'''

annotation.namespace = 'note_hz'
annotation.data.value = 440 * (2.0 ** ((annotation.data.value - 69.0)/12.0))
data = annotation.pop_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can neither articulate why nor propose something "better", but the semantics of this feel weird for some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you object to the operation itself, or just the naming?

'value': None,
'confdence': None}
# Bypass the safety checks in append
ann.data.add(Observation(time=data['time'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks so much nicer

@ejhumphrey
Copy link
Collaborator

Couple thoughts, minor apologies if / when relevance is a stretch:

  • Not concerned with the coverage drop; does coveralls have a setting that says "don't worry about drops if the total coverage is over a certain level" (say, 98?)
  • It has since occurred to me that all of my other projects using jams have things like for idx, obs in ann.data.iterrows() ... is this a common pattern (or am I weird); and if it is common, are we throwing deprecation warnings for this usage? or is it gone outright?

@bmcfee
Copy link
Contributor Author

bmcfee commented May 19, 2017

Not concerned with the coverage drop; does coveralls have a setting that says "don't worry about drops if the total coverage is over a certain level" (say, 98?)

I'd prefer to keep the strict coverage check in place, just so that we're aware of it at all times.

It has since occurred to me that all of my other projects using jams have things like for idx, obs in ann.data.iterrows() ... is this a common pattern (or am I weird); and if it is common, are we throwing deprecation warnings for this usage? or is it gone outright?

Gone outright: any previous access to the data field is not guaranteed to work in general. We throw deprecation warnings for some things in 0.2.3, but it was too much of a mess to do it in full generality without also flagging internal access that was going to change anyway.

The idx, obs pattern can be recuvered by

for idx, obs in enumerate(ann):
    ...

though there's almost never a good reason to want the index itself.

@bmcfee
Copy link
Contributor Author

bmcfee commented May 19, 2017

Hashed out in person with @ejhumphrey -- this one's good to go.

@bmcfee bmcfee merged commit 237d7c1 into master May 19, 2017
@bmcfee bmcfee changed the title [CR needed] alternate AnnotationData backend alternate AnnotationData backend May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants