Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project starter #1

Open
thorwhalen opened this issue Jun 18, 2021 · 1 comment
Open

Project starter #1

thorwhalen opened this issue Jun 18, 2021 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@thorwhalen
Copy link
Member

thorwhalen commented Jun 18, 2021

Previous (old) issue: i2mint/py2store#40 on the subject.

We'll restrict ourselves, for now, to a following types of data (in order of priority):

  • fixed-rate fvs (i.e. fixed size numerical vectors).
  • fixed-rate snips (i.e. uni-dimensional sequences of non-negative integers (so can use uint8, unint16, unint32...))
  • multi-dimensional non-regular numerical timeseries (here, should use one channel to represent timestamps)

These cover most of the (pragmatic ground), but it would be useful to have a few data converters to be able to easily cast data to the forms accepted by the serializers. For this we need to (1) decide what our "normal" input format is and (2) set things up so we can add a growing number of types we'll translate from. For the normal input format I'd say the first argument is always an iterable of numbers or fixed-size sequences (tuples or lists say) of numbers. Additionally, there'll be some optional arguments to describe things about the data (to avoid having to peek) and other meta data (that might not be the data's concern, but is the serializers.

Serializers should go from (data, [meta]) to bytes, and deserializers from bytes to (data, [meta]).

The persistence (saving the bytes) is a separate concern.

Within the persistence concern is a concern of indexing -- namely according to time. Here, we need to tackle aspects such as block storage

  • dividing into pieces, and bringing pieces together to recreate the whole
  • external and internal indexing to be able to satisfy interval queries (give me the data between bt and tt timestamps)

Note: When timeseries are created by fixed-step chunkers, the data rate is determined by the step_step (NOT the step_size).

Session-block storage

data_provider = get_timeseries(src, bt, tt)

# alternative:
my_data = mk_timeseries_interval_accessor(src, ...)
data_provider = my_data[bt:tt]

# and then use...

Resources to check out

Would be good to find some resources on codec design and formats. Something describing a principled approach so we don't have to rethink the wheel, and we can use the words the community uses to describe things (such as "codec", "container", "header", etc). Here are a few things:

It seems IFF/TLV could be used with for headers. But it seems wasteful to use this for our chunks/frames, since we are only dealing with fixed structure in our case.

Builtins

Might want to check this codec module out.

And for sure, we'll need the struct module.

The following probably use the above (not sure though):

wave

chunk

More like this in this list

Third party

soundfile

our stuff

Peek at an iterator -- useful when you want to see what format it's elements have (without consuming it) --

Somewhat related modules of ours:
https://github.com/otosense/hear/blob/3a87757d3fd094c8834c6f10e845d5a45592e026/hear/regular_panel_data.py
https://github.com/i2mint/py2store/blob/813bb853a28be1eef6454b9d7d9be5ddb1f9b7b1/py2store/utils/affine_conversion.py
https://github.com/i2mint/py2store/blob/6d525784c9212a4839c42dfbc4fb9d427b959811/py2store/utils/timeseries_caching.py

https://github.com/otosense/hear/blob/3a87757d3fd094c8834c6f10e845d5a45592e026/hear/session_block_stores.py
https://github.com/otosense/hear/blob/3a87757d3fd094c8834c6f10e845d5a45592e026/hear/stores.py

┆Issue is synchronized with this Asana task by Unito

@thorwhalen thorwhalen added the documentation Improvements or additions to documentation label Jun 18, 2021
@owen7lloyd
Copy link
Collaborator

Notes on things to do now:

  • struct.iter_unpack - does it do anything for us (might chunk things for us)
  • comments / uncle bob refactoring
  • look at codec, what is it about, how can we use it

Looking forward:

  • replicate the main functionality of soundfile (i.e. read and write of audio)
  • add struct to the dicts in util
  • find the name of this (fixed size of chunks with fixed size of bytes)
  • try looking at raw video format
  • implicitly make functionality easier for user:
    • use peek to make the interface easier (find the number of channels, format, etc.)
      • function takes first sample and decides which format to apply (user can override)
  • deal with rows of a dataframe (list of dicts)
  • serialize the meta into a header
  • express the meta in a normalized dict fashion and pickle it, put the pickle as the header
    use type and length to determine length of header (pickle varies in length) to deserialize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants