Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EArrays instead of Tables in the HDF5 file #11

Open
rossant opened this issue Apr 10, 2013 · 6 comments
Open

Use EArrays instead of Tables in the HDF5 file #11

rossant opened this issue Apr 10, 2013 · 6 comments
Assignees

Comments

@rossant
Copy link
Member

rossant commented Apr 10, 2013

@thesamovar @shabnamkadir @kdharris101 Currently, the new file format implemented in SpikeDetekt does not match exactly what is described here. These notes should be updated, but some decisions need to be made first.

A potential issue with the current format concerns the features. We will want the option to show all spikes at once in the FeatureView. With really large datasets (hundreds of channels) that will be available soon, it won't be possible to load the whole feature matrix (Nspikes x Nfeatures) at once because it won't fit in memory. We'll want to load just a Nspikes x 2 array with the two features we're interested in. I don't think that's possible to do that with the current file structure, where there's a table with Nspikes rows, and a Features column of datatype (Nfeatures-long vector).

A solution would be to have a special array in the file, called features, of size Nspikes x Nfeatures. With a chunk shape equal to Nspikes x 1, it will be quite efficient to read just two random columns of this array.

More generally, the advantage of using a Table with one row per spike, instead of independent arrays (features, masks...), should be precised somewhere. Both tables and arrays can have an extendable dimension.

@kdharris101
Copy link
Member

Hi all,

To repeat this point: Cyrille and I were talking this morning. We are not sure why the hdf5 format uses tables rather than arrays; and as Cyrille explained, it seems that arrays will allow KlustaViewa to scale much better to the very large data sets that will be coming in a few years. So we are considering switching the format to have a separate array for the features, spikes, spike times, etc. This would mean dropping tables altogether.

Dan, I think I remember a discussion in which you explained why tables were the better choice. But I can’t remember the reason. What was it?

k

From: Cyrille Rossant [mailto:[email protected]]
Sent: 10 April 2013 13:50
To: klusta-team/spikedetekt
Cc: Harris, Kenneth
Subject: [spikedetekt] Decisions about the HDF5 file format (#11)

@thesamovarhttps://github.com/thesamovar @shabnamkadirhttps://github.com/shabnamkadir @kdharris101https://github.com/kdharris101 Currently, the new file format implemented in SpikeDetekt does not match exactly what is described herehttps://github.com/klusta-team/spikedetekt/blob/master/docs/fileformat.md. These notes should be updated, but some decisions need to be made first.

A potential issue with the current format concerns the features. We will want the option to show all spikes at once in the FeatureView. With really large datasets (hundreds of channels) that will be available soon, it won't be possible to load the whole feature matrix (Nspikes x Nfeatures) at once because it won't fit in memory. We'll want to load just a Nspikes x 2 array with the two features we're interested in. I don't think that's possible to do that with the current file structure, where there's a table with Nspikes rows, and a Features column of datatype (Nfeatures-long vector).

A solution would be to have a special array in the file, called features, of size Nspikes x Nfeatures. With a chunk shapehttp://pytables.github.io/usersguide/optimization.html equal to Nspikes x 1, it will be quite efficient to read just two random columns of this array.

More generally, the advantage of using a Table with one row per spike, instead of independent arrays (features, masks...), should be precised somewhere. Both tables and arrays can have an extendable dimension.


Reply to this email directly or view it on GitHubhttps://github.com//issues/11.

@thesamovar
Copy link
Member

The discussion we had before was around whether or not you can efficiently append to an array, which is necessary for the main loop of spikedetekt because you don't know in advance how many spikes you will have. You can definitely append to tables, no problem. At the time we thought you couldn't append to arrays, but what Cyrille writes suggests that he's found a way to append to arrays ("Both tables and arrays can have an extendable dimension."), in which case I think arrays would be preferable to tables.

@rossant
Copy link
Member Author

rossant commented Apr 10, 2013

@thesamovar @kdharris101 Indeed, arrays can be extendable in HDF5 along one dimension at most. In PyTables, one needs to use EArrays.

If everyone agrees I can update the code to use arrays instead of tables.

@kdharris101
Copy link
Member

I agree!

From: Cyrille Rossant [mailto:[email protected]]
Sent: 10 April 2013 15:50
To: klusta-team/spikedetekt
Cc: Harris, Kenneth
Subject: Re: [spikedetekt] Decisions about the HDF5 file format (#11)

@thesamovarhttps://github.com/thesamovar @kdharris101https://github.com/kdharris101 Indeed, arrays can be extendable in HDF5 along one dimension at most. In PyTables, one needs to use EArrayshttp://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class.

If everyone agrees I can update the code to use arrays instead of tables.


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-16178921.

@thesamovar
Copy link
Member

Me too!

@ghost ghost assigned rossant Apr 10, 2013
@shabnamkadir
Copy link
Member

I agree too.

Shabnam

On Wed, Apr 10, 2013 at 3:51 PM, Dan Goodman [email protected]:

Me too!


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-16179030
.


Dr. Shabnam Kadir
Institute of Neurology, Department of Neuroscience, Physiology, and
Pharmacology
University College London
21 University Street
London WC1E 6DE

Tel: +44 (0)20 3108 2407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants