-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use EArrays instead of Tables in the HDF5 file #11
Comments
Hi all, To repeat this point: Cyrille and I were talking this morning. We are not sure why the hdf5 format uses tables rather than arrays; and as Cyrille explained, it seems that arrays will allow KlustaViewa to scale much better to the very large data sets that will be coming in a few years. So we are considering switching the format to have a separate array for the features, spikes, spike times, etc. This would mean dropping tables altogether. Dan, I think I remember a discussion in which you explained why tables were the better choice. But I can’t remember the reason. What was it? k From: Cyrille Rossant [mailto:[email protected]] @thesamovarhttps://github.com/thesamovar @shabnamkadirhttps://github.com/shabnamkadir @kdharris101https://github.com/kdharris101 Currently, the new file format implemented in SpikeDetekt does not match exactly what is described herehttps://github.com/klusta-team/spikedetekt/blob/master/docs/fileformat.md. These notes should be updated, but some decisions need to be made first. A potential issue with the current format concerns the features. We will want the option to show all spikes at once in the FeatureView. With really large datasets (hundreds of channels) that will be available soon, it won't be possible to load the whole feature matrix (Nspikes x Nfeatures) at once because it won't fit in memory. We'll want to load just a Nspikes x 2 array with the two features we're interested in. I don't think that's possible to do that with the current file structure, where there's a table with Nspikes rows, and a Features column of datatype (Nfeatures-long vector). A solution would be to have a special array in the file, called features, of size Nspikes x Nfeatures. With a chunk shapehttp://pytables.github.io/usersguide/optimization.html equal to Nspikes x 1, it will be quite efficient to read just two random columns of this array. More generally, the advantage of using a Table with one row per spike, instead of independent arrays (features, masks...), should be precised somewhere. Both tables and arrays can have an extendable dimension. — |
The discussion we had before was around whether or not you can efficiently append to an array, which is necessary for the main loop of spikedetekt because you don't know in advance how many spikes you will have. You can definitely append to tables, no problem. At the time we thought you couldn't append to arrays, but what Cyrille writes suggests that he's found a way to append to arrays ("Both tables and arrays can have an extendable dimension."), in which case I think arrays would be preferable to tables. |
@thesamovar @kdharris101 Indeed, arrays can be extendable in HDF5 along one dimension at most. In PyTables, one needs to use EArrays. If everyone agrees I can update the code to use arrays instead of tables. |
I agree! From: Cyrille Rossant [mailto:[email protected]] @thesamovarhttps://github.com/thesamovar @kdharris101https://github.com/kdharris101 Indeed, arrays can be extendable in HDF5 along one dimension at most. In PyTables, one needs to use EArrayshttp://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class. If everyone agrees I can update the code to use arrays instead of tables. — |
Me too! |
I agree too. Shabnam On Wed, Apr 10, 2013 at 3:51 PM, Dan Goodman [email protected]:
Dr. Shabnam Kadir Tel: +44 (0)20 3108 2407 |
@thesamovar @shabnamkadir @kdharris101 Currently, the new file format implemented in SpikeDetekt does not match exactly what is described here. These notes should be updated, but some decisions need to be made first.
A potential issue with the current format concerns the features. We will want the option to show all spikes at once in the FeatureView. With really large datasets (hundreds of channels) that will be available soon, it won't be possible to load the whole feature matrix (Nspikes x Nfeatures) at once because it won't fit in memory. We'll want to load just a Nspikes x 2 array with the two features we're interested in. I don't think that's possible to do that with the current file structure, where there's a table with Nspikes rows, and a Features column of datatype (Nfeatures-long vector).
A solution would be to have a special array in the file, called
features
, of size Nspikes x Nfeatures. With a chunk shape equal to Nspikes x 1, it will be quite efficient to read just two random columns of this array.More generally, the advantage of using a Table with one row per spike, instead of independent arrays (
features
,masks
...), should be precised somewhere. Both tables and arrays can have an extendable dimension.The text was updated successfully, but these errors were encountered: