-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
single file binary lindi #88
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #88 +/- ##
==========================================
- Coverage 79.43% 78.23% -1.21%
==========================================
Files 30 33 +3
Lines 2256 2600 +344
==========================================
+ Hits 1792 2034 +242
- Misses 464 566 +102 ☔ View full report in Codecov by Sentry. |
HDF5 does support plugins https://pypi.org/project/hdf5plugin/ but the plugins are implemented in C/C++
I think this is possible via HDF5 virtual datasets https://docs.hdfgroup.org/archive/support/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html
I'm wondering whether it would be easier to embed all this data in an HDF5 file as a container format, rather than using a fully custom binary. I.e., you'd have a dataset for the LINDI JSON and then any binary blocks could be stored as datasets. If the datasets are stored without chunking then they are still directly addressable via memory offset and readable via memmap. I.e., you wouldn't use all the fancy features of HDF5 (chunking, compression etc.) but you would have the advantage of still having a self-describing file, rather than a fully custom binary format. |
That's an interesting idea. The trick would be how to embed the RFS in the HDF5 in such a way that it would be easy to extract out without needing to use the HDF5 driver. Ironically we only need to attach two integers to the HDF5 file somehow to give the start and end byte to the RFS, and then we're off to the races. Let me think about how that might be done. |
According to the HDF5 spec: "The superblock may begin at certain predefined offsets within the HDF5 file, allowing a block of unspecified content for users to place additional information at the beginning (and end) of the HDF5 file without limiting the HDF5 Library’s ability to manage the objects within the file itself. This feature was designed to accommodate wrapping an HDF5 file in another file format or adding descriptive information to an HDF5 file without requiring the modification of the actual file’s information. The superblock is located by searching for the HDF5 format signature at byte offset 0, byte offset 512, and at successive locations in the file, each a multiple of two of the previous location; in other words, at these byte offsets: 0, 512, 1024, 2048, and so on." So you should be able to place additional information (like byte offsets) at the beginning of the file and still use HDF5. |
Cool. Now I'm reading through the rest of that document and I don't see where to put said unspecified data. Are you able to make sense of it? |
Following text is generated by ChatGPT Creating and modifying the user block of an HDF5 file using Here are the steps to create and modify the user block in an HDF5 file using Step 1: Creating an HDF5 File with a User BlockWhen creating a new HDF5 file, you can specify the size of the user block using the import h5py
# Specify the size of the user block (must be a power of 2 and at least 512 bytes)
userblock_size = 512
# Create a new HDF5 file with the specified user block size
with h5py.File('example.h5', 'w', userblock_size=userblock_size) as f:
# Create some datasets or groups if needed
f.create_dataset('dataset', data=[1, 2, 3])
print("HDF5 file with user block created.") Step 2: Writing to the User BlockTo write data to the user block, you need to open the file in binary mode and write directly to the beginning of the file. # Open the file in binary mode to write to the user block
with open('example.h5', 'r+b') as f:
# Write some data to the user block
user_block_data = b'This is some user block data.'
f.write(user_block_data)
print("Data written to the user block.") Step 3: Reading from the User BlockTo read data from the user block, you again open the file in binary mode and read the desired number of bytes from the beginning of the file. # Open the file in binary mode to read from the user block
with open('example.h5', 'rb') as f:
# Read the data from the user block
user_block_data = f.read(userblock_size)
print("Data read from the user block:", user_block_data.decode('utf-8')) Step 4: Modifying the User BlockTo modify the user block, you can overwrite the desired portion of the user block by seeking to the appropriate position and writing the new data. # Open the file in binary mode to modify the user block
with open('example.h5', 'r+b') as f:
# Seek to the beginning of the user block
f.seek(0)
# Write new data to the user block
new_user_block_data = b'Updated user block data.'
f.write(new_user_block_data)
print("User block data modified.") Important Considerations
By following these steps, you can create and modify the user block of an HDF5 file using |
Closing in favor of #89 |
(built on #84 )
Motivation
As we have discussed, there are advantages of file.nwb.lindi.json being in JSON format. It can be parsed from all different languages and all different tools. But there are some important limitations
So I was thinking, it would be nice to have a nwb.lindi binary file (no JSON) that has embedded in it the reference file system (as JSON) in way that is as easy as possible to parse out (of course it won't be as easy as being a true .json file), while at the same time allowing binary blobs to be appended to the .lindi file, and the RFS can refer to itself for those chunks.
Wait, isn't this reinventing HDF5?
Well there are some important drawbacks of HDF5
Each of these are deal breakers for me, especially 2 and 3.
So here's a simple example script that shows how this could work:
What happens? A binary .lindi file is created and it doesn't depend on any staging area or other chunks.
Getting into the weeds, here's what the top of the test.lindi file looks like
followed by a bunch of zero bytes.
When reading the file, we recognize it as "lindi1", lindi binary format. We see that the reference file system is embedded in the file at location 1024 with a size of 641. So then with a second request we can get the entire rfs just like we are reading a JSON. Here's what that looks like:
You can see there are references to binary chunks - and the "." for the URL means that it's referring to locations within the file itself.
So when it comes time to write a new chunk, it is appended to the end of the file, and the reference file system is updated to point to that new chunk. On each file flush (or when file closes), the RFS is rewritten in the file and the top header is updated accordingly. But what if the RFS becomes too large and no longer fits in the pre-allocated padded space? Then a new space is allocated and appended at the end of the file, and the previous RFS is replaced by all zeros (to avoid confusion).
Datasets can be deleted, but it should be noted that there is no mechanism to actually free up that space in the file.
I did explore other options before inventing this format. Specifically tar, zip, parquet. None of these met all the needed criteria.
Happy to get your feedback @rly