-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: read headers only #37
Comments
This is the approach presented in sccn#37
In general I very much like this idea. However, I've never used this feature but I've been told that LabRecorder supports adding a new stream after recording has already started, which would put the header for that stream somewhere in the middle of the file. Do you think you can accommodate that? |
Thanks for the input @cboulay. I agree with this idea, headers can be in any part of the file. In my very first approach, I used the The second approach, the one that I am proposing above, would accommodate to your use-case: the file description is actually moved to the beginning of the next chunk. Please, correct me if I'm wrong: if one adds a new stream after the recording has already started, it would create a write a new Chunk with tag 2 (StreamHeader) with its header info. |
It would be good to have a utility that can parse xdf files and
manipulate just the meta-data. This is definitely pointing towards BIDS
compatibility.
But, in LSL world, streams come and go and the arbitrariness of this is
part of its design and I would not try to restructure this.
Is there any reason to deal with the binary files directly in this
context? If I were to make an xdf2bids (for example) tool, I would read
the whole data file into Matlab/Python first and parse the returned
structures/dictionaries to create the new meta-data, sidecar files, data
files, etc. This takes bit more time, but in the end it is a more
flexible and easier way to deal with the data.
Maybe I'm misunderstanding your problem, but I can't actually imagine a
scenario when any header data chunk would be embedded in a data chunk.
The write process is always protected by a mutex lock. LabRecorder would
never write a non-corrupt file otherwise. Chunks are always written one
after another. In practice, headers get written at the beginning, but it
is certainly the case that LabRecorder will start to write data chunks
before it completes writing the meta-data for all the requested streams
get picked up. I should also say that in the 5 years that I have been
using LSL I have only come across one corrupt data file (in the sense
that the binary coding couldn't be parsed due to some fallacious
ambiguity in the chunk header codes).
…On 02/26/2019 06:50 PM, David Ojeda wrote:
Thanks for the input @cboulay <https://github.com/cboulay>.
I agree with this idea, headers can be in any part of the file. In my
very first approach, I used the |_scan_forward| function to move
between chunk boundaries (which is why I filed #34
<#34>).
It worked well for my case because my headers are always at the
beginning of the file. However, if a header came anywhere between two
boundaries, I would missed them.
The second approach, the one that I am proposing above, would
accommodate to your use-case: the file description is actually moved
to the *beginning of the next chunk*.
Please, correct me if I'm wrong: if one adds a new stream after the
recording has already started, it would create a write a new Chunk
with tag 2 (StreamHeader) with its header info.
However, it does not work if the StreamHeader chunk is /embedded/
inside the data chunk of another stream. I believe that this case
would lead to a corrupted file.
In other words, do the threads that write to the file protected by
some mutex?
I'll try reading the LabRecorder source code to confirm this.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or
mute the thread
<https://github.com/notifications/unsubscribe-auth/ADch7r0MyvPfsl-lGtT6-XrvW_x5OWLyks5vRXQDgaJpZM4bSXv6>.
|
I am not saying it would be useful to have headers embedded in data. I was just commenting (probably misunderstanding) Chadwick's comment on having new streams in the middle of the file. As I commented before, this case is handled by my proposal since I am following the specification. Each chunk is read (at the very minimum its size and tag), but the data part is ignored when it's not a header chunk. |
I've implemented this functionality in |
Hi Clemens,
I also have a fork on my side with this feature along with other issues
implemented. Have a look so we can synchronize and create a PR
Cheers
David
…On Thu, Apr 4, 2019, 11:19 Clemens Brunner ***@***.***> wrote:
I've implemented this functionality in parse_xdf (currently only in this
Gist <https://gist.github.com/cbrnr/7164a12868bc7616056587a41f51e92d>,
which will later maybe become part of some package, depending on where it
fits best).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwkqlBmAGvJFjop_zux-v3V9iq2HDkDks5vdcOjgaJpZM4bSXv6>
.
|
Yes, we should definitely consolidate our efforts - it doesn't make sense if everyone develops the same features over and over. I saw that you really only read the headers without the XML in the stream header. In my version, I do read the XML and parse certain things, but not the raw data samples. I found that it is this step that takes time, everything else is very fast. Furthermore, you introduced an option to select an XML to dict converter - did you have problematic use cases where |
I agree with you for a consolidated effort. I don't really understand why you think that I only read the headers without the XML in the stream. This is not what's going on my commit from Feb 26th: dojeda@e209638 I read your gist: how to you propose to integrate it with pyxdf? Concerning the XML to dict converter, I don't have a specific problem, but more of a disagreement on how this information is decoded and how this makes the code that uses this information less readable and specially inconvenient. Finally, I am still not filing a PR for the issues that I filed a month ago because only one of these issues got a bit of traction (this issue in particular), and I still have a new feature to implement concerning keeping the original timestamps even if there is a clock sync and dejittering operation. I really need this feature because I have found some files where the timestamps have some nasty jitter and while the dejitter algorithm does a great job to reconstruct a sensible timeline, we need to apply our own correction in some cases (i.e. we want to use pyxdf dejittering most of the time, and some other times we want to roll our own correction). |
Alright, I missed the StreamHeader (2) part.
Good question, I don't know yet. We could either modify Regarding the XML parser, I think this should be discussed in a separate issue. Same for the original timestamps. Since this repo has moved to https://github.com/xdf-modules/xdf-Python, we should also continue our discussion of this issue there. |
I didn't know this repo was moved elsewhere... I have been filing some issues since a couple of months ago and I did not find any information regarding this. |
@cboulay is the current maintainer AFAIK |
It hasn't moved yet. I made a pull request proposing it to be moved, and for that pull request to work I had to copy the code into the component submodules. Please take a look at the PR and let me know what you think. So far I've only received a little bit of feedback, but it's been positive! Note that sccn/xdf remains and is still the home. Any documents related to the xdf file spec will stay here. What has moved is the python code, matlab code, and the new addition of C++ libxdf, though the python and matlab code are still linked here via submodules. This separation is desirable for several reasons:
There are also some conveniences for build systems and continuous integration but those are relatively minor as long as the only 'testable' part is Python. |
I am sorry if I file so many issues at once. What I really wanted was to implement the feature explained in this issue, but when doing so I found the problems/discussions points filed on #34 #35 and #36.
The feature I would like to propose is an extension to
load_xdf
that is significantly faster but only reads the headers.My use-case is the following: I am reading and importing all my XDF files into another database/filesystem and I wanted to save the information present on the headers in a database (hence my discussion on #36). The problem is that I have a lot of files, and some of them are big (about 3Gb, probably recordings that started and was then forgotten, but it could be a long recording session).
The way I plan to implement this is to use the tag that identifies the chunk header and the chunk size in bytes (https://github.com/sccn/xdf/wiki/Specifications#chunk). When the tag is not of types FileHeader (1) or StreamHeader (2) I will move the file pointer to the beginning of the next chunk.
I managed to achieve this with the following code:
With this modification, I can cut down the read time of a 3Gb file from 2 1/2 minutes to 5 seconds. Considering I have several hundreds of files (not that big, though), I am saving quite a lot of time.
If you guys agree with this modification, I would gladly make a PR for it.
The text was updated successfully, but these errors were encountered: