-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip empty datasets #192
base: master
Are you sure you want to change the base?
Skip empty datasets #192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I wonder if any of these places should warn that there's missing data where the index says there should be data.
I'd like to get @philsmt's thoughts on this. The idea of using .select(..., require_all=True)
is that you know the selected data all exists for the same set of trains, so you can get arrays and use them against one another easily. But that only looks at the indexes. So this would mean that even after using require_all
, you could get arrays missing some or all trains.
Of course, in that case it would break either way, but delaying the error might make it harder to understand.
Are there any other place where this can be problematic? I can also add a check in |
Thanks for working on this! I'm not fully comfortable however with "just" treating it silently and return no data, but at the minimum raise a The situation of an empty dataset but supposed records in INDEX represents an abnormal situation - as opposed to actually having a pipeline source sending no data. I think we should go with the assumption that this was not intended (of course not sending data may too, but that is beyond our control). Given the ubiquity of these files, we should be able to handle it once access goes beyond the INDEX section and to INSTRUMENT, but definitely make the user aware of the underlying corrupt structure. |
Hum, I don't think I agree here. Yes, this is a problem and the file is wrongly written. But this problem is identified and effectively the result if having an empty dataset (even with filled index) is the same as a data source not sending data. And I don't think the user should care for these differences and raising an exception because there's a mismatch between the dataset array and the index, but not raising when both index and dataset are empty will be confusing.
that is updated now, and would return an empty |
Are you sure with
I don't think it will be confusing. Let me try to explain the situation from an operator's perspective: I understand your point that the difference is academic from the file's point of view, but it's not from the scientist who tried to record the data he saw on the screen. And raising a warning that the data is absent due some misconfiguration beyond the experimental's control is a way to allow for fixing this in time. If we'd be 100% sure it is always caught by automatic validation (or even better, the DAQ 😠), I'd be fine with just handling it implicitly. But we're far away from that. |
Would it be an acceptable compromise to drop it with a warning from methods which retrieve data from multiple sources (like |
That sounds like a good idea to me. I don't want to restrict the functionality when we could handle it, but for |
Ah, I run in this problem only for slow data, and in that case, I don't think you'd see a difference with lsxfel (for slow data).
I agree this is important, but still don't think that this is the right place for it. You see it as a beamline operator. But more generally, users don't have a clue what's a DAQ, karabo, and what can cause this or that issue and I am not sure we should let their code deal with the various caveat of our files. I see DataCollection as an interface to data in files and if there's no data for a key then it returns nothing, whatever the reason. And if we want to understand why this has no data, use other tools. also note that this kind of issues is seen on old files (2019), I don't think any new file has that kind of problem. |
Ah, maybe we set our focus on different kinds of data in our discussion. I think the changes will affect both though. For control data, quite likely this problem only occurs with older files, I've never seen it myself. Honestly this problem's even worse for control data, as it can be assumed to be present for every train whenever a source is present in files. The state of "source-existing-but-no-control-data" is an exceptional state that should not occur at all. For fast data, the problem's still there, alive and kicking. It's still comparably easy to trigger it, even if most instruments got their hands burned by now and are somewhat aware to watch out for it. And I made a poor choice of words with "operator". I was actually refering to the "user operators", i.e. those users present or watching the experiment remotely, looking at data preview whether they satisfy their needs and triggering runs. Those roles are combined for commissioning, but in a user beamtime, it's not the instrment staff deciding which data to take. You're right that the far off PhD tasked with analyzing data may not see a difference. I'm not advocating for sabotaging EXtra-data here, but pointing out the exceptional state. The user may well ignore it. But there's a chance it helps diagnosing this problem, which is really hard for a non-expert to understand. We should not intentionally assume ignorance on our user's part. |
That is bad, but there's actually even worse: slow data updates not being seen by the DAQ, so values in files are wrong. And that is still happening (have seen that in my last DOC shift) :(
Ah, I'm actually an idiot. So, with this merge request, I feel this discussion goes beyond the scope of that PR. I merely want to handle an other case of malformed file that breaks the API... without changing the API behavior.
Same cause (no data for a source @ train) leads to different result, and as a user I am confused by that. We can discuss a modification in the API for that, but that is a larger change and would require consistency. I can imagine for example to raise an error if a selection returns an empty collection (that would be a more abstract case that include that specific issue). |
I'm somewhat unsure about this now. The case where the DAQ doesn't record any data, but still makes index entries as though it did is a DAQ bug - whereas simple missing data is not. We shouldn't prevent people retrieving other data that was correctly recorded, but I think this has gone too far in the direction of hiding a serious problem. If users specifically try to get data which is affected by this problem, I think they should get an error, so it's clear that the files are broken. So I'm currently thinking:
|
|
||
# Remove any trains previously selected, for which this | ||
# selected source and key group has no data. | ||
train_ids = np.intersect1d(train_ids, source_tids) | ||
train_ids = np.intersect1d(train_ids, key_tids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code now needs to:
- Find all keys for control sources (which are often more numerous than instrument keys, IIRC), where the selection keeps all keys.
- Do one intersection per key, rather than one per source (/source group). I found in Speed up _check_source_conflicts() #183 that NumPy set operations are not that efficient for working with many small sets. (I appreciate that it avoids a set union per file, though).
So I am concerned that this could be markedly slower in some circumstances (probably when selecting sources with many keys) than the code it replaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is slower, but I don't think that it's a problem here. Selections are not operation you'd do very often and you'd also not select many sources.
When testing with this branch, a selection took always less than 1 ms.
I also realised while looking into something related that |
OK, I'm still not convinced this is right, but I seem to be the minority. So be it. |
I'm not entirely convinced of what I'm saying either - I see the argument that missing data is missing data and it should appear the same regardless of how it came to be missing. But if we can't reach a consensus, I lean towards keeping more things as they are. We might decide to make more changes later - whereas if we change now and then want to change back, that's extra confusion. |
@tmichela @philsmt: I don't suppose either of you have shifted your thinking on this? 🙂 I'm stuck in the middle: I can see the argument that users don't really care why missing data is missing, but I also don't want to let serious DAQ bugs pass silently. Maybe warnings are the answer? Indicate that there's a problem with the files, but give results back as if it was regular missing data? |
Fixes #189
I simplified the case where we skip a dataset when it is completely empty in a file, as it it the only scenario I've seen so far where we had index filled but no data.