-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values (nulls) in identifiers and no unique key check #54
Comments
I think there are some cases where nulls are required in some parts of the getData() stream computation. With outer join for instance i believe. The spec states this somewhere, precising that the result of the overall join operation - ie. what ends up in the vtl dataset variable - should be checked for duplicates somehow. On the other hand, if all the streams are valid (without duplicates) I don't think most VTL operation will generate duplicates so we can only put this logic in the operation that are problematic (union and outer join I believe). Union has some sort of logic for this. This is a memory consuming thing to check for duplicates, I thought of using a bloom filter and two passes for solve this. |
Yes, the working dataset can contain nulls in ICs, but the end dataset cannot (VTL 1.1 User Manual, lines 3249-3257). The case I have seen is that the duplicates exist in the input dataset (i.e. the input dataset is not functionally integral, VTL 1.1 User Manual, lines 2269-2292), so I think this is something we need to fix globally, not only for specific operations, don't you think? If I understand you idea correctly, we would put the hash of all the ICs of a given Data Point into the bloom filter. But for the next Data Point, when we ask the filter if it contains the new hash, the filter could answer "maybe". But we need to know for sure? Can you also elaborate on two passes? |
The implementation allows null values in identifier components whereas specification 1.1 of the VTL language says:
There is also no unique key check. In other words it is possible to have two or more Data Points that have same values in Identifier Components.
The text was updated successfully, but these errors were encountered: