Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values (nulls) in identifiers and no unique key check #54

Open
pawbu opened this issue Dec 20, 2017 · 3 comments
Open

Missing values (nulls) in identifiers and no unique key check #54

pawbu opened this issue Dec 20, 2017 · 3 comments

Comments

@pawbu
Copy link

pawbu commented Dec 20, 2017

The implementation allows null values in identifier components whereas specification 1.1 of the VTL language says:

The “missing” value is not allowed for the Identifier Components, in order to ensure that the Data are always identifiable.

There is also no unique key check. In other words it is possible to have two or more Data Points that have same values in Identifier Components.

@pawbu
Copy link
Author

pawbu commented Dec 20, 2017

@hadrienk @eivindgi Do you have any thoughts on this?

@pawbu pawbu changed the title Missing values (nulls) in identifiers and no unique key Missing values (nulls) in identifiers and no unique key check Dec 20, 2017
@hadrienk
Copy link

I think there are some cases where nulls are required in some parts of the getData() stream computation. With outer join for instance i believe. The spec states this somewhere, precising that the result of the overall join operation - ie. what ends up in the vtl dataset variable - should be checked for duplicates somehow.

On the other hand, if all the streams are valid (without duplicates) I don't think most VTL operation will generate duplicates so we can only put this logic in the operation that are problematic (union and outer join I believe).

Union has some sort of logic for this. This is a memory consuming thing to check for duplicates, I thought of using a bloom filter and two passes for solve this.

@pawbu
Copy link
Author

pawbu commented Dec 20, 2017

Yes, the working dataset can contain nulls in ICs, but the end dataset cannot (VTL 1.1 User Manual, lines 3249-3257).

The case I have seen is that the duplicates exist in the input dataset (i.e. the input dataset is not functionally integral, VTL 1.1 User Manual, lines 2269-2292), so I think this is something we need to fix globally, not only for specific operations, don't you think?

If I understand you idea correctly, we would put the hash of all the ICs of a given Data Point into the bloom filter. But for the next Data Point, when we ask the filter if it contains the new hash, the filter could answer "maybe". But we need to know for sure? Can you also elaborate on two passes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants