Missing values (nulls) in identifiers and no unique key check #54

pawbu · 2017-12-20T08:20:12Z

The implementation allows null values in identifier components whereas specification 1.1 of the VTL language says:

The “missing” value is not allowed for the Identifier Components, in order to ensure that the Data are always identifiable.

There is also no unique key check. In other words it is possible to have two or more Data Points that have same values in Identifier Components.

pawbu · 2017-12-20T08:20:48Z

@hadrienk @eivindgi Do you have any thoughts on this?

hadrienk · 2017-12-20T09:30:42Z

I think there are some cases where nulls are required in some parts of the getData() stream computation. With outer join for instance i believe. The spec states this somewhere, precising that the result of the overall join operation - ie. what ends up in the vtl dataset variable - should be checked for duplicates somehow.

On the other hand, if all the streams are valid (without duplicates) I don't think most VTL operation will generate duplicates so we can only put this logic in the operation that are problematic (union and outer join I believe).

Union has some sort of logic for this. This is a memory consuming thing to check for duplicates, I thought of using a bloom filter and two passes for solve this.

pawbu · 2017-12-20T10:37:56Z

Yes, the working dataset can contain nulls in ICs, but the end dataset cannot (VTL 1.1 User Manual, lines 3249-3257).

The case I have seen is that the duplicates exist in the input dataset (i.e. the input dataset is not functionally integral, VTL 1.1 User Manual, lines 2269-2292), so I think this is something we need to fix globally, not only for specific operations, don't you think?

If I understand you idea correctly, we would put the hash of all the ICs of a given Data Point into the bloom filter. But for the next Data Point, when we ask the filter if it contains the new hash, the filter could answer "maybe". But we need to know for sure? Can you also elaborate on two passes?

pawbu changed the title ~~Missing values (nulls) in identifiers and no unique key~~ Missing values (nulls) in identifiers and no unique key check Dec 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values (nulls) in identifiers and no unique key check #54

Missing values (nulls) in identifiers and no unique key check #54

pawbu commented Dec 20, 2017

pawbu commented Dec 20, 2017

hadrienk commented Dec 20, 2017

pawbu commented Dec 20, 2017

Missing values (nulls) in identifiers and no unique key check #54

Missing values (nulls) in identifiers and no unique key check #54

Comments

pawbu commented Dec 20, 2017

pawbu commented Dec 20, 2017

hadrienk commented Dec 20, 2017

pawbu commented Dec 20, 2017