Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix record conversion for Arrays #591

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Commits on Nov 16, 2022

  1. Fix record conversion for Arrays

    Issue summary: I cannot use the Wrangler or any other XML plugin provided for a (a priori) simple use case which consist of importing (nested/repeated) XML data (that have repeated columns, i.e. JSON Arrays) to whatever sink.
    
    Steps to reproduce:
    1. Create a pipeline GCS->Wrangler->Whatever sink (with the input path in GCS set as a runtime variable).
    
    2. Use the following sample to create the output schema (with the xml-to-json transform) and run the pipeline with this file.
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <MyRoot>
        <SomeField>
            <Total>65.95</Total>
            <Total>3.98</Total>
            <Total TotalType="FinalTotal">65.95</Total>
        </SomeField>
        <Timer>
            <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
        </Timer>
    </MyRoot>
    
    3. Oberve that the pipeline is successful.
    
    4. Change the source to a new file:
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <MyRoot>
        <SomeField>
            <Total>65.95</Total>
        </SomeField>
        <Timer>
            <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
        </Timer>
    </MyRoot>
    
    5. Observe that the pipeline fails with the "Unable to decode array 'body_MyRoot_SomeField'" error.
    
    Why this PR? Because there is no general way to know when an XML contains repeated columns or not and thus everything should be expected to be repeated.
    
    Why I think it's a good idea to do that in the standard CDAP code:
    1. Correct me if I'm wrong but this RecordConvertor.java is meant to convert the input Runtime data to match the Output Schema. It is NOT meant to "VALIDATE the input against the output schema".
    2. It is a "high level" data type since an array is always filled with elements that have a type themselves (or no element but then we won't have any issue in the first place) thus doing this Collections.singletonList(object) is pretty much the "array equivalent" of doing Double.parseDouble(value); (which is already in this code) i.e. we basically cast the input to match the output schema.
    tgmof authored Nov 16, 2022
    Configuration menu
    Copy the full SHA
    bb08ab6 View commit details
    Browse the repository at this point in the history