Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFlow Activity Support #140

Closed
deenairn opened this issue Nov 7, 2024 · 7 comments
Closed

DataFlow Activity Support #140

deenairn opened this issue Nov 7, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@deenairn
Copy link
Contributor

deenairn commented Nov 7, 2024

The framework looks like it does a great job of testing individual activities in Pipelines.

Feature Request
However, the ADF DataFlow Activity can have some quite complex transformations in place, and it would be great if there were a way to substitute the DataFlow data source for a fixed string, and then test that the transformation against this (JSON for JSON data type, array of arrays for delimited text or database types, XML for XML data types, etc) so you can provide a reasonable unit test for complex transformations of data.

i.e. for a DataFlow that takes in a JSON data type in Blob Storage and outputs to a Delimited Text in Blob Storage, you can do something like

setup activity with simple JSON like:

[ { "name": "Donald", "age": 21 }]

assert that it returns what you expect via checking against a CSV like

Name,Age
Donald,21
@arjendev arjendev added the enhancement New feature or request label Nov 25, 2024
@arjendev
Copy link
Collaborator

Hi @deenairn,

Thank you for the feature request! In the coming weeks I'll set aside some time to play with DataFlow and see if it makes sense to add such feature to the framework.

@deenairn
Copy link
Contributor Author

deenairn commented Dec 6, 2024

I think it would really enhance the value of the framework, but I did start thinking about it and realised this isn't a task I can just do in my spare time as it could get complicated. Happy to help if I can though.

@arjendev
Copy link
Collaborator

@deenairn. could you share with us an example dataflow json definition?

@deenairn
Copy link
Contributor Author

@arjendev - this is a trivial example of an ADF data flow, querying REST API for the public OData service https://services.odata.org/TripPinRESTierService. See the docs at: https://www.odata.org/odata-services/

Input: JSON, multiple layers of nesting
Output: Blob, tabular CSV

There's literally no logic here, just flattening a bit of JSON data into a tabular CSV data.

In the simplest case, you could provide a bit of JSON that could have properties multiple layers deep, then check that it's reformatted to a CSV (or any tabular output) as expected.

Once this sort of functionality is achieved, you could focus on more logic elements (under certain conditions, the data is filtered / resharped based on data flow conditions).

Does this make sense?

REST.json

@arjendev
Copy link
Collaborator

@deenairn, yes, thank you for sharing!

Being not familiar with Data Flow Script (DFS) myself, it seems like a language that abstracts away the underlying Spark SQL runtime. Unfortunately, the Data Factory Testing Framework is currently only able to evaluate the Data Factory Expression Language (DFEL) which is built on top of the Logic Apps Expression Language. A testing framework for DFS thus requires a completely different approach I am afraid. I am also not sure if the DFS language is publicly available.

Another approach would be to rebuild the language in Python, as we did before with this Testing Framework for the DFEL. However, given the more complicated nature of DFS compared to DFEL and the amount of functions available (even more complicated), I am afraid it is a lot of work.

@deenairn
Copy link
Contributor Author

@arjendev - that's a shame, it would be a real help for our project, I did consider what it might take to implement it and realised it would be a lot of work, unless I misunderstood what was necessary. I still think it would be a great addition if it were possible.

@arjendev
Copy link
Collaborator

@deenairn - we've had a discussion with our v-team working on this framework and unfortunately we indeed decided to not further pursue DataFlow support at this time. If that changes in the future, we will inform you with a reply on this issue.

Thank you for your time in identifying some issues and coming up with feature suggestions, appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants