Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-http plugin #59

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

VictorD-Veolia
Copy link

Add a transform-type plugin that perform an HTTP request for every input record received. The URL used for the request can be dynamically modified depending on input record value through a placeholder mechanism.

@bdmogal
Copy link
Contributor

bdmogal commented Jun 30, 2021

Hi @VictorD-Veolia ,

Thanks for creating this pull request. Apologies for the delay Could you please create a short document with the use cases for this, but more importantly, how this plugin addresses performance? Making HTTP requests for every record could be quite expensive and that was the reason why this functionality wasn't built into CDF yet. If you can elaborate on how this is handled, that would be great.

Thanks,
Bhooshan

@VictorD-Veolia
Copy link
Author

Hi @bdmogal,

The typical use case for this feature is for API that works with two calls.

  • The first call to the API return a list of object (ex: GET [MY_ENDPOINT]/users)
  • The second returns details for one of the objects. (ex: GET [MY_ENDPOINT]/user/[USER_ID])

If we want to retrieve the details of all users, we will need to do a GET to the first API, using the existing HTTP Source plugin, and then a second HTTP Get for every user of the retrieved list, using this multi-HTTP plugin.

This usecase is pretty common, for example with the 360learning API : https://api.360learning.com/#24c393ca-8e38-4ba7-a55e-0894c308d72d.

For the performance part, I have added an optional throttling feature to avoid overwhelming fragile APIs. This feature allows the user to limit the number of calls per seconds made with the plugin.

@sagarrbatchu
Copy link

i was wondering if there was any interest towards getting this merged in ? Definitely a very useful feature our team at LiveRamp could leverage as well

Copy link

@tivv tivv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach won't work nicely for long lists as transform is not supposed to produce huge output for every input record. This will produce OOMs as Emitter is a push-interface, so it effectively caches everything that is emitted for given record.

@wyzhang
Copy link

wyzhang commented Dec 10, 2021

Some more relevant details:
emitter caches input in memory io/cdap/cdap/etl/common/DefaultEmitter.java:34

@tivv
Copy link

tivv commented Dec 10, 2021

Also from the architecture standpoint I don't think this is the proper way to go forward. Basically, the intent is to have a dataset as a list of parameters to a source. This should not be done on plugin level as we don't want to have a Transform variant of each Source. Such capability should be provided on ETL platform level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants