Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing an ETL/ELT Framework for Modules #53

Open
Patowhiz opened this issue Jun 12, 2024 · 0 comments
Open

Choosing an ETL/ELT Framework for Modules #53

Patowhiz opened this issue Jun 12, 2024 · 0 comments

Comments

@Patowhiz
Copy link
Collaborator

Overview

Our project includes modules that will require common Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) operations. We have the option to use the Arquero JavaScript library or rely on languages that offer powerful ETL/ELT tools, such as SQL, Python, or R.

I propose we rely on SQL for our ETL/ELT operations for the following reasons:

  1. Platform Agnosticism: SQL is supported by numerous powerful analytical engines, ensuring our solution remains platform-agnostic.
  2. Ease of Use: SQL is relatively easier to learn and use compared to more programmatic approaches. This allows more team members, including those with lower coding skills, to contribute effectively.
  3. Performance and Efficiency: Analytical engines are optimized for handling large datasets, and their implementations use tools written in lower-level languages that guarantee optimal performance and efficiency.

After evaluating multiple analytical engines such as Apache Spark, Presto, and DuckDB, I recommend using DuckDB for most of our ETL/ELT operations. DuckDB is in-process OLAP database that excels in a single-node environment, offering high performance for our current user needs. DuckDB is also very well supported by NodeJS, Python and R. This means that we can integrate it in our API and in future in our analytics and visualisations processes that may use Python or R.

In the future, if our users require analysis that can only be efficiently performed in distributed environments, we can transition to Apache Spark, which also provides robust support for ETL/ELT operations in distributed settings using SQL.

Request for Comments

I invite feedback on this proposal. Your insights and suggestions will be invaluable.

@Patowhiz Patowhiz changed the title Choosing an ETL/ELT Framework for Our Modules Choosing an ETL/ELT Framework for Modules Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant