Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Engineer Exam Solution #3

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
928a95e
Initialize meltano
mkdlt Jan 13, 2022
ccfedeb
Complete data generation task
mkdlt Jan 13, 2022
be6204a
Move original README text
mkdlt Jan 13, 2022
2e701fa
Add tools, ELT notes
mkdlt Jan 13, 2022
670a979
Add links, formatting
mkdlt Jan 13, 2022
1a17874
Add extractor for flat files
mkdlt Jan 13, 2022
2783ce0
Add extractor config
mkdlt Jan 13, 2022
c57bf56
Add loader for postgres
mkdlt Jan 13, 2022
fb9bf55
Fix authentication error
mkdlt Jan 13, 2022
6d34213
Add extraction and loading notes
mkdlt Jan 13, 2022
e936076
Add dbt as transformer
mkdlt Jan 13, 2022
ddfb768
Merge branch 'master' of github.com:mkdlt/dataengineer_test
mkdlt Jan 13, 2022
efc0b41
Implement star schema
mkdlt Jan 13, 2022
51c7686
Add star schema ERD
mkdlt Jan 13, 2022
d95ccad
Add notes on star schema and dbt
mkdlt Jan 13, 2022
5fce20c
Add commands for transform process
mkdlt Jan 13, 2022
4a031e8
Add dbt_date package for date dimension table
mkdlt Jan 13, 2022
9ccf2e2
Fix dbt target schema and config paths
mkdlt Jan 13, 2022
9b7a20a
Add reports
mkdlt Jan 13, 2022
3e0f4ef
Add notes on reporting, deployment, architecture
mkdlt Jan 13, 2022
123b189
Add architecture notes
mkdlt Jan 13, 2022
3c21119
Trim answers
mkdlt Jan 13, 2022
1376ab0
Fix star schema reference
mkdlt Jan 13, 2022
eb5cd3a
Add Airflow as orchestrator
mkdlt Jan 13, 2022
44304cf
Fix Airflow config
mkdlt Jan 13, 2022
9f7caad
Fix typo
mkdlt Jan 13, 2022
95f2230
Fix tools list
mkdlt Jan 13, 2022
724f1b5
Add step by step instructions
mkdlt Jan 13, 2022
cb32bdd
Fix report references
mkdlt Jan 13, 2022
a72f700
Fix instructions
mkdlt Jan 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add links, formatting
  • Loading branch information
mkdlt authored Jan 13, 2022
commit 670a979f7d565dc43290a11e0862d550c74911a5
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Data Engineer Interview Test

## Tools and technologies
- Meltano - basically a very neat (open source) way to package:
- Singer taps and targets
- dbt
- Airflow
- Apache Superset - "open source Tableau" (also comes with a useful SQL editor for ad hoc queries)
- [Meltano](https://meltano.com/) - very convenient open source tool for building pipelines with:
- [Singer](https://www.singer.io/) taps and targets - ready-to-use extract and load scripts
- [dbt](https://www.getdbt.com/product/what-is-dbt/) - transform data with simple SELECT statements and Jinja templating
- Apache Airflow - the one and only
- [Apache Superset](https://superset.apache.org/) - "open source Tableau" (also comes with a useful SQL editor for ad hoc queries)
- PostgreSQL
- Docker and docker-compose

Expand All @@ -19,21 +19,21 @@ To get the bonus points, please encoded the file with the instructions were used

**Done ✅**. File was encoded in base64.

2. Code you scripts to load the data into a database.
>2. Code you scripts to load the data into a database.

3. Design a star schema model which the data should flow.
>3. Design a star schema model which the data should flow.

4. Build your process to load the data into the star schema
>4. Build your process to load the data into the star schema

**Bonus** point:
- add a fields to classify the customer account balance in 3 groups
- add revenue per line item
- convert the dates to be distributed over the last 2 years
>**Bonus** point:
>- add a fields to classify the customer account balance in 3 groups
>- add revenue per line item
>- convert the dates to be distributed over the last 2 years

5. How to schedule this process to run multiple times per day?
>5. How to schedule this process to run multiple times per day?

**Bonus**: What to do if the data arrives in random order and times via streaming?
>**Bonus**: What to do if the data arrives in random order and times via streaming?

6. How to deploy this code?
>6. How to deploy this code?

**Bonus**: Can you make it to run on a container like process (Docker)?
>**Bonus**: Can you make it to run on a container like process (Docker)?