New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Rfix/pipeline v2 extract comments #68

Closed

adrianbr wants to merge 66 commits into master from rfix/pipeline-v2-extract-2

Contributor

adrianbr commented Oct 24, 2022

No description provided.

rudolfix added 30 commits

September 20, 2022 22:38


          initial implementation of plugabble data writers and buffered writer

6eba24d


          experimental implementation of extraction pipe, source, resources and…

47e4766

… associated typing


          adds schema utils to diff tables

20f6180


          adds hard linking and path validation to file storage

ddd06a9


          implements pipe, resource, source and extractor without tests

15d29b5


          adds pathvalidate to deps

533f576


          adds additional caps to loader clients, fixes tests to use new writers

3db4695


          refactoring extractor and normalizer code to introduce new data write…

be15172

…rs and extracted file jsonl format


          makes schema name validation part of name normalization, function to …

059e77d

…normalize all names in tables


          several name cleanips

e9b70dd


          adds close data writer operation and written items count

fcaaef6


          adds explicit table name to json normalizer func signature

40c5acc


          adds performance improvements to Schema, 3x normalization speed gaine…

b07fb54

…d by just removing runtime protocol check


          removes runtime protocol check from schema altogether

d8d8563


          implements data item storage and adds to normalize and load storages …

f98ddbb

…as mixins


          rewrites Normalize to work with new extracted jsonl files and use buf…

4d3861b

…fered data writers


          adds decorator for Pool workers to pass configuration types across pr…

52eebb9

…ocess boundary


          simplifies ExtractorStorage by moving data items storage out

693f89d


          refactors configurations: uses dataclass based instances, no producti…

1ce4735

…on configs, snake case field names, recursive binding to values


          adds init values and embedded config tests

277f8e4


          reorganizes configuration module

400ca1a


          moves configurations into specs folder

b53ed23


          moves file_storage to storages, writes lists of items into jsonl file…

c0cc1d6

…s between extract and normalize stages


          data item may be of any type, not dict loaded from json

80233c0


          adds resolving configs via providers with variable namespaces, provid…

f7b5434

…es lookup traces


          moves config tests

f16285b


          adds config inject decorator, basic tests, applies decorator to buffe…

a80b2f5

…red writers and pipe


          adds injection container with tests

1e0695f


          adds config providers framewrok, injectable config and basic tests

81eb730


          adds namespaces, pipeline name, injectable namespace config to config…

f08b8d5

… resolver, typing improvements and tests

rudolfix added 7 commits

October 20, 2022 11:39


          adding pseudo code samples for api v2: create pipeline, working with …

47277d6

…credentials


          adds more code samples

6b65451


          adds general usage samples

b17e513


          adds example to general usage

f20871e


          general usage doc completed + README

0d5a894


          adds project structure example

20a9186


          adds working with schemas doc

150da6d

adrianbr commented

View reviewed changes

experiments/pipeline/examples/general_usage.md

+              ## default and explicitly configured pipelines
+              When the `dlt` is imported a default pipeline is automatically created. That pipeline is configured via configuration providers (ie. `config.toml` or env variables - see [secrets_and_config.md](secrets_and_config.md)). If no configuration is present, default values will be used.
+. the name of the pipeline, the name of default schema (if not overridden by the source extractor function) and the default dataset (in destination) are set to **current module name** which in 99% of cases is the name of executing python script

Contributor Author

adrianbr Oct 25, 2022

If we default destination (dataset/schema) we need to make sure we somehow communicate that to avoid "where did the data go?"
I would log this happening and perhaps default could include our branding dlt name for multiple reasons (easy to find, branding)

Collaborator

rudolfix Oct 31, 2022

100% agree. I want to add user readable logs that will tell exactly when data went. the default name will always start with dlt_ prefix (I will update info above)

experiments/pipeline/examples/general_usage.md

+              When the `dlt` is imported a default pipeline is automatically created. That pipeline is configured via configuration providers (ie. `config.toml` or env variables - see [secrets_and_config.md](secrets_and_config.md)). If no configuration is present, default values will be used.
+. the name of the pipeline, the name of default schema (if not overridden by the source extractor function) and the default dataset (in destination) are set to **current module name** which in 99% of cases is the name of executing python script
+. the working directory of the pipeline will be **OS temporary folder/pipeline name**

Contributor Author

adrianbr Oct 25, 2022

do users need to interact with this ever? then i would give an option for setting paths

if the runner does not have permission, can the user (can be hacky) config a new folder? or do they need to fix permissions?

Collaborator

rudolfix Oct 31, 2022

yes, they can set the working dir. also I decided to change /tmp to ~/.dlt/pipelines/<name> so it is easier to implement CLI. I will put the correct info here.

experiments/pipeline/examples/general_usage.md

+. the name of the pipeline, the name of default schema (if not overridden by the source extractor function) and the default dataset (in destination) are set to **current module name** which in 99% of cases is the name of executing python script
+. the working directory of the pipeline will be **OS temporary folder/pipeline name**
+. the logging level will be **INFO**

Contributor Author

adrianbr Oct 25, 2022

can we/should we log schema changes as warnings?

Or should we collect them in the run summary like the errors?

Contributor Author

adrianbr Oct 25, 2022

I would push them to slack :) as logs are for troubleshooting, not daily reading

Collaborator

rudolfix Oct 31, 2022

lets gather idea for end user log messages in one place.

experiments/pipeline/examples/general_usage.md

+              Pipeline can be explicitly created and configured via `dlt.pipeline()` that returns `Pipeline` object. All parameters are optional. If no parameter is provided then default pipeline is returned. Here's a list of options. All the options are configurable.
+. pipeline_name - default as above
+. working_dir - default as above
+. pipeline_secret - for deterministic hashing - default is random number

Contributor Author

adrianbr Oct 25, 2022

deterministic ids by default has more value IMO than randomising by default

I would expect it to stay deterministic between envs and , so I can later distinct them on the hash, would sure be cheaper than distincting on all columns. For example, stripe might return data from X till last activity. One would then request from that last activity including its timestamp because there may be a new activity within the same second that came later.

I would also expect them to be deterministic between local/prod envs as there are cases where the engineer might run a fix/migration etc from local

Collaborator

rudolfix Nov 1, 2022

let's have an hardcoded secret value for the deterministic hashing. rename it to hashing slat or seed

experiments/pipeline/examples/general_usage.md

+. working_dir - default as above
+. pipeline_secret - for deterministic hashing - default is random number
+. destination - the imported destination module or module name (we accept strings so they can be configured) - default is None
+. import_schema_path - default is None

Contributor Author

adrianbr Oct 25, 2022

I can imagine default path would ideally be at extractor script location, and it would not be exported by default.

So i guess it can be achieved by defaulting to None and accepting some relative path parameter, perhaps this relative path can be relative to extractor.

experiments/pipeline/examples/general_usage.md

+              **Pipeline working directory should be preserved between the runs - if possible**
+              If the working directory is not preserved:
+. the auto-evolved schema is reset to the initial one. the schema evolution is deterministic so it should not be a problem - just a time wasted to compare schemas with each run

Contributor Author

adrianbr Oct 26, 2022

I think we should allow storing schemas on s3 or similar, or perhaps we offer schema hosting as a service for money

Contributor Author

adrianbr Oct 26, 2022

or they could be stored at destination in binary or something

experiments/pipeline/examples/general_usage.md

+. the current schemas with all the recent updates
+. the pipeline and source state files.
+              **Pipeline working directory should be preserved between the runs - if possible**

Contributor Author

adrianbr Oct 26, 2022

This likely will not happen in any non-basic data engineering setups.

To scale pipelines, most companies with some maturity send their jobs to k8/cloud run/airflow

experiments/pipeline/examples/general_usage.md


		the `run` and `load` return information on loaded packages: to which datasets, list of jobs etc. let me think what should be the content

		> `load` is atomic if SQL transformations ie in `dbt` and all the SQL queries take into account only committed `load_ids`. It is certainly possible - we did in for RASA but requires some work... Maybe we implement a fully atomic staging at some point in the loader.

Contributor Author

adrianbr Oct 26, 2022

atomic staging is the way to go IMO

Contributor Author

adrianbr Oct 26, 2022

the "load id" thing for rasa is useful for incremental processing but can be implemented without our help

experiments/pipeline/examples/general_usage.md

+              # the `run` command below will create default pipeline and use it to load data
+              # I only want logs from the resources present in taktile_data
+              taktile_data.select("logs").run(destination=bigquery)

Contributor Author

adrianbr Oct 26, 2022

do we want select('endpoint') or do we want endpoints = ['endpoint']

I think the latter is more explicit

Collaborator

rudolfix Oct 31, 2022

so you would do this like

taktile_data.endpoints = ["logs"]
taktile_data.run()

Collaborator

rudolfix Oct 31, 2022

hmmm you are probably right. current interface is not very intuitive

experiments/pipeline/examples/general_usage.md

+              dlt.pipeline(name="pipe", destination=bigquery, dataset="extract_1")
+              # use dlt secrets directly to get api key
+              # no parameters needed to run - we configured destination and dataset already
+              data(dlt.secrets["api_key"]).run()

Contributor Author

adrianbr Oct 26, 2022

interesting but I think we wanna parametrise the changing stuff and freeze names on the function calls?

Collaborator

rudolfix Oct 31, 2022

this one I do not understand?

rudolfix added 20 commits

October 26, 2022 19:52


          first implementation of source and resource decorators, adds pipeline…

8282f30

… context


          extracts destination and pipeline common code

8ca3b9b


          adds ignored example secrets.toml

ce9c3b9


          implements toml config provider, changes how embedded and config name…

c4f6c8c

…spaces work, config initial values behaving like any other values + tests


          ports pipeline v1 util methods

8b4d6fd


          renames modules holding client implementations in load

3fa7d21


          removes pipeline v1

08c10db


          moves pipeline v2 in

0290aa3


          fixes pipeline imports

d16b430


          fixes typing errors, adds overloads

b2a9bde


          moves source decorators to extract

3ce749e


          adds restore pipeline and better state management, with cli support

daba606


          fixes dependent resources handling, partially adds missing exceptions

dab58bf


          adds self importing via name or module name for destination reference

0a9691a


          fixes config exceptions names

1f7bf39


          adds optional embedded namespaces extension in config resolve + tests

371a058


          refactors interface to select resources in source, adds missing excep…

bc61466

…tions and typings


          changes the schema file pattern _schema. -> .schema.

161b8d6


          adds flag to wipe out loader storage before initializing

32d769b


          various typing improvements

11a80d8

rudolfix closed this

rudolfix deleted the rfix/pipeline-v2-extract-2 branch

February 9, 2023 14:04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet