OSC-Platform-031: Secret management for proprietary data ingestion #81

caldeirav · 2021-10-07T06:23:27Z

We have two issues currently with secret management for AWS S3 buckets:

The first one is for ingestion purpose, we have some public sources which we can all make accessible through one set of secrets on redhat-osc-physical-landing-647521352890. But there are other data sources (LSEG, Urgentem, etc...) where we don't want just any pipeline that does ingestion to have access to the data (i.e we should have one set of secrets per data source, as they will be ingested in independent pipelines that sit in different github repos).
The second issue is managing secrets, sending secrets for people to update in their credentials.env is not great - ideally we need a secret store where the require secret can be retrieved at pipeline runtime based on credentials of the pipeline user (or process user, when we automate it in prod).

We want to go preferably towards a secret management solution that works with a secretless broker to make the process seamless for developers, example with Conjur secret store:
https://github.com/cyberark/secretless-broker

eoriorda · 2022-05-09T14:37:12Z

See can this be part of Operate first

HeatherAck · 2022-10-31T18:46:49Z

ODH team to provide guidance on secret mgmt (Landon Smith) - not just notebooks;Heather to ask on ODH support channel in slack. This issue to be broken up. Key mgmt requires overall architecture and planning. Need for discussion/meeting - @HeatherAck to set up.

HeatherAck · 2022-11-14T18:36:55Z

@HeatherAck to schedule week of 14-Nov

HeatherAck · 2022-11-17T00:18:47Z

meeting planned for 8-Dec; will mark as blocked until that date

HeatherAck · 2022-12-16T00:49:35Z

Prefer to use airflow as scheduler - handle more complex data pipelines -
(1) @redmikhail need to see if we can inject secrets into airflow pipeline
(2) need to verify hash corp vault will work with airflow (if not will need to pick different key store - (https://github.com/cyberark/secretless-broker). need to define full set of components and integration between them. also determine impacts to users in moving away from operate first key store

Other factors to consider: open metadata and airflow are tightly coupled. Consider externalizing airflow to enable other functionality. See also #243

HeatherAck · 2022-12-16T01:05:37Z

(3) Need to know use cases where the keys will be used.

@caldeirav to provide use case for data pipeline.
@ryanaslett and @MightyNerdEric to identify other operational use cases where keys are needed [kubeflow (e.g. ML data extraction), argocd]

(4) Need to establish different rules/restrictions for CL2 to ensure sandbox development and testing is not slowed down ; CL3 will be stable cluster

(5) Document the policy and use thereof

HeatherAck · 2022-12-19T18:17:28Z

@redmikhail still investigating (1) above

re: (3) above - customize plugin; certificates (need to be updated - expiring on 1-Jan, issue with cert mgr need to fix bug - manual updates required every 3 mos - see operate-first/apps#1998);

need to list/document automated pipelines/processes @MightyNerdEric to look at what's already in vault, @redmikhail will look at kubeflow
plan to start with AWS keys (s3 buckets, Airflow, Inception, Open Metadata, any deployed application)
plus seek developer feedback (jupyter notebooks)

HeatherAck · 2023-01-09T19:01:50Z

@bryonbaker will help with HashiCorp. @redmikhail to reply to email and still validating airflow's use. @HeatherAck to ensure ticket created in LF

bryonbaker · 2023-01-09T20:24:19Z

I have started PoCing HashiCorp Vault and various means to inject and access secrets, but would like to get a broader view of the team's needs to make sure I come up with the right solution.
Who is the best source of requirements?

caldeirav · 2023-01-09T22:49:58Z

Here is a list of secrets currently required for data pipeline ingestion:

S3 bucket credentials for data landing zone (accessible through SFTP) as well as data buckets for Parquet / Iceberg
Pachyderm credentials
Trino credentials
DBT profiles (YAML which includes Trino JWT token)

Ideally the secrets should be injected to Airflow in runtime, including the generation of a JWT token for Trino access which is used for both data ingestion and read (probably the most complex requirements).

Also we need to check if we can automate the execution of OpenMetadata at end of the data pipeline which may require secrets of the OpenMetada admin acount (automation not tested yet, we need to open a dedicated issue for this).

redmikhail · 2023-01-11T02:17:15Z

Just to add to the list stated above , we also need to manage

Openshift platform specific secrets (certificates and so on)
ODH JupyterHub notebooks ( S3 buckets accessed from code) , trino credentials
Kubeflow pipelines allow to specify keys of the kubernetes secret to access S3 buckets but Elyra doesn't have it

While may be not perfect we have solution that is already being used for use cases where corresponding component/application uses kubernetes secrets - External Secrets Operator (https://external-secrets.io/). It ultimately syncs entries in external KMS (including HashiCorp Vault) with kubernetes secrets. It is fairly lightweight and non-invasive (not using side-car containers and so on) . Some of the use cases mentioned above may be covered already , however if target application does not allow easy access to the kubernetes secret we may need to have different way of implementing it

bryonbaker · 2023-01-11T03:29:49Z

Is there any reason we could not use Hashicorp vault with K8s service accounts to give all pods in a namespace access to a set of secrets?
We can do this either via injection or calling the vault.

I know airflow lets you assign attributes to the scheduled pods so that would enable injection.

caldeirav assigned redmikhail Oct 7, 2021

caldeirav added this to the Roadmap Q1 2022 milestone Oct 7, 2021

caldeirav assigned myeung18 Oct 14, 2021

caldeirav added this to Data Commons Platform Jan 17, 2022

caldeirav removed this from the Roadmap Q1 2022 milestone Jan 17, 2022

caldeirav moved this to In Progress in Data Commons Platform May 9, 2022

eoriorda assigned HumairAK May 9, 2022

eoriorda moved this from In Progress to Todo in Data Commons Platform Jul 18, 2022

HeatherAck added the blocked label Nov 17, 2022

HeatherAck assigned eb-oss, tnscorcoran and ryanaslett and unassigned HumairAK Dec 16, 2022

HeatherAck removed the blocked label Dec 16, 2022

caldeirav moved this from Todo to In Progress in Data Commons Platform Dec 16, 2022

HeatherAck added this to Data Commons - Q4 2023 Sep 18, 2023

HeatherAck moved this to Todo in Data Commons - Q4 2023 Sep 18, 2023

ModeSevenIndustrialSolutions self-assigned this Dec 11, 2023

ModeSevenIndustrialSolutions unassigned ryanaslett and eb-oss Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSC-Platform-031: Secret management for proprietary data ingestion #81

OSC-Platform-031: Secret management for proprietary data ingestion #81

caldeirav commented Oct 7, 2021

eoriorda commented May 9, 2022

HeatherAck commented Oct 31, 2022 •

edited

Loading

HeatherAck commented Nov 14, 2022

HeatherAck commented Nov 17, 2022

HeatherAck commented Dec 16, 2022 •

edited

Loading

HeatherAck commented Dec 16, 2022 •

edited

Loading

HeatherAck commented Dec 19, 2022 •

edited

Loading

HeatherAck commented Jan 9, 2023

bryonbaker commented Jan 9, 2023

caldeirav commented Jan 9, 2023

redmikhail commented Jan 11, 2023

bryonbaker commented Jan 11, 2023 •

edited

Loading

OSC-Platform-031: Secret management for proprietary data ingestion #81

OSC-Platform-031: Secret management for proprietary data ingestion #81

Comments

caldeirav commented Oct 7, 2021

eoriorda commented May 9, 2022

HeatherAck commented Oct 31, 2022 • edited Loading

HeatherAck commented Nov 14, 2022

HeatherAck commented Nov 17, 2022

HeatherAck commented Dec 16, 2022 • edited Loading

HeatherAck commented Dec 16, 2022 • edited Loading

HeatherAck commented Dec 19, 2022 • edited Loading

HeatherAck commented Jan 9, 2023

bryonbaker commented Jan 9, 2023

caldeirav commented Jan 9, 2023

redmikhail commented Jan 11, 2023

bryonbaker commented Jan 11, 2023 • edited Loading

HeatherAck commented Oct 31, 2022 •

edited

Loading

HeatherAck commented Dec 16, 2022 •

edited

Loading

HeatherAck commented Dec 16, 2022 •

edited

Loading

HeatherAck commented Dec 19, 2022 •

edited

Loading

bryonbaker commented Jan 11, 2023 •

edited

Loading