Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSC-Platform-031: Secret management for proprietary data ingestion #81

Open
caldeirav opened this issue Oct 7, 2021 · 12 comments
Open

Comments

@caldeirav
Copy link
Contributor

We have two issues currently with secret management for AWS S3 buckets:

  • The first one is for ingestion purpose, we have some public sources which we can all make accessible through one set of secrets on redhat-osc-physical-landing-647521352890. But there are other data sources (LSEG, Urgentem, etc...) where we don't want just any pipeline that does ingestion to have access to the data (i.e we should have one set of secrets per data source, as they will be ingested in independent pipelines that sit in different github repos).
  • The second issue is managing secrets, sending secrets for people to update in their credentials.env is not great - ideally we need a secret store where the require secret can be retrieved at pipeline runtime based on credentials of the pipeline user (or process user, when we automate it in prod).

We want to go preferably towards a secret management solution that works with a secretless broker to make the process seamless for developers, example with Conjur secret store:
https://github.com/cyberark/secretless-broker

@eoriorda
Copy link

eoriorda commented May 9, 2022

See can this be part of Operate first

@eoriorda eoriorda moved this from In Progress to Todo in Data Commons Platform Jul 18, 2022
@HeatherAck
Copy link
Contributor

HeatherAck commented Oct 31, 2022

ODH team to provide guidance on secret mgmt (Landon Smith) - not just notebooks;Heather to ask on ODH support channel in slack. This issue to be broken up. Key mgmt requires overall architecture and planning. Need for discussion/meeting - @HeatherAck to set up.

@HeatherAck
Copy link
Contributor

@HeatherAck to schedule week of 14-Nov

@HeatherAck
Copy link
Contributor

meeting planned for 8-Dec; will mark as blocked until that date

@HeatherAck
Copy link
Contributor

HeatherAck commented Dec 16, 2022

Prefer to use airflow as scheduler - handle more complex data pipelines -
(1) @redmikhail need to see if we can inject secrets into airflow pipeline
(2) need to verify hash corp vault will work with airflow (if not will need to pick different key store - (https://github.com/cyberark/secretless-broker). need to define full set of components and integration between them. also determine impacts to users in moving away from operate first key store

Other factors to consider: open metadata and airflow are tightly coupled. Consider externalizing airflow to enable other functionality. See also #243

@HeatherAck
Copy link
Contributor

HeatherAck commented Dec 16, 2022

(3) Need to know use cases where the keys will be used.

  • @caldeirav to provide use case for data pipeline.
  • @ryanaslett and @MightyNerdEric to identify other operational use cases where keys are needed [kubeflow (e.g. ML data extraction), argocd]

(4) Need to establish different rules/restrictions for CL2 to ensure sandbox development and testing is not slowed down ; CL3 will be stable cluster

(5) Document the policy and use thereof

@caldeirav caldeirav moved this from Todo to In Progress in Data Commons Platform Dec 16, 2022
@HeatherAck
Copy link
Contributor

HeatherAck commented Dec 19, 2022

@redmikhail still investigating (1) above

re: (3) above - customize plugin; certificates (need to be updated - expiring on 1-Jan, issue with cert mgr need to fix bug - manual updates required every 3 mos - see operate-first/apps#1998);

  • need to list/document automated pipelines/processes @MightyNerdEric to look at what's already in vault, @redmikhail will look at kubeflow

  • plan to start with AWS keys (s3 buckets, Airflow, Inception, Open Metadata, any deployed application)

  • plus seek developer feedback (jupyter notebooks)

@HeatherAck
Copy link
Contributor

@bryonbaker will help with HashiCorp. @redmikhail to reply to email and still validating airflow's use. @HeatherAck to ensure ticket created in LF

@bryonbaker
Copy link

I have started PoCing HashiCorp Vault and various means to inject and access secrets, but would like to get a broader view of the team's needs to make sure I come up with the right solution.
Who is the best source of requirements?

@caldeirav
Copy link
Contributor Author

Here is a list of secrets currently required for data pipeline ingestion:

  • S3 bucket credentials for data landing zone (accessible through SFTP) as well as data buckets for Parquet / Iceberg
  • Pachyderm credentials
  • Trino credentials
  • DBT profiles (YAML which includes Trino JWT token)

Ideally the secrets should be injected to Airflow in runtime, including the generation of a JWT token for Trino access which is used for both data ingestion and read (probably the most complex requirements).

Also we need to check if we can automate the execution of OpenMetadata at end of the data pipeline which may require secrets of the OpenMetada admin acount (automation not tested yet, we need to open a dedicated issue for this).

@redmikhail
Copy link

Just to add to the list stated above , we also need to manage

  • Openshift platform specific secrets (certificates and so on)
  • ODH JupyterHub notebooks ( S3 buckets accessed from code) , trino credentials
  • Kubeflow pipelines allow to specify keys of the kubernetes secret to access S3 buckets but Elyra doesn't have it

While may be not perfect we have solution that is already being used for use cases where corresponding component/application uses kubernetes secrets - External Secrets Operator (https://external-secrets.io/). It ultimately syncs entries in external KMS (including HashiCorp Vault) with kubernetes secrets. It is fairly lightweight and non-invasive (not using side-car containers and so on) . Some of the use cases mentioned above may be covered already , however if target application does not allow easy access to the kubernetes secret we may need to have different way of implementing it

@bryonbaker
Copy link

bryonbaker commented Jan 11, 2023

Is there any reason we could not use Hashicorp vault with K8s service accounts to give all pods in a namespace access to a set of secrets?
We can do this either via injection or calling the vault.

I know airflow lets you assign attributes to the scheduled pods so that would enable injection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests