You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I found an issue when deploying GX in our AWS S3 bucket using AWS Glue. I find it hard to setup the yaml configuration for s3, so to build the configuration I need to setup manually using this code:
# Function to update yaml manually
def update_yml_config(context, s3_client, bucket_name='my-bucket', key='great-expectations/great_expectations.yml'):
config_dict = context.variables.config.to_json_dict()
if 'data_context_id' in config_dict:
config_dict['data_context_id'] = str(config_dict['data_context_id'])
if 'fluent_datasources' in config_dict:
for ds in config_dict['fluent_datasources'].values():
if 'connection_string' in ds:
ds['connection_string'] = "secret|${aws_secrets_manager_arn}|redshift_connection_string"
config_yaml = yaml.dump(config_dict, default_flow_style=False)
s3_client.put_object(
Bucket=bucket_name,
Key=key,
Body=config_yaml,
ContentType="text/yaml"
)
# Get default context
context = gx.get_context()
config_dict = context.variables.config
# Updating config
config_dict["stores"] = {
"expectations_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my-bucket",
"prefix": "great-expectations/expectations/"
}
},
"validation_results_store": {
"class_name": "ValidationResultsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my-bucket",
"prefix": "great-expectations/validations/"
}
},
"checkpoint_store": {
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my-bucket",
"prefix": "great-expectations/checkpoints/"
}
},
"validation_definition_store": {
"class_name": "ValidationDefinitionStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my-bucket",
"prefix": "great-expectations/validation_definitions/"
}
}
}
# # Update context config to point to config_variables
config_dict["config_variables_file_path"] = "great-expectations/uncommitted/config_variables.yml"
# Creating config variables yaml
config_variables = {
"my_aws_creds": "secret|arn:aws:secretsmanager|redshift_connection_string"
}
# Create data docs site
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my-bucket",
"prefix": "great-expectations/data_docs/",
},
}
context.variables.config.data_docs_sites = {
"local_site": site_config,
"s3_site": site_config
}
s3_client.put_object(
Bucket='my-bucket',
Key='great-expectations/uncommitted/config_variables.yml',
Body=yaml.dump(config_variables),
ContentType="text/yaml"
)
# Dump config to yaml files
context_config = context.config
config_dict = context_config.to_dict()
config_yaml = yaml.dump(config_dict, default_flow_style=False)
s3_client.put_object(
Bucket='my-bucket',
Key='great-expectations/great_expectations.yml',
Body=config_yaml,
ContentType="text/yaml"
)
update_yml_config(context, s3_client)
# Read the config from S3 (when dump it doesn't automatically generate the data_context_id. However running code below and save the config yaml files generate the data_context_id automatically)
response = s3_client.get_object(
Bucket='my-bucket',
Key='great-expectations/great_expectations.yml'
)
config_yaml = response['Body'].read().decode('utf-8')
config_dict = yaml.safe_load(config_yaml)
context_config = DataContextConfig(**config_dict)
context = gx.get_context(project_config=context_config)
update_yml_config(context, s3_client)
# Load config_variables.yml into the Data Context
config_vars_response = s3_client.get_object(
Bucket="my-bucket",
Key="great-expectations/uncommitted/config_variables.yml"
)
config_vars = yaml.safe_load(config_vars_response["Body"].read())
context.config_variables.update(config_vars) # Inject variables
it successfully store the yaml file within the path, the folder directory and the config_variables. To call the config I need to read the yaml file manually both for gx config and config variables like this:
thanks for your detailed report. This behavior is expected—GX requires a local project structure and does not currently support using an S3 path as the full project directory. By default, we initialize in a temporary directory (e.g., /tmp/tmp3scyd_fn). However, we do support configuring individual stores (e.g., expectations_store, validation_results_store, checkpoint_store) to use S3 by specifying an s3:// path in the store_backend configuration.
I understand that this setup adds complexity, but I believe your current approach is the best way with our existing structure. I’ll share your feedback about streamlining deployments with my team, though I can't guarantee any immediate changes. Please check back on this issue for further updates.
Describe the bug
I found an issue when deploying GX in our AWS S3 bucket using AWS Glue. I find it hard to setup the yaml configuration for s3, so to build the configuration I need to setup manually using this code:
it successfully store the yaml file within the path, the folder directory and the
config_variables
. To call the config I need to read the yaml file manually both for gx config and config variables like this:I once tried to run this code:
But the return show that the GX read the config from the
tmp
files instead of the project folder, like this log:I hope it can be more simpler in the future to deploy the GX. Thank you
Environment:
Additional context
Here's my full code that I need to execute to run the data validation in AWS Glue after setting up the config:
Here's my config for references:
The text was updated successfully, but these errors were encountered: