Release v1.1.1 (#3)

aws-solutions · May 13, 2024 · 841c590 · 841c590
1 parent 6863280
commit 841c590
Show file tree

Hide file tree

Showing 42 changed files with 617 additions and 319 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.1.1] - 2024-05-13
+
+### Updated
+
+- Upgrade Lambda functions runtime version to Python 3.12
+- Extended the list of supported regions
+
+### Fixed
+
+- Fix Glue job failures occurring when empty vault
+- Fix InitiateRetrieval workflow to skip incorrect inventory entries and avoid failing the entire process
+
 ## [1.1.0] - 2024-04-04
 
 ### Added
@@ -16,7 +28,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 - Implement retry within MetricsProcessor Lambda when encountering TransactionConflict exception
 - Use SHA-256 to generate TransactWriteItems ClientRequestToken as a more secure alternative to MD5 hashing
-- Add try-except block around the archive naming logic to prevent the entire Glue job from failing due to a single/few names parsing errors.
+- Add try-except block around the archive naming logic to prevent the entire Glue job from failing due to a single/few names parsing errors
 - Enhance SSM Automation documents descriptions
 - Add user-agents to all service clients to track usage on solution service API usage dashboard
 

diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ The solution automates the optimized restore, copy, and transfer process and pro
 Copying your Amazon S3 Glacier vault contents to the S3 Glacier Deep Archive storage class combines the low cost and high durability benefits of S3 Glacier Deep Archive, with the familiar Amazon S3 user and application experience that offers simple visibility and access to data. Once your archives are stored as objects in your Amazon S3 bucket, you can add tags to your data to enable items such as attributing data costs on a granular level.
 
  _Note: The solution only copies archives from a source S3 Glacier vault to
- the destination S3 bucket, it does not delete archives in the source S3 Glacier vault. After the solution completes a successful archive copy to the destination S3 bucket, you must manually delete the archives from your S3 Glacier vault.For more information,
+ the destination S3 bucket, it does not delete archives in the source S3 Glacier vault. After the solution completes a successful archive copy to the destination S3 bucket, you must manually delete the archives from your S3 Glacier vault. For more information,
  refer to [Deleting an Archive in Amazon S3 Glacier](https://docs.aws.amazon.com/amazonglacier/latest/dev/deleting-an-archive.html) in the Amazon S3 Glacier Developer Guide._
 
 ## Table of contents
@@ -46,6 +46,8 @@ Copying your Amazon S3 Glacier vault contents to the S3 Glacier Deep Archive sto
 18. The Amazon EventBridge rules periodically initiate Step Functions Extend Download Window and Update CloudWatch Dashboard workflows.
 19. Customers monitor the transfer progress by using the Amazon CloudWatch dashboard.
 
+Refer to the [solution developer guide](./docs/DEVELOPER_GUIDE.md) for more details about the internal components, workflows, and resource dependencies involved in transferring a Glacier Vault to S3.
+
 ## Deploying the solution
 
 ### One-Click deploy From AWS CloudFormation

diff --git a/cdk.json b/cdk.json
@@ -41,7 +41,7 @@
     "SOLUTION_NAME": "Data transfer from Amazon S3 Glacier vaults to Amazon S3",
     "APP_REGISTRY_NAME": "data_retrieval_for_amazon_glacier_s3",
     "SOLUTION_ID": "SO0293",
-    "SOLUTION_VERSION": "v1.1.0",
+    "SOLUTION_VERSION": "v1.1.1",
     "APPLICATION_TYPE": "AWS-Solutions"
   }
 }
diff --git a/docs/DEVELOPER_GUIDE.md b/docs/DEVELOPER_GUIDE.md
@@ -0,0 +1,103 @@
+# Data Transfer from Amazon S3 Glacier Vaults to Amazon S3 - Developer Guide
+## Transfer Workflow Sequence Diagram
+
+The sequence diagram below illustrates the sequence of events, starting with the customer's interaction with Systems Manager through executing an automation document to launch a transfer. It demonstrates the internal components, workflows, and resource dependencies involved in transferring a Glacier Vault to S3.
+
+
+![Data transfer from Glacier vaults to S3 - Sequence diagram](./sequence_diagram.png)
+
+## How the Solution works?
+
+*Data Transfer from Amazon S3 Glacier Vaults to Amazon S3* Solution leverages *Systems Manager* automation documents as entry points, providing a user-friendly interface for initiating transfer workflows. After deploying the Solution, two automation documents are created to capture all necessary inputs from users; `LaunchAutomationRunbook` and `ResumeAutomationRunbook`.
+
+* `LaunchAutomationRunbook` initiates a new orchestrator workflow to transfer archives from a Glacier vault to an S3 bucket.
+* `ResumeAutomationRunbook` initiates an orchestrator workflow to resume a partially completed transfer from a Glacier vault to an S3 bucket. This can happen due to failures during the initial transfer or intentional stops by the customer.
+
+
+
+### Orchestrator workflow
+
+Once users provide the necessary inputs and execute the automation document, the main orchestrator Step Function `OrchestratorStateMachine` is triggered. This orchestrator Step Function consists of a series of steps, designed to manage the entire transfer process:
+
+1. Metadata regarding the entire transfer is stored in `GlacierObjectRetrieval` DynamoDB table.
+2. A nested Step Function `ArchivesStatusCleanupStateMachine` is triggered to clean up any outdated archives statuses in DynamoDB resulting from partial transfer or previously failed transfers.
+3. Following this, a nested Step Function `InventoryRetrievalStateMachine` is triggered to retrieve the *Inventory* file. Refer [Vault Inventory](https://docs.aws.amazon.com/amazonglacier/latest/dev/vault-inventory.html).
+4. Upon completing the *Inventory* file download, two *EventBridge* rules are configured to periodically trigger nested Step Functions `ExtendDownloadWindowStateMachine` and `CloudWatchDashboardUpdateStateMachine`.
+5. Subsequently, a nested Step Function `InitiateRetrievalStateMachine` is triggered which iterates through the archives records listed in the downloaded *Inventory* file and initiates retrieval job for each archive.
+6. Afterward, another *EventBridge* rule is configured to periodically trigger `CompletionChecker` Lambda function which checks the completion of the `ArchiveRetrieval` workflow.
+7. Subsequently, the workflow enters an asynchronous wait, pausing until the `ArchiveRetrieval` workflow concludes before the nested `CleanupStateMachine` Step Function is triggered.
+
+### Inventory Retrieval workflow
+
+`InventoryRetrievalStateMachine` Step Function’s purpose is to retrieve the *Inventory* file of a specific Vault, which is then used in subsequent steps to transfer the included archives records. 
+`InventoryRetrievalStateMachine` Step Function consists of a series of steps, designed to manage the retrieval and downloading of the *Inventory* file:
+
+1. `InventoryRetrievalStateMachine` Step Function has the option (choice step) to bypass the *Inventory* download step if the *Inventory* is already provided by the user.
+2. `RetrieveInventoryInitiateJob` Lambda is invoked to Initiate an *Inventory* retrieval job.
+3. Subsequently, the workflow enters an asynchronous wait, pausing until the *Inventory* retrieval job is completed and ready for downloading.
+4. Upon Glacier sending a job completion event to `AsyncFacilitatorTopic` SNS topic, which is configured to deliver messages to `NotificationsQueue` SQS queue, which in turn triggers `AsyncFacilitator` Lambda function to unlock the asynchronous wait, allowing the workflow to proceed with the *Inventory* download.
+5. `InventoryChunkDownload` Lambda function is invoked within a Distributed map, with each iteration downloading a chunk of the *Inventory* file.
+6. Following the completion of downloading all chunks, the `InventoryValidation` Lambda function validates the downloaded *Inventory* file and stores it in the `InventoryBucket` S3 bucket under *original_inventory* prefix.
+7. After that, a Glue job is generated based on a predefined glue workflow graph definition.
+8. The Glue job filters out archives greater than 5 TB, then sorts the remaining archives by creation date, performs description parsing to extract archive names, handles duplicates, and partitions the *Inventory* into predefined partitions stored in the `InventoryBucket` S3 bucket under the *sorted_inventory* prefix.
+9. After that, `SendAnonymizedStats` Lambda function is invoked to send anonymized operational metrics
+
+
+### Initiate Archives Retrieval workflow
+
+`InitiateRetrievalStateMachine` Step Function’s purpose is to initiate retrieval job for all the archives listed in the *Inventory* file. A Distributed map with a locked concurrency is used to iterate over the ordered inventory stored in S3. `InitiateArchiveRetrieval` Lambda function is invoked to process batches of 100 archives. `InitiateRetrievalStateMachine` processes a portion of the archives included in the *Inventory* file at a time to control the request rate. It then calculates a timeout value based on the size of the processed portion and the time taken to initiate retrieval jobs. This ensures that the total number of initiated archive jobs within a 24-hour period remains below Glacier's daily quota.
+
+1. For each partition, DynamoDB partition metadata is reset, which will be used to capture the total requested archives jobs within the partition and the time taken for jobs initiation.
+2. Following this, two nested Distributed maps are utilized, the outer one iterates through S3 CSV partition files located in the `InventoryBucket` S3 bucket under the *sorted_inventory* prefix, while the inner map iterates through individual archive records within each file.
+3. To ensure archives are requested in order, concurrency for Distributed maps is set to 1.
+4. `InitiateArchiveRetrieval` Lambda function is invoked within the inner distributed map, which is configured with a batch size of 100, enabling it to concurrently initiate 100 archives jobs and achieve nearly 100 transactions per second (TPS).
+5. Upon completing the requests for all archives within a partition, `CalculateTimeout` Lambda is invoked to calculate the wait time before initiating requests for the next partition.
+6. The timeout value is calculated based on partition size, Glacier daily quota, and the time elapsed to initiate requests.
+
+### Archive Retrieval workflow
+
+Archives Retrieval follows an event-driven architecture:
+
+1. When a job is ready, Glacier sends a notification to the `AsyncFacilitatorTopic` SNS topic. This SNS topic is set up to deliver messages to `NotificationsQueue` SQS queue, indicating the completion of the job.
+2. `NotificationsQueue` SQS queue triggers `NotificationsProcessor` Lambda function, responsible for processing notifications and initiating initial steps for archive retrieval. This includes starting a multipart upload and calculating the number and sizes of chunks.
+3. Then `NotificationsProcessor` Lambda function places messages for chunk retrieval in `ChunksRetrievalQueue` SQS queue for further processing.
+4. `ChunksRetrievalQueue` SQS queue triggers `ChunkRetrieval` Lambda function to retrieve archive’s chunk.
+    1. First, `ChunkRetrieval` Lambda function downloads the chunk from Glacier.
+    2. Then, it uploads a multipart upload part to S3 destination bucket.
+    3. After a new chunk is download, chunk metadata are stored in `GlacierObjectRetrieval` DynamoDB table.
+    4. `ChunkRetrieval` Lambda function verifies whether all chunks for a particular archive have been processed. If so, it inserts an event into `ValidationQueueSQS` queue to trigger `ArchiveValidationLambda` function.
+5. `ArchiveValidation` Lambda function conducts hash validation and integrity checks before closing the S3 multipart upload.
+6. `MetricTable` DynamoDB stream invokes `MetricsProcessor` Lambda function to update transfer process metrics in `MetricTable` DynamoDB table.
+
+
+### Archives Status Cleanup workflow
+
+`ArchivesStatusCleanupStateMachine` Step Function’s purpose is to update any outdated archive statuses in DynamoDB to a terminated status, especially when resuming a partial transfer or a previously failed transfer attempt. `ArchivesStatusCleanupStateMachine` Step Function is triggered prior to initiating a transfer workflow.
+
+### Download Window extension workflow
+
+`ExtendDownloadWindowStateMachine` Step Function’s purpose is to account for the 24hr download window for each archive. At predefined intervals, archives that are staged will be checked to see how long is left in their download window. If the remaining time is less than five hours, another retrieval job is initiated (No additional cost to re-stage already staged archives). 
+
+`ExtendDownloadWindowStateMachine` Step Function consists of a series of steps, designed to manage the extension process for the download window of staged archives nearing expiration:
+
+1. First `ArchivesNeedingWindowExtension` Lambda function is invoked to query DynamoDB, capturing all archives  already staged but set to expire within the next 5 hours.
+2. Then this lambda generates an S3 JSON file containing these archives needing a download window extension, and write this file to the inventory bucket.
+3. Then, a Distributed map iterates over the generated JSON file, and batch process these archives.
+4. Within this Distributed map, `ExtendDownloadInitiateRetrieval` Lambda function is invoked to batch initiate requests for the archives which are included in the generated JSON file
+
+### CloudWatch Dashboard Update workflow
+
+`CloudWatchDashboardUpdateStateMachine` Step Function’s purpose is to update the Solution’s custom CloudWatch dashboard `CloudWatchDashboard` every 5 minutes, refreshing the dashboard with the recently collected metrics.
+
+`CloudWatchDashboardUpdateStateMachine` Step Function consists of a series of steps, designed to refresh the dashboard with the recently collected metrics:
+
+1. Begin by querying `MetricTable` DynamoDB table
+2. Subsequently, call *PutMetricData* API to update CloudWatch dashboard
+
+### Completion Checker workflow
+
+`CompletionChecker` Lambda function’s purpose is to periodically verify the completion status of the `ArchiveRetrieval` workflow. It accomplishes this by checking the `MetricTable` DynamoDB table to compare the count of downloaded archives with the total number of archives originally in the Vault. When the counts are equal, it concludes the asynchronous wait and proceeds to the `CleanupStateMachine` Step Function.
+
+### Cleanup workflow
+
+`CleanupStateMachine` Step Function’s purpose is to perform post-transfer cleanup tasks. This includes removing all incomplete multipart uploads and stopping all *EventBridge* rules to prevent periodic workflows from being triggered after the transfer’s termination.
diff --git a/docs/sequence_diagram.png b/docs/sequence_diagram.png
diff --git a/pyproject.toml b/pyproject.toml
@@ -10,7 +10,7 @@ package-dir = {"" = "source"}
 
 [project]
 name = "solution"
-version = "1.1.0"
+version = "1.1.1"
 description = "Data transfer from Amazon S3 Glacier vaults to Amazon S3"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -22,15 +22,16 @@ classifiers = [
     "Programming Language :: Python :: 3",
     "Programming Language :: Python :: 3.10",
     "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
     "Typing :: Typed",
 ]
 dependencies = [
-    "boto3",
-    "aws-cdk-lib==2.91.0",
-    "aws-cdk.aws-glue-alpha",
-    "aws-cdk.aws-servicecatalogappregistry-alpha",
+    "boto3==1.34.91",
+    "aws-cdk-lib==2.139.0",
+    "aws-cdk.aws-glue-alpha==2.139.0a0",
+    "aws-cdk.aws-servicecatalogappregistry-alpha==2.139.0a0",
     "constructs>=10.0.0,<11.0.0",
-    "cdk-nag",
+    "cdk-nag==2.28.12",
 ]
 
 [project.scripts]
@@ -39,33 +40,33 @@ mock-glacier = "solution.application.mocking.mock_glacier_generator:write_mock_g
 
 [project.optional-dependencies]
 dev = [
-    "tox",
+    "tox==4.14.2",
     "black==23.12.1",
     "pytest==7.4.3",
-    "pytest-cov",
-    "cdk-nag",
+    "pytest-cov==5.0.0",
+    "cdk-nag==2.28.12",
     "mypy==1.7.1",
-    "moto==4.1.10",
-    "aws-lambda-powertools[aws-sdk]",
-    "boto3-stubs-lite[essential]",
-    "boto3-stubs-lite[cloudformation]",
-    "boto3-stubs-lite[dynamodb]",
-    "boto3-stubs-lite[sqs]",
-    "boto3-stubs-lite[sns]",
-    "boto3-stubs-lite[s3]",
-    "boto3-stubs-lite[iam]",
-    "boto3-stubs-lite[stepfunctions]",
-    "boto3-stubs-lite[glacier]",
-    "boto3-stubs-lite[events]",
-    "pyspark",
-    "boto3-stubs-lite[ssm]",
-    "boto3-stubs[logs]",
-    "types-pyyaml"
+    "moto==4.1.15",
+    "pyspark==3.5.1",
+    "types-pyyaml==6.0.12.20240311",
+    "aws-lambda-powertools[aws-sdk]==2.37.0",
+    "boto3-stubs-lite[essential]==1.34.91",
+    "boto3-stubs-lite[cloudformation]==1.34.91",
+    "boto3-stubs-lite[dynamodb]==1.34.91",
+    "boto3-stubs-lite[sqs]==1.34.91",
+    "boto3-stubs-lite[sns]==1.34.91",
+    "boto3-stubs-lite[s3]==1.34.91",
+    "boto3-stubs-lite[iam]==1.34.91",
+    "boto3-stubs-lite[stepfunctions]==1.34.91",
+    "boto3-stubs-lite[glacier]==1.34.91",
+    "boto3-stubs-lite[events]==1.34.91",
+    "boto3-stubs-lite[ssm]==1.34.91",
+    "boto3-stubs[logs]==1.34.91"
 ]
 
 [tool.isort]
 profile = "black"
 known_first_party = "solution"
 
 [tool.bandit]
-exclude_dirs = ["tests"]
+exclude_dirs = ["tests"]
diff --git a/solution-manifest.yaml b/solution-manifest.yaml
@@ -0,0 +1,9 @@
+---
+id: SO0293 # Solution Id
+name: data-transfer-from-amazon-s3-glacier-vaults-to-amazon-s3 # trademarked name
+version: v1.1.1 # current version of the solution. Used to verify template headers
+cloudformation_templates: # This list should match with AWS CloudFormation templates section of IG
+  - template: data-transfer-from-amazon-s3-glacier-vaults-to-amazon-s3.template
+    main_template: true
+build_environment:
+    build_image: 'aws/codebuild/standard:7.0' # Options include: 'aws/codebuild/standard:5.0','aws/codebuild/standard:6.0','aws/codebuild/standard:7.0','aws/codebuild/amazonlinux2-x86_64-standard:4.0','aws/codebuild/amazonlinux2-x86_64-standard:5.0'
diff --git a/source/solution/application/archive_retrieval/initiator.py b/source/solution/application/archive_retrieval/initiator.py
@@ -134,6 +134,9 @@ def extend_request(
         glacier_client, vault_name, sns_topic, archive_id, tier, account_id
     )
 
+    if not job_id:
+        return
+
     ddb_client.update_item(
         TableName=os.environ[OutputKeys.GLACIER_RETRIEVAL_TABLE_NAME],
         Key=GlacierTransferMetadataRead(
@@ -154,7 +157,7 @@ def glacier_initiate_job(
     archive_id: str,
     tier: str,
     account_id: str,
-) -> str:
+) -> str | None:
     job_parameters: JobParametersTypeDef = {
         "Type": GlacierJobType.ARCHIVE_RETRIEVAL,
         "SNSTopic": sns_topic,
@@ -168,7 +171,10 @@ def glacier_initiate_job(
             jobParameters=job_parameters,
         )
     except Exception as e:
-        logger.error(f"An error occurred while initiating job: {e}")
+        logger.error(
+            f"An error occurred while initiating job for {job_parameters}. Error: {e}"
+        )
+        return None
 
     return initiate_job_response["jobId"]
 
@@ -197,6 +203,9 @@ def initiate_request(
         glacier_client, vault_name, sns_topic, archive_id, tier, account_id
     )
 
+    if not job_id:
+        return
+
     archive_metadata = GlacierTransferMetadata(
         workflow_run=workflow_run,
         glacier_object_id=archive_id,

diff --git a/source/solution/application/handlers.py b/source/solution/application/handlers.py
@@ -267,7 +267,7 @@ def archive_naming_overrides(
     # This is necessary for cases when the user does not provide a naming override file
     create_header_file(event["WorkflowRun"])
 
-    if event.get("NameOverridePresignedURL") is not None:
+    if event.get("NameOverridePresignedURL") not in (None, ""):
         upload_provided_file(event["WorkflowRun"], event["NameOverridePresignedURL"])
     else:
         logger.info("No name override file is provided.")

diff --git a/source/solution/application/mocking/mock_glacier_apis.py b/source/solution/application/mocking/mock_glacier_apis.py
@@ -64,6 +64,7 @@ def initiate_job(
                 if archive_id is None
                 else GlacierJobType.ARCHIVE_RETRIEVAL,
                 "archive_id": jobParameters.get("ArchiveId"),
+                "aws_partition": "test",
             }
             client.invoke(
                 FunctionName=os.environ["MOCK_NOTIFY_SNS_LAMBDA_ARN"],