Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Comprehend DLP support #2

Merged
merged 4 commits into from
Aug 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "redacter"
version = "0.1.2"
version = "0.2.0"
edition = "2021"
authors = ["Abdulla Abdurakhmanov <[email protected]>"]
license = "Apache-2.0"
Expand Down Expand Up @@ -47,6 +47,7 @@ tempfile = "3"
csv-async = { version = "1", default-features = false, features = ["tokio", "tokio-stream"] }
aws-config = { version = "1", features = ["behavior-version-latest"] }
aws-sdk-s3 = { version = "1" }
aws-sdk-comprehend = { version = "1" }


[dev-dependencies]
Expand Down
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Google Cloud Platform's DLP API.
* text, html, json files
* structured data table files (csv)
* images (jpeg, png, bpm, gif)
* AWS Comprehend PII redaction for text files.
* ... more DLP providers can be added in the future.
* **CLI:** Easy-to-use command-line interface for streamlined workflows.
* Built with Rust to ensure speed, safety, and reliability.

Expand Down Expand Up @@ -56,7 +58,7 @@ Options:
-f, --filename-filter <FILENAME_FILTER>
Filter by name using glob patterns such as *.txt
-d, --redact <REDACT>
Redacter type [possible values: gcp-dlp]
Redacter type [possible values: gcp-dlp, aws-comprehend-dlp]
--gcp-project-id <GCP_PROJECT_ID>
GCP project id that will be used to redact and bill API calls
--allow-unsupported-copies
Expand All @@ -65,6 +67,8 @@ Options:
Disable CSV headers (if they are not present)
--csv-delimiter <CSV_DELIMITER>
CSV delimiter (default is ','
--aws-region <AWS_REGION>
AWS region for AWS Comprehend DLP redacter
-h, --help
Print help
```
Expand All @@ -73,17 +77,27 @@ DLP is optional and should be enabled with `--redact` (`-d`) option.
Without DLP enabled, the tool will copy all files without redaction.
With DLP enabled, the tool will redact files based on the DLP model and skip unsupported files.

To be able to use GCP DLP you need to authenticate using `gcloud auth application-default login` or provide a service
account key using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

Source/destination can be a local file or directory, or a file in GCS, S3, or a zip archive:

- Local file: `/tmp/file.txt` or `/tmp` for whole directory recursive copy
- GCS: `gs://bucket/file.txt` or `gs://bucket/test-dir/` for whole directory recursive copy
- S3: `s3://bucket/file.txt` or `s3://bucket/test-dir/` for whole directory recursive copy
- Zip archive: `zip://tmp/archive.zip`

### Examples:
## DLP redacters

### Google Cloud Platform DLP

To be able to use GCP DLP you need to authenticate using `gcloud auth application-default login` or provide a service
account key using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

### AWS Comprehend DLP

To be able to use AWS Comprehend DLP you need to authenticate using `aws configure` or provide a service account.
To provide an AWS region use `--aws-region` option since AWS Comprehend may not be available in all regions.
AWS Comprehend DLP is only available for unstructured text files.

## Examples:

```sh
# Copy and redact a file from local filesystem to GCS
Expand Down Expand Up @@ -120,6 +134,9 @@ and/or by size:
- The accuracy of redaction depends on the DLP model, so don't rely on it as the only security measure.
- The tool was mostly design to redact files internally. Not recommended use it in public environments without proper
security measures and manual review.
- Integrity of the files is not guaranteed due to DLP implementation specifics. Some of the formats such as
HTML/XML/JSON
may be corrupted after redaction since they treated as text.
- Use it at your own risk. The author is not responsible for any data loss or security breaches.

## Licence
Expand Down
11 changes: 11 additions & 0 deletions src/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ pub enum CliCommand {
#[derive(ValueEnum, Debug, Clone)]
pub enum RedacterType {
GcpDlp,
AwsComprehendDlp,
}

impl std::str::FromStr for RedacterType {
Expand All @@ -48,6 +49,7 @@ impl std::str::FromStr for RedacterType {
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"gcp-dlp" => Ok(RedacterType::GcpDlp),
"aws-comprehend-dlp" => Ok(RedacterType::AwsComprehendDlp),
_ => Err(format!("Unknown redacter type: {}", s)),
}
}
Expand All @@ -57,6 +59,7 @@ impl Display for RedacterType {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
RedacterType::GcpDlp => write!(f, "gcp-dlp"),
RedacterType::AwsComprehendDlp => write!(f, "aws-comprehend-dlp"),
}
}
}
Expand Down Expand Up @@ -89,6 +92,9 @@ pub struct RedacterArgs {

#[arg(long, help = "CSV delimiter (default is ','")]
pub csv_delimiter: Option<char>,

#[arg(long, help = "AWS region for AWS Comprehend DLP redacter")]
pub aws_region: Option<String>,
}

impl TryInto<RedacterOptions> for RedacterArgs {
Expand All @@ -104,6 +110,11 @@ impl TryInto<RedacterOptions> for RedacterArgs {
message: "GCP project id is required for GCP DLP redacter".to_string(),
}),
},
Some(RedacterType::AwsComprehendDlp) => Ok(RedacterProviderOptions::AwsComprehendDlp(
crate::redacters::AwsComprehendDlpRedacterOptions {
region: self.aws_region.map(aws_config::Region::new),
},
)),
None => Err(AppError::RedacterConfigError {
message: "Redacter type is required".to_string(),
}),
Expand Down
6 changes: 4 additions & 2 deletions src/commands/copy_command.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ use crate::filesystems::{
AbsoluteFilePath, DetectFileSystem, FileMatcher, FileMatcherResult, FileSystemConnection,
FileSystemRef,
};
use crate::redacters::{Redacter, RedacterOptions, Redacters};
use crate::redacters::{RedactSupportedOptions, Redacter, RedacterOptions, Redacters};
use crate::reporter::AppReporter;
use crate::AppResult;
use console::{Style, Term};
Expand Down Expand Up @@ -228,7 +228,9 @@ async fn redact_upload_file<'a, SFS: FileSystemConnection<'a>, DFS: FileSystemCo
dest_file_ref: &FileSystemRef,
redacter: &impl Redacter,
) -> AppResult<TransferFileResult> {
if redacter.is_redact_supported(dest_file_ref).await? {
if redacter.redact_supported_options(dest_file_ref).await?
!= RedactSupportedOptions::Unsupported
{
match redacter.redact_stream(source_reader, dest_file_ref).await {
Ok(redacted_reader) => {
destination_fs
Expand Down
1 change: 0 additions & 1 deletion src/filesystems/aws_s3.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ impl<'a> AwsS3FileSystem<'a> {
let shared_config = aws_config::load_from_env().await;
let (bucket_name, object_name) = Self::parse_s3_path(path)?;
let is_dir = object_name.ends_with('/');
println!("Bucket: {}, Object: {}", bucket_name, object_name);
let client = aws_sdk_s3::Client::new(&shared_config);

Ok(AwsS3FileSystem {
Expand Down
175 changes: 175 additions & 0 deletions src/redacters/aws_comprehend.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
use crate::errors::AppError;
use crate::filesystems::FileSystemRef;
use crate::redacters::{
RedactSupportedOptions, Redacter, RedacterDataItem, RedacterDataItemContent, RedacterOptions,
Redacters,
};
use crate::reporter::AppReporter;
use crate::AppResult;
use aws_config::Region;
use rvstruct::ValueStruct;

#[derive(Debug, Clone)]
pub struct AwsComprehendDlpRedacterOptions {
pub region: Option<Region>,
}

#[derive(Clone)]
pub struct AwsComprehendDlpRedacter<'a> {
client: aws_sdk_comprehend::Client,
redacter_options: RedacterOptions,
reporter: &'a AppReporter<'a>,
}

impl<'a> AwsComprehendDlpRedacter<'a> {
pub async fn new(
redacter_options: RedacterOptions,
aws_dlp_options: AwsComprehendDlpRedacterOptions,
reporter: &'a AppReporter<'a>,
) -> AppResult<Self> {
let region_provider = aws_config::meta::region::RegionProviderChain::first_try(
aws_dlp_options.region.clone(),
)
.or_default_provider();
let shared_config = aws_config::from_env().region(region_provider).load().await;
let client = aws_sdk_comprehend::Client::new(&shared_config);
Ok(Self {
client,
redacter_options,
reporter,
})
}

pub async fn redact_text_file(
&self,
input: RedacterDataItem,
) -> AppResult<RedacterDataItemContent> {
self.reporter.report(format!(
"Redacting a text file: {} ({:?})",
input.file_ref.relative_path.value(),
input.file_ref.media_type
))?;
let text_content = match input.content {
RedacterDataItemContent::Value(content) => Ok(content),
_ => Err(AppError::SystemError {
message: "Unsupported item for text redacting".to_string(),
}),
}?;

let aws_request = self
.client
.detect_pii_entities()
.language_code(aws_sdk_comprehend::types::LanguageCode::En)
.text(text_content.clone());

let result = aws_request.send().await?;
let redacted_content = result.entities.iter().fold(text_content, |acc, entity| {
entity.iter().fold(acc, |acc, entity| {
match (entity.begin_offset, entity.end_offset) {
(Some(start), Some(end)) => [
acc[..start as usize].to_string(),
"X".repeat((end - start) as usize),
acc[end as usize..].to_string(),
]
.concat(),
(Some(start), None) => {
acc[..start as usize].to_string()
+ "X".repeat(acc.len() - start as usize).as_str()
}
(None, Some(end)) => {
["X".repeat(end as usize), acc[end as usize..].to_string()].concat()
}
_ => acc,
}
})
});
Ok(RedacterDataItemContent::Value(redacted_content))
}
}

impl<'a> Redacter for AwsComprehendDlpRedacter<'a> {
async fn redact(&self, input: RedacterDataItem) -> AppResult<RedacterDataItemContent> {
match &input.content {
RedacterDataItemContent::Value(_) => self.redact_text_file(input).await,
RedacterDataItemContent::Table { .. } | RedacterDataItemContent::Image { .. } => {
Err(AppError::SystemError {
message: "Attempt to redact of unsupported image type".to_string(),
})
}
}
}

async fn redact_supported_options(
&self,
file_ref: &FileSystemRef,
) -> AppResult<RedactSupportedOptions> {
Ok(match file_ref.media_type.as_ref() {
Some(media_type) if Redacters::is_mime_text(media_type) => {
RedactSupportedOptions::Supported
}
Some(media_type) if Redacters::is_mime_table(media_type) => {
RedactSupportedOptions::SupportedAsText
}
_ => RedactSupportedOptions::Unsupported,
})
}

fn options(&self) -> &RedacterOptions {
&self.redacter_options
}
}

#[allow(unused_imports)]
mod tests {
use super::*;
use crate::redacters::RedacterProviderOptions;
use console::Term;

#[tokio::test]
#[cfg_attr(not(feature = "ci-aws"), ignore)]
async fn redact_text_file_test() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
let term = Term::stdout();
let reporter: AppReporter = AppReporter::from(&term);
let test_aws_region = std::env::var("TEST_AWS_REGION").expect("TEST_AWS_REGION required");
let test_content = "Hello, John";

let file_ref = FileSystemRef {
relative_path: "temp_file.txt".into(),
media_type: Some(mime::TEXT_PLAIN),
file_size: Some(test_content.len() as u64),
};

let content = RedacterDataItemContent::Value(test_content.to_string());
let input = RedacterDataItem { file_ref, content };

let redacter_options = RedacterOptions {
provider_options: RedacterProviderOptions::AwsComprehendDlp(
AwsComprehendDlpRedacterOptions {
region: Some(Region::new(test_aws_region.clone())),
},
),
allow_unsupported_copies: false,
csv_headers_disable: false,
csv_delimiter: None,
};

let redacter = AwsComprehendDlpRedacter::new(
redacter_options,
AwsComprehendDlpRedacterOptions {
region: Some(Region::new(test_aws_region)),
},
&reporter,
)
.await?;

let redacted_content = redacter.redact(input).await?;
match redacted_content {
RedacterDataItemContent::Value(value) => {
assert_eq!(value, "Hello, XXXX");
}
_ => panic!("Unexpected redacted content type"),
}

Ok(())
}
}
Loading