Skip to content

Latest commit

 

History

History
232 lines (169 loc) · 19.9 KB

s3.md

File metadata and controls

232 lines (169 loc) · 19.9 KB

Amazon S3

This page contains the setup guide and reference information for the Amazon S3 source connector.

Prerequisites

Define file pattern, see the Path Patterns section

Setup guide

Step 1: Set up Amazon S3

  • If syncing from a private bucket, the credentials you use for the connection must have have both read and list access on the S3 bucket. list is required to discover files based on the provided pattern(s).

Step 2: Set up the Amazon S3 connector in Airbyte

For Airbyte Cloud:

  1. Log into your Airbyte Cloud account.
  2. In the left navigation bar, click <Sources/Destinations>. In the top-right corner, click +new source/destination.
  3. On the Set up the source/destination page, enter the name for the connector name connector and select connector name from the Source/Destination type dropdown.
  4. Set dataset appropriately. This will be the name of the table in the destination.
  5. If your bucket contains only files containing data for this table, use ** as path_pattern. See the Path Patterns section for more specific pattern matching.
  6. Leave schema as {} to automatically infer it from the file(s). For details on providing a schema, see the User Schema section.
  7. Fill in the fields within the provider box appropriately. If your bucket is not public, add credentials with sufficient permissions under aws_access_key_id and aws_secret_access_key.
  8. Choose the format corresponding to the format of your files and fill in fields as required. If unsure about values, try out the defaults and come back if needed. Find details on these settings here.

For Airbyte Open Source:

  1. Create a new S3 source with a suitable name. Since each S3 source maps to just a single table, it may be worth including that in the name.
  2. Set dataset appropriately. This will be the name of the table in the destination.
  3. If your bucket contains only files containing data for this table, use ** as path_pattern. See the Path Patterns section for more specific pattern matching.
  4. Leave schema as {} to automatically infer it from the file(s). For details on providing a schema, see the User Schema section.
  5. Fill in the fields within the provider box appropriately. If your bucket is not public, add credentials with sufficient permissions under aws_access_key_id and aws_secret_access_key.
  6. Choose the format corresponding to the format of your files and fill in fields as required. If unsure about values, try out the defaults and come back if needed. Find details on these settings here.

Supported sync modes

The Amazon S3 source connector supports the following sync modes:

Feature Supported?
Full Refresh Sync Yes
Incremental Sync Yes
Replicate Incremental Deletes No
Replicate Multiple Files (pattern matching) Yes
Replicate Multiple Streams (distinct tables) No
Namespaces No

File Compressions

Compression Supported?
Gzip Yes
Zip No
Bzip2 Yes
Lzma No
Xz No
Snappy No

Please let us know any specific compressions you'd like to see support for next!

Path Patterns

(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)

This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:

  • Referencing many files with just one pattern, e.g. ** would indicate every file in the bucket.
  • Referencing future files that don't exist yet (and therefore don't have a specific path).

You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.

Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).

Some example patterns:

  • ** : match everything.
  • **/*.csv : match all files with specific extension.
  • myFolder/**/*.csv : match all csv files anywhere under myFolder.
  • */** : match everything at least one folder deep.
  • */*/*/** : match everything at least three folders deep.
  • **/file.*|**/file : match every file called "file" with any extension (or no extension).
  • x/*/y/* : match all files that sit in folder x -> any folder -> folder y.
  • **/prefix*.csv : match all csv files with specific prefix.
  • **/prefix*.parquet : match all parquet files with specific prefix.

Let's look at a specific example, matching the following bucket layout:

myBucket
    -> log_files
    -> some_table_files
        -> part1.csv
        -> part2.csv
    -> images
    -> more_table_files
        -> part3.csv
    -> extras
        -> misc
            -> another_part1.csv

We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:

  • We could pick up every csv file called "partX" with the single pattern **/part*.csv.
  • To be a bit more robust, we could use the dual pattern some_table_files/*.csv|more_table_files/*.csv to pick up relevant files only from those exact folders.
  • We could achieve the above in a single pattern by using the pattern *table_files/*.csv. This could however cause problems in the future if new unexpected folders started being created.
  • We can also recursively wildcard, so adding the pattern extras/**/*.csv would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.

User Schema

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from each file and a superset schema created. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:

  • You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the _ab_additional_properties map.
  • Your initial dataset is quite small (in terms of number of records), and you think the automatic type inference from this sample might not be representative of the data in the future.
  • You want to purposely define types for every column.
  • You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the _ab_additional_properties map.

Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"} where each datatype is one of:

  • string
  • number
  • integer
  • object
  • array
  • boolean
  • null

For example:

  • {"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
  • {"username": "string", "friends": "array", "information": "object"}

S3 Provider Settings

  • bucket : name of the bucket your files are in

  • aws_access_key_id : one half of the required credentials for accessing a private bucket.

  • aws_secret_access_key : other half of the required credentials for accessing a private bucket.

  • path_prefix : an optional string that limits the files returned by AWS when listing files to only that those starting with this prefix. This is different to path_pattern as it gets pushed down to the API call made to S3 rather than filtered in Airbyte and it does not accept pattern-style symbols (like wildcards *). We recommend using this if your bucket has many folders and files that are unrelated to this stream and all the relevant files will always sit under this chosen prefix.

    • Together with path_pattern, there are multiple ways to specify the files to sync. For example, all the following configs are equivalent:
      • path_prefix = <empty>, path_pattern = path1/path2/myFolder/**/*.
      • path_prefix = path1/, path_pattern = path2/myFolder/**/*.csv.
      • path_prefix = path1/path2/ and path_pattern = myFolder/**/*.csv
      • path_prefix = path1/path2/myFolder/, path_pattern = **/*.csv. This is the most efficient one because the directories are filtered earlier in the S3 API call. However, the difference in efficiency is usually negligible.
    • The rationale of having both path_prefix and path_pattern is to accommodate as many use cases as possible. If you found them confusing, feel free to ignore path_prefix and just set the path_pattern.
  • endpoint : optional parameter that allow using of non Amazon S3 compatible services. Leave it blank for using default Amazon serivce.

  • use_ssl : Allows using custom servers that configured to use plain http. Ignored in case of using Amazon service.

  • verify_ssl_cert : Skip ssl validity check in case of using custom servers with self signed certificates. Ignored in case of using Amazon service.

    File Format Settings

    The Reader in charge of loading the file format is currently based on PyArrow (Apache Arrow).

    Note that all files within one stream must adhere to the same read options for every provided format.

CSV

Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.

  • delimiter : Even though CSV is an acronymn for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator.

  • quote_char : In some cases, data values may contain instances of reserved characters (like a comma, if that's the delimiter). CSVs can allow this behaviour by wrapping a value in defined quote characters so that on read it can parse it correctly.

  • escape_char : An escape character can be used to prefix a reserved character and allow correct parsing.

  • encoding : Some data may use a different character set (typically when different alphabets are involved). See the list of allowable encodings here.

  • double_quote : Whether two quotes in a quoted CSV value denote a single quote in the data.

  • newlines_in_values : Sometimes referred to as multiline. In most cases, newline characters signal the end of a row in a CSV, however text data may contain newline characters within it. Setting this to True allows correct parsing in this case.

  • block_size : This is the number of bytes to process in memory at a time while reading files. The default value here is usually fine but if your table is particularly wide (lots of columns / data in fields is large) then raising this might solve failures on detecting schema. Since this defines how much data to read into memory, raising this too high could cause Out Of Memory issues so use with caution.

  • additional_reader_options : This allows for editing the less commonly required CSV ConvertOptions. The value must be a valid JSON string, e.g.:

    {"timestamp_parsers": ["%m/%d/%Y %H:%M", "%Y/%m/%d %H:%M"], "strings_can_be_null": true, "null_values": ["NA", "NULL"]}
    
  • advanced_options : This allows for editing the less commonly required CSV ReadOptions. The value must be a valid JSON string. One use case for this is when your CSV has no header, or you want to use custom column names, you can specify column_names using this option.

    {"column_names": ["column1", "column2", "column3"]}
    

Parquet

Apache Parquet file is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. For now, the solution involves iterating through individual files at the abstract level thus partitioned parquet datasets are unsupported. The following settings are available:

  • buffer_size : If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
  • columns : If not None, only these columns will be read from the file.
  • batch_size : Maximum number of records per batch. Batches may be smaller if there aren’t enough rows in the file.

You can find details on here.

Avro

The avro parser uses fastavro. Currently, no additional options are supported.

Jsonl

The Jsonl parser uses pyarrow hence,only the line-delimited JSON format is supported.For more detailed info, please refer to the [docs] (https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html)

Changelog

Version Date Pull Request Subject
0.1.23 2022-10-10 17991 Fix pyarrow to JSON schema type conversion for arrays
0.1.23 2022-10-10 17800 Deleted use_ssl and verify_ssl_cert flags and hardcoded to True
0.1.22 2022-09-28 17304 Migrate to per-stream state
0.1.21 2022-09-20 16921 Upgrade pyarrow
0.1.20 2022-09-12 16607 Fix for reading jsonl files containing nested structures
0.1.19 2022-09-13 16631 Adjust column type to a broadest one when merging two or more json schemas
0.1.18 2022-08-01 14213 Add support for jsonl format files.
0.1.17 2022-07-21 14911 "decimal" type added for parquet
0.1.16 2022-07-13 14669 Fixed bug when extra columns apeared to be non-present in master schema
0.1.15 2022-05-31 12568 Fixed possible case of files being missed during incremental syncs
0.1.14 2022-05-23 11967 Increase unit test coverage up to 90%
0.1.13 2022-05-11 12730 Fixed empty options issue
0.1.12 2022-05-11 12602 Added support for Avro file format
0.1.11 2022-04-30 12500 Improve input configuration copy
0.1.10 2022-01-28 8252 Refactoring of files' metadata
0.1.9 2022-01-06 9163 Work-around for web-UI, backslash - t converts to tab for format.delimiter field.
0.1.7 2021-11-08 7499 Remove base-python dependencies
0.1.6 2021-10-15 6615 & 7058 Memory and performance optimisation. Advanced options for CSV parsing.
0.1.5 2021-09-24 6398 Support custom non Amazon S3 services
0.1.4 2021-08-13 5305 Support of Parquet format
0.1.3 2021-08-04 5197 Fixed bug where sync could hang indefinitely on schema inference
0.1.2 2021-08-02 5135 Fixed bug in spec so it displays in UI correctly
0.1.1 2021-07-30 4990 Fixed documentation url in source definition
0.1.0 2021-07-30 4990 Created S3 source connector