Skip to content

Commit

Permalink
v.0.0.3 Sync error handling, activate version, documentation (#2)
Browse files Browse the repository at this point in the history
* v.0.0.2 schema and sync changes

Change number json schema to anyOf with multipleOf; skip empty rows; move write_bookmark to end of sync.py

* v.0.0.3 Sync activate version and error handling

Update README.md documentation. Improved logging and handling of errors and warnings. Better null handling in Discovery and Sync. Fix issues with activate version messages.
  • Loading branch information
jeffhuth-bytecode authored and KAllan357 committed Jan 9, 2020
1 parent 5890b89 commit 43a24cb
Show file tree
Hide file tree
Showing 5 changed files with 144 additions and 45 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog

## 0.0.3
* Update README.md documentation. Improved logging and handling of errors and warnings. Better null handling in Discovery and Sync. Fix issues with activate version messages.

## 0.0.2
* Change number json schema to anyOf with multipleOf; skip empty rows; move write_bookmark to end of sync.py

Expand Down
37 changes: 23 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,51 +11,60 @@ This tap:
- [File Metadata](https://developers.google.com/drive/api/v3/reference/files/get)
- [Spreadsheet Metadata](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
- [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
- Outputs the following metadata streams:
- File Metadata: Name, audit/change info from Google Drive
- Spreadsheet Metadata: Basic metadata about the Spreadsheet: Title, Locale, URL, etc.
- Sheet Metadata: Title, URL, Area (max column and row), and Column Metadata
- Column Metadata: Column Header Name, Data type, Format
- Sheets Loaded: Sheet title, load date, number of rows
- For each Sheet:
- Outputs the schema for each resource (based on the column header and datatypes of first row of data)
- Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row
- Outputs the schema for each resource (based on the column header and datatypes of row 2, the first row of data)
- Outputs a record for all columns that have column headers, and for each row of data
- Emits a Singer ACTIVATE_VERSION message after each sheet is complete. This forces hard deletes on the data downstream if fewer records are sent.
- Primary Key for each row in a Sheet is the Row Number: `__sdc_row`
- Each Row in a Sheet also includes Foreign Keys to the Spreadsheet Metadata, `__sdc_spreadsheet_id`, and Sheet Metadata, `__sdc_sheet_id`.

## API Endpoints
[**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
- Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
- Primary keys: id
- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
- Replication strategy: Incremental (GET file audit data for spreadsheet_id in config)
- Process/Transformations: Replicate Data if Modified

[**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
- This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
- Primary keys: spreadsheetId, title, field_name
- Primary keys: Spreadsheet Id, Sheet Id, Column Index
- Foreign keys: None
- Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
- Process/Transformations:
- Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
- sheetId, title, index, gridProperties (rowCount, columnCount)
- Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness
- Header's field_name, position: data.rowData[0].values[i].formattedValue
- Create/Verify Datatypes (2nd row):
- Row 2's datatype, format: data.rowData[1].values[i]
- Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column name uniqueness
- Create/Verify Datatypes based on 2nd row value and cell metadata
- First check:
- [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
- Valid types: numberValue, stringValue, boolValue
- Invalid types: formulaValue, errorValue
- Then check:
- [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
- Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
- If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time
- [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat)
- Determine JSON schema column data type based on the value and the above cell metadata settings.
- If DATE, DATE_TIME, or TIME, set JSON schema format accordingly

[**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
- This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
- Primary keys: row
- Primary keys: _sdc_row
- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
- Process/Transformations:
- Loop through sheets (compared to catalog selection)
- Send metadata for sheet
- Loop through ranges of rows until reaching empty row or area max row (from sheet metadata)
- Transform values, if necessary (dates, date-times, boolean, integer, numers)
- Process/send records
- Loop through ALL columns for columns having a column header
- Loop through ranges of rows for ALL rows in sheet available area max row (from sheet metadata)
- Transform values, if necessary (dates, date-times, times, boolean).
- Date/time serial numbers converted to date, date-time, and time strings. Google Sheets uses Lotus 1-2-3 [Serial Number](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) format for date/times. These are converted to normal UTC date-time strings.
- Process/send records to target

## Authentication
The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
from setuptools import setup, find_packages

setup(name='tap-google-sheets',
version='0.0.2',
version='0.0.3',
description='Singer.io tap for extracting data from the Google Sheets v4 API',
author='jeff.huth@bytecode.io',
classifiers=['Programming Language :: Python :: 3 :: Only'],
py_modules=['tap_google_sheets'],
install_requires=[
'backoff==1.8.0',
'requests==2.22.0',
'singer-python==5.8.1'
'singer-python==5.9.0'
],
entry_points='''
[console_scripts]
Expand Down
51 changes: 41 additions & 10 deletions tap_google_sheets/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ def colnum_string(num):

# Create sheet_metadata_json with columns from sheet
def get_sheet_schema_columns(sheet):
sheet_title = sheet.get('properties', {}).get('title')
sheet_json_schema = OrderedDict()
data = next(iter(sheet.get('data', [])), {})
row_data = data.get('rowData', [])
Expand Down Expand Up @@ -62,15 +63,34 @@ def get_sheet_schema_columns(sheet):
skipped = 0
column_name = '{}'.format(header_value)
if column_name in header_list:
raise Exception('DUPLICATE HEADER ERROR: {}'.format(column_name))
raise Exception('DUPLICATE HEADER ERROR: SHEET: {}, COL: {}, CELL: {}1'.format(
sheet_title, column_name, column_letter))
header_list.append(column_name)

first_value = first_values[i]

first_value = None
try:
first_value = first_values[i]
except IndexError as err:
raise Exception('NO VALUE IN 2ND ROW FOR HEADER ERROR. SHEET: {}, COL: {}, CELL: {}2. {}'.format(
sheet_title, column_name, column_letter, err))

column_effective_value = first_value.get('effectiveValue', {})
for key in column_effective_value.keys():
if key in ('numberValue', 'stringValue', 'boolValue', 'errorType', 'formulaType'):
column_effective_value_type = key

col_val = None
if column_effective_value == {}:
column_effective_value_type = 'stringValue'
LOGGER.info('WARNING: NO VALUE IN 2ND ROW FOR HEADER. SHEET: {}, COL: {}, CELL: {}2.'.format(
sheet_title, column_name, column_letter))
LOGGER.info(' Setting column datatype to STRING')
else:
for key, val in column_effective_value.items():
if key in ('numberValue', 'stringValue', 'boolValue'):
column_effective_value_type = key
col_val = str(val)
elif key in ('errorType', 'formulaType'):
col_val = str(val)
raise Exception('DATA TYPE ERROR 2ND ROW VALUE: SHEET: {}, COL: {}, CELL: {}2, TYPE: {}, VALUE: {}'.format(
sheet_title, column_name, column_letter, key, col_val))

column_number_format = first_values[i].get('effectiveFormat', {}).get(
'numberFormat', {})
Expand All @@ -87,7 +107,13 @@ def get_sheet_schema_columns(sheet):
# https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType
#
column_format = None # Default
if column_effective_value_type == 'stringValue':
if column_effective_value == {}:
col_properties = {'type': ['null', 'string']}
column_gs_type = 'stringValue'
LOGGER.info('WARNING: 2ND ROW VALUE IS BLANK: SHEET: {}, COL: {}, CELL: {}2'.format(
sheet_title, column_name, column_letter))
LOGGER.info(' Setting column datatype to STRING')
elif column_effective_value_type == 'stringValue':
col_properties = {'type': ['null', 'string']}
column_gs_type = 'stringValue'
elif column_effective_value_type == 'boolValue':
Expand Down Expand Up @@ -138,8 +164,8 @@ def get_sheet_schema_columns(sheet):
else:
col_properties = {'type': ['null', 'string']}
column_gs_type = 'unsupportedValue'
LOGGER.info('Unsupported data type: {}, value: {}'.format(column_name, \
column_effective_value_type))
LOGGER.info('WARNING: UNSUPPORTED 2ND ROW VALUE: SHEET: {}, COL: {}, CELL: {}2, TYPE: {}, VALUE: {}'.format(
sheet_title, column_name, column_letter, column_effective_value_type, col_val))
LOGGER.info('Converting to string.')
else: # skipped
column_is_skipped = True
Expand All @@ -148,11 +174,16 @@ def get_sheet_schema_columns(sheet):
column_name = '__sdc_skip_col_{}'.format(column_index_str)
col_properties = {'type': ['null', 'string']}
column_gs_type = 'stringValue'
LOGGER.info('WARNING: SKIPPED COLUMN; NO COLUMN HEADER. SHEET: {}, COL: {}, CELL: {}1'.format(
sheet_title, column_name, column_letter))
LOGGER.info(' This column will be skipped during data loading.')

if skipped >= 2:
# skipped = 2 consecutive skipped headers
# Remove prior_header column_name
sheet_json_schema['properties'].pop(prior_header, None)
LOGGER.info('TWO CONSECUTIVE SKIPPED COLUMNS. STOPPING SCAN AT: SHEET: {}, COL: {}, CELL {}1'.format(
sheet_title, column_name, column_letter))
break

else:
Expand Down Expand Up @@ -245,7 +276,7 @@ def get_schemas(client, spreadsheet_id):
for sheet in sheets:
# GET sheet_json_schema for each worksheet (from function above)
sheet_json_schema, columns = get_sheet_metadata(sheet, spreadsheet_id, client)
LOGGER.info('columns = {}'.format(columns))
# LOGGER.info('columns = {}'.format(columns))

sheet_title = sheet.get('properties', {}).get('title')
schemas[sheet_title] = sheet_json_schema
Expand Down
Loading

0 comments on commit 43a24cb

Please sign in to comment.