diff --git a/README.md b/README.md index 1550f7b..636bd9b 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,6 @@ The CLI supports the following subcommands: - `create_flows`: create RapidPro flows (in JSON format) from spreadsheets using content index - `flows_to_sheets`: convert RapidPro flows (in JSON format) into spreadsheets - `convert`: save input spreadsheets as JSON -- `save_data_sheets`: save input spreadsheets as nested JSON using content index - an experimental feature that is likely to change. Full details of the available options for each can be found via the help feature: diff --git a/docs/notation.md b/docs/notation.md new file mode 100644 index 0000000..5e2438b --- /dev/null +++ b/docs/notation.md @@ -0,0 +1,205 @@ +# Spreadsheet notation + +Summary of spreadsheet notation used to convert sheets into a nested data structure (JSON). A series of data tables will be shown alongside the resultant JSON structure. + +# Books + +A container for multiple tables. Also known as a spreadsheet or workbook. A book is converted to an object containing a property for each table. The property key is the name of the sheet; the value is the converted contents of the sheet. + +For example, given an Excel workbook with two sheets ("table1" and "table2"), the resulting JSON will be: + +```json +{ + "table1": [], + "table2": [] +} +``` + +# Tables + +Also known as a sheet in a spreadsheet (or workbook). + +The contents of a table are converted to a sequence of objects - corresponding to rows in the sheet. Each object will have keys corresponding to the column headers of the sheet, and values corresponding to a particular row in the sheet. + +| a | b | +|----|----| +| v1 | v2 | + +`data` + +```json +{ + "data": [ + {"a": "v1", "b": "v2"} + ] +} +``` + +This means that the first row of every table should be a header row that specifies the name of each column. + +# Basic types + +Refers to the following value types in JSON: `string`, `number`, `true` and `false`. + +| string | number | true | false | +|--------|--------|------|-------| +| hello | 123 | true | false | + +`basic_types` + +```json +{ + "basic_types": [ + { + "string": "hello", + "number": 123, + "true": true, + "false": false + } + ] +} +``` + +The JSON type `null` is not represented because an empty cell is assumed to be equivalent to the empty string (""). + +# Sequences + +An ordered sequence of items. Also known as lists or arrays. + +| seq1 | seq1 | seq2.1 | seq2.2 | seq3 | seq4 | +|------|------|--------|--------|----------|--------------------| +| v1 | v2 | v1 | v2 | v1 \| v2 | v1 ; v2 \| v3 ; v4 | + +`sequences` + +```json +{ + "sequences": [ + { + "seq1": ["v1", "v2"], + "seq2": ["v1", "v2"], + "seq3": ["v1", "v2"] + "seq4": [["v1", "v2"], ["v3", "v4"]] + } + ] +} +``` + +`seq1`, `seq2` and `seq3` are equivalent. In all cases, the order of items is specified from left to right. + +`seq1` uses a 'wide' layout, where the column header is repeated and each column holds one item in the sequence. Values from columns with the same name are collected into a sequence in the resulting JSON object. + +`seq2` is similar to `seq1`, but the index of each item is specified explicitly. + +`seq3` uses an 'inline' layout, where the sequence is defined as a delimited string within a single cell of the table. The default delimiter is a vertical bar or pipe character ('|'). + +Two levels of nesting are possible within a cell, as shown by `seq4` - a list of lists. This could be used to model a list of key-value pairs, which could easily be converted to an object (map / dictionary). The default delimiter for second-level sequences is a semi-colon (';'). + +The interpretation of delimiter characters can be skipped by escaping the delimiter characters. An escape sequence begins with a backslash ('\\') and ends with the character to be escaped. For example, to escape a vertical bar, use: '\\|'. + +# Objects + +An unordered collection of key-value pairs (properties). Also known as maps, dictionaries or associative arrays. + +| obj1.key1 | obj1.key2 | obj2 | +|-----------|-----------|------------------------| +| v1 | v2 | key1 ; v1 \| key2 ; v2 | + +`objects` + +```json +{ + "objects": [ + { + "obj1": { + "key1": "v1", + "key2": "v2" + }, + "obj2": [ + ["key1", "v1"], + ["key2", "v2"] + ] + } + ] +} +``` + +`obj1` and `obj2` are slightly different, but can be interpreted in the same way, as a list of key-value pairs. + +A wide layout is used for `obj1`, where one or more column headers use a dotted 'keypath' notation to identify a particular property key belonging to a particular object, and the corresponding cells in subsequent rows contain the values for that property. The dotted keypath notation can be used to access properties at deeper levels of nesting e.g. `obj.key.subkey.etc`. + +An inline layout is used for `obj2`, where properties are defined as a sequence of key-value pairs. The delimiter of properties is a vertical bar or pipe character - same as top-level sequences. The delimiter of keys and values is a semi-colon character - same as second-level sequences. + +All the previous notation can be combined to create fairly complicated structures. + +| obj1.key1 | obj1.key1 | +|------------------------|--------------------------------| +| 1 ; 2 ; 3 \| one ; two | active ; true \| debug ; false | + +`nesting` + +```json +{ + "nesting": [ + { + "obj1": { + "key1": [ + [ + [1, 2, 3], + ["one", "two"] + ], + [ + ["active", true], + ["debug", false] + ] + ], + } + } + ] +} +``` + +# Templates + +Table cells may contain Jinja templates. A cell is considered a template if it contains template placeholders anywhere within it. There are three types of template placeholders: + +- `{{ ... }}` +- `{% ... %}` +- `{@ ... @}` + +When converting between spreadsheets and JSON, templates will not be interpreted in any way, just copied verbatim. This means that sequence delimiters do not need to be escaped if they exist within a template. It is intended for templates to eventually be interpreted at a later stage, during further processing. + +# Metadata + +Information that would otherwise be lost during the conversion from spreadsheets to JSON is stored as metadata - in a top-level property with key `_idems`. The metadata property is intended to be 'hidden' and unlikely to be shared by any sheet name. + +The original header names for each sheet are held as metadata to direct the conversion process from JSON back to spreadsheet. The original headers preserve the order of columns and whether a wide or inline layout was used. + + +| seq1 | seq1 | seq2 | +|------|------|----------| +| v1 | v2 | v1 \| v2 | + +`sequences` + +```json +{ + "_idems": { + "tabulate": { + "sequences": { + "headers": [ + "seq1", + "seq1", + "seq2" + ] + } + } + } + "sequences": [ + { + "seq1": ["v1", "v2"], + "seq2": ["v1", "v2"] + } + ] +} +``` diff --git a/pyproject.toml b/pyproject.toml index df4f6f9..bcc0d65 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -34,9 +34,11 @@ dependencies = [ "google-api-python-client~=2.6.0", "google-auth-oauthlib~=0.4.4", "networkx~=2.5.1", + "odfpy", "openpyxl", "pydantic >= 2", - "tablib[ods]>=3.1.0", + "python-benedict", + "tablib @ git+https://github.com/istride/tablib@v3.8.0-0", ] [project.urls] diff --git a/src/rpft/cli.py b/src/rpft/cli.py index 1c78946..d61f9f0 100644 --- a/src/rpft/cli.py +++ b/src/rpft/cli.py @@ -39,16 +39,16 @@ def flows_to_sheets(args): ) -def save_data_sheets(args): - output = converters.save_data_sheets( - args.input, - None, - args.format, - data_models=args.datamodels, - tags=args.tags, - ) - with open(args.output, "w", encoding="utf-8") as export: - json.dump(output, export, indent=4) +def uni_to_sheets(args): + with open(args.output, "wb") as handle: + handle.write(converters.uni_to_sheets(args.input)) + + +def sheets_to_uni(args): + data = converters.sheets_to_uni(args.input) + + with open(args.output, "w", encoding="utf-8") as f: + json.dump(data, f, indent=2) def create_parser(): @@ -64,7 +64,8 @@ def create_parser(): _add_create_command(sub) _add_convert_command(sub) _add_flows_to_sheets_command(sub) - _add_save_data_sheets_command(sub) + _add_uni_to_sheets_command(sub) + _add_sheets_to_uni_command(sub) return parser @@ -77,25 +78,13 @@ def _add_create_command(sub): ) parser.set_defaults(func=create_flows) - _add_content_index_arguments(parser) - - -def _add_content_index_arguments(parser): parser.add_argument( - "--datamodels", + "input", help=( - "name of the module defining user data models underlying the data sheets," - " e.g. if the model definitions reside in" - " ./myfolder/mysubfolder/mymodelsfile.py, then this argument should be" - " myfolder.mysubfolder.mymodelsfile" + "paths to XLSX or JSON files, or directories containing CSV files, or" + " Google Sheets IDs i.e. from the URL; inputs should be of the same format" ), - ) - parser.add_argument( - "-f", - "--format", - choices=["csv", "google_sheets", "json", "xlsx"], - help="input sheet format", - required=True, + nargs="+", ) parser.add_argument( "-o", @@ -114,12 +103,20 @@ def _add_content_index_arguments(parser): nargs="*", ) parser.add_argument( - "input", + "--datamodels", help=( - "paths to XLSX or JSON files, or directories containing CSV files, or" - " Google Sheets IDs i.e. from the URL; inputs should be of the same format" + "name of the module defining user data models underlying the data sheets," + " e.g. if the model definitions reside in" + " ./myfolder/mysubfolder/mymodelsfile.py, then this argument should be" + " myfolder.mysubfolder.mymodelsfile" ), - nargs="+", + ) + parser.add_argument( + "-f", + "--format", + choices=["csv", "google_sheets", "json", "xlsx"], + help="input sheet format", + required=True, ) @@ -180,14 +177,37 @@ def _add_flows_to_sheets_command(sub): ) -def _add_save_data_sheets_command(sub): +def _add_uni_to_sheets_command(sub): + parser = sub.add_parser( + "uni-to-sheets", + help="convert JSON to sheets", + ) + parser.set_defaults(func=uni_to_sheets) + parser.add_argument( + "input", + help=("location of input JSON file"), + ) + parser.add_argument( + "output", + help=("location where sheets will be saved"), + ) + + +def _add_sheets_to_uni_command(sub): parser = sub.add_parser( - "save_data_sheets", - help="save data sheets referenced in context index as nested json", + "sheets-to-uni", + help="convert sheets to nested JSON", ) - parser.set_defaults(func=save_data_sheets) - _add_content_index_arguments(parser) + parser.set_defaults(func=sheets_to_uni) + parser.add_argument( + "input", + help=("location of workbook"), + ) + parser.add_argument( + "output", + help=("location where JSON will be saved"), + ) if __name__ == "__main__": diff --git a/src/rpft/converters.py b/src/rpft/converters.py index 86a1e40..e0e9754 100644 --- a/src/rpft/converters.py +++ b/src/rpft/converters.py @@ -1,10 +1,12 @@ import json import logging import os +import re import shutil import sys from pathlib import Path +from rpft.parsers.universal import bookify, parse_tables from rpft.parsers.creation.contentindexparser import ContentIndexParser from rpft.parsers.creation.tagmatcher import TagMatcher from rpft.parsers.sheets import ( @@ -13,9 +15,11 @@ CSVSheetReader, GoogleSheetReader, JSONSheetReader, + ODSSheetReader, XLSXSheetReader, ) from rpft.rapidpro.models.containers import RapidProContainer +from tablib import Databook, Dataset LOGGER = logging.getLogger(__name__) @@ -50,30 +54,20 @@ def create_flows(input_files, output_file, sheet_format, data_models=None, tags= return flows -def save_data_sheets(input_files, output_file, sheet_format, data_models=None, tags=[]): - """ - Save data sheets as JSON. +def uni_to_sheets(infile) -> bytes: + with open(infile, "r") as handle: + data = json.load(handle) - Collect the data sheets referenced in the source content index spreadsheet(s) and - save this collection in a single JSON file. Returns the output as a dict. + sheets = bookify(data) + book = Databook( + [Dataset(*table[1:], headers=table[0], title=name) for name, table in sheets] + ) - :param sources: list of source spreadsheets - :param output_files: (deprecated) path of file to export output to as JSON - :param sheet_format: format of the spreadsheets - :param data_models: name of module containing supporting Python data classes - :param tags: names of tags to be used to filter the source spreadsheets - :returns: dict representing the collection of data sheets. - """ + return book.export("ods") - parser = get_content_index_parser(input_files, sheet_format, data_models, tags) - output = parser.data_sheets_to_dict() - - if output_file: - with open(output_file, "w") as export: - json.dump(output, export, indent=4) - - return output +def sheets_to_uni(infile) -> list: + return parse_tables(create_sheet_reader(None, infile)) def get_content_index_parser(input_files, sheet_format, data_models, tags): @@ -121,18 +115,29 @@ def flows_to_sheets( def create_sheet_reader(sheet_format, input_file): - if sheet_format == "csv": - sheet_reader = CSVSheetReader(input_file) - elif sheet_format == "xlsx": - sheet_reader = XLSXSheetReader(input_file) - elif sheet_format == "json": - sheet_reader = JSONSheetReader(input_file) - elif sheet_format == "google_sheets": - sheet_reader = GoogleSheetReader(input_file) + sheet_format = sheet_format if sheet_format else detect_format(input_file) + cls = { + "csv": CSVSheetReader, + "google_sheets": GoogleSheetReader, + "json": JSONSheetReader, + "ods": ODSSheetReader, + "xlsx": XLSXSheetReader, + }.get(sheet_format) + + if cls: + return cls(input_file) else: raise Exception(f"Format {sheet_format} currently unsupported.") - return sheet_reader + +def detect_format(fp): + if bool(re.fullmatch(r"[a-z0-9_-]{44}", fp, re.IGNORECASE)): + return "google_sheets" + + ext = Path(fp).suffix.lower()[1:] + + if ext in ["xlsx", "ods"]: + return ext def sheets_to_csv(path, sheet_ids): diff --git a/src/rpft/parsers/creation/contentindexparser.py b/src/rpft/parsers/creation/contentindexparser.py index c4440b9..285da3f 100644 --- a/src/rpft/parsers/creation/contentindexparser.py +++ b/src/rpft/parsers/creation/contentindexparser.py @@ -1,6 +1,7 @@ import importlib import logging from collections import OrderedDict + from rpft.logger.logger import logging_context from rpft.parsers.common.model_inference import model_from_headers from rpft.parsers.common.sheetparser import SheetParser @@ -56,7 +57,7 @@ def __init__( self.tag_matcher = tag_matcher self.template_sheets = {} self.data_sheets = {} - self.flow_definition_rows: list[ContentIndexRowModel] = [] + self.flow_definition_rows = [] self.campaign_parsers: dict[str, tuple[str, CampaignParser]] = {} self.surveys = {} self.trigger_parsers = OrderedDict() diff --git a/src/rpft/parsers/sheets.py b/src/rpft/parsers/sheets.py index bc89d4e..51eb2e1 100644 --- a/src/rpft/parsers/sheets.py +++ b/src/rpft/parsers/sheets.py @@ -19,6 +19,9 @@ def __init__(self, reader, name, table): self.name = name self.table = table + def __repr__(self): + return f"Sheet(name: '{self.name}')" + class AbstractSheetReader(ABC): @property @@ -31,6 +34,9 @@ def get_sheet(self, name) -> Sheet: def get_sheets_by_name(self, name) -> list[Sheet]: return [sheet] if (sheet := self.get_sheet(name)) else [] + def __repr__(self): + return f"{type(self).__name__}(name: '{self.name}')" + class CSVSheetReader(AbstractSheetReader): def __init__(self, path): @@ -62,23 +68,9 @@ def __init__(self, filename): self.sheets[sheet.title] = Sheet( reader=self, name=sheet.title, - table=self._sanitize(sheet), + table=sanitize(sheet), ) - def _sanitize(self, sheet): - data = tablib.Dataset() - data.headers = sheet.headers - # remove trailing Nones - while data.headers[-1] is None: - data.headers.pop() - for row in sheet: - vals = tuple(str(e) if e is not None else "" for e in row) - new_row = vals[: len(data.headers)] - if any(new_row): - # omit empty rows - data.append(new_row) - return data - class GoogleSheetReader(AbstractSheetReader): @@ -156,6 +148,41 @@ def get_sheets_by_name(self, name): return sheets +class DatasetSheetReader(AbstractSheetReader): + def __init__(self, datasets): + self._sheets = {d.title: Sheet(self, d.title, d) for d in datasets} + self.name = "[datasets]" + + +class ODSSheetReader(AbstractSheetReader): + def __init__(self, path): + book = tablib.Databook() + + with open(path, "rb") as f: + book.load(f, format="ods") + + self._sheets = { + sheet.title: Sheet(self, sheet.title, sanitize(sheet)) + for sheet in book.sheets() + } + self.name = str(path) + + +def sanitize(sheet): + data = tablib.Dataset() + data.headers = sheet.headers + # remove trailing Nones + while data.headers and data.headers[-1] is None: + data.headers.pop() + for row in sheet: + vals = tuple(str(e) if e is not None else "" for e in row) + new_row = vals[: len(data.headers)] + if any(new_row): + # omit empty rows + data.append(new_row) + return data + + def load_csv(path): with open(path, mode="r", encoding="utf-8") as csv: return tablib.import_set(csv, format="csv") diff --git a/src/rpft/parsers/universal.py b/src/rpft/parsers/universal.py new file mode 100644 index 0000000..26433f8 --- /dev/null +++ b/src/rpft/parsers/universal.py @@ -0,0 +1,210 @@ +import logging +import re +from collections import defaultdict +from functools import singledispatch +from typing import Any + +from benedict import benedict + +from rpft.parsers.sheets import AbstractSheetReader + +LOGGER = logging.getLogger(__name__) + +DELIMS = "|;" +PROP_ACCESSOR = "." +META_KEY = "_idems" +TABULATE_KEY = "tabulate" +HEADERS_KEY = "headers" +Table = list[list[str]] +Book = list[tuple[str, Table]] + + +def bookify(data: dict) -> Book: + """ + Convert a dict into a 'book' - a list of named tables. + """ + meta = data.get(META_KEY, {}).get(TABULATE_KEY, {}) + + return [(k, tabulate(v, meta.get(k, {}))) for k, v in data.items() if k != META_KEY] + + +def tabulate(data, meta: dict = {}) -> Table: + """ + Convert a nested data structure to a tabular form + """ + headers = meta.get(HEADERS_KEY, []) or list( + {k: None for item in data for k, _ in item.items()}.keys() + ) + paths = keypaths(headers) + rows = [] + + for item in data: + obj = benedict(item) + rows += [[stringify(obj[kp]) for kp in paths]] + + return [headers] + rows + + +@singledispatch +def stringify(value, delimiters=DELIMS, **_) -> str: + s = str(value) + + return s if is_template(s) else re.sub(rf"([{delimiters}])", r"\\\1", s) + + +@stringify.register +def _(value: dict, delimiters=DELIMS, depth=0) -> str: + if len(delimiters[depth:]) > 1: + d1, d2 = delimiters[depth : depth + 2] + else: + raise ValueError("Too few delimiters to stringify dict") + + s = f" {d1} ".join( + f"{stringify(k)}{d2} {stringify(v, delimiters=delimiters, depth=depth + 2)}" + for k, v in value.items() + ) + + if len(value) == 1: + s += " " + d1 + + return s + + +@stringify.register +def _(value: list, delimiters=DELIMS, depth=0) -> str: + d = delimiters[depth] if depth < len(delimiters) else None + + if not d: + raise ValueError("Too few delimiters to stringify list") + + s = f" {d} ".join( + stringify(item, delimiters=delimiters, depth=depth + 1) for item in value + ) + + if len(value) == 1: + s += f" {d}" + elif value[-1] == "": + s += d + + return s + + +@stringify.register +def _(value: tuple, delimiters=DELIMS, depth=0) -> str: + return stringify(list(value), delimiters=delimiters, depth=depth) + + +@stringify.register +def _(value: bool, **_) -> str: + return str(value).lower() + + +def parse_tables(reader: AbstractSheetReader) -> dict: + """ + Parse a workbook into a nested structure + """ + obj = benedict() + + for title, sheet in reader.sheets.items(): + obj.merge(parse_table(title, sheet.table.headers, sheet.table[:])) + + return obj + + +def parse_table(title: str = None, headers=tuple(), rows=tuple()): + """ + Parse data in tabular form into a nested structure + """ + title = title or "table" + + if not headers or not rows: + return {title: []} + + return create_obj(stream(title, headers, rows)) + + +def stream(title: str = None, headers=tuple(), rows=tuple()): + yield [META_KEY, TABULATE_KEY, title, HEADERS_KEY], headers + + for i, row in enumerate(rows): + for h, v in zip(keypaths(headers), row): + yield [title, i] + h, parse_cell(v) + + +def keypaths(headers): + counters = defaultdict(int) + indexed = [] + + for key in headers: + indexed += [(key, counters[key])] + counters[key] += 1 + + return [keypath(h, i, counters[h]) for h, i in indexed] + + +def keypath(header, index, count): + expanded = [normalise_key(k) for k in header.split(PROP_ACCESSOR)] + i = index if index < count else count - 1 + + return expanded + [i] if count > 1 else expanded + + +def normalise_key(key): + try: + return int(key) - 1 + except ValueError: + return key + + +def create_obj(pairs): + obj = benedict() + + for kp, v in pairs: + obj[kp] = v + + return obj + + +def parse_cell(s: str, delimiters=DELIMS, depth=0) -> Any: + if type(s) is not str: + raise TypeError("Value to convert is not a string") + + clean = s.strip() if s else "" + + try: + return int(clean) + except Exception: + pass + + try: + return float(clean) + except Exception: + pass + + if clean in ("true", "false"): + return clean == "true" + + if is_template(clean): + return clean + + d = delimiters[depth] if depth < len(delimiters) else "" + pattern = rf"(?", clean) + + +def is_template(s: str) -> bool: + return bool(re.search(r"{{.*?}}|{@.*?@}|{%.*?%}|@\(.*?\)", s)) diff --git a/tests/test_contentindexparser.py b/tests/test_contentindexparser.py index e2bb72e..d711f59 100644 --- a/tests/test_contentindexparser.py +++ b/tests/test_contentindexparser.py @@ -3,8 +3,13 @@ from rpft.parsers.creation.contentindexparser import ContentIndexParser from rpft.parsers.creation.tagmatcher import TagMatcher -from rpft.parsers.sheets import CompositeSheetReader, CSVSheetReader, XLSXSheetReader +from rpft.parsers.sheets import ( + CompositeSheetReader, + CSVSheetReader, + XLSXSheetReader, +) from rpft.rapidpro.models.triggers import RapidProTriggerError + from tests import TESTS_ROOT from tests.mocks import MockSheetReader from tests.utils import Context, csv_join, traverse_flow @@ -1417,77 +1422,3 @@ def test_with_model(self): self.assertFlowMessages(flows, "template - a", ["hello georg"]) self.assertFlowMessages(flows, "template - b", ["hello chiara"]) - - -class TestSaveAsDict(TestCase): - def test_save_as_dict(self): - self.maxDiff = None - ci_sheet = ( - "type,sheet_name,data_sheet,data_row_id,new_name,data_model,status\n" - "data_sheet,simpledata,,,simpledata_renamed,ListRowModel,\n" - "create_flow,my_basic_flow,,,,,\n" - "data_sheet,nesteddata,,,,NestedRowModel,\n" - ) - simpledata = csv_join( - "ID,list_value.1,list_value.2", - "rowID,val1,val2", - ) - nesteddata = ( - "ID,value1,custom_field.happy,custom_field.sad\n" - "row1,Value1,Happy1,Sad1\n" - "row2,Value2,Happy2,Sad2\n" - ) - my_basic_flow = csv_join( - "row_id,type,from,message_text", - ",send_message,start,Some text", - ) - sheet_dict = { - "simpledata": simpledata, - "my_basic_flow": my_basic_flow, - "nesteddata": nesteddata, - } - - output = ContentIndexParser( - MockSheetReader(ci_sheet, sheet_dict), - "tests.datarowmodels.nestedmodel", - ).data_sheets_to_dict() - - output["meta"].pop("version") - exp = { - "meta": { - "user_models_module": "tests.datarowmodels.nestedmodel", - }, - "sheets": { - "simpledata_renamed": { - "model": "ListRowModel", - "rows": [ - { - "ID": "rowID", - "list_value": ["val1", "val2"], - } - ], - }, - "nesteddata": { - "model": "NestedRowModel", - "rows": [ - { - "ID": "row1", - "value1": "Value1", - "custom_field": { - "happy": "Happy1", - "sad": "Sad1", - }, - }, - { - "ID": "row2", - "value1": "Value2", - "custom_field": { - "happy": "Happy2", - "sad": "Sad2", - }, - }, - ], - }, - }, - } - self.assertEqual(output, exp) diff --git a/tests/test_universal.py b/tests/test_universal.py new file mode 100644 index 0000000..2a440e6 --- /dev/null +++ b/tests/test_universal.py @@ -0,0 +1,430 @@ +from unittest import TestCase + +from rpft.parsers.sheets import DatasetSheetReader +from rpft.parsers.universal import ( + bookify, + parse_cell, + parse_table, + parse_tables, + stringify, + tabulate, +) +from tablib import Dataset + + +class TestConvertDataToCell(TestCase): + def test_delimiters_can_be_configured(self): + self.assertEqual( + stringify( + [ + ["click", ["auth", "sign_in_google"]], + ["click", ["emit", "force_reprocess"]], + ], + delimiters=";|:", + ), + "click | auth : sign_in_google ; click | emit : force_reprocess", + ) + + +class TestConvertUniversalToTable(TestCase): + def test_headers_must_be_first_row(self): + data = [ + {"type": "create_flow", "sheet_name": "flow1"}, + ] + + table = tabulate(data) + + self.assertEqual( + table[0], + ["type", "sheet_name"], + "First row must be column headers", + ) + self.assertEqual( + table[1], + ["create_flow", "flow1"], + "Subsequent rows must contain values", + ) + + def test_values_must_be_strings(self): + data = [ + { + "boolean": True, + "float": 1.23, + "integer": "123", + "string": "hello", + }, + ] + + table = tabulate(data) + + self.assertEqual(table[1], ["true", "1.23", "123", "hello"]) + + def test_columns_can_be_ordered_by_metadata(self): + meta = {"headers": ["integer", "float", "string", "boolean"]} + data = [ + { + "boolean": True, + "float": 1.23, + "integer": "123", + "string": "hello", + }, + ] + + table = tabulate(data, meta) + + self.assertEqual( + table[1], + ["123", "1.23", "hello", "true"], + "Columns should be in the same order as the headers metadata", + ) + + def test_arrays_use_single_cell_layout_by_default(self): + data = [ + {"h1": ["yes", "no", 1, False]}, + {"h1": ("yes", "no", 1, False)}, + ] + + table = tabulate(data) + + self.assertEqual(table[1], ["yes | no | 1 | false"]) + self.assertEqual(table[2], ["yes | no | 1 | false"]) + + def test_array_delimiters_are_escaped(self): + data = [ + {"h1": [[1, 2], "3 | 4", "5 ; 6"]}, + ] + + table = tabulate(data) + + self.assertEqual(table[1], [r"1 ; 2 | 3 \| 4 | 5 \; 6"]) + + def test_delimiters_in_templates_are_not_escaped(self): + data = [ + {"h1": '{@ values | map(attribute="ID") @}'}, + ] + + table = tabulate(data) + + self.assertEqual(table[1], ['{@ values | map(attribute="ID") @}']) + + def test_single_item_array(self): + data = [{"k1": ["seq1v1"]}] + + table = tabulate(data) + + self.assertEqual(table[1][0], "seq1v1 |") + + def test_arrays_with_empty_single_item(self): + data = [{"k1": [""]}] + + table = tabulate(data) + + self.assertEqual(table[1][0], " |") + + def test_arrays_with_empty_last_item(self): + data = [{"k1": ["v1", ""]}] + + table = tabulate(data) + + self.assertEqual(table[1][0], "v1 | |") + + def test_nested_arrays_within_a_single_cell(self): + data = [ + {"k1": ["seq1v1", ["seq2v1", "seq2v2"]]}, + ] + + table = tabulate(data) + + self.assertEqual(table[1][0], "seq1v1 | seq2v1 ; seq2v2") + + def test_raise_exception_if_too_much_nesting_for_a_single_cell(self): + data = [ + {"k1": ["seq1v1", ["seq2v1", ["seq3v1"]]]}, + ] + + self.assertRaises(Exception, tabulate, data) + + def test_arrays_use_wide_layout_if_indicated_by_metadata(self): + meta = { + "headers": [ + "choices", + "choices", + "choices", + "choices", + ] + } + data = [ + { + "choices": ["yes", "no", 1, False], + }, + ] + + table = tabulate(data, meta) + + self.assertEqual(table[0], ["choices", "choices", "choices", "choices"]) + self.assertEqual(table[1], ["yes", "no", "1", "false"]) + + def test_objects_use_single_cell_layout_by_default(self): + data = [ + { + "obj": { + "prop1": "val1", + "prop2": "val2", + }, + }, + ] + + table = tabulate(data) + + self.assertEqual(table[1], ["prop1; val1 | prop2; val2"]) + + def test_object_with_single_property_within_cell_has_trailing_delimiter(self): + data = [{"obj": {"k": "v"}}] + + table = tabulate(data) + + self.assertEqual(table[1], ["k; v |"]) + + def test_objects_use_wide_layout_if_indicated_by_metadata(self): + meta = {"headers": ["obj1.k1", "obj1.k2", "seq1.1.k1", "seq1.2.k2"]} + data = [ + { + "obj1": { + "k1": "obj1_k1_v", + "k2": "obj1_k2_v", + }, + "seq1": [ + {"k1": "seq1_k1_v"}, + {"k2": "seq1_k2_v"}, + ], + }, + ] + + table = tabulate(data, meta) + + self.assertEqual( + table[0], + ["obj1.k1", "obj1.k2", "seq1.1.k1", "seq1.2.k2"], + ) + self.assertEqual( + table[1], + ["obj1_k1_v", "obj1_k2_v", "seq1_k1_v", "seq1_k2_v"], + ) + + +class TestUniversalToWorkbook(TestCase): + def test_assembly(self): + data = { + "group1": [{"a": "a1", "b": "b1"}], + "group2": [{"A": "A1", "B": "B1"}], + "_idems": { + "tabulate": { + "group1": {"headers": ["a", "b"]}, + "group2": {"headers": ["B", "A"]}, + }, + }, + } + + workbook = bookify(data) + + self.assertEqual(len(workbook), 2) + self.assertEqual(workbook[0][0], "group1") + self.assertEqual(workbook[0][1], [["a", "b"], ["a1", "b1"]]) + self.assertEqual(workbook[1][0], "group2") + self.assertEqual( + workbook[1][1], + [["B", "A"], ["B1", "A1"]], + "Columns should be ordered according to metadata", + ) + self.assertEqual( + data, + { + "group1": [{"a": "a1", "b": "b1"}], + "group2": [{"A": "A1", "B": "B1"}], + "_idems": { + "tabulate": { + "group1": {"headers": ["a", "b"]}, + "group2": {"headers": ["B", "A"]}, + }, + }, + }, + "Input data should not be mutated", + ) + + +class TestConvertWorkbookToUniversal(TestCase): + + def test_workbook_converts_to_object(self): + workbook = DatasetSheetReader( + [ + Dataset(("t1a1", "t1b1"), headers=("T1A", "T1B"), title="table1"), + Dataset(("t2a1", "t2b1"), headers=("T2A", "T2B"), title="table2"), + ] + ) + + nested = parse_tables(workbook) + + self.assertIsInstance(nested, dict) + self.assertEqual(list(nested.keys()), ["_idems", "table1", "table2"]) + self.assertEqual( + list(nested["_idems"]["tabulate"].keys()), + ["table1", "table2"], + ) + + +class TestConvertTableToNested(TestCase): + + def test_default_type_is_string(self): + self.assertEqual( + parse_table( + title="title", + headers=["a"], + rows=[["a1"]], + ), + { + "_idems": {"tabulate": {"title": {"headers": ["a"]}}}, + "title": [{"a": "a1"}], + }, + ) + + def test_table_must_have_title(self): + self.assertEqual(parse_table(), {"table": []}) + + def test_integer_as_string_is_int(self): + parsed = parse_table(headers=["a"], rows=[["123"]]) + + self.assertEqual(parsed["table"][0]["a"], 123) + + def test_boolean_as_string_is_bool(self): + parsed = parse_table(headers=("a", "b"), rows=[("true", "false")]) + + self.assertEqual(parsed["table"][0]["a"], True) + self.assertEqual(parsed["table"][0]["b"], False) + + def test_delimited_string_is_array(self): + parsed = parse_table(headers=["a"], rows=[["one | 2 | true | 3.4"]]) + + self.assertEqual(parsed["table"][0]["a"], ["one", 2, True, 3.4]) + + def test_columns_with_same_name_are_grouped_into_list(self): + parsed = parse_table(headers=["a"] * 4, rows=[("one", "2", "true", "3.4")]) + + self.assertEqual(parsed["table"][0]["a"], ["one", 2, True, 3.4]) + + def test_columns_with_same_name_and_delimited_strings_is_2d_array(self): + parsed = parse_table(headers=["a"] * 2, rows=[("one | 2", "true | 3.4")]) + + self.assertEqual(parsed["table"][0]["a"], [["one", 2], [True, 3.4]]) + + def test_column_using_dot_notation_is_nested_object_property(self): + parsed = parse_table( + headers=("obj.prop1", "obj.prop2"), + rows=[("one", "2")], + ) + + self.assertEqual(parsed["table"][0]["obj"], {"prop1": "one", "prop2": 2}) + self.assertEqual( + parsed["_idems"]["tabulate"]["table"]["headers"], + ("obj.prop1", "obj.prop2"), + ) + + def test_nested_object_with_2d_array_property_value(self): + parsed = parse_table(headers=["obj.k1"] * 2, rows=[["1 | 2", "3 | 4"]]) + + self.assertEqual(parsed["table"][0]["obj"], {"k1": [[1, 2], [3, 4]]}) + + def test_nested_object_with_nested_object(self): + parsed = parse_table( + headers=["obj.k1"] * 2, + rows=[["k2; 2 | k3; false", "k4; v4 | k5; true"]], + ) + + self.assertEqual( + parsed["table"][0]["obj"], + {"k1": [[["k2", 2], ["k3", False]], [["k4", "v4"], ["k5", True]]]}, + ) + + +class TestCellConversion(TestCase): + + def setUp(self): + self.func = parse_cell + + def test_convert_cell_string_to_number(self): + self.assertEqual(self.func("123"), 123) + self.assertEqual(self.func("1.23"), 1.23) + + def test_output_clean_string_if_no_conversion_possible(self): + self.assertEqual(self.func("one"), "one") + self.assertEqual(self.func(" one "), "one") + self.assertEqual(self.func(""), "") + self.assertEqual(self.func("http://example.com/"), "http://example.com/") + self.assertEqual(self.func("k1: v1"), "k1: v1") + + def test_raises_error_if_not_string_input(self): + self.assertRaises(TypeError, self.func, None) + self.assertRaises(TypeError, self.func, 123) + + def test_convert_cell_string_to_bool(self): + self.assertEqual(self.func("true"), True) + self.assertEqual(self.func(" true "), True) + self.assertEqual(self.func("false"), False) + + def test_convert_cell_string_to_list(self): + self.assertEqual(self.func("one | 2 | false"), ["one", 2, False]) + self.assertEqual(self.func("one ; 2 ; false"), ["one", 2, False]) + self.assertEqual(self.func("one |"), ["one"]) + self.assertEqual(self.func("|"), [""]) + self.assertEqual(self.func("| 2 |"), ["", 2]) + self.assertEqual(self.func("a||"), ["a", ""]) + self.assertEqual(self.func("k1 | v1 : k2 | v2"), ["k1", "v1 : k2", "v2"]) + + def test_convert_cell_string_to_list_of_lists(self): + self.assertEqual(self.func("k1; v1 |"), [["k1", "v1"]]) + self.assertEqual(self.func("k1; k2; v2 |"), [["k1", "k2", "v2"]]) + self.assertEqual(self.func("k1; 1 | k2; true"), [["k1", 1], ["k2", True]]) + + def test_delimiters_can_be_configured(self): + self.assertEqual( + self.func( + "click | auth: sign_in_google; click | emit: force_reprocess", + delimiters=";|:", + ), + [ + ["click", ["auth", "sign_in_google"]], + ["click", ["emit", "force_reprocess"]], + ], + ) + + def test_inline_templates_are_preserved(self): + self.assertEqual(self.func("{{ template }}"), "{{ template }}") + self.assertEqual(self.func("{@ template @}"), "{@ template @}") + self.assertEqual( + self.func("{% if other_option!=" "%}1wc;1wt;1wb{%endif-%}"), + "{% if other_option!=" "%}1wc;1wt;1wb{%endif-%}", + ) + self.assertEqual(self.func("{{ template }} |"), "{{ template }} |") + self.assertEqual( + self.func("{{ template }} | something | {{ blah }}"), + "{{ template }} | something | {{ blah }}", + ) + self.assertEqual( + self.func( + "{{3*(steps.values()|length -1)}}|{{3*(steps.values()|length -1)+2}}" + ), + "{{3*(steps.values()|length -1)}}|{{3*(steps.values()|length -1)+2}}", + ) + self.assertEqual( + self.func("6;0{%if skip_option != " " -%};skip{% endif %}"), + "6;0{%if skip_option != " " -%};skip{% endif %}", + ) + self.assertEqual( + self.func('@(fields.survey_behave & "no|")'), + '@(fields.survey_behave & "no|")', + ) + + def test_delimiters_can_be_escaped(self): + self.assertEqual( + self.func(r"1 ; 2 | 3 \| 4 | 5 \; 6 \|"), + [[1, 2], "3 | 4", "5 ; 6 |"], + )