Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create universal parser #143

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
5bd2f8c
Create universal parser
istride Sep 17, 2024
70998ba
Fix invalid dict access syntax
istride Sep 17, 2024
0ba9d73
Create function to convert uni format workbook into nested
istride Sep 23, 2024
ff6fa77
Ensure sheet order when converting legacy sheets
istride Sep 24, 2024
0477412
Ensure column headers are preserved accurately
istride Sep 25, 2024
55f0e77
Support the same dotted path notation as legacy sheets
istride Sep 26, 2024
be4d2a1
Create command to convert universal sheets to JSON
istride Sep 30, 2024
b8a9a19
Create cell parser that preserves templates
istride Oct 10, 2024
a792590
Fix tests
istride Jan 17, 2025
34c8355
Parse cell content from existing sheets to nested JSON without models
istride Jan 27, 2025
bcd708f
Remove legacy sheets to universal format using models
istride Jan 28, 2025
f6201bb
Clean up uni to sheets conversion
istride Jan 28, 2025
c96f511
Remove template preserver
istride Jan 28, 2025
9d27fe9
Tidy conversion to sheets; fix bugs
istride Jan 31, 2025
e8d5580
Support Python v3.9
istride Jan 31, 2025
1b372f0
Remove type hints from parse_table
istride Jan 31, 2025
ab6cde7
Remove Lark-based cell parser
istride Jan 31, 2025
c2e1fb0
Remove lark package
istride Jan 31, 2025
70f8dc8
Remove non-essential changes
istride Jan 31, 2025
286f1e4
Allow dicts to be stringified with variable delimiters
istride Jan 31, 2025
cf308cd
Create documentation for spreadsheet notation
istride Feb 1, 2025
c6c6a26
Amend docs
istride Feb 1, 2025
9a1d86c
Add support for ODS files
istride Feb 3, 2025
79d76aa
Preserve templates; escape delimiters
istride Feb 3, 2025
7beb27d
Rename convert_cell to parse_cell
istride Feb 3, 2025
4993ac3
Consider escaped delimiters at the end of a cell
istride Feb 3, 2025
9d926ac
Stop escaping delimiters in templates
istride Feb 3, 2025
21c9098
Update docs
istride Feb 4, 2025
7d113e3
Add info about metadata
istride Feb 4, 2025
e2170e0
Treat RapidPro expressions as templates
istride Feb 6, 2025
ab9d6b6
Allow sequence delimiters to be configured
istride Feb 10, 2025
4c7bb7c
Require patched tablib
istride Feb 19, 2025
b1f34d2
Bug fix
istride Mar 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ The CLI supports the following subcommands:
- `create_flows`: create RapidPro flows (in JSON format) from spreadsheets using content index
- `flows_to_sheets`: convert RapidPro flows (in JSON format) into spreadsheets
- `convert`: save input spreadsheets as JSON
- `save_data_sheets`: save input spreadsheets as nested JSON using content index - an experimental feature that is likely to change.

Full details of the available options for each can be found via the help feature:

Expand Down
205 changes: 205 additions & 0 deletions docs/notation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Spreadsheet notation

Summary of spreadsheet notation used to convert sheets into a nested data structure (JSON). A series of data tables will be shown alongside the resultant JSON structure.

# Books

A container for multiple tables. Also known as a spreadsheet or workbook. A book is converted to an object containing a property for each table. The property key is the name of the sheet; the value is the converted contents of the sheet.

For example, given an Excel workbook with two sheets ("table1" and "table2"), the resulting JSON will be:

```json
{
"table1": [],
"table2": []
}
```

# Tables

Also known as a sheet in a spreadsheet (or workbook).

The contents of a table are converted to a sequence of objects - corresponding to rows in the sheet. Each object will have keys corresponding to the column headers of the sheet, and values corresponding to a particular row in the sheet.

| a | b |
|----|----|
| v1 | v2 |

`data`

```json
{
"data": [
{"a": "v1", "b": "v2"}
]
}
```

This means that the first row of every table should be a header row that specifies the name of each column.

# Basic types

Refers to the following value types in JSON: `string`, `number`, `true` and `false`.

| string | number | true | false |
|--------|--------|------|-------|
| hello | 123 | true | false |

`basic_types`

```json
{
"basic_types": [
{
"string": "hello",
"number": 123,
"true": true,
"false": false
}
]
}
```

The JSON type `null` is not represented because an empty cell is assumed to be equivalent to the empty string ("").

# Sequences

An ordered sequence of items. Also known as lists or arrays.

| seq1 | seq1 | seq2.1 | seq2.2 | seq3 | seq4 |
|------|------|--------|--------|----------|--------------------|
| v1 | v2 | v1 | v2 | v1 \| v2 | v1 ; v2 \| v3 ; v4 |

`sequences`

```json
{
"sequences": [
{
"seq1": ["v1", "v2"],
"seq2": ["v1", "v2"],
"seq3": ["v1", "v2"]
"seq4": [["v1", "v2"], ["v3", "v4"]]
}
]
}
```

`seq1`, `seq2` and `seq3` are equivalent. In all cases, the order of items is specified from left to right.

`seq1` uses a 'wide' layout, where the column header is repeated and each column holds one item in the sequence. Values from columns with the same name are collected into a sequence in the resulting JSON object.

`seq2` is similar to `seq1`, but the index of each item is specified explicitly.

`seq3` uses an 'inline' layout, where the sequence is defined as a delimited string within a single cell of the table. The default delimiter is a vertical bar or pipe character ('|').

Two levels of nesting are possible within a cell, as shown by `seq4` - a list of lists. This could be used to model a list of key-value pairs, which could easily be converted to an object (map / dictionary). The default delimiter for second-level sequences is a semi-colon (';').

The interpretation of delimiter characters can be skipped by escaping the delimiter characters. An escape sequence begins with a backslash ('\\') and ends with the character to be escaped. For example, to escape a vertical bar, use: '\\|'.

# Objects

An unordered collection of key-value pairs (properties). Also known as maps, dictionaries or associative arrays.

| obj1.key1 | obj1.key2 | obj2 |
|-----------|-----------|------------------------|
| v1 | v2 | key1 ; v1 \| key2 ; v2 |

`objects`

```json
{
"objects": [
{
"obj1": {
"key1": "v1",
"key2": "v2"
},
"obj2": [
["key1", "v1"],
["key2", "v2"]
]
}
]
}
```

`obj1` and `obj2` are slightly different, but can be interpreted in the same way, as a list of key-value pairs.

A wide layout is used for `obj1`, where one or more column headers use a dotted 'keypath' notation to identify a particular property key belonging to a particular object, and the corresponding cells in subsequent rows contain the values for that property. The dotted keypath notation can be used to access properties at deeper levels of nesting e.g. `obj.key.subkey.etc`.

An inline layout is used for `obj2`, where properties are defined as a sequence of key-value pairs. The delimiter of properties is a vertical bar or pipe character - same as top-level sequences. The delimiter of keys and values is a semi-colon character - same as second-level sequences.

All the previous notation can be combined to create fairly complicated structures.

| obj1.key1 | obj1.key1 |
|------------------------|--------------------------------|
| 1 ; 2 ; 3 \| one ; two | active ; true \| debug ; false |

`nesting`

```json
{
"nesting": [
{
"obj1": {
"key1": [
[
[1, 2, 3],
["one", "two"]
],
[
["active", true],
["debug", false]
]
],
}
}
]
}
```

# Templates

Table cells may contain Jinja templates. A cell is considered a template if it contains template placeholders anywhere within it. There are three types of template placeholders:

- `{{ ... }}`
- `{% ... %}`
- `{@ ... @}`

When converting between spreadsheets and JSON, templates will not be interpreted in any way, just copied verbatim. This means that sequence delimiters do not need to be escaped if they exist within a template. It is intended for templates to eventually be interpreted at a later stage, during further processing.

# Metadata

Information that would otherwise be lost during the conversion from spreadsheets to JSON is stored as metadata - in a top-level property with key `_idems`. The metadata property is intended to be 'hidden' and unlikely to be shared by any sheet name.

The original header names for each sheet are held as metadata to direct the conversion process from JSON back to spreadsheet. The original headers preserve the order of columns and whether a wide or inline layout was used.


| seq1 | seq1 | seq2 |
|------|------|----------|
| v1 | v2 | v1 \| v2 |

`sequences`

```json
{
"_idems": {
"tabulate": {
"sequences": {
"headers": [
"seq1",
"seq1",
"seq2"
]
}
}
}
"sequences": [
{
"seq1": ["v1", "v2"],
"seq2": ["v1", "v2"]
}
]
}
```
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,11 @@ dependencies = [
"google-api-python-client~=2.6.0",
"google-auth-oauthlib~=0.4.4",
"networkx~=2.5.1",
"odfpy",
"openpyxl",
"pydantic >= 2",
"tablib[ods]>=3.1.0",
"python-benedict",
"tablib @ git+https://github.com/istride/[email protected]",
]

[project.urls]
Expand Down
92 changes: 56 additions & 36 deletions src/rpft/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,16 @@ def flows_to_sheets(args):
)


def save_data_sheets(args):
output = converters.save_data_sheets(
args.input,
None,
args.format,
data_models=args.datamodels,
tags=args.tags,
)
with open(args.output, "w", encoding="utf-8") as export:
json.dump(output, export, indent=4)
def uni_to_sheets(args):
with open(args.output, "wb") as handle:
handle.write(converters.uni_to_sheets(args.input))


def sheets_to_uni(args):
data = converters.sheets_to_uni(args.input)

with open(args.output, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2)


def create_parser():
Expand All @@ -64,7 +64,8 @@ def create_parser():
_add_create_command(sub)
_add_convert_command(sub)
_add_flows_to_sheets_command(sub)
_add_save_data_sheets_command(sub)
_add_uni_to_sheets_command(sub)
_add_sheets_to_uni_command(sub)

return parser

Expand All @@ -77,25 +78,13 @@ def _add_create_command(sub):
)

parser.set_defaults(func=create_flows)
_add_content_index_arguments(parser)


def _add_content_index_arguments(parser):
parser.add_argument(
"--datamodels",
"input",
help=(
"name of the module defining user data models underlying the data sheets,"
" e.g. if the model definitions reside in"
" ./myfolder/mysubfolder/mymodelsfile.py, then this argument should be"
" myfolder.mysubfolder.mymodelsfile"
"paths to XLSX or JSON files, or directories containing CSV files, or"
" Google Sheets IDs i.e. from the URL; inputs should be of the same format"
),
)
parser.add_argument(
"-f",
"--format",
choices=["csv", "google_sheets", "json", "xlsx"],
help="input sheet format",
required=True,
nargs="+",
)
parser.add_argument(
"-o",
Expand All @@ -114,12 +103,20 @@ def _add_content_index_arguments(parser):
nargs="*",
)
parser.add_argument(
"input",
"--datamodels",
help=(
"paths to XLSX or JSON files, or directories containing CSV files, or"
" Google Sheets IDs i.e. from the URL; inputs should be of the same format"
"name of the module defining user data models underlying the data sheets,"
" e.g. if the model definitions reside in"
" ./myfolder/mysubfolder/mymodelsfile.py, then this argument should be"
" myfolder.mysubfolder.mymodelsfile"
),
nargs="+",
)
parser.add_argument(
"-f",
"--format",
choices=["csv", "google_sheets", "json", "xlsx"],
help="input sheet format",
required=True,
)


Expand Down Expand Up @@ -180,14 +177,37 @@ def _add_flows_to_sheets_command(sub):
)


def _add_save_data_sheets_command(sub):
def _add_uni_to_sheets_command(sub):
parser = sub.add_parser(
"uni-to-sheets",
help="convert JSON to sheets",
)
parser.set_defaults(func=uni_to_sheets)
parser.add_argument(
"input",
help=("location of input JSON file"),
)
parser.add_argument(
"output",
help=("location where sheets will be saved"),
)


def _add_sheets_to_uni_command(sub):
parser = sub.add_parser(
"save_data_sheets",
help="save data sheets referenced in context index as nested json",
"sheets-to-uni",
help="convert sheets to nested JSON",
)

parser.set_defaults(func=save_data_sheets)
_add_content_index_arguments(parser)
parser.set_defaults(func=sheets_to_uni)
parser.add_argument(
"input",
help=("location of workbook"),
)
parser.add_argument(
"output",
help=("location where JSON will be saved"),
)


if __name__ == "__main__":
Expand Down
Loading