[DRAFT] Array Support POC #282

forestmvey · 2023-07-06T16:26:28Z

Description

Add support for indexing arrays using square parenthesis. This implementation gives users the responsibility on how to handle their data. If array indexing parenthesis are supplied then the plugin assumes an array is to be found with supplied field name. In the event of non-array values or invalid indexes used by the user, an empty value will be returned.

The goal of adding the syntax option for returning and indexing arrays is to provide users the ability to utilize arrays without the need for introducing a breaking change with the SQL plugin. By offloading the responsibility for how to handle stored array data we can avoid the need for additional data mapping.

Syntax

array[:]
array[index]<[index]>

Examples

Dataset:

"array": [[1], [2, [3, 4]], 5]

Query:

SELECT array FROM arrays;

Result:
1

Query:

SELECT array[:] FROM arrays;

Result:
[[1], [2, [3, 4]], 5]

Query:

SELECT array[0] FROM arrays;

Result:
[1]

Query:

SELECT array[1] FROM arrays;

Result:
[2, [3, 4]]

Query:

SELECT array[1][1] FROM arrays;

Result:
[3, 4]

Query:

SELECT array[1][1][0] FROM arrays;

Result:
3

Query:

SELECT array[2] FROM arrays;

Result:
5

Query:

SELECT array[3] FROM arrays;

Result:
null

Object Arrays

The V2 engine has the current functionality for object arrays.
Dataset:

"objectArray": [{"id": [1, 2]}, {"id": 2}],
"nestedArray": [{"id": [1, 2]}, {"id": 2}]

Query:

SELECT objectArray FROM arrays;

Result
{"id": 1}

Query:

SELECT nestedArray FROM arrays;

Result
[{"id": [1, 2]}, {"id": 2}]

Added Functionality

Query:

SELECT objectArray[:] FROM arrays;
SELECT nestedArray[:] FROM arrays;

Result
[{"id": [1, 2]}, {"id": 2}]

Query:

SELECT objectArray[1] FROM arrays;
SELECT nestedArray[1] FROM arrays;

Result
{"id": 2}

Query:

SELECT objectArray[1].id FROM arrays;
SELECT nestedArray[1].id FROM arrays;

Result
[1, 2]

Query:

SELECT objectArray[1].id[1] FROM arrays;
SELECT nestedArray[1].id[1] FROM arrays;

Result
2

Edge Cases

Index out of bounds returns empty
Indexing non-array value returns empty

Additional Functionality

Add support for ranges: array[<firstIndex>:<lastIndex>]

Limitations

Array support is limited to the support that OpenSearch has for arrays. Arrays are not mapped as array type but the type defined within the arrays. As well arrays are stored as a single document and will return entire arrays if a portion of the array matches a filter. For example an array consisting of [1, 2] would return the whole array when filtering for values of 1 or 2.

Clause Support

✅ SELECT Clause
- ❌ Function Support
❌ ORDER BY
❌ WHERE
❌ GROUP BY
❌ HAVING

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-07-06T16:39:35Z

Codecov Report

Merging #282 (41d7bf5) into integ-array-support-poc (e186cf7) will decrease coverage by 0.69%.
The diff coverage is 49.36%.

@@                      Coverage Diff                      @@
##             integ-array-support-poc     #282      +/-   ##
=============================================================
- Coverage                      99.98%   99.30%   -0.69%     
- Complexity                      2624     2637      +13     
=============================================================
  Files                            205      206       +1     
  Lines                           5955     6025      +70     
  Branches                         378      392      +14     
=============================================================
+ Hits                            5954     5983      +29     
- Misses                             1       38      +37     
- Partials                           0        4       +4

Flag	Coverage Δ
sql-engine	`99.30% <49.36%> (-0.69%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ensearch/sql/expression/ExpressionNodeVisitor.java	`95.45% <0.00%> (-4.55%)`	⬇️
...org/opensearch/sql/analysis/QualifierAnalyzer.java	`69.23% <11.11%> (-30.77%)`	⬇️
...rg/opensearch/sql/analysis/ExpressionAnalyzer.java	`92.09% <36.36%> (-7.91%)`	⬇️
...earch/sql/expression/ArrayReferenceExpression.java	`43.33% <43.33%> (ø)`
...nsearch/sql/analysis/SelectExpressionAnalyzer.java	`100.00% <100.00%> (ø)`
...opensearch/sql/expression/HighlightExpression.java	`96.29% <100.00%> (-3.71%)`	⬇️
...opensearch/sql/expression/ReferenceExpression.java	`100.00% <100.00%> (ø)`
...h/sql/expression/function/OpenSearchFunctions.java	`100.00% <100.00%> (ø)`
...nsearch/sql/storage/bindingtuple/BindingTuple.java	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

GumpacG · 2023-07-06T18:06:44Z

Regarding future support for features like geo_shape, consider cases like:
Data:

{
    "borders": [
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        }
    ]  
}

accessing coordinates with borders[0].border.coordinates
accessing starting points of border with borders[0].border.coordinates[0]

Note: this example may be an edge case and may not represent typical real world use cases.

Yury-Fridlyand

1.

Please, add more tests:

Array of objects
Array in object
Array in nested

Consider different tricky combinations, like this for the first test:

[
  {
    "id" : 1,
    "name" : "one"
  },
  {
    "id" : 2,
    "name" : "two"
  },
  null
]

2.

What if I do?

select int0[0] from calcs

Please, add a test for this.

3.

Please add a design doc.

4.

Please, fix column headers/names.

Yury-Fridlyand · 2023-07-06T23:13:17Z

integ-test/src/test/java/org/opensearch/sql/sql/NestedIT.java

@@ -254,7 +254,7 @@ public void nested_function_and_field_with_order_by_clause() {
        rows("a", 4),
        rows("b", 2),
        rows("c", 3),
-        rows("zz", new JSONArray(List.of(3, 4))));
+        rows("zz", 3));


Isn't a breaking change? Please, compare this with nested behavior in V1 and V2 @ 2.8.

This is V1 behaviour, but my updates make this consistent with V2 behaviour. I believe this should be considered a bug and not a breaking change because if we were to do a select on this value without the nested function, a 3 would be returned. Very weird case here but this new change aligns with whats expected with V2.

Yury-Fridlyand · 2023-07-06T23:13:53Z

sql/src/main/antlr/OpenSearchSQLParser.g4

@@ -814,6 +815,11 @@ columnName
    : qualifiedName
    ;

+arrayColumnName
+    : qualifiedName LT_SQR_PRTHS COLON_SYMB RT_SQR_PRTHS
+    | qualifiedName LT_SQR_PRTHS decimalLiteral RT_SQR_PRTHS


Could decimalLiteral be negative?

Yury-Fridlyand · 2023-07-06T23:14:45Z

sql/src/main/java/org/opensearch/sql/sql/parser/AstExpressionBuilder.java

+  @Override
+  public UnresolvedExpression visitArrayColumnName(ArrayColumnNameContext ctx) {
+    UnresolvedExpression qualifiedName = visit(ctx.qualifiedName());
+    if (ctx.decimalLiteral() == null) {


Make decimalLiteral named value in ANTLR grammar, so you can reference the name here.

GumpacG · 2023-07-06T23:57:32Z

core/src/main/java/org/opensearch/sql/analysis/SelectExpressionAnalyzer.java

@@ -105,8 +107,17 @@ public List<NamedExpression> visitAllFields(AllFields node,
                                              AnalysisContext context) {
    TypeEnvironment environment = context.peek();
    Map<String, ExprType> lookupAllFields = environment.lookupAllFields(Namespace.FIELD_NAME);
-    return lookupAllFields.entrySet().stream().map(entry -> DSL.named(entry.getKey(),
-        new ReferenceExpression(entry.getKey(), entry.getValue()))).collect(Collectors.toList());
+    return lookupAllFields.entrySet().stream().map(entry ->


Could you add an example in the PR description for SELECT *? Will this return the full array for a column when SELECT * is queried? Since SELECT array still returns just the first value, is this still the case for SELECT * as well?

Yes the functionality for SELECT * will stay the same, otherwise would be a breaking change.

Yury-Fridlyand · 2023-07-12T19:54:54Z

common/src/main/java/org/opensearch/sql/common/utils/StringUtils.java

@@ -107,6 +107,6 @@ private static boolean isQuoted(String text, String mark) {
  }

  public static String removeParenthesis(String qualifier) {
-    return qualifier.replaceAll("\\[.+\\]", "");
+    return qualifier.replaceAll("\\[\\d+\\]", "");


Is it a regex? Do you want to match : or \d+:\d+ (or even \d*:\d*)?
Add a comment for this function please.

Signed-off-by: forestmvey <[email protected]>

* Implement creation of ip2geo feature (#257) * Update gradle version to 7.6 (#265) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Implement creation of ip2geo feature * Implementation of ip2geo datasource creation * Implementation of ip2geo processor creation Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> * Added unit tests with some refactoring of codes (#271) * Add Unit tests * Set cache true for search query * Remove in memory cache implementation (Two way door decision) * Relying on search cache without custom cache * Renamed datasource state from FAILED to CREATE_FAILED * Renamed class name from *Helper to *Facade * Changed updateIntervalInDays to updateInterval * Changed value type of default update_interval from TimeValue to Long * Read setting value from cluster settings directly Signed-off-by: Heemin Kim <[email protected]> * Sync from main (#280) * Update gradle version to 7.6 (#265) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Exclude lombok generated code from jacoco coverage report (#268) Signed-off-by: Heemin Kim <[email protected]> * Make jacoco report to be generated faster in local (#267) Signed-off-by: Heemin Kim <[email protected]> * Update dependency org.json:json to v20230227 (#273) Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com> * Baseline owners and maintainers (#275) Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com> * Add datasource name validation (#281) Signed-off-by: Heemin Kim <[email protected]> * Refactoring of code (#282) 1. Change variable name from datasourceName to name 2. Change variable name from id to name 3. Added helper methods in test code Signed-off-by: Heemin Kim <[email protected]> * Change field name from md5 to sha256 (#285) Signed-off-by: Heemin Kim <[email protected]> * Implement get datasource api (#279) Signed-off-by: Heemin Kim <[email protected]> * Update index option (#284) 1. Make geodata index as hidden 2. Make geodata index as read only allow delete after creation is done 3. Refresh datasource index immediately after update Signed-off-by: Heemin Kim <[email protected]> * Make some fields in manifest file as mandatory (#289) Signed-off-by: Heemin Kim <[email protected]> * Create datasource index explicitly (#283) Signed-off-by: Heemin Kim <[email protected]> * Add wrapper class of job scheduler lock service (#290) Signed-off-by: Heemin Kim <[email protected]> * Remove all unused client attributes (#293) Signed-off-by: Heemin Kim <[email protected]> * Update copyright header (#298) Signed-off-by: Heemin Kim <[email protected]> * Run system index handling code with stashed thread context (#297) Signed-off-by: Heemin Kim <[email protected]> * Reduce lock duration and renew the lock during update (#299) Signed-off-by: Heemin Kim <[email protected]> * Implements delete datasource API (#291) Signed-off-by: Heemin Kim <[email protected]> * Set User-Agent in http request (#300) Signed-off-by: Heemin Kim <[email protected]> * Implement datasource update API (#292) Signed-off-by: Heemin Kim <[email protected]> * Refactoring test code (#302) Make buildGeoJSONFeatureProcessorConfig method to be more general Signed-off-by: Heemin Kim <[email protected]> * Add ip2geo processor integ test for failure case (#303) Signed-off-by: Heemin Kim <[email protected]> * Bug fix and refactoring of code (#305) 1. Bugfix: Ingest metadata can be null if there is no processor created 2. Refactoring: Moved private method to another class for better testing support 3. Refactoring: Set some private static final variable as public so that unit test can use it 4. Refactoring: Changed string value to static variable Signed-off-by: Heemin Kim <[email protected]> * Add integration test for Ip2GeoProcessor (#306) Signed-off-by: Heemin Kim <[email protected]> * Add ConcurrentModificationException (#308) Signed-off-by: Heemin Kim <[email protected]> * Add integration test for UpdateDatasource API (#307) Signed-off-by: Heemin Kim <[email protected]> * Bug fix on lock management and few performance improvements (#310) * Release lock before response back to caller for update/delete API * Release lock in background task for creation API * Change index settings to improve indexing performance Signed-off-by: Heemin Kim <[email protected]> * Change index setting from read_only_allow_delete to write (#311) read_only_allow_delete does not block write to an index. The disk-based shard allocator may add and remove this block automatically. Therefore, use index.blocks.write instead. Signed-off-by: Heemin Kim <[email protected]> * Fix bug in get datasource API and improve memory usage (#313) Signed-off-by: Heemin Kim <[email protected]> * Change package for Strings.hasText (#314) (#317) Signed-off-by: Heemin Kim <[email protected]> * Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319) Signed-off-by: Heemin Kim <[email protected]> * Do not index blank value and do not enrich null property (#320) Signed-off-by: Heemin Kim <[email protected]> * Move index setting keys to constants (#321) Signed-off-by: Heemin Kim <[email protected]> * Return null index name for expired data (#322) Return null index name for expired data so that it can be deleted by clean up process. Clean up process exclude current index from deleting. Signed-off-by: Heemin Kim <[email protected]> * Add new fields in datasource (#325) Signed-off-by: Heemin Kim <[email protected]> * Delete index once it is expired (#326) Signed-off-by: Heemin Kim <[email protected]> * Add restoring event listener (#328) In the listener, we trigger a geoip data update Signed-off-by: Heemin Kim <[email protected]> * Reverse forcemerge and refresh order (#331) Otherwise, opensearch does not clear old segment files Signed-off-by: Heemin Kim <[email protected]> * Removed parameter and settings (#332) * Removed first_only parameter * Removed max_concurrency and batch_size setting first_only parameter was added as current geoip processor has it. However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search. max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later. Signed-off-by: Heemin Kim <[email protected]> * Add a field in datasource for current index name (#333) Signed-off-by: Heemin Kim <[email protected]> * Delete GeoIP data indices after restoring complete (#334) We don't want to use restored GeoIP data indices. Therefore we delete the indices once restoring process complete. When GeoIP metadata index is restored, we create a new GeoIP data index instead. Signed-off-by: Heemin Kim <[email protected]> * Use bool query for array form of IPs (#335) Signed-off-by: Heemin Kim <[email protected]> * Run update/delete request in a new thread (#337) This is not to block transport thread Signed-off-by: Heemin Kim <[email protected]> * Remove IP2Geo processor validation (#336) Cannot query index to get data to validate IP2Geo processor. Will add validation when we decide to store some of data in cluster state metadata. Signed-off-by: Heemin Kim <[email protected]> * Acquire lock sychronously (#339) By acquiring lock asychronously, the remaining part of the code is being run by transport thread which does not allow blocking code. We want only single update happen in a node using single thread. However, it cannot be acheived if I acquire lock asynchronously and pass the listener. Signed-off-by: Heemin Kim <[email protected]> * Added a cache to store datasource metadata (#338) Signed-off-by: Heemin Kim <[email protected]> * Changed class name and package (#341) Signed-off-by: Heemin Kim <[email protected]> * Refactoring of code (#342) 1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao 2. Moved the Ip2GeoCachedDao from cache to dao package Signed-off-by: Heemin Kim <[email protected]> * Add geo data cache (#340) Signed-off-by: Heemin Kim <[email protected]> * Add cache layer to reduce GeoIp data retrieval latency (opensearch-project#343) Signed-off-by: Heemin Kim <[email protected]> * Use _primary in query preference and few changes (opensearch-project#347) 1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346 2. Update datasource metadata index mapping 3. Move batch size from static value to setting Signed-off-by: Heemin Kim <[email protected]> * Wait until GeoIP data to be replicated to all data nodes (opensearch-project#348) Signed-off-by: Heemin Kim <[email protected]> * Update packages according to a change in OpenSearch core (opensearch-project#354) * Update packages according to a change in OpenSearch core Signed-off-by: Heemin Kim <[email protected]> * Update packages according to a change in OpenSearch core (opensearch-project#353) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

forestmvey requested a review from a team July 6, 2023 16:39

Yury-Fridlyand reviewed Jul 6, 2023

View reviewed changes

GumpacG reviewed Jul 7, 2023

View reviewed changes

Yury-Fridlyand reviewed Jul 12, 2023

View reviewed changes

forestmvey added 5 commits July 13, 2023 09:53

Initial implementation for array support.

1e3d879

Signed-off-by: forestmvey <[email protected]>

Adding support for returning arrays as a whole.

05048cc

Signed-off-by: forestmvey <[email protected]>

Adding comment.

95e2c82

Signed-off-by: forestmvey <[email protected]>

Added indexing for each part of array column name.

579d44a

Signed-off-by: forestmvey <[email protected]>

Base array functionality working with IT tests.

41d7bf5

Signed-off-by: forestmvey <[email protected]>

forestmvey force-pushed the dev-array-support-poc branch from f2df802 to 41d7bf5 Compare July 13, 2023 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Array Support POC #282

[DRAFT] Array Support POC #282

forestmvey commented Jul 6, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading

GumpacG commented Jul 6, 2023

Yury-Fridlyand left a comment

Yury-Fridlyand Jul 6, 2023

forestmvey Jul 12, 2023 •

edited

Loading

Yury-Fridlyand Jul 6, 2023

Yury-Fridlyand Jul 6, 2023

GumpacG Jul 6, 2023

forestmvey Jul 12, 2023

Yury-Fridlyand Jul 12, 2023

[DRAFT] Array Support POC #282

Are you sure you want to change the base?

[DRAFT] Array Support POC #282

Conversation

forestmvey commented Jul 6, 2023 • edited Loading

Description

Syntax

Examples

Object Arrays

Added Functionality

Edge Cases

Additional Functionality

Limitations

Clause Support

codecov bot commented Jul 6, 2023 • edited Loading

Codecov Report

GumpacG commented Jul 6, 2023

Yury-Fridlyand left a comment

Choose a reason for hiding this comment

1.

2.

3.

4.

Yury-Fridlyand Jul 6, 2023

Choose a reason for hiding this comment

forestmvey Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Yury-Fridlyand Jul 6, 2023

Choose a reason for hiding this comment

Yury-Fridlyand Jul 6, 2023

Choose a reason for hiding this comment

GumpacG Jul 6, 2023

Choose a reason for hiding this comment

forestmvey Jul 12, 2023

Choose a reason for hiding this comment

Yury-Fridlyand Jul 12, 2023

Choose a reason for hiding this comment

forestmvey commented Jul 6, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading

forestmvey Jul 12, 2023 •

edited

Loading