Feature: Update data access tooling to better support distributed querying of big data #475

carter-cundiff · 2024-11-19T15:49:11Z

Description

Currently data access makes use of a GraphQL Quarkus app for accessing data outside of your spark pipeline. GraphQL is not optimized for performing queries against large datasets stored in data lakes. For better performance when accessing your data lake data, GraphQL should be replaced with a tool specifically designed for querying large data lakes (e.g Trino).

DOD

Implement Trino as deploy profile option for data access
- Create baseline Helm chart using the official Trino Helm chart as the parent
  - Should include defaults for configuring the chart to use the hive connector
  - Include helm chart unit tests
- When enabled, generate a deploy resource dependent on the aiSSEMBLE Trino Helm Chart
- Update antora docs to detail new data access option
  - Current data access becomes a drop down with GraphQL and Trino pages
Remove GraphQL antora docs and deprecate the fermenter profiles

Test Strategy/Script

OTS Only:

Within the aiSSEMBLE repo, run the following and verify it builds successfully:

mvn clean install -pl :foundation-mda,:aissemble-trino-chart -Dmaven.build.cache.skipCache

Create a downstream project:

mvn archetype:generate -U -DarchetypeGroupId=com.boozallen.aissemble \
  -DarchetypeArtifactId=foundation-archetype \
  -DarchetypeVersion=1.11.0-SNAPSHOT \
  -DgroupId=com.test \
  -DartifactId=test-475 \
  -DprojectGitUrl=test.url \
  -DprojectName=test-475 \
  && cd test-475

Add the attached SparkPipeline.json to the test-475-pipeline-models/src/main/resources/pipelines/ directory
Add the attached PersonDictionary.json to the test-475-pipeline-models/src/main/resources/dictionaries/ directory
Add the attached Person.json to the test-475-pipeline-models/src/main/resources/records/ directory
Run mvn clean install until all the manual actions are complete
Add the following execution to the test-475-deploy/pom.xml:

<execution>
    <id>trino</id>
    <phase>generate-sources</phase>
    <goals>
        <goal>generate-sources</goal>
    </goals>
    <configuration>
        <basePackage>com.test</basePackage>
        <profile>data-access-trino-deploy-v2</profile>
        <!-- The property variables below are passed to the Generation Context and utilized
                to customize the deployment artifacts. -->
        <propertyVariables>
            <appName>trino</appName>
        </propertyVariables>
    </configuration>
</execution>

Add the following to the test-475-pipelines/spark-pipeline/src/main/java/com/test/TestSyncStep.java:

+import java.util.List;
+import java.util.stream.Stream;
+import simple.test.record.Person;
+import simple.test.record.PersonSchema;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;

...

    @Override
    protected void executeStepImpl() {
-         // TODO: Add your business logic here for this step!
-         logger.error("Implement executeStepImpl(..) or remove this pipeline step!");
+        logger.info("Saving Person to table People");
+        Person person = new Person();
+        person.setName("John Smith");
+        person.setAge(50);
+        PersonSchema personSchema = new PersonSchema();
+        List<Row> rows = Stream.of(person).map(PersonSchema::asRow).toList();
+        Dataset<Row> dataset = sparkSession.createDataFrame(rows, personSchema.getStructType());
+        saveDataset(dataset, "People");
+        logger.info("Completed saving to table People");
    }

Run mvn clean install -Dmaven.build.cache.skipCache to get any remaining manual actions
OTS Only: The project will fail to build due to the new helm chart not being published yet
- Update the test-475-deploy/src/main/resources/apps/trino/Chart.yaml with the following:
```
dependencies:
  - name: aissemble-trino-chart
    version: 1.11.0-SNAPSHOT
-   repository: oci://ghcr.io/boozallen
+   repository: file://../../../../../../../aissemble/extensions/extensions-helm/aissemble-trino-chart
```
- Continue the build with mvn clean install -Dmaven.build.cache.skipCache -rf :test-475-deploy
Complete the manual actions and run tilt up
Once all the resources are ready on the tilt ui, start the spark-pipeline resource
Verify you see the following log ouput in the pipeline:

INFO TestSyncStep: Completed saving to table People

Connect to Trino using the cli: ./trino --server http://localhost:8084
- See Trino CLI documentation for details on installation
Run the following command to query the data:

select * from hive.default.people;

Verify you get the following output:

    name    | age
------------+-----
 John Smith |  50
(1 row)

Query 20241122_143943_00000_c3nss, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.65 [1 rows, 14B] [0 rows/s, 5B/s]

tilt down
Remove the following from test-475-pipeline-models/src/main/resources/records/Person.json on lines 5-7:

    "dataAccess": {
        "enabled": "false"
    },

Build the project once with mvn clean install -Dmaven.build.cache.skipCache and complete the manual actions
Build the project once with mvn clean install and verify you see the following warnings about data-access deprecation:

/your/path/test-475/test-475-pipelines/test-475-data-access/pom.xml:
Data Access using GraphQL is deprecated, please see the latest documentation for details on using Trino for Data Access: https://boozallen.github.io/aissemble/aissemble/current/data-access-details.html

/your/path/test-475/test-475-docker/test-475-data-access-docker/pom.xml:
The profile 'aissemble-data-access-docker' is deprecated, please replace all references to it.

/your/path/devRepos/test-475/test-475-deploy/pom.xml:
The profile 'data-access-deploy-v2' is deprecated, please replace all references to it.

References/Additional Context

The text was updated successfully, but these errors were encountered:

carter-cundiff · 2024-11-20T16:30:47Z

DOD with @ewilkins-csi, @csun-cpointe

carter-cundiff · 2024-11-22T19:10:40Z

OTS with @nartieri @ewilkins-csi

nartieri · 2024-11-25T19:42:44Z

Visual confirmation of success.

nartieri · 2024-11-25T19:49:57Z

Confirmed deprecation warnings were received. Testing has passed.

carter-cundiff added the enhancement New feature or request label Nov 19, 2024

carter-cundiff self-assigned this Nov 19, 2024

carter-cundiff added this to the 1.10.0 milestone Nov 19, 2024

carter-cundiff changed the title ~~Feature: Update data access tooling to better support querying outside of a pipeline and support Hive Metastore v4.0.0~~ Feature: Update data access tooling to better support querying outside of a pipeline Nov 19, 2024

carter-cundiff changed the title ~~Feature: Update data access tooling to better support querying outside of a pipeline~~ Feature: Update data access tooling to better support distributed querying of big data Nov 19, 2024

ewilkins-csi modified the milestones: 1.10.0, 1.11.0 Nov 20, 2024

carter-cundiff added a commit that referenced this issue Nov 22, 2024

#475 Remove GraphQL documentation and deprecate fermenter profiles

ced6c1e

carter-cundiff added a commit that referenced this issue Nov 22, 2024

#475 Create Trino helm chart and documentation

f17cb9e

carter-cundiff added a commit that referenced this issue Nov 22, 2024

#475 Remove GraphQL documentation and deprecate fermenter profiles

e2b600a

carter-cundiff added a commit that referenced this issue Nov 22, 2024

#475 Create Trino helm chart and documentation

d855714

carter-cundiff added a commit that referenced this issue Nov 25, 2024

#475 Create Trino helm chart and documentation

84d16a0

carter-cundiff added a commit that referenced this issue Nov 25, 2024

#475 Create Trino helm chart and documentation

f6bf5b0

carter-cundiff linked a pull request Nov 25, 2024 that will close this issue

Replace graphql data access with trino #485

Merged

carter-cundiff closed this as completed in #485 Nov 25, 2024

carter-cundiff reopened this Nov 25, 2024

nartieri closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Update data access tooling to better support distributed querying of big data #475

Feature: Update data access tooling to better support distributed querying of big data #475

carter-cundiff commented Nov 19, 2024 •

edited

Loading

carter-cundiff commented Nov 20, 2024

carter-cundiff commented Nov 22, 2024

nartieri commented Nov 25, 2024

nartieri commented Nov 25, 2024

Feature: Update data access tooling to better support distributed querying of big data #475

Feature: Update data access tooling to better support distributed querying of big data #475

Comments

carter-cundiff commented Nov 19, 2024 • edited Loading

Description

DOD

Test Strategy/Script

References/Additional Context

carter-cundiff commented Nov 20, 2024

carter-cundiff commented Nov 22, 2024

nartieri commented Nov 25, 2024

nartieri commented Nov 25, 2024

carter-cundiff commented Nov 19, 2024 •

edited

Loading