Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Update data access tooling to better support distributed querying of big data #475

Closed
8 tasks done
carter-cundiff opened this issue Nov 19, 2024 · 4 comments · Fixed by #485
Closed
8 tasks done
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@carter-cundiff
Copy link
Contributor

carter-cundiff commented Nov 19, 2024

Description

Currently data access makes use of a GraphQL Quarkus app for accessing data outside of your spark pipeline. GraphQL is not optimized for performing queries against large datasets stored in data lakes. For better performance when accessing your data lake data, GraphQL should be replaced with a tool specifically designed for querying large data lakes (e.g Trino).

DOD

  • Implement Trino as deploy profile option for data access
    • Create baseline Helm chart using the official Trino Helm chart as the parent
      • Should include defaults for configuring the chart to use the hive connector
      • Include helm chart unit tests
    • When enabled, generate a deploy resource dependent on the aiSSEMBLE Trino Helm Chart
    • Update antora docs to detail new data access option
      • Current data access becomes a drop down with GraphQL and Trino pages
  • Remove GraphQL antora docs and deprecate the fermenter profiles

Test Strategy/Script

  • OTS Only:

    • Within the aiSSEMBLE repo, run the following and verify it builds successfully:
      mvn clean install -pl :foundation-mda,:aissemble-trino-chart -Dmaven.build.cache.skipCache
      
  • Create a downstream project:

mvn archetype:generate -U -DarchetypeGroupId=com.boozallen.aissemble \
  -DarchetypeArtifactId=foundation-archetype \
  -DarchetypeVersion=1.11.0-SNAPSHOT \
  -DgroupId=com.test \
  -DartifactId=test-475 \
  -DprojectGitUrl=test.url \
  -DprojectName=test-475 \
  && cd test-475
  • Add the attached SparkPipeline.json to the test-475-pipeline-models/src/main/resources/pipelines/ directory
  • Add the attached PersonDictionary.json to the test-475-pipeline-models/src/main/resources/dictionaries/ directory
  • Add the attached Person.json to the test-475-pipeline-models/src/main/resources/records/ directory
  • Run mvn clean install until all the manual actions are complete
  • Add the following execution to the test-475-deploy/pom.xml:
<execution>
    <id>trino</id>
    <phase>generate-sources</phase>
    <goals>
        <goal>generate-sources</goal>
    </goals>
    <configuration>
        <basePackage>com.test</basePackage>
        <profile>data-access-trino-deploy-v2</profile>
        <!-- The property variables below are passed to the Generation Context and utilized
                to customize the deployment artifacts. -->
        <propertyVariables>
            <appName>trino</appName>
        </propertyVariables>
    </configuration>
</execution>
  • Add the following to the test-475-pipelines/spark-pipeline/src/main/java/com/test/TestSyncStep.java:
+import java.util.List;
+import java.util.stream.Stream;
+import simple.test.record.Person;
+import simple.test.record.PersonSchema;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;

...

    @Override
    protected void executeStepImpl() {
-         // TODO: Add your business logic here for this step!
-         logger.error("Implement executeStepImpl(..) or remove this pipeline step!");
+        logger.info("Saving Person to table People");
+        Person person = new Person();
+        person.setName("John Smith");
+        person.setAge(50);
+        PersonSchema personSchema = new PersonSchema();
+        List<Row> rows = Stream.of(person).map(PersonSchema::asRow).toList();
+        Dataset<Row> dataset = sparkSession.createDataFrame(rows, personSchema.getStructType());
+        saveDataset(dataset, "People");
+        logger.info("Completed saving to table People");
    }
  • Run mvn clean install -Dmaven.build.cache.skipCache to get any remaining manual actions
  • OTS Only: The project will fail to build due to the new helm chart not being published yet
    • Update the test-475-deploy/src/main/resources/apps/trino/Chart.yaml with the following:

      dependencies:
        - name: aissemble-trino-chart
          version: 1.11.0-SNAPSHOT
      -   repository: oci://ghcr.io/boozallen
      +   repository: file://../../../../../../../aissemble/extensions/extensions-helm/aissemble-trino-chart
    • Continue the build with mvn clean install -Dmaven.build.cache.skipCache -rf :test-475-deploy

  • Complete the manual actions and run tilt up
  • Once all the resources are ready on the tilt ui, start the spark-pipeline resource
  • Verify you see the following log ouput in the pipeline:
INFO TestSyncStep: Completed saving to table People
  • Connect to Trino using the cli: ./trino --server http://localhost:8084
  • Run the following command to query the data:
select * from hive.default.people;
  • Verify you get the following output:
    name    | age
------------+-----
 John Smith |  50
(1 row)

Query 20241122_143943_00000_c3nss, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.65 [1 rows, 14B] [0 rows/s, 5B/s]
  • tilt down
  • Remove the following from test-475-pipeline-models/src/main/resources/records/Person.json on lines 5-7:
    "dataAccess": {
        "enabled": "false"
    },
  • Build the project once with mvn clean install -Dmaven.build.cache.skipCache and complete the manual actions
  • Build the project once with mvn clean install and verify you see the following warnings about data-access deprecation:
/your/path/test-475/test-475-pipelines/test-475-data-access/pom.xml:
Data Access using GraphQL is deprecated, please see the latest documentation for details on using Trino for Data Access: https://boozallen.github.io/aissemble/aissemble/current/data-access-details.html

/your/path/test-475/test-475-docker/test-475-data-access-docker/pom.xml:
The profile 'aissemble-data-access-docker' is deprecated, please replace all references to it.

/your/path/devRepos/test-475/test-475-deploy/pom.xml:
The profile 'data-access-deploy-v2' is deprecated, please replace all references to it.

References/Additional Context

@carter-cundiff carter-cundiff added the enhancement New feature or request label Nov 19, 2024
@carter-cundiff carter-cundiff self-assigned this Nov 19, 2024
@carter-cundiff carter-cundiff added this to the 1.10.0 milestone Nov 19, 2024
@carter-cundiff carter-cundiff changed the title Feature: Update data access tooling to better support querying outside of a pipeline and support Hive Metastore v4.0.0 Feature: Update data access tooling to better support querying outside of a pipeline Nov 19, 2024
@carter-cundiff carter-cundiff changed the title Feature: Update data access tooling to better support querying outside of a pipeline Feature: Update data access tooling to better support distributed querying of big data Nov 19, 2024
@ewilkins-csi ewilkins-csi modified the milestones: 1.10.0, 1.11.0 Nov 20, 2024
@carter-cundiff
Copy link
Contributor Author

DOD with @ewilkins-csi, @csun-cpointe

@carter-cundiff
Copy link
Contributor Author

OTS with @nartieri @ewilkins-csi

@nartieri
Copy link
Collaborator

Visual confirmation of success.
image

@nartieri
Copy link
Collaborator

Confirmed deprecation warnings were received. Testing has passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants