-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update doc dependency * Update docs content * Update docs content
- Loading branch information
Showing
19 changed files
with
672 additions
and
428 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
156 changes: 156 additions & 0 deletions
156
docs/docs/developer-guide/spark-mode/data-sources/sdmx.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,156 @@ | ||
--- | ||
id: sdmx | ||
title: Spark mode - SDMX source | ||
sidebar_label: SDMX | ||
slug: /developer-guide/spark-mode/data-sources/sdmx | ||
custom_edit_url: null | ||
--- | ||
|
||
`vtl-sdmx` module exposes the following utilities. | ||
|
||
### `buildStructureFromSDMX3` utility | ||
|
||
`TrevasSDMXUtils.buildStructureFromSDMX3` allows to obtain a Trevas DataStructure. | ||
|
||
Providing corresponding data, you can build a Trevas Dataset. | ||
|
||
```java | ||
Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "STRUCT_ID"); | ||
|
||
SparkDataset ds = new SparkDataset( | ||
spark.read() | ||
.option("header", "true") | ||
.option("delimiter", ";") | ||
.option("quote", "\"") | ||
.csv("path"), | ||
structure | ||
); | ||
``` | ||
|
||
### `SDMXVTLWorkflow` object | ||
|
||
The `SDMXVTLWorkflow` constructor takes 3 arguments: | ||
|
||
- a `ScriptEngine` (Trevas or another) | ||
- a `ReadableDataLocation` to handle an SDMX message | ||
- a map of names / Datasets | ||
|
||
```java | ||
SparkSession.builder() | ||
.appName("test") | ||
.master("local") | ||
.getOrCreate(); | ||
|
||
ScriptEngineManager mgr = new ScriptEngineManager(); | ||
ScriptEngine engine = mgr.getEngineByExtension("vtl"); | ||
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark"); | ||
|
||
ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml"); | ||
|
||
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of()); | ||
``` | ||
|
||
This object then allows you to activate the following 3 functions. | ||
|
||
### SDMXVTLWorkflow `run` function - Preview mode | ||
|
||
The `run` function can easily be called in a preview mode, without attached data. | ||
|
||
```java | ||
ScriptEngineManager mgr = new ScriptEngineManager(); | ||
ScriptEngine engine = mgr.getEngineByExtension("vtl"); | ||
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark"); | ||
|
||
ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml"); | ||
|
||
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of()); | ||
|
||
// instead of using TrevasSDMXUtils.buildStructureFromSDMX3 and data sources | ||
// to build Trevas Datasets, sdmxVtlWorkflow.getEmptyDatasets() | ||
// will handle SDMX message structures to produce Trevas Datasets | ||
// with metadata defined in this message, and adding empty data | ||
Map<String, Dataset> emptyDatasets = sdmxVtlWorkflow.getEmptyDatasets(); | ||
engine.getBindings(ScriptContext.ENGINE_SCOPE).putAll(emptyDatasets); | ||
|
||
Map<String, PersistentDataset> result = sdmxVtlWorkflow.run(); | ||
``` | ||
|
||
The preview mode allows to check the conformity of the SDMX file and the metadata of the output datasets. | ||
|
||
### SDMXVTLWorkflow `run` function | ||
|
||
Once an `SDMXVTLWorkflow` is built, it is easy to run the VTL validations and transformations defined in the SDMX file. | ||
|
||
```java | ||
Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "ds1"); | ||
|
||
SparkDataset ds1 = new SparkDataset( | ||
spark.read() | ||
.option("header", "true") | ||
.option("delimiter", ";") | ||
.option("quote", "\"") | ||
.csv("path/data.csv"), | ||
structure | ||
); | ||
|
||
ScriptEngineManager mgr = new ScriptEngineManager(); | ||
ScriptEngine engine = mgr.getEngineByExtension("vtl"); | ||
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark"); | ||
|
||
Map<String, Dataset> inputs = Map.of("ds1", ds1); | ||
|
||
ReadableDataLocation rdl = new ReadableDataLocationTmp("path/sdmx_file.xml"); | ||
|
||
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, inputs); | ||
|
||
Map<String, PersistentDataset> bindings = sdmxVtlWorkflow.run(); | ||
``` | ||
|
||
As a result, one will receive all the datasets defined as persistent in the `TransformationSchemes` definition. | ||
|
||
### SDMXVTLWorkflow `getTransformationsVTL` function | ||
|
||
Gets the VTL code corresponding to the SDMX TransformationSchemes definition. | ||
|
||
```java | ||
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of()); | ||
String vtl = sdmxVtlWorkflow.getTransformationsVTL(); | ||
``` | ||
|
||
### SDMXVTLWorkflow `getRulesetsVTL` function | ||
|
||
Gets the VTL code corresponding to the SDMX TransformationSchemes definition. | ||
|
||
```java | ||
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of()); | ||
String dprs = sdmxVtlWorkflow.getRulesetsVTL(); | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
### Hadoop client | ||
|
||
The integration of `vtl-modules` with `hadoop-client` can cause dependency issues. | ||
|
||
It was noted that `com.fasterxml.woodstox.woodstox-core` is imported by `hadoop-client`, with an incompatible version for a `vtl-sdmx` sub-dependency. | ||
|
||
A way to fix this is to exclude `com.fasterxml.woodstox.woodstox-core` dependency from `hadoop-client` and import a newest version in your `pom.xml`: | ||
|
||
```xml | ||
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-client</artifactId> | ||
<version>3.3.4</version> | ||
<exclusions> | ||
<exclusion> | ||
<groupId>com.fasterxml.woodstox</groupId> | ||
<artifactId>woodstox-core</artifactId> | ||
</exclusion> | ||
</exclusions> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.fasterxml.woodstox</groupId> | ||
<artifactId>woodstox-core</artifactId> | ||
<version>6.5.1</version> | ||
</dependency> | ||
``` |
Oops, something went wrong.