[#228] Skeleton code to implement unified tabular processing #229

blcham · 2023-11-10T16:55:24Z

Implements #228

blcham · 2023-12-07T15:11:56Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/tabular/CSVReader.java

+    }
+
+    @Override
+    public Set<Row> getRows(TableSchema tableSchema,String sourceResourceUri, Mode outputMode){


This does not make sense to refactor to separate method of TabularReader !!!

Reason:
There is some nontrivial logic that references specification. The same logic will be then duplicated in:

ExcelReader.getRows()
HtmlReader.getRows()

I have reverted this change, now this is done in TabularModule, as it was before. It is done after getting Row Statements in order to be possible to get the number of rows without creating new list reader.

blcham · 2024-02-09T08:10:39Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

-            String[] header = listReader.getHeader(true); // skip the header (can't be used with CsvListReader)
+            TabularReader tabularReader = new CSVReader(listReader);
+
+            List<String> header = tabularReader.getHeader();


check if we have test support for skipHeader in tests.

you know if you need to read header or not so why reading the header here ?

Header is readed here to check if the input resource is empty.

blcham · 2024-02-09T08:25:39Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

-            for (String columnTitle : header) {
-                String columnName = normalize(columnTitle);
-                boolean isDuplicate = !columnNames.add(columnName);
+            outputColumns = tabularReader.getOutputColumns(header);


Let's put it into separate method TabularModule.setOutputColumns(tableSchema, header) and then access outputColumns using something like tableSchema.getColumns() ...

the research question is whether it should not be rather tableSchema.setOuputColums(header).

Going over code it seems to me that it makes sense to have tableSchema.setOuputColums(header) but not sure about name ... maybe addOuptutColumns?

whatch out line tableSchema.adjustProperties(hasInputSchema, outputColumns, sourceResource.getUri()); ... it was not set to the tableSchema before

Now this is done in tableSchema, there are methods tableSchema.setOutputColumns(header,sourceResource.getUri(),dataPrefix,hasInputSchema); and tableSchema.getOutputColumns();

rodionnv · 2024-02-14T14:18:34Z

@blcham Please have a look at the current implementation of the #228 .
There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

blcham · 2024-03-02T22:06:17Z

Adding missing context:

Failing tests are in methods:
- executeWithSimpleTransformationMerged[Xls | Xlsx]()
The input file:

blcham · 2024-03-02T22:32:04Z

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:
<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .
while the expected output is:
<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .
This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@rodionnv :

I believe the expected output now is correct
i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

rodionnv · 2024-03-03T13:43:38Z

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:
<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .
while the expected output is:
<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .
This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.
@rodionnv :

I believe the expected output now is correct

i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

@blcham
In main it seems that it's impossible to have empty cells. I have changed input.xls so it looks like this

And got this error:

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here:

Output:

<http://test-file#row-3>
        <http://onto.fel.cvut.cz/data/aa>
                "merged rows" ;
        <http://onto.fel.cvut.cz/data/cc>
                "ee" .

blcham · 2024-03-03T16:13:22Z

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here:

Yes, it makes sense like that, and for the first non-header row, we should generate:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

blcham · 2024-03-03T16:35:30Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/tabular/CSVReader.java

+            rowNumber++;
+
+            for (int i = 0; i < header.size(); i++) {
+                // 4.6.8.1


It seems like we are mixing logic with concrete reader, aren't we? we can discuss.

blcham · 2024-03-03T16:38:02Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/model/TableSchema.java

@@ -73,6 +75,54 @@ public void setValueUrl(String valueUrl) {
        tabularModuleUtils.setVariable(this.valueUrl, valueUrl, value -> this.valueUrl = value, "valueUrl");
    }

+    private transient List<Column> outputColumns;
+
+    public void setOutputColumns(List<String> header,String sourceResourceUri,String dataPrefix,boolean hasInputSchema) throws UnsupportedEncodingException {


Please describe here what this method does, we will put it as documentation for the method for later.

blcham · 2024-03-03T16:41:57Z

s-pipes-modules/module-tabular/src/test/resources/examples/mergedCells/expected-output.ttl

Why did we have to change completely the file, i cannot really see the differences between previous version and the new one.

You could just remove some values from the previous version and it would show the history in a nice way right ?

BTW, do we have an option to use concrete URL instead of blank nodes?

blcham · 2024-03-05T18:49:45Z

There is only way to continue on this. Make small PRs that can be merged immediately without breaking existing implementation. It should be easy to review (at most 15 mins for me).

Not complete -- tests are failing here.

…om TabularModule

…tabular reader

…m TabularReader because it doesn't require reading the actual file.

…ore readable.

blcham commented Dec 7, 2023

View reviewed changes

blcham commented Feb 9, 2024

View reviewed changes

blcham commented Mar 3, 2024

View reviewed changes

blcham force-pushed the 228-unify-tabular-processing branch 2 times, most recently from a308b3a to 9730de4 Compare August 14, 2024 14:18

blcham and others added 17 commits December 4, 2024 09:46

[#228] Skeleton code to implement unified tabular processing

b38e186

Not complete -- tests are failing here.

[Fix] Fix failing tests.

05a1086

[Upd] Refactor creating output columns to the TabularReader

830fe5b

[Upd] Refactor getting table.rows columns to the TabularReader

6f5cfe9

[Upd] Refactor getting rows and rowStatements to the TabularReader fr…

956a1de

…om TabularModule

[Upd] Rollback the unwanted refactoring of the getRows() method into …

a898b85

…tabular reader

[Upd] Bring back getting output columns back to the TabularModule fro…

d359cf9

…m TabularReader because it doesn't require reading the actual file.

[Upd] Refactor setOuputColumns() to the TableSchema.

de77d90

[Fix] Handle situation when the input file is empty.

9fc4458

[Upd] Restructuring expected-output.ttl for merged cells to make it m…

f38094b

…ore readable.

[Upd] Initial implementation of the ExcelReader.

eaada40

[Upd] Initial implementation of the HtmlReader.

6cb763c

[Upd] Eliminate usage of the TSV convertors.

6cf60df

[Upd] Refactor TabularModule.

fa38bd0

[Upd] Add blank cell to the tests.

00e63af

[Fix] Fix failing xls,xlsx tests.

b9b01f9

[Fix] Exclude empty cells in HTML files similar to Excel handling.

7c1ea36

blcham force-pushed the 228-unify-tabular-processing branch from 9730de4 to 7c1ea36 Compare December 4, 2024 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#228] Skeleton code to implement unified tabular processing #229

[#228] Skeleton code to implement unified tabular processing #229

blcham commented Nov 10, 2023 •

edited

Loading

blcham Dec 7, 2023

rodionnv Feb 8, 2024

blcham Feb 9, 2024

blcham Feb 9, 2024

rodionnv Feb 10, 2024

blcham Feb 9, 2024

blcham Feb 9, 2024

blcham Feb 9, 2024

rodionnv Feb 10, 2024 •

edited

Loading

rodionnv commented Feb 14, 2024

blcham commented Mar 2, 2024 •

edited

Loading

blcham commented Mar 2, 2024

rodionnv commented Mar 3, 2024 •

edited

Loading

blcham commented Mar 3, 2024 •

edited

Loading

blcham Mar 3, 2024

blcham Mar 3, 2024

blcham Mar 3, 2024

blcham Mar 3, 2024

blcham commented Mar 5, 2024 •

edited

Loading

[#228] Skeleton code to implement unified tabular processing #229

Are you sure you want to change the base?

[#228] Skeleton code to implement unified tabular processing #229

Conversation

blcham commented Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodionnv Feb 10, 2024 • edited Loading

Choose a reason for hiding this comment

rodionnv commented Feb 14, 2024

blcham commented Mar 2, 2024 • edited Loading

blcham commented Mar 2, 2024

rodionnv commented Mar 3, 2024 • edited Loading

blcham commented Mar 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blcham commented Mar 5, 2024 • edited Loading

blcham commented Nov 10, 2023 •

edited

Loading

rodionnv Feb 10, 2024 •

edited

Loading

blcham commented Mar 2, 2024 •

edited

Loading

rodionnv commented Mar 3, 2024 •

edited

Loading

blcham commented Mar 3, 2024 •

edited

Loading

blcham commented Mar 5, 2024 •

edited

Loading