Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#228] Skeleton code to implement unified tabular processing #229

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

blcham
Copy link
Contributor

@blcham blcham commented Nov 10, 2023

Implements #228

}

@Override
public Set<Row> getRows(TableSchema tableSchema,String sourceResourceUri, Mode outputMode){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make sense to refactor to separate method of TabularReader !!!

Reason:
There is some nontrivial logic that references specification. The same logic will be then duplicated in:

ExcelReader.getRows()
HtmlReader.getRows()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this change, now this is done in TabularModule, as it was before. It is done after getting Row Statements in order to be possible to get the number of rows without creating new list reader.

String[] header = listReader.getHeader(true); // skip the header (can't be used with CsvListReader)
TabularReader tabularReader = new CSVReader(listReader);

List<String> header = tabularReader.getHeader();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if we have test support for skipHeader in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you know if you need to read header or not so why reading the header here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header is readed here to check if the input resource is empty.

for (String columnTitle : header) {
String columnName = normalize(columnTitle);
boolean isDuplicate = !columnNames.add(columnName);
outputColumns = tabularReader.getOutputColumns(header);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it into separate method TabularModule.setOutputColumns(tableSchema, header) and then access outputColumns using something like tableSchema.getColumns() ...

the research question is whether it should not be rather tableSchema.setOuputColums(header).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going over code it seems to me that it makes sense to have tableSchema.setOuputColums(header) but not sure about name ... maybe addOuptutColumns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatch out line tableSchema.adjustProperties(hasInputSchema, outputColumns, sourceResource.getUri()); ... it was not set to the tableSchema before

Copy link
Contributor

@rodionnv rodionnv Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is done in tableSchema, there are methods tableSchema.setOutputColumns(header,sourceResource.getUri(),dataPrefix,hasInputSchema); and tableSchema.getOutputColumns();

@rodionnv
Copy link
Contributor

@blcham Please have a look at the current implementation of the #228 .
There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@blcham
Copy link
Contributor Author

blcham commented Mar 2, 2024

Adding missing context:

  • Failing tests are in methods:

    • executeWithSimpleTransformationMerged[Xls | Xlsx]()
  • The input file:
    image

@blcham
Copy link
Contributor Author

blcham commented Mar 2, 2024

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@rodionnv :

  1. I believe the expected output now is correct
  2. i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

@rodionnv
Copy link
Contributor

rodionnv commented Mar 3, 2024

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@rodionnv :

  1. I believe the expected output now is correct
  2. i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

@blcham
In main it seems that it's impossible to have empty cells. I have changed input.xls so it looks like this
image
And got this error:
image

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here:
image
Output:

<http://test-file#row-3>
        <http://onto.fel.cvut.cz/data/aa>
                "merged rows" ;
        <http://onto.fel.cvut.cz/data/cc>
                "ee" .

@blcham
Copy link
Contributor Author

blcham commented Mar 3, 2024

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here:

Yes, it makes sense like that, and for the first non-header row, we should generate:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

rowNumber++;

for (int i = 0; i < header.size(); i++) {
// 4.6.8.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are mixing logic with concrete reader, aren't we? we can discuss.

@@ -73,6 +75,54 @@ public void setValueUrl(String valueUrl) {
tabularModuleUtils.setVariable(this.valueUrl, valueUrl, value -> this.valueUrl = value, "valueUrl");
}

private transient List<Column> outputColumns;

public void setOutputColumns(List<String> header,String sourceResourceUri,String dataPrefix,boolean hasInputSchema) throws UnsupportedEncodingException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe here what this method does, we will put it as documentation for the method for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to change completely the file, i cannot really see the differences between previous version and the new one.

You could just remove some values from the previous version and it would show the history in a nice way right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we have an option to use concrete URL instead of blank nodes?

@blcham
Copy link
Contributor Author

blcham commented Mar 5, 2024

There is only way to continue on this. Make small PRs that can be merged immediately without breaking existing implementation. It should be easy to review (at most 15 mins for me).

@blcham blcham force-pushed the 228-unify-tabular-processing branch 2 times, most recently from a308b3a to 9730de4 Compare August 14, 2024 14:18
@blcham blcham force-pushed the 228-unify-tabular-processing branch from 9730de4 to 7c1ea36 Compare December 4, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants