-
Notifications
You must be signed in to change notification settings - Fork 151
dev/add datatypes to excel columns #230
base: master
Are you sure you want to change the base?
dev/add datatypes to excel columns #230
Conversation
added eager datatype checker for first 1000 rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next to the inline comments, can you also add some test cases to DefaultSpreadsheetReaderDelegateTest
, which test the new functionality?
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
return table; | ||
} | ||
|
||
private void setColumnType(Iterator<Row> data, int rowLength, ColumnType[] columnTypes) { | ||
while (data.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially you're iterating over the complete data set now. I would expect you to only look at the first line (or maybe the first few lines of data) to get the column type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a query to look up only the first 1000records, I see that that got removed when getting DataRow types instead of the "normal" Row type.
I've added a countdown from 1000 in the while-loop
int eagerness = 1000;
...
...
while(data.hasnext() && eagerness-- > 0) {
...
...
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
It would have been nice to have a small discussion about this change before doing the PR because I feel like there's maybe a few things I would have liked to do differently:
|
Oh, and column type detection should be optional. We don't want to break existing users and we also don't want it to potentially break while reading beyond the 1000th first records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of inline comments, please also add unit tests for the added functionality.
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
if (currentRow.getLastCellNum() == 0) { | ||
continue; | ||
} | ||
if (currentRow.getCell(index) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the logic a bit around here. I would propose you extract lines 189 through 218 into a separate method which returns the ColumnType
of a cell in a row. And then move the logic from lines 225 through 231 into this method deciding whether or not to assign that value to the columnTypes
array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted the method: getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index)
from lines 189 through 218.
I haven't moved the logic of
checkColumnTypes(final ColumnType expecetedColumnType, ColumnType columnType)
to
getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index)
Cause it gets used multiple times.
I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?
I completely agree with these three last remarks. I'm also not sure if scanning a 1000 rows in an Excel file may be too much, because of possible performance issues. |
Maybe just some "column type detector" reusable class that you could use as a sort of builder to add data type observations to. Basically just components that we could derive a general pattern from. And potentially have the |
Wishful thinking code:
|
That thing could conceivably then also detect whether something is nullable etc. |
I see what you mean now. I think the code here can be organized in a manner so that the column type detection is implemented on a separate class. I personally don't think it should be generalized right away in the context of this PR, but more likely in a separate PR, because I'm afraid otherwise the scope of this PR will become a bit too broad. |
Yes that's a fine approach. |
Maybe we can take a percentage of the file row count? Like check 10% with a max cap of 1000 rows. |
I don't think it's a case where we can set a default that our users will be happy with. I think 1000 is an OK default for when the feature is enabled. I think you should add the number into |
* I do like the change from VARCHAR to STRING. For the sake of separating concerns I would do that in a separate PR though. - changed STRING back to VARCHAR * And potentially have the 1000 constant in here incapsulated - Added EAGER_READ to ExcelConfiguration including boolean validateColumnTypes
* I do like the change from VARCHAR to STRING. For the sake of separating concerns I would do that in a separate PR though. - changed STRING back to VARCHAR * And potentially have the 1000 constant in here incapsulated - Added EAGER_READ to ExcelConfiguration including boolean validateColumnTypes
corrected a few tests cause these where breaking on the datatypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @kaspersorensen mentioned earlier, when the data type detection is enabled, you want to use it when writing to an Excel sheet, so for example in the ExcelInsertBuilder
, you probably want to check if the class of the value that is inserted matches the ColumnType
for the cell its inserted into.
excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java
Outdated
Show resolved
Hide resolved
if (currentRow.getLastCellNum() == 0) { | ||
continue; | ||
} | ||
columnTypes[index] = getColumnTypeFromRow(columnTypes[index], currentRow, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it better if you don't pass the current value of the columnType at the current index into the getColumnTypeFromRow
method. I would move the logic from the checkColumnType(ColumnType, ColumnType) here, so that method can be removed and the
getColumnTypeFromRow` just returns the ColumnType it determines for its inspected cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a comment about this earlier on which I didn't get an answer. The logic in checkColumnType(expected, columnType)
gets used multiple times in this method so why not reuse the code in a seperate method like now?
excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java
Outdated
Show resolved
Hide resolved
excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java
Outdated
Show resolved
Hide resolved
switching devices
switching devices and apearantly autosave was off -_-
…om/SociopathicPixel/metamodel into dev/add-datatypes-to-excel-columns # Conflicts: # excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java
coudn't find any finals
final Column column = new MutableColumn(columnNamingSession.getNextColumnName(namingContext), | ||
ColumnType.STRING, table, j, true); | ||
final Column column; | ||
if (_configuration.isDetectColumnTypes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this check inside the getColumnTypes
method, because you only want to execute the logic which scans a number of rows in the Excel sheet for data types when this is activated, otherwise you'll only lose performance for something which you don't use. In case isDetectColumnTypes
returns false, just have the getColumnTypes
method return an array filled with ColumnType.STRING
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the whole part? (lines 135 - 185).
Also appearantly I'm not in beta mode on this repo which is a bummer for selecting code in review comments.
} | ||
|
||
ColumnType columnType = columnTypes[index]; | ||
ColumnType expecetedColumnType = getColumnTypeFromRow(currentRow, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this variable to expectedColumnType
and make both this and the columnType
variable final
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with next commit
if (currentRow.getCell(index) == null) { | ||
return ColumnType.STRING; | ||
} else { | ||
CellType cellType = currentRow.getCell(index).getCellType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this variable final
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the currentRow.getCell(index).getCellType()
into the switch statement with the next commit
public int getEagerness() { | ||
return eagerReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this method and the underlying field to something different? When I see "eager" in combination with "reader", I interpret it as the opposite of "lazy reading", which is not what this method is about. I would rather see something like getNumberOfLinesToScan
, getScanLines
or something even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with next commit
@@ -69,12 +69,33 @@ public void execute() throws MetaModelException { | |||
private List<Row> updateRows(List<Row> rows) { | |||
for (ListIterator<Row> it = rows.listIterator(); it.hasNext();) { | |||
final Row original = (Row) it.next(); | |||
validateUpdateType(original); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it from the class with next commit, I think it was to prevent the insertion of for example a String value into a Column which only contains Integers and therefore has the columnType of Integer.
@@ -149,8 +151,33 @@ protected CellStyle fetch() { | |||
cell.setCellStyle(cellStyle.get()); | |||
} | |||
} | |||
validateUpdateType(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the value is of adding this here. I think that for now it may be best to skip the "validation" part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see: ExcelDataContextTest.testUpdateDifferentDataTypes
It looks if the Column has a specific ColumnType and if so it will validate the given row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SociopathicPixel I'm sorry, but I still don't get why this has been added here.
Outside of the comments the (Excel) tests currently fail, that should be addressed. |
I see indeed that I still had one change staged which wasn't commited (validateColumnType vs detectColumnType) |
@SociopathicPixel Can you also add a unit test for an .xlsx Excel file, because MetaModel uses the |
@SociopathicPixel The values in a Row of an Excel DataSet are all still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, the values in a Row of an Excel DataSet are all still String objects. They don't use the detected column type to return an object of the expected type. This is implemented in the ExcelUtils#createRow(Workbook, Row, DataSetHeader) method.
final ColumnType[] columnTypes = new ColumnType[rowLength]; | ||
if (_configuration.isDetectColumnTypes()) { | ||
|
||
int eagerness = _configuration.getEagerness(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change eagerness
to something like numberOfLinesToScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
|
||
public ExcelConfiguration() { | ||
this(DEFAULT_COLUMN_NAME_LINE, true, false); | ||
} | ||
|
||
public ExcelConfiguration(int columnNameLineNumber, boolean skipEmptyLines, boolean skipEmptyColumns) { | ||
this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns); | ||
this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns, false, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the default value of a "1000" was in a constant, and maybe you want to set the default to zero in case you pass "false" as an argument for the detectColumnTypes
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
} | ||
|
||
public ExcelConfiguration(int columnNameLineNumber, ColumnNamingStrategy columnNamingStrategy, | ||
boolean skipEmptyLines, boolean skipEmptyColumns) { | ||
boolean skipEmptyLines, boolean skipEmptyColumns, boolean detectColumnTypes, int eagerness) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename eagerness
parameter to numberOfLinesToScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
@@ -38,25 +38,38 @@ | |||
public static final int NO_COLUMN_NAME_LINE = 0; | |||
public static final int DEFAULT_COLUMN_NAME_LINE = 1; | |||
|
|||
private final int getNumberOfLinesToScan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename getNumberOfLinesToScan
parameter to numberOfLinesToScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
+ detectColumnTypes + "]"; | ||
} | ||
|
||
public int getEagerness() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename getEagerness
to getNumberOfLinesToScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
@Override | ||
protected void decorateIdentity(List<Object> identifiers) { | ||
identifiers.add(columnNameLineNumber); | ||
identifiers.add(skipEmptyLines); | ||
identifiers.add(skipEmptyColumns); | ||
identifiers.add(detectColumnTypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add numberOfLinesToScan
to identifiers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
} | ||
|
||
@Override | ||
public String toString() { | ||
return "ExcelConfiguration[columnNameLineNumber=" | ||
+ columnNameLineNumber + ", skipEmptyLines=" + skipEmptyLines | ||
+ ", skipEmptyColumns=" + skipEmptyColumns + "]"; | ||
+ ", skipEmptyColumns=" + skipEmptyColumns +", detectColumnTypes=" | ||
+ detectColumnTypes + "]"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add numberOfLinesToScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit.
@@ -152,10 +160,150 @@ public void close() throws IOException { | |||
}); | |||
} | |||
|
|||
private ColumnType[] getColumnTypes(final Sheet sheet, final Row row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are copy/pasting a lot of code into this class from the DefaultSpreadsheetReaderDelegate
class, that seems like a bad idea to me, why not always use the DefaultSpreadsheetReaderDelegate
in case you need to detect column types and don't change anything in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with next commit
@SociopathicPixel It may also be a good idea to merge master branch (from apache/metamodel) into this branch, because that contains a general fix for the failing unit tests. |
@@ -149,8 +151,33 @@ protected CellStyle fetch() { | |||
cell.setCellStyle(cellStyle.get()); | |||
} | |||
} | |||
validateUpdateType(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SociopathicPixel I'm sorry, but I still don't get why this has been added here.
final Style[] styles = new Style[size]; | ||
if (row != null) { | ||
for (int i = 0; i < size; i++) { | ||
final int columnNumber = header.getSelectItem(i).getColumn().getColumnNumber(); | ||
final Cell cell = row.getCell(columnNumber); | ||
final String value = ExcelUtils.getCellValue(workbook, cell); | ||
final Object value = ExcelUtils.getCellValue(workbook, cell); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an Object doesn't do the trick, we know based on the column type, what type of object should be returned, so use that to either return a String, Integer, Boolean or Date object.
Updated: