dev/add datatypes to excel columns #230

SociopathicPixel · 2019-09-26T07:37:03Z

Updated:

poi-ooxml: version update from 4.0.1 to 4.1.0
ExcelDataContextTest: Updated test outcome String.

added eager datatype checker for first 1000 rows

arjansh

Next to the inline comments, can you also add some test cases to DefaultSpreadsheetReaderDelegateTest, which test the new functionality?

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

arjansh · 2019-09-26T09:54:33Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

        return table;
    }

+    private void setColumnType(Iterator<Row> data, int rowLength, ColumnType[] columnTypes) {
+        while (data.hasNext()) {


Essentially you're iterating over the complete data set now. I would expect you to only look at the first line (or maybe the first few lines of data) to get the column type.

I had a query to look up only the first 1000records, I see that that got removed when getting DataRow types instead of the "normal" Row type.
I've added a countdown from 1000 in the while-loop

int eagerness = 1000; ... ... while(data.hasnext() && eagerness-- > 0) { ... ...

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

kaspersorensen · 2019-09-27T06:52:57Z

It would have been nice to have a small discussion about this change before doing the PR because I feel like there's maybe a few things I would have liked to do differently:

The idea of detecting column type seems general enough that we should have a facility for it as a decorator pattern or such.
The PR does not do anything to ensure conversion of values to conform with the data types. I know we have a converter API for that in the core module.
I do like the change from VARCHAR to STRING. For the sake of separating concerns I would do that in a separate PR though.

kaspersorensen · 2019-09-27T06:53:46Z

Oh, and column type detection should be optional. We don't want to break existing users and we also don't want it to potentially break while reading beyond the 1000th first records.

arjansh

Outside of inline comments, please also add unit tests for the added functionality.

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

arjansh · 2019-09-27T07:05:37Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

+                if (currentRow.getLastCellNum() == 0) {
+                    continue;
+                }
+                if (currentRow.getCell(index) == null) {


Can you change the logic a bit around here. I would propose you extract lines 189 through 218 into a separate method which returns the ColumnType of a cell in a row. And then move the logic from lines 225 through 231 into this method deciding whether or not to assign that value to the columnTypes array.

Extracted the method: getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index) from lines 189 through 218.

I haven't moved the logic of
checkColumnTypes(final ColumnType expecetedColumnType, ColumnType columnType)
to
getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index)
Cause it gets used multiple times.

arjansh · 2019-09-27T07:27:28Z

It would have been nice to have a small discussion about this change before doing the PR because I feel like there's maybe a few things I would have liked to do differently:
* The idea of detecting column type seems general enough that we should have a facility for it as a decorator pattern or such.

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

* The PR does not do anything to ensure conversion of values to conform with the data types. I know  we have a converter API for that in the core module.

* I  do like  the change from VARCHAR to STRING. For the sake of separating concerns I would  do that in a separate PR though.

Oh, and column type detection should be optional. We don't want to break existing users and we also don't want it to potentially break while reading beyond the 1000th first records.

I completely agree with these three last remarks. I'm also not sure if scanning a 1000 rows in an Excel file may be too much, because of possible performance issues.

kaspersorensen · 2019-09-27T07:41:40Z

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

Maybe just some "column type detector" reusable class that you could use as a sort of builder to add data type observations to. Basically just components that we could derive a general pattern from. And potentially have the 1000 constant in here incapsulated (and eventually made parameterized) from etc.

kaspersorensen · 2019-09-27T07:45:09Z

Wishful thinking code:

ColumnTypeDetector ctd = new ColumnTypeDetector(tableName, columnNames);
for (Object[] record : nativeDataSet) {
  for (int i=0: i<columnNames.length; i++) {
    Object value = record[i];
    ctd.registerValue(i, value);
  }
  if (ctd.sampledEnough()) {
    break;
  }
}
Table table = ctd.createTable();

kaspersorensen · 2019-09-27T07:46:00Z

That thing could conceivably then also detect whether something is nullable etc.

arjansh · 2019-09-27T08:10:41Z

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

Maybe just some "column type detector" reusable class that you could use as a sort of builder to add data type observations to. Basically just components that we could derive a general pattern from. And potentially have the 1000 constant in here incapsulated (and eventually made parameterized) from etc.

I see what you mean now. I think the code here can be organized in a manner so that the column type detection is implemented on a separate class. I personally don't think it should be generalized right away in the context of this PR, but more likely in a separate PR, because I'm afraid otherwise the scope of this PR will become a bit too broad.

kaspersorensen · 2019-09-27T12:02:24Z

I see what you mean now. I think the code here can be organized in a manner so that the column type detection is implemented on a separate class. I personally don't think it should be generalized right away in the context of this PR, but more likely in a separate PR, because I'm afraid otherwise the scope of this PR will become a bit too broad.

Yes that's a fine approach.

SociopathicPixel · 2019-09-28T17:19:19Z

I'm also not sure if scanning a 1000 rows in an Excel file may be too much, because of possible performance issues.

Maybe we can take a percentage of the file row count? Like check 10% with a max cap of 1000 rows.

kaspersorensen · 2019-09-29T16:58:47Z

Maybe we can take a percentage of the file row count? Like check 10% with a max cap of 1000 rows.

I don't think it's a case where we can set a default that our users will be happy with. I think 1000 is an OK default for when the feature is enabled. I think you should add the number into ExcelConfiguration along with a boolean that enables/disables the feature. And the default should be to have it OFF so that we retain backwards compatibility.

* I do like the change from VARCHAR to STRING. For the sake of separating concerns I would do that in a separate PR though. - changed STRING back to VARCHAR * And potentially have the 1000 constant in here incapsulated - Added EAGER_READ to ExcelConfiguration including boolean validateColumnTypes

corrected a few tests cause these where breaking on the datatypes

arjansh

As @kaspersorensen mentioned earlier, when the data type detection is enabled, you want to use it when writing to an Excel sheet, so for example in the ExcelInsertBuilder, you probably want to check if the class of the value that is inserted matches the ColumnType for the cell its inserted into.

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

arjansh · 2019-10-01T13:21:47Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

+                if (currentRow.getLastCellNum() == 0) {
+                    continue;
+                }
+                columnTypes[index] = getColumnTypeFromRow(columnTypes[index], currentRow, index);


I like it better if you don't pass the current value of the columnType at the current index into the getColumnTypeFromRow method. I would move the logic from the checkColumnType(ColumnType, ColumnType) here, so that method can be removed and the getColumnTypeFromRow` just returns the ColumnType it determines for its inspected cell.

I've made a comment about this earlier on which I didn't get an answer. The logic in checkColumnType(expected, columnType) gets used multiple times in this method so why not reuse the code in a seperate method like now?

excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

switching devices

switching devices and apearantly autosave was off -_-

…om/SociopathicPixel/metamodel into dev/add-datatypes-to-excel-columns # Conflicts: # excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java

coudn't find any finals

arjansh · 2019-10-11T11:14:30Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

-                    final Column column = new MutableColumn(columnNamingSession.getNextColumnName(namingContext),
-                            ColumnType.STRING, table, j, true);
+                    final Column column;
+                    if (_configuration.isDetectColumnTypes()) {


I would do this check inside the getColumnTypes method, because you only want to execute the logic which scans a number of rows in the Excel sheet for data types when this is activated, otherwise you'll only lose performance for something which you don't use. In case isDetectColumnTypes returns false, just have the getColumnTypes method return an array filled with ColumnType.STRING objects.

you mean the whole part? (lines 135 - 185).
Also appearantly I'm not in beta mode on this repo which is a bummer for selecting code in review comments.

arjansh · 2019-10-11T11:26:01Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

+                }
+
+                ColumnType columnType = columnTypes[index];
+                ColumnType expecetedColumnType = getColumnTypeFromRow(currentRow, index);


Can you rename this variable to expectedColumnType and make both this and the columnType variable final?

done with next commit

arjansh · 2019-10-11T11:27:34Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

+        if (currentRow.getCell(index) == null) {
+            return ColumnType.STRING;
+        } else {
+            CellType cellType = currentRow.getCell(index).getCellType();


Can you make this variable final?

added the currentRow.getCell(index).getCellType() into the switch statement with the next commit

arjansh · 2019-10-11T11:33:32Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

+	public int getEagerness() {
+		return eagerReader;


Can you rename this method and the underlying field to something different? When I see "eager" in combination with "reader", I interpret it as the opposite of "lazy reading", which is not what this method is about. I would rather see something like getNumberOfLinesToScan, getScanLines or something even better.

done with next commit

arjansh · 2019-10-11T11:45:11Z

core/src/main/java/org/apache/metamodel/DeleteAndInsertBuilder.java

@@ -69,12 +69,33 @@ public void execute() throws MetaModelException {
    private List<Row> updateRows(List<Row> rows) {
        for (ListIterator<Row> it = rows.listIterator(); it.hasNext();) {
            final Row original = (Row) it.next();
+            validateUpdateType(original);


Why add this here?

removed it from the class with next commit, I think it was to prevent the insertion of for example a String value into a Column which only contains Integers and therefore has the columnType of Integer.

arjansh · 2019-10-11T11:48:49Z

excel/src/main/java/org/apache/metamodel/excel/ExcelInsertBuilder.java

@@ -149,8 +151,33 @@ protected CellStyle fetch() {
 					cell.setCellStyle(cellStyle.get());
 				}
 			}
+			validateUpdateType(row);


I'm not sure what the value is of adding this here. I think that for now it may be best to skip the "validation" part.

see: ExcelDataContextTest.testUpdateDifferentDataTypes

It looks if the Column has a specific ColumnType and if so it will validate the given row.

@SociopathicPixel I'm sorry, but I still don't get why this has been added here.

arjansh · 2019-10-11T12:50:36Z

Outside of the comments the (Excel) tests currently fail, that should be addressed.

SociopathicPixel · 2019-10-14T15:00:02Z

Outside of the comments the (Excel) tests currently fail, that should be addressed.

I see indeed that I still had one change staged which wasn't commited (validateColumnType vs detectColumnType)
Will be fixed in the next commit

arjansh · 2019-10-21T11:49:30Z

@SociopathicPixel Can you also add a unit test for an .xlsx Excel file, because MetaModel uses the XlsxSpreadsheetReaderDelegate to read these Excel files (see line 192 in ExcelDataContext), so you may need to build in a check which ensure that in case detectColumnTypes is set to true on the ExcelConfiguration, the DefaultSpreadsheetReaderDelegate is always used for reading the Excel file.

arjansh · 2019-10-21T11:55:15Z

@SociopathicPixel The values in a Row of an Excel DataSet are all still String objects. They don't use the detected column type to return an object of the expected type. This is implemented in the ExcelUtils#createRow(Workbook, Row, DataSetHeader) method.

…be better coded

arjansh

As mentioned before, the values in a Row of an Excel DataSet are all still String objects. They don't use the detected column type to return an object of the expected type. This is implemented in the ExcelUtils#createRow(Workbook, Row, DataSetHeader) method.

arjansh · 2019-10-24T08:15:14Z

excel/src/main/java/org/apache/metamodel/excel/DefaultSpreadsheetReaderDelegate.java

+        final ColumnType[] columnTypes = new ColumnType[rowLength];
+        if (_configuration.isDetectColumnTypes()) {
+
+            int eagerness = _configuration.getEagerness();


Can you change eagerness to something like numberOfLinesToScan?

Done with next commit.

arjansh · 2019-10-24T09:20:24Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java


 	public ExcelConfiguration() {
 		this(DEFAULT_COLUMN_NAME_LINE, true, false);
 	}

    public ExcelConfiguration(int columnNameLineNumber, boolean skipEmptyLines, boolean skipEmptyColumns) {
-        this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns);
+        this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns, false, 1000);


It would be nice if the default value of a "1000" was in a constant, and maybe you want to set the default to zero in case you pass "false" as an argument for the detectColumnTypes property.

Done with next commit.

arjansh · 2019-10-24T09:20:59Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

    }

    public ExcelConfiguration(int columnNameLineNumber, ColumnNamingStrategy columnNamingStrategy,
-            boolean skipEmptyLines, boolean skipEmptyColumns) {
+            boolean skipEmptyLines, boolean skipEmptyColumns, boolean detectColumnTypes, int eagerness) {


Can you rename eagerness parameter to numberOfLinesToScan?

Done with next commit.

arjansh · 2019-10-24T09:21:15Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

@@ -38,25 +38,38 @@
 	public static final int NO_COLUMN_NAME_LINE = 0;
 	public static final int DEFAULT_COLUMN_NAME_LINE = 1;

+	private final int getNumberOfLinesToScan;


Can you rename getNumberOfLinesToScan parameter to numberOfLinesToScan?

Done with next commit.

arjansh · 2019-10-24T09:22:08Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

+				+ detectColumnTypes + "]";
+	}
+
+	public int getEagerness() {


Can you rename getEagerness to getNumberOfLinesToScan?

Done with next commit.

arjansh · 2019-10-24T09:22:32Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

 	@Override
 	protected void decorateIdentity(List<Object> identifiers) {
 		identifiers.add(columnNameLineNumber);
 		identifiers.add(skipEmptyLines);
 		identifiers.add(skipEmptyColumns);
+		identifiers.add(detectColumnTypes);


Can you also add numberOfLinesToScan to identifiers?

Done with next commit.

arjansh · 2019-10-24T09:23:07Z

excel/src/main/java/org/apache/metamodel/excel/ExcelConfiguration.java

 	}

 	@Override
 	public String toString() {
 		return "ExcelConfiguration[columnNameLineNumber="
 				+ columnNameLineNumber + ", skipEmptyLines=" + skipEmptyLines
-				+ ", skipEmptyColumns=" + skipEmptyColumns + "]";
+				+ ", skipEmptyColumns=" + skipEmptyColumns +", detectColumnTypes="
+				+ detectColumnTypes + "]";


Can you also add numberOfLinesToScan?

Done with next commit.

arjansh · 2019-10-24T09:26:05Z

excel/src/main/java/org/apache/metamodel/excel/XlsxSpreadsheetReaderDelegate.java

@@ -152,10 +160,150 @@ public void close() throws IOException {
                });
    }

+    private ColumnType[] getColumnTypes(final Sheet sheet, final Row row) {


You are copy/pasting a lot of code into this class from the DefaultSpreadsheetReaderDelegate class, that seems like a bad idea to me, why not always use the DefaultSpreadsheetReaderDelegate in case you need to detect column types and don't change anything in this class?

Done with next commit

arjansh · 2019-10-24T09:30:57Z

@SociopathicPixel It may also be a good idea to merge master branch (from apache/metamodel) into this branch, because that contains a general fix for the failing unit tests.

arjansh · 2019-10-28T08:54:36Z

excel/src/main/java/org/apache/metamodel/excel/ExcelInsertBuilder.java

@@ -149,8 +151,33 @@ protected CellStyle fetch() {
 					cell.setCellStyle(cellStyle.get());
 				}
 			}
+			validateUpdateType(row);


@SociopathicPixel I'm sorry, but I still don't get why this has been added here.

arjansh · 2019-10-28T08:55:51Z

excel/src/main/java/org/apache/metamodel/excel/ExcelUtils.java

        final Style[] styles = new Style[size];
        if (row != null) {
            for (int i = 0; i < size; i++) {
                final int columnNumber = header.getSelectItem(i).getColumn().getColumnNumber();
                final Cell cell = row.getCell(columnNumber);
-                final String value = ExcelUtils.getCellValue(workbook, cell);
+                final Object value = ExcelUtils.getCellValue(workbook, cell);


Returning an Object doesn't do the trick, we know based on the column type, what type of object should be returned, so use that to either return a String, Integer, Boolean or Date object.

Jan Bob added 2 commits September 25, 2019 14:37

updated poi-ooxml

90303fa

added eager datatype checker for first 1000 rows

refactored [tabs] to [spaces]

999647c

arjansh reviewed Sep 26, 2019

View reviewed changes

resolved review comments

5a75486

arjansh reviewed Sep 27, 2019

View reviewed changes

SociopathicPixel and others added 7 commits September 30, 2019 13:52

resolved review comments part 1

b82e728

removed empty spaces in whitelines

aae088f

added a test, however datatypes are not found...

7ae1b26

added test case for datatypes

77a1b80

corrected a few tests cause these where breaking on the datatypes

fixed another test that fel over

84c6f06

arjansh reviewed Oct 1, 2019

View reviewed changes

Jan Bob and others added 7 commits October 2, 2019 16:33

commit part 1; did some indentation fixes

b4d9a20

switching devices

commit part 1.01; did some indentation fixes

1a34285

switching devices and apearantly autosave was off -_-

resolving review comments

3785c91

Merge branch 'dev/add-datatypes-to-excel-columns' of https://github.c…

66749e9

…om/SociopathicPixel/metamodel into dev/add-datatypes-to-excel-columns # Conflicts: # excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java

resolving indentation filler

a8ab4b6

revert all code style changes

57108bc

coudn't find any finals

wrote some tests, added update check

138b147

fixed assert that was set wrong

5088744

arjansh reviewed Oct 11, 2019

View reviewed changes

Jan Bob added 2 commits October 14, 2019 17:00

resolving review comments 1 of many

4bfcc41

resolving review comments

0701bab

resolving review comments, still there are a few thingies that could …

9c3976e

…be better coded

arjansh reviewed Oct 24, 2019

View reviewed changes

still need to pull apache/master into this branch

cc85725

arjansh reviewed Oct 28, 2019

View reviewed changes

resolving review comments, not finished yet

23a7781

dev/add datatypes to excel columns #230

Are you sure you want to change the base?

dev/add datatypes to excel columns #230

Conversation

SociopathicPixel commented Sep 26, 2019

arjansh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaspersorensen commented Sep 27, 2019

kaspersorensen commented Sep 27, 2019

arjansh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SociopathicPixel Sep 30, 2019 • edited Loading

Choose a reason for hiding this comment

arjansh commented Sep 27, 2019

kaspersorensen commented Sep 27, 2019

kaspersorensen commented Sep 27, 2019

kaspersorensen commented Sep 27, 2019

arjansh commented Sep 27, 2019

kaspersorensen commented Sep 27, 2019

SociopathicPixel commented Sep 28, 2019

kaspersorensen commented Sep 29, 2019

arjansh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SociopathicPixel Oct 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjansh Oct 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjansh commented Oct 11, 2019

SociopathicPixel commented Oct 14, 2019

arjansh commented Oct 21, 2019

arjansh commented Oct 21, 2019

arjansh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjansh commented Oct 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SociopathicPixel Sep 30, 2019 •

edited

Loading

SociopathicPixel Oct 15, 2019 •

edited

Loading

arjansh Oct 11, 2019 •

edited

Loading