-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite number of (unnamed) columns #13
Comments
RFC4180 states that each line of a CSV should have the same number of fields, so this is really outside the scope of what we were trying to do. However, an approach that could work would be to actually treat the variable portion as a single quote-wrapped field with commas in. So in your example it would become:
with schema
|
Ah, thanks for that clarification. Yes, I'm aware of the work around. If I modify the scenario a bit though.. Example File A
File B
Would it still violate the RFC and thus be out of scope for this project? |
It wouldn't directly violate the RFC, but as it stands a schema expects a constant number of columns, so currently all you could do would be to have different schemas for each number of columns, it was fairly fundamental part of our thinking (maybe it shouldn't have been) that the number of fields was a defining feature of a homogeneous set of csv files. It would certainly not be something that would be a priority for us right now. (of course it's not hard to include it in the schema language as such, it's developing the validator to implement the checks where most of the effort lies). Thoughts @adamretter @rhubner ? |
@adbre based on your example: File A
File B
I am not at all clear what it is you are trying to achieve, both of those appear to me to be valid CSV files for which a CSV Schema could be created. What seems to be the issue? |
I would like a single schema file for both file A and B. The third column, and every column after it, follows the same rules. So I'm wondering if it would be possible to add support in the schema for specifying the last column to be repeated 1..N times (as long as the total number of columns for each record is still consistent for the entire file). |
@adbre Okay so I think you are maybe talking about reuse of CSV Schema components. If so, I would suggest that this is outside of CSV Schema, and you could easily create a tool that allows you to assembly fragments of CSV Schema. e.g. You for the above example, you would have 3 schema components.
I would imagine that you could just concatenate (1) and either (2) or (3) to give you a complete CSV Schema, a simple command would do this, or if you wanted to be fancier and have some options then maybe a short python script. |
@adamretter I do not think you understood my question. I just had an idea how to extend the CSV schema, in a (for me) not unreasonable way, that I wanted to share with the community. It's fine by me if you do not feel this "feature" belongs in the CSV schema for one reason or another. |
@adbre I think you are correct that I do not understand what you are looking for :-/ Can you perhaps give me an example of what you think needs to be added to CSV Schema to support your needs along with example CSV files and CSV Schema? |
Gladly! file-a.csv
file-b.csv
schema.csvs
In my original post I included Edit: As I also included in my original post, this would behave very much like the "varargs" notation in Java or the |
@adbre Okay that makes more sense to me now. Describing it as varags doesn't relate to me what you wanted. I think we would need to work on the syntax though, I don't like the syntax We would need something that we could disambiguate in the AST which currently looks like: BodyPart ::= Comment* ColumnDefinition Comment*
ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
QuotedColumnIdentifier ::= StringLiteral Perhaps something like: BodyPart ::= Comment* (ColumnDefinition | ColumnGroup) Comment*
ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
QuotedColumnIdentifier ::= StringLiteral
ColumnGroup ::= "group(" positiveInteger "," positiveIntegerOrAny "):" ColumnRule The column group would allow you to specify the number of following columns (from the current index) in the form: Exactly 4 columns - Obviously we would need some rules about how these interact with each other, for example you couldn't have two sequential lines like:
I think the rules would be that How does that sound? Obviously the syntax could be improved, the above was just off the top of my head ;-) |
That sounds perfectly reasonable to me! I realized something though when you mentioned named columns... I don't use them, but if someone would, these "group" columns would be unnamed and without possibility to give them any name. A solution could be a third/first argument with the column/group name which could be used as a prefix followed by the numerical index for the column within that group.
Which would allow something similar to
Would match the CSV header (given zero based indexes)
|
@adbre I did consider allowing you to name column groups, but then I thought that we could just use the underlying column name or index (if there is no column header) in any validation messages for the specific column. |
That works too |
@DavidUnderdown What do you reckon then? |
Thank you for your suggestion. We have discussed this but have decided that as this is not something we would use it will not be prioritised for the 1.2 schema language update. Any pull requests however would be welcome. |
|
Forgive my naiveté @adbre but couldn't the CSV files be pre-proccessed to pad them with max - n blank columns before validation, where max = most columns seen (this time) in a csv, and n = number of columns in a given csv? then update the schema with max columns, do the validation? |
The first sentence describing CSV Schema on https://digital-preservation.github.io/csv-schema
I would argue that if you need to do pre-processing of CSV files before using CSV Schema, then the CSV Schema spec has already failed. Beyond its inability to actually describe the data in the CSV - if you are able to reliably preform pre-processing of a CSV without validation first - then the usefulness of many of the directives involving headers is largely diminished. There are multiple instances in this project where the solution is to dynamically generate a schema to match the data in the CSV (1) (2). The fact that this is considered a valid solution by the authors only serves to make it clear that the goal of CSV Schema is not to describe data in CSV files for the purposes of validation. I can only speculate that the true goal of CSV Schema is to be a pre-processor for the limited and specific needs of the National Archives, other use cases that do not align specifically with the needs of National Archives are considered out of scope. With the current direction of CSV Schema, it will only ever be able to validate CSV files for those who have complete control of the CSV file generation process, making it useless for any who wish to use it in an ETL process. |
We don't have enough developers currently to implement additional features I'm afraid, but we are certainly not opposed to the addition of further use cases in principle. If others wish to create appropriate EBNF for additional features, please do propose it as a pull request for inclusion in the 1.2 (or later) draft. In effect we saw CSV Schema as something to be used after the transform stage (what else is preprocessing except a form of transformation?), to verify that your data will be OK to load. It seems you want to be able to use it earlier in the cycle, immediately after extraction? |
To backup my claim, #16 laid out some very clear arguments for an addition to the 1.2 draft, and #17 created the appropriate EBNF changes. I no longer work with this tool, but if the request had been approved, then I had a couple developers that could have worked full time on implementing the features. The use case was that the company was receiving CSV from multiple and very inconsistent sources of which we had no control over. To make things worse, they periodically modified the columns they would include. If we had a robust schema validator at the time, then we could have linked various schemas with specific transformation logic. This matches the use case for schema validation perfectly - validating that a CSV matches the expected data type. A schema is only useful if it is actually able to describe the document. Missing simple features like |
@mhGLEIF It cannot. |
We have CSV files which, for each entry, contains a "header" constituted by a fixed number of columns
then followed by a "body", a variable number of columns.
Example below with a fixed number of columns 2 (letters) followed by a variable number of columns (containing numbers).
The concept is very much similar to the "varargs" notation in Java or the
params
keyword in C#.I'm looking for a way to express this in the schema file.
The schema for this could be expressed as
The text was updated successfully, but these errors were encountered: