-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardizing resource and column names #8
Comments
As for the agreement on fixed names to refer to files within
so we fix the resource's As for a standard way to refer to column names, I do agree that the third option seems to be the cleanest and preferable ( |
Briefly discussed with @simleo. Overall, I like the usage of a well-defined property (here Maintaining the flexibility on the column naming also means existing CSV tables could be unaltered under some conditions with all the metadata layer expressed at the level of the JSON file. Also this approach might simplify both the implementation but also the validation of the file format. |
See also #14 (not entirely orthogonal to this one) |
I have updated the original description by appending a new "UPDATE (2017-05-22)" section. This stems from considerations emerged after working on several pull requests that required more extensive review of the exisiting code. I haven't altered the text above that, so that the original discussion can still be followed for future reference. |
Fixed in #41. |
For the background, see CellMigStandOrg/CMSO-samples#18. After discussing at today's meeting, the consensus for now seems to be:
Agree on fixed names to refer to files within
datapackage.json
. For instance,objects_table
for the objects table. Note that it's not the file name that should be fixed, but rather the name that the JSON file uses to refer to it.For instance, the JSON data could be:
Or:
While the actual
"path"
of the tabular file changes from datapackage to datapackage, the resource's"name"
is fixed. This could be implemented with a global variable defined somewhere in the API, e.g.:which would be used both by data package generators and readers. The same goes for other files that make up the data package.
Readers also need a standard way to refer to column names. Options for solving this include:
agree on fixed names for columns, such as "x_coord", "frame_idx", etc. In this case, the code that generates data packages would need to convert the CSV files during the process, replacing column names on the basis of a mapping file such as the
.ini
we are currently using.allow columns to have arbitrary names. In this case the datapackage should include the mapping file so that the reader can resolve the actual names.
In any event, keys in the mapping file (e.g., "x_coord_cmso") should be fixed.
A third option that has not been discussed at the meeting is to use the
"name"
property of table schema fields for the fixed CMSO keys and the"title"
property for the actual column name in the file, e.g.:Note that the spec imposes a restriction on
"name"
attributes (Lower case characters with., _, - and /
are allowed). This does not seem to be currently enforced for table schema fields in the Python API, but it is enforced for other"name"
attributes at upper levels. In future versions of frictionless APIs they might add this missing restriction, so this third option:UPDATE (2017-05-22)
De facto status:
Concrete examples show that the upstream format (i.e., the output of cell tracking software packages) can be something that's not csv (e.g., XML for TrackMate);
Even when the upstream format is some form of CSV, our current workflow is to read it into a pair of dataframes (for objects and links), do some processing and eventually write out our own
{objects,links,tracks}.csv
files. Currently these names are hardcoded, see readers return value and how it's used in the data package creation script. While these names should not be hardcoded (i.e., we should define them in the names module as we've done for other entities in Single source for resource and column names #15), having fixed file names is good (we are defining our own specialization of a tabular datapackage) and simplifies the task of reading them.Since we always rewrite upstream data, the column names we use can also be fixed. This should also eliminate the confusion regarding the role of the config file (see Role of config file in data package creation #14), i.e., output names are always fixed and defined in the library and the config files specifies input column names where necessary.
The text was updated successfully, but these errors were encountered: