Standardizing resource and column names #8

simleo · 2017-04-06T12:59:10Z

For the background, see CellMigStandOrg/CMSO-samples#18. After discussing at today's meeting, the consensus for now seems to be:

Agree on fixed names to refer to files within datapackage.json. For instance, objects_table for the objects table. Note that it's not the file name that should be fixed, but rather the name that the JSON file uses to refer to it.

For instance, the JSON data could be:

{
  "resources": [{
    "name": "objects_table",
    "path": "objects.csv",
    ...
    }]
  ...
}

Or:

{
  "resources": [{
    "name": "objects_table",
    "path": "tracking_output_0001.csv",
    ...
    }]
  ...
}

While the actual "path" of the tabular file changes from datapackage to datapackage, the resource's "name" is fixed. This could be implemented with a global variable defined somewhere in the API, e.g.:

OBJECTS_NAME = "objects_table"

which would be used both by data package generators and readers. The same goes for other files that make up the data package.

Readers also need a standard way to refer to column names. Options for solving this include:

agree on fixed names for columns, such as "x_coord", "frame_idx", etc. In this case, the code that generates data packages would need to convert the CSV files during the process, replacing column names on the basis of a mapping file such as the .ini we are currently using.
allow columns to have arbitrary names. In this case the datapackage should include the mapping file so that the reader can resolve the actual names.

In any event, keys in the mapping file (e.g., "x_coord_cmso") should be fixed.

A third option that has not been discussed at the meeting is to use the "name" property of table schema fields for the fixed CMSO keys and the "title" property for the actual column name in the file, e.g.:

{
  "resources": [{
    "name": "objects_table",
      "schema": {
        "fields": [{
          "name": "cmso_frame",
          "title": "FRAME",
          ...

Note that the spec imposes a restriction on "name" attributes (Lower case characters with ., _, - and / are allowed). This does not seem to be currently enforced for table schema fields in the Python API, but it is enforced for other "name" attributes at upper levels. In future versions of frictionless APIs they might add this missing restriction, so this third option:

is probably the safest long-term
does not require CSV conversions
does not require the inclusion of a mapping file in the data package

UPDATE (2017-05-22)

De facto status:

Concrete examples show that the upstream format (i.e., the output of cell tracking software packages) can be something that's not csv (e.g., XML for TrackMate);
Even when the upstream format is some form of CSV, our current workflow is to read it into a pair of dataframes (for objects and links), do some processing and eventually write out our own {objects,links,tracks}.csv files. Currently these names are hardcoded, see readers return value and how it's used in the data package creation script. While these names should not be hardcoded (i.e., we should define them in the names module as we've done for other entities in Single source for resource and column names #15), having fixed file names is good (we are defining our own specialization of a tabular datapackage) and simplifies the task of reading them.
Since we always rewrite upstream data, the column names we use can also be fixed. This should also eliminate the confusion regarding the role of the config file (see Role of config file in data package creation #14), i.e., output names are always fixed and defined in the library and the config files specifies input column names where necessary.

The text was updated successfully, but these errors were encountered:

pcmasuzzo · 2017-04-06T14:43:39Z

As for the agreement on fixed names to refer to files within datapackage.json, perhaps best to have something like option 2:

{
  "resources": [{
    "name": "objects_table",
    "path": "tracking_output_0001.csv",
    ...
    }]
  ...
}

so we fix the resource's "name" (objects_table), while the actual "path" stays flexible to change across data_packages.

As for a standard way to refer to column names, I do agree that the third option seems to be the cleanest and preferable ("name" property is fixed (something like x_coord_cmso), while the "title" changes and can be used to write the CSV files and read them back for further plotting, analytics etc.)

sbesson · 2017-04-18T16:48:25Z

Briefly discussed with @simleo. Overall, I like the usage of a well-defined property (here name) in both resources and schema to provide an unambiguous mapping between the tables/columns in the file format and the specification. Coming back to the initial discussions on the Positions, this have similar functionality as the prefixed CMSO:xxx column title.

Maintaining the flexibility on the column naming also means existing CSV tables could be unaltered under some conditions with all the metadata layer expressed at the level of the JSON file. Also this approach might simplify both the implementation but also the validation of the file format.

simleo · 2017-04-25T14:56:15Z

See also #14 (not entirely orthogonal to this one)

sbesson · 2017-05-05T15:09:17Z

@simleo is this partly or fully addressed via #15?

simleo · 2017-05-05T15:29:15Z

@simleo is this partly or fully addressed via #15?

Only partly. createdp takes data from dictionaries created in readfile, where many fields are still hardwired to the specific values we have in our examples.

simleo · 2017-05-22T10:22:47Z

I have updated the original description by appending a new "UPDATE (2017-05-22)" section. This stems from considerations emerged after working on several pull requests that required more extensive review of the exisiting code. I haven't altered the text above that, so that the original discussion can still be followed for future reference.

simleo · 2017-07-05T10:57:17Z

Fixed in #41.

simleo added the enhancement label Apr 6, 2017

sbesson mentioned this issue Apr 19, 2017

Tracks table mismatch between specification and implementation CellMigStandOrg/Tracks#15

Closed

simleo mentioned this issue Apr 25, 2017

Role of config file in data package creation #14

Closed

simleo mentioned this issue Apr 25, 2017

Single source for resource and column names #15

Merged

This was referenced Jun 23, 2017

Review and fix hardcoded values #30

Open

Biotracks-specific datapackage reader #39

Closed

sbesson modified the milestone: 0.3.0 Jun 29, 2017

simleo self-assigned this Jun 29, 2017

simleo mentioned this issue Jun 30, 2017

Use standard names for the output format #41

Merged

simleo closed this as completed Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing resource and column names #8

Standardizing resource and column names #8

simleo commented Apr 6, 2017 •

edited

Loading

pcmasuzzo commented Apr 6, 2017

sbesson commented Apr 18, 2017

simleo commented Apr 25, 2017

sbesson commented May 5, 2017

simleo commented May 5, 2017

simleo commented May 22, 2017 •

edited

Loading

simleo commented Jul 5, 2017

Standardizing resource and column names #8

Standardizing resource and column names #8

Comments

simleo commented Apr 6, 2017 • edited Loading

pcmasuzzo commented Apr 6, 2017

sbesson commented Apr 18, 2017

simleo commented Apr 25, 2017

sbesson commented May 5, 2017

simleo commented May 5, 2017

simleo commented May 22, 2017 • edited Loading

simleo commented Jul 5, 2017

simleo commented Apr 6, 2017 •

edited

Loading

simleo commented May 22, 2017 •

edited

Loading