Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing resource and column names #8

Closed
simleo opened this issue Apr 6, 2017 · 7 comments
Closed

Standardizing resource and column names #8

simleo opened this issue Apr 6, 2017 · 7 comments
Assignees
Milestone

Comments

@simleo
Copy link
Member

simleo commented Apr 6, 2017

For the background, see CellMigStandOrg/CMSO-samples#18. After discussing at today's meeting, the consensus for now seems to be:

Agree on fixed names to refer to files within datapackage.json. For instance, objects_table for the objects table. Note that it's not the file name that should be fixed, but rather the name that the JSON file uses to refer to it.

For instance, the JSON data could be:

{
  "resources": [{
    "name": "objects_table",
    "path": "objects.csv",
    ...
    }]
  ...
}

Or:

{
  "resources": [{
    "name": "objects_table",
    "path": "tracking_output_0001.csv",
    ...
    }]
  ...
}

While the actual "path" of the tabular file changes from datapackage to datapackage, the resource's "name" is fixed. This could be implemented with a global variable defined somewhere in the API, e.g.:

OBJECTS_NAME = "objects_table"

which would be used both by data package generators and readers. The same goes for other files that make up the data package.

Readers also need a standard way to refer to column names. Options for solving this include:

  • agree on fixed names for columns, such as "x_coord", "frame_idx", etc. In this case, the code that generates data packages would need to convert the CSV files during the process, replacing column names on the basis of a mapping file such as the .ini we are currently using.

  • allow columns to have arbitrary names. In this case the datapackage should include the mapping file so that the reader can resolve the actual names.

In any event, keys in the mapping file (e.g., "x_coord_cmso") should be fixed.

A third option that has not been discussed at the meeting is to use the "name" property of table schema fields for the fixed CMSO keys and the "title" property for the actual column name in the file, e.g.:

{
  "resources": [{
    "name": "objects_table",
      "schema": {
        "fields": [{
          "name": "cmso_frame",
          "title": "FRAME",
          ...

Note that the spec imposes a restriction on "name" attributes (Lower case characters with ., _, - and / are allowed). This does not seem to be currently enforced for table schema fields in the Python API, but it is enforced for other "name" attributes at upper levels. In future versions of frictionless APIs they might add this missing restriction, so this third option:

  • is probably the safest long-term
  • does not require CSV conversions
  • does not require the inclusion of a mapping file in the data package

UPDATE (2017-05-22)

De facto status:

  • Concrete examples show that the upstream format (i.e., the output of cell tracking software packages) can be something that's not csv (e.g., XML for TrackMate);

  • Even when the upstream format is some form of CSV, our current workflow is to read it into a pair of dataframes (for objects and links), do some processing and eventually write out our own {objects,links,tracks}.csv files. Currently these names are hardcoded, see readers return value and how it's used in the data package creation script. While these names should not be hardcoded (i.e., we should define them in the names module as we've done for other entities in Single source for resource and column names #15), having fixed file names is good (we are defining our own specialization of a tabular datapackage) and simplifies the task of reading them.

  • Since we always rewrite upstream data, the column names we use can also be fixed. This should also eliminate the confusion regarding the role of the config file (see Role of config file in data package creation #14), i.e., output names are always fixed and defined in the library and the config files specifies input column names where necessary.

@pcmasuzzo
Copy link
Member

As for the agreement on fixed names to refer to files within datapackage.json, perhaps best to have something like option 2:

{
  "resources": [{
    "name": "objects_table",
    "path": "tracking_output_0001.csv",
    ...
    }]
  ...
}

so we fix the resource's "name" (objects_table), while the actual "path" stays flexible to change across data_packages.

As for a standard way to refer to column names, I do agree that the third option seems to be the cleanest and preferable ("name" property is fixed (something like x_coord_cmso), while the "title" changes and can be used to write the CSV files and read them back for further plotting, analytics etc.)

@sbesson
Copy link
Member

sbesson commented Apr 18, 2017

Briefly discussed with @simleo. Overall, I like the usage of a well-defined property (here name) in both resources and schema to provide an unambiguous mapping between the tables/columns in the file format and the specification. Coming back to the initial discussions on the Positions, this have similar functionality as the prefixed CMSO:xxx column title.

Maintaining the flexibility on the column naming also means existing CSV tables could be unaltered under some conditions with all the metadata layer expressed at the level of the JSON file. Also this approach might simplify both the implementation but also the validation of the file format.

@simleo
Copy link
Member Author

simleo commented Apr 25, 2017

See also #14 (not entirely orthogonal to this one)

@sbesson
Copy link
Member

sbesson commented May 5, 2017

@simleo is this partly or fully addressed via #15?

@simleo
Copy link
Member Author

simleo commented May 5, 2017

@simleo is this partly or fully addressed via #15?

Only partly. createdp takes data from dictionaries created in readfile, where many fields are still hardwired to the specific values we have in our examples.

@simleo
Copy link
Member Author

simleo commented May 22, 2017

I have updated the original description by appending a new "UPDATE (2017-05-22)" section. This stems from considerations emerged after working on several pull requests that required more extensive review of the exisiting code. I haven't altered the text above that, so that the original discussion can still be followed for future reference.

@simleo
Copy link
Member Author

simleo commented Jul 5, 2017

Fixed in #41.

@simleo simleo closed this as completed Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants