Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single source for resource and column names #15

Merged
merged 9 commits into from
Apr 27, 2017

Conversation

simleo
Copy link
Member

@simleo simleo commented Apr 25, 2017

A first step towards fixing #8 and #3, but it does not address everything because #14 needs to be discussed first.

Note that:

  • The dpkg.py has been moved to an external script (scripts/create_dpkg.py). It's not part of the library and it had the same name as the package, which would have messed up absolute imports. This means that the package now has to be installed.

  • writeConfigFile.py is gone. The user can create the .ini file directly, that should be easier and less error-prone than modifying Python code (which would have been installed anyway).

  • "objects.csv" and "links.csv" are still hardwired in readfile.py and createdp.py. We should define these in the library as well at some point. There is more hardwired stuff, see Role of config file in data package creation #14.

  • I would follow @JaimePrilusky's suggestion to prepend rather than append cmso, e.g., cmso_x_coord.

  • For consistency, we might want to use cmso_frame_id instead of cmso_frame.

Copy link
Member

@pcmasuzzo pcmasuzzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good to me, this PR seems a big step forward towards a couple of issues (role of config file in the library and standard names for files).

After @sbesson review, for me this is good to go.

README.rst Outdated
this will create the *.ini configuration file*

This file should look something like this:
+ **step 1** - create a `cell_track_dpkg.ini` configuration file and place it in the same directory as your tracking file. The file must be structured as follows:

.. code-block::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having deleted the writeConfigFile.py.

"schema": {
"foreignKeys": [{
"fields": "SPOT_ID",
"reference": {
"resource": "objectsTable",
"resource": "objects_table",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going towards spec compliance, nice!

Z_COORD_NAME = "z_coord_cmso"
FRAME_NAME = "frame_cmso"
OBJECT_NAME = "object_id_cmso"
LINK_NAME = "link_id_cmso"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simleo - the fields in this names file are standard, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (does not necessarily have to be called names.py, we can change that here or in the future) is meant to be the single source where our resource / column names are defined. Everywhere else in the code we should use these vars (e.g., names.OBJECT_NAME) rather than hardwire the values. I have kept the current values (e.g., "object_id_cmso"), but they can change depending on what we agree upon for the spec (keeping in mind datapackage spec compliance, of course). Indeed, I have proposed some changes myself in the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -14,8 +14,9 @@ def push_to_pandas(directory, object_id_cmso):
storage = dp.push_datapackage(descriptor=descr, backend='pandas')
print(storage.buckets)

objects = storage['objects___objectstable']
links = storage['links___linkstable']
# FIXME: the following is hardwired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simleo - can we get this from the names file as well?

Copy link
Member Author

@simleo simleo Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked more deeply into this issue. The push_datapackage function generates those keys based on the resource's path and name via a convert_path function:

    table = os.path.splitext(path)[0]
    table = table.replace(os.path.sep, '__')
    if name is not None:
        table = '___'.join([table, name])
    table = re.sub('[^0-9a-zA-Z_]+', '_', table)
    table = table.lower()
    return table

Until datapackage version 0.x this was implemented in the public API (from datapackage.mappers import convert_path), but they have moved it to an internal function in datapackage.pushpull for the upcoming 1.0.0 version (they are at v1.0.0-alpha4 now). This is not good. I will open an issue on their repo. We can implement a workaround in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they have this already scheduled for a change, see frictionlessdata/datapackage-py#110 which they "merged" into frictionlessdata/datapackage-py#132.

@simleo
Copy link
Member Author

simleo commented Apr 26, 2017

Before this is merged can someone please double check that at least one example runs correctly (e.g., python3 create_dpkg.py bloboverlap15_spots.csv), since current test coverage is pretty low. Note that ICY examples are currently broken (#12).

@pcmasuzzo
Copy link
Member

pcmasuzzo commented Apr 27, 2017

@simleo I have tested this for the CellProfiler and the TrackMate examples, all good.
Obviously did not test on ICY, because broken as for #12.

One last thing before merging. The README should be updated in the line:

step 1: install the library (I did python setup.py install on a Windows 10 machine, worked like charm)
step 2: make sure you have a .ini file (as already explained)
step 3: run "python create_dpkg.py your-tracking-file" (instead of "run python dpkg.py your-tracking-file") (need to point out that the create_dpkg is in the scripts folder?) .

Do you mind committing these edits before we proceed with the merging? Thanks.

@pcmasuzzo
Copy link
Member

Thanks! For me this is good to go.

@sbesson sbesson modified the milestone: 0.2.0 Apr 27, 2017
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed earlier today with @simleo @pcmasuzzo @gsergeant. Two minor comments about using relative imports within the package but these can be sorted out in upcoming PRs.

Overall, the strategy of defining and sourcing constants in a dedicated module makes sense and I don't have an obvious better suggestion than names at the moment. Merging so that we can move forwards with the package renaming (see #13)

@@ -7,6 +7,8 @@
import xlrd
from xlrd import XLRDError

import dpkg.names as names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use relative imports within the package?

from jsontableschema import infer
import dpkg.names as names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use relative imports within the package?

@sbesson sbesson merged commit b06dfe2 into CellMigStandOrg:master Apr 27, 2017
@simleo simleo deleted the standard_names branch April 27, 2017 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants