Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

Improve information obtainable from Dataset version property #230

Open
hayfield opened this issue Nov 8, 2017 · 11 comments
Open

Improve information obtainable from Dataset version property #230

hayfield opened this issue Nov 8, 2017 · 11 comments
Labels
datasets Relating to IATI Datasets. enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). versions Relating to IATI Version Numbers.

Comments

@hayfield
Copy link
Contributor

hayfield commented Nov 8, 2017

An iati.Dataset has a version property that specifies the version of the Standard that the Dataset is specified against.

At present, there are a range of different situations that may occur:

  1. A valid Standard version is specified (1.01, [...], 2.02)
  2. A valid Standard version is specified, but there is whitespace (at present this is stripped - 1.01, [...], 2.02)
  3. No version is specified (at present this becomes 1.01)
  4. Various Standard versions are specified and the 1.0x combination rules are followed, leading to something detectable (1.01, [...], 1.05)
  5. An invalid Standard version is specified and looks like a number, but is not a valid version (1, 17.6, -19.4, etc)
  6. An invalid Standard version is specified and it's a misc string (bob, jim, one point zero one, etc)
  7. An invalid Standard version is specified, but there is whitespace (at present this is stripped - 1, 17.6, -19.4, bob, jim, one point zero one, etc)
  8. Various Standard versions are specified and the 1.0x combination rules are followed, leading to no clear answer (None)

With these cases, at present:

  • 1, 5 and 6 will return the data as-is
  • 2 and 7 strip the whitespace and returns whatever else is there
  • 3 is a special case of 4
  • 4 does the magic combination as required by the IATI Standard.
  • 8 comes up with a magic value

This means that:

  • 1, 3 and 4 provide correct and useful information
  • 5 and 6 provide information that you need to remember to validate or deal with at a later point
  • 2 and 7 modify the data without saying
  • 8 is a special case that requires you know about the oddities of IATI

This is a mess.


As such, the property should be improved such that:

  • 1 returns the value as-is
  • 2 likely strip the whitespace and return the value, leaving notification of whitespace to a call to something in iati.validator
  • 3 and 4 return the value as detected via the 1.0x combination rules
  • 5, 6, 7 raise an error. This should be a custom error in iati.exceptions that inherits from ValueError and has an attribute containing the located data.
  • 8 raises an error with information about the 1.0x combination rules

This proposal would change use of this property from:

version_num = dataset.version
if version_num in iati.constants.STANDARD_VERSIONS:
  if version_num == '1.01':
    # hope this was detected and not guessed
  # do stuff

to:

try:
  version_num = dataset.version
except iati.exceptions.DataProblem as err: # nb: error name needs sorting out
  raise err
  # OR
  log(err.msg)
  handle_in_some_manner(err)
  # OR
  gracefully_handle_in_manner_appropriate_to_needs(err)
  # OR, if you know what you're doing
  version_num = err.data # no, really, give me the questionable data
# do stuff

This alternative is more Pythonic, provides more information at the point of failure, and simplifies the potential return values.

It also generalises far better to other pipleline work (#187) where the current code would look far clunkier (due to a lack of relevant-constant).

@hayfield hayfield added datasets Relating to IATI Datasets. enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). labels Nov 8, 2017
@hayfield hayfield mentioned this issue Nov 8, 2017
15 tasks
hayfield added a commit that referenced this issue Nov 8, 2017
Asdetailed in #230, there are a number of issues with the current behavior of dataset.version.
This provides a clunky workaround in a situation where the improved behavior would be beneficial.
@dalepotter
Copy link
Contributor

3 makes something up

3 fails silently, and returns something

I disagree. in the case where no version is specified, the Standard specifies that “1.01” should be assumed if not specified. It is required to specify this attribute if the document is using features specific to an IATI specification other than the initial 1.01 version..

@hayfield
Copy link
Contributor Author

hayfield commented Nov 9, 2017

in the case where no version is specified, the Standard specifies that [...]

So it does! (and it's mandatory at 2.0x) In which case, 3) is a particular case within 4) and should be handled as such.

Have edited the main post to fix this.

@dalepotter
Copy link
Contributor

After some internal team discussion on the merits of making this change, I am now inclined to support this change as it is more pythonic yet still allows access to invalid data from elements, albeit it is required to parse an error exception object.

However, it would be good to get the feedback from data users who:

  1. Have used IATI data in anger and,
  2. Done so using python.

Therefore, @andylolz @markbrough @akmiller01 - do you have any thoughts?

This change will likely set a precedent for the way in which we extend pyIATI in future to normalise IATI data (i.e. significantly reduce the complexity required to handle data in versions1.x and 2.x of the Standard) and provide APIs for developers (and the IATI Tech Team) to more easily create and maintain robust end user tools.

@markbrough
Copy link

So first thought is that it would probably be useful to handle the case of a dataset having multiple versions. For example, the Datastore returns the version against each iati-activity within the iati-extra namespace. Though v2 Standard (unlike v1) only allows publishers to publish a version in the root element (iati-activities) and not per activity, there is clearly a use case for considering a dataset as a combination of iati-activity from multiple versions (e.g. if a third party made a dataset of all activities for Bangladesh).

I don't think 8 should return an error, I think it should return a list of multiple versions.

This might be a bit off topic though, because I am not really sure what you would do with the result of a Dataset? Is this something I would get out of the Datastore, or just if I am trying to validate a single package on the Registry? (I am not being facetious, I am just unclear what happens with the output of this)

@akmiller01
Copy link

+1 to @markbrough comments.

Had my first experience with pyIATI this week; just some simple testing to get used to it. I'm not really sure what value the validation is adding when parsing something like output from the Datastore. Each activity has a different version, but because the iati-activities element is missing a version the package is parsing the whole thing as 1.01. Some pdb.set_trace() output pasted below:

(Pdb) activities
<iati.data.Dataset object at 0x00000000025B8D30>
(Pdb) dir(activities)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_raw_source_at_line', '_xml_str', '_xml_tree', 'source_around_line', 'source_at_line', 'version', 'xml_str', 'xml_tree']
(Pdb) activities.version
'1.01'
(Pdb) activities.xml_tree.xpath('iati-activities/iati-activity[1]' )[0].attrib["{%s}version" % iati_extra]
'2.01'

And beyond that, the Dataset object just seems to pull in the standard behavior from the lxml.etree function. If it weren't for the iati.default.codelist function in the package, it would be much easier just to parse Datastore output as XML with lxml.etree and raise your own errors where needed.

@hayfield
Copy link
Contributor Author

hayfield commented Nov 9, 2017

I don't think 8 should return an error, I think it should return a list of multiple versions.

The IATI Standard itself only permits a single version to be specified - something like the contents of the Datastore's iati-extra namespace extends this through use of a custom namespace. In the pyIATI case, you'd likely deal with this by creating a subclass that inherits from an iati.Dataset and extends / modifies the base functionality in the desired manner based on the intended use of the custom namespace.

because the iati-activities element is missing a version the package is parsing the whole thing as 1.01.

The Datastore does not output valid IATI XML. As such, it is necessary to convert it into a version that is valid IATI XML before pyIATI will handle it in the manner that is likely desired.

And beyond that, the Dataset object just seems to pull in the standard behavior from the lxml.etree function.

A Dataset object is a container to deal with the complexities of IATI. At this point in time this work hasn't been undertaken, but this issue is proposing an exception mechanism within which this work can be undertaken.

@markbrough
Copy link

The IATI Standard itself only permits a single version to be specified

This does not appear to be true for version 1. Also, there is clearly a use case for this in derivative datasets or third-party tools even if this is not something that publishers should be doing in their own data.

The Datastore does not output valid IATI XML. As such, it is necessary to convert it into a version that is valid IATI XML before pyIATI will handle it in the manner that is likely desired.

So yeah, maybe the confusion stems from the fact that I am not really sure what I am supposed to use this for. Validating real-world data coming out of real-world tools seems within scope? Why would I want to convert the data from the Datastore, just to pipe it through pyIATI? I am not sure where that gets me. Or does the Datastore output need to be corrected to output valid IATI XML? Again though maybe I am missing the point of pyIATI.

@hayfield
Copy link
Contributor Author

hayfield commented Nov 9, 2017

This does not appear to be true for version 1.

More specifically, a data file may only be a single version - should multiple be specified, the combination rules must be followed to determine the single version that the data is specified at.

Also, there is clearly a use case for this in derivative datasets or third-party tools

Aye, hence inheritance and custom namespaces are things.

Or does the Datastore output need to be corrected to output valid IATI XML?

Pretty much this.

The XML output from the Datastore (or any unmodified subtree at a scope greater than <iati-activity>) does not validate against any IATI Schema. As such, it needs munging into valid IATI XML for tooling that works with data that conforms to the IATI Standard. It cannot be fed raw and expected to be IATI XML.

@andylolz
Copy link
Contributor

andylolz commented Nov 12, 2017

in the case where no version is specified, the Standard specifies that [...]

So it does!

Are we talking about the organisation standard here, too? For iati-organisations, there doesn’t appear to be a default version. For iati-organisation, the default is apparently “1.0” 🤔

@andylolz
Copy link
Contributor

Until this thread I had no idea that in 1.0x, @version could also be declared on iati-activity / iati-organisation! I’m really surprised that’s there (and that it defaults to something other than the version on the parent element.)

It seems like in practice, it’s pretty widely omitted, which would break these combination rules you’ve created. About 139 publishers get it wrong for activity files, compared with only 15 publishers getting it right. The numbers are worse for org files – something like 165 publishers get it wrong; just 6 get it right. So if you implement this as proposed, the version attribute will mostly be throwing exceptions for 1.0x datasets. (And you’ll probably be parsing a lot of datasets as v1.01 that probably aren’t actually v1.01.)

So… I’m not at all sure about these combination rules. I absolutely agree that the 1.0x standard is really unclear (and arguably wrong) here – but I’m not sure the solution is to rigidly follow what’s (not) there. After all, very few publishers are following it. So I’m not sure the value in retrospectively enforcing rules that have not been codified before.

I guess that leads me to make the same comment others have made above – that it really depends what your usecases are for pyIATI. I gather you’re after a to-the-letter implementation of the standard as it stands… My sense is: the data that exists in the registry doesn’t follow that.

@hayfield
Copy link
Contributor Author

For iati-organisation, the default is apparently “1.0”

More evidence of the mystical v1.00! :D Though also, ARGH!!!

that it really depends what your usecases are for pyIATI

At the moment, it's deemed that a pyIATI Dataset contains data at a single version. Should there be data at multiple versions, multiple pyIATI Datasets should be created... which seems like something that a utility function could help with.

I’m not at all sure about these combination rules. I absolutely agree that the 1.0x standard is really unclear (and arguably wrong) here – but I’m not sure the solution is to rigidly follow what’s (not) there

To be honest, nor am I.

@dalepotter knows more about them than I do, though it seems like a reasonable way to work within the above design concept until a reasonable validation-based approach is implemented (current thought: validate the data against a bunch of versions, the one with the fewest errors is probably correct; maybe mixing in something about seeing at what version the newest element or attribute was added).

@hayfield hayfield added the versions Relating to IATI Version Numbers. label Feb 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
datasets Relating to IATI Datasets. enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). versions Relating to IATI Version Numbers.
Projects
None yet
Development

No branches or pull requests

5 participants