Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

Datasets do not validate as XML when created with a string containing a specified encoding #285

Open
dalepotter opened this issue Mar 7, 2018 · 8 comments
Labels
datasets Relating to IATI Datasets. enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). validation Changes to validation functionality.

Comments

@dalepotter
Copy link
Contributor

dalepotter commented Mar 7, 2018

Linked to #24, datasets with an encoding declared do not validate as XML.

This example shows the problem using code from the master branch (v0.3.0):

$ python
Python 3.6.0 (default, Dec 24 2016, 08:02:28) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iati
>>> dataset_xml_declaration_no_encoding = iati.Dataset("""
... <?xml version="1.0"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_no_encoding)
True

vs. the same dataset with and encoding="UTF-8"? declared:

>>> dataset_xml_declaration_with_encoding = iati.Dataset("""
... <?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding)
False

This latter XML (pastebin link for convenience) does validate as XML using two online XML validation sites: codebeautify and truugo

@dalepotter dalepotter added the bug This issue identifies and details a bug. label Mar 7, 2018
@hayfield
Copy link
Contributor

hayfield commented Mar 7, 2018

Noting that relevant tests are in test_data.py#TestDatasetWithEncoding

@hayfield hayfield added validation Changes to validation functionality. datasets Relating to IATI Datasets. labels Mar 7, 2018
@hayfield
Copy link
Contributor

hayfield commented Mar 7, 2018

An XML string must not have any leading whitespace, as both these examples do.

value_stripped = value.strip()
undertakes some amount of stripping of leading and trailing whitespace, though an explicit encoding may cause complications. There is currently only one test relating to leading whitespace - this doesn't seem fully comprehensive!

def test_dataset_xml_string_leading_whitespace(self):
"""Test Dataset creation with a valid XML string that is not IATI data."""
xml_str = iati.tests.resources.load_as_string('leading_whitespace_xml')
data = iati.Dataset(xml_str)
tree = etree.fromstring(xml_str.strip())
assert data.xml_str == xml_str.strip()
assert etree.tostring(data.xml_tree) == etree.tostring(tree)

@hayfield
Copy link
Contributor

hayfield commented Mar 7, 2018

>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...  <iati-activities version="xx">
...    <iati-activity>
...      <iati-identifier></iati-identifier>
...      <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...      <title>
...        <narrative>Xxxxxxx</narrative>
...      </title>
...      <description>
...        <narrative>Xxxxxxx</narrative>
...      </description>
...      <participating-org role="xx"></participating-org>
...      <activity-status code="xx"/>
...      <activity-date type="xx" iso-date="2013-11-27"/>
...      <activity-date type="xx" iso-date="2013-11-27">
...        <narrative>Xxxxxxx</narrative>
...      </activity-date>
...    </iati-activity>
...  </iati-activities>
...  """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

As premised, this is a problem that the provided string is not valid XML because it contains leading whitespace. This is therefore a problem with an explicit encoding in combination with leading whitespace (the automatic removal of which is deemed to be a feature of pyIATI).

I will update the title to better reflect this.

@hayfield hayfield changed the title iati.validator.is_xml fails with XML strings containing a defined encoding iati.validator.is_xml fails with XML strings containing leading whitespace and a defined encoding Mar 7, 2018
@hayfield hayfield changed the title iati.validator.is_xml fails with XML strings containing leading whitespace and a defined encoding Datasets do not strip leading whitespace upon creation when there is also a defined encoding Mar 7, 2018
@hayfield hayfield added enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). and removed enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). labels Mar 7, 2018
@dalepotter
Copy link
Contributor Author

I think the wrong string was tested! With no leading whitespace the same results come back...

No whitespace and no encoding

>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

No whitespace and a UTF-8 encoding

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
False

The error log tells us more...

>>> err_log = iati.validator.validate_is_xml(dataset_xml_declaration_with_encoding_3)
>>> len(err_log)
1
>>> err_log[0].name
'err-not-xml-not-string'
>>> err_log[0].info
"The value provided is a `<class 'str'>` rather than a `str`."

But... A workaround?!

However, when it is encoded to a bytes object all is well...

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """.encode())
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
True

@hayfield mentioned that all tests for validation use bytes objects - I'd suggest adding some tests where we test strings.

@hayfield hayfield changed the title Datasets do not strip leading whitespace upon creation when there is also a defined encoding Datasets do not validate as XML when created with a string containing a specified encoding Mar 8, 2018
@hayfield
Copy link
Contributor

hayfield commented Mar 8, 2018

Due to the re-ordering of Dataset-creation operations in #286, the error occurs earlier under that branch. As such, that may be a better place to start from (also because it's a change that looks to explicitly separate how bytes and str objects are treated).

@hayfield
Copy link
Contributor

hayfield commented Mar 8, 2018

The underlying error raised by lxml is: ValueError('Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.',) - will look to improve the visibility of this message.

hayfield added a commit that referenced this issue Mar 8, 2018
lxml does not support strings with an encoding declaration. They must be bytes objects if there is an encoding declaration.
Previously, this error was grouped in with others. This separates two possible ValueErrors that lxml may raise so that it's clearer.

This issue was highlighted in #285
@hayfield hayfield self-assigned this Mar 8, 2018
@hayfield hayfield added enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). and removed bug This issue identifies and details a bug. labels Mar 8, 2018
@hayfield hayfield removed their assignment Mar 8, 2018
@hayfield
Copy link
Contributor

hayfield commented Mar 8, 2018

Changing from bug to enhancement since lxml does not support this feature, and so this would be some additional pyIATI functionality to convert strs to bytes where required.

@hayfield
Copy link
Contributor

hayfield commented Mar 8, 2018

NOTE: This is only a problem at Python 3 due to the changes to what a str is.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
datasets Relating to IATI Datasets. enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). validation Changes to validation functionality.
Projects
None yet
Development

No branches or pull requests

2 participants