Datasets do not validate as XML when created with a string containing a specified encoding #285

dalepotter · 2018-03-07T16:43:48Z

Linked to #24, datasets with an encoding declared do not validate as XML.

This example shows the problem using code from the master branch (v0.3.0):

$ python
Python 3.6.0 (default, Dec 24 2016, 08:02:28) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iati
>>> dataset_xml_declaration_no_encoding = iati.Dataset("""
... <?xml version="1.0"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_no_encoding)
True

vs. the same dataset with and encoding="UTF-8"? declared:

>>> dataset_xml_declaration_with_encoding = iati.Dataset("""
... <?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding)
False

This latter XML (pastebin link for convenience) does validate as XML using two online XML validation sites: codebeautify and truugo

The text was updated successfully, but these errors were encountered:

hayfield · 2018-03-07T16:59:07Z

Noting that relevant tests are in test_data.py#TestDatasetWithEncoding

hayfield · 2018-03-07T17:23:38Z

An XML string must not have any leading whitespace, as both these examples do.

pyIATI/iati/data.py

Line 87 in 5958810

value_stripped = value.strip()

undertakes some amount of stripping of leading and trailing whitespace, though an explicit encoding may cause complications. There is currently only one test relating to leading whitespace - this doesn't seem fully comprehensive!

pyIATI/iati/tests/test_data.py

Lines 46 to 53 in 5958810

    
           def test_dataset_xml_string_leading_whitespace(self): 
        
               """Test Dataset creation with a valid XML string that is not IATI data.""" 
        
               xml_str = iati.tests.resources.load_as_string('leading_whitespace_xml') 
        
               data = iati.Dataset(xml_str) 
        
               tree = etree.fromstring(xml_str.strip()) 
        
               assert data.xml_str == xml_str.strip() 
        
               assert etree.tostring(data.xml_tree) == etree.tostring(tree)

hayfield · 2018-03-07T17:34:43Z

>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...  <iati-activities version="xx">
...    <iati-activity>
...      <iati-identifier></iati-identifier>
...      <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...      <title>
...        <narrative>Xxxxxxx</narrative>
...      </title>
...      <description>
...        <narrative>Xxxxxxx</narrative>
...      </description>
...      <participating-org role="xx"></participating-org>
...      <activity-status code="xx"/>
...      <activity-date type="xx" iso-date="2013-11-27"/>
...      <activity-date type="xx" iso-date="2013-11-27">
...        <narrative>Xxxxxxx</narrative>
...      </activity-date>
...    </iati-activity>
...  </iati-activities>
...  """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

As premised, this is a problem that the provided string is not valid XML because it contains leading whitespace. This is therefore a problem with an explicit encoding in combination with leading whitespace (the automatic removal of which is deemed to be a feature of pyIATI).

I will update the title to better reflect this.

dalepotter · 2018-03-07T17:55:37Z

I think the wrong string was tested! With no leading whitespace the same results come back...

No whitespace and no encoding

>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

No whitespace and a UTF-8 encoding

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
False

The error log tells us more...

>>> err_log = iati.validator.validate_is_xml(dataset_xml_declaration_with_encoding_3)
>>> len(err_log)
1
>>> err_log[0].name
'err-not-xml-not-string'
>>> err_log[0].info
"The value provided is a `<class 'str'>` rather than a `str`."

But... A workaround?!

However, when it is encoded to a bytes object all is well...

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """.encode())
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
True

@hayfield mentioned that all tests for validation use bytes objects - I'd suggest adding some tests where we test strings.

hayfield · 2018-03-08T09:08:05Z

Due to the re-ordering of Dataset-creation operations in #286, the error occurs earlier under that branch. As such, that may be a better place to start from (also because it's a change that looks to explicitly separate how bytes and str objects are treated).

hayfield · 2018-03-08T09:20:21Z

The underlying error raised by lxml is: ValueError('Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.',) - will look to improve the visibility of this message.

lxml does not support strings with an encoding declaration. They must be bytes objects if there is an encoding declaration. Previously, this error was grouped in with others. This separates two possible ValueErrors that lxml may raise so that it's clearer. This issue was highlighted in #285

hayfield · 2018-03-08T10:07:57Z

Changing from bug to enhancement since lxml does not support this feature, and so this would be some additional pyIATI functionality to convert strs to bytes where required.

hayfield · 2018-03-08T10:34:37Z

NOTE: This is only a problem at Python 3 due to the changes to what a str is.

dalepotter added the bug This issue identifies and details a bug. label Mar 7, 2018

hayfield added validation Changes to validation functionality. datasets Relating to IATI Datasets. labels Mar 7, 2018

hayfield changed the title ~~iati.validator.is_xml fails with XML strings containing a defined encoding~~ iati.validator.is_xml fails with XML strings containing leading whitespace and a defined encoding Mar 7, 2018

hayfield changed the title ~~iati.validator.is_xml fails with XML strings containing leading whitespace and a defined encoding~~ Datasets do not strip leading whitespace upon creation when there is also a defined encoding Mar 7, 2018

hayfield added enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). and removed enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). labels Mar 7, 2018

hayfield changed the title ~~Datasets do not strip leading whitespace upon creation when there is also a defined encoding~~ Datasets do not validate as XML when created with a string containing a specified encoding Mar 8, 2018

hayfield mentioned this issue Mar 8, 2018

Make visible an error message about unsupported values #287

Merged

hayfield self-assigned this Mar 8, 2018

hayfield added enhancement Some sort of new functionality (rather than fixing or tweaking something that already existed). and removed bug This issue identifies and details a bug. labels Mar 8, 2018

hayfield removed their assignment Mar 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets do not validate as XML when created with a string containing a specified encoding #285

Datasets do not validate as XML when created with a string containing a specified encoding #285

dalepotter commented Mar 7, 2018 •

edited

Loading

hayfield commented Mar 7, 2018

hayfield commented Mar 7, 2018 •

edited

Loading

hayfield commented Mar 7, 2018

dalepotter commented Mar 7, 2018

hayfield commented Mar 8, 2018 •

edited

Loading

hayfield commented Mar 8, 2018

hayfield commented Mar 8, 2018

hayfield commented Mar 8, 2018 •

edited

Loading

Datasets do not validate as XML when created with a string containing a specified encoding #285

Datasets do not validate as XML when created with a string containing a specified encoding #285

Comments

dalepotter commented Mar 7, 2018 • edited Loading

hayfield commented Mar 7, 2018

hayfield commented Mar 7, 2018 • edited Loading

hayfield commented Mar 7, 2018

dalepotter commented Mar 7, 2018

No whitespace and no encoding

No whitespace and a UTF-8 encoding

But... A workaround?!

hayfield commented Mar 8, 2018 • edited Loading

hayfield commented Mar 8, 2018

hayfield commented Mar 8, 2018

hayfield commented Mar 8, 2018 • edited Loading

dalepotter commented Mar 7, 2018 •

edited

Loading

hayfield commented Mar 7, 2018 •

edited

Loading

hayfield commented Mar 8, 2018 •

edited

Loading

hayfield commented Mar 8, 2018 •

edited

Loading