-
Notifications
You must be signed in to change notification settings - Fork 5
Datasets do not validate as XML when created with a string containing a specified encoding #285
Comments
Noting that relevant tests are in |
An XML string must not have any leading whitespace, as both these examples do. Line 87 in 5958810
pyIATI/iati/tests/test_data.py Lines 46 to 53 in 5958810
|
>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True As premised, this is a problem that the provided string is not valid XML because it contains leading whitespace. This is therefore a problem with an explicit encoding in combination with leading whitespace (the automatic removal of which is deemed to be a feature of pyIATI). I will update the title to better reflect this. |
I think the wrong string was tested! With no leading whitespace the same results come back... No whitespace and no encoding>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True No whitespace and a UTF-8 encoding>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
False The error log tells us more... >>> err_log = iati.validator.validate_is_xml(dataset_xml_declaration_with_encoding_3)
>>> len(err_log)
1
>>> err_log[0].name
'err-not-xml-not-string'
>>> err_log[0].info
"The value provided is a `<class 'str'>` rather than a `str`." But... A workaround?!However, when it is encoded to a >>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """.encode())
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
True @hayfield mentioned that all tests for validation use bytes objects - I'd suggest adding some tests where we test strings. |
Due to the re-ordering of Dataset-creation operations in #286, the error occurs earlier under that branch. As such, that may be a better place to start from (also because it's a change that looks to explicitly separate how |
The underlying error raised by lxml is: |
lxml does not support strings with an encoding declaration. They must be bytes objects if there is an encoding declaration. Previously, this error was grouped in with others. This separates two possible ValueErrors that lxml may raise so that it's clearer. This issue was highlighted in #285
Changing from |
NOTE: This is only a problem at Python 3 due to the changes to what a |
Linked to #24, datasets with an encoding declared do not validate as XML.
This example shows the problem using code from the master branch (v0.3.0):
vs. the same dataset with and
encoding="UTF-8"?
declared:This latter XML (pastebin link for convenience) does validate as XML using two online XML validation sites: codebeautify and truugo
The text was updated successfully, but these errors were encountered: