rexlib
is a pure Python module for working with tokenized XML. It allows you to iterate through the tokens of an XML document just as you would iterate through items in a list and provides an API for manipulating tokens. By creating simple generator functions, you can create very focused XML filters that can be chained together to create surprisingly complex, robust, and efficient XML transformations.
rexlib
works at the lexical level and is not necessarily concerned with well formedness or validity. It excels when you want to make precisely targeted changes to an XML document without disturbing the rest of the document. It's also effective when you want a simple way to extract element and attribute content from XML. It's well suited for working with XML that is not well formed, XML that contains syntax errors in markup, and XML where you want control over lexical details that are normally discarded by XML parsers. Text or source code containing islands of markup can also be usefully manulipulated.
Because rexlib
is relatively simple and implemented in Python, you can use it to create practical, roll-your-own solutions.
rexlib
relies on Robert D. Cameron's REX shallow parsing regular expression for tokenizing XML. REX does the tokenizing; rexlib
is all about making it easy to work with tokenized XML. Combining regex-based shallow parsing with secondary parsing of individual tokens leads to an XML processing API that's more procedural in flavor than event-based and tree-based XML APIs and should feel comfortable to those accustomed to doing text processing.
rexlib
isn't an XML API per se, however it does provide an API for tokens. Each token is an instance of one of the following classes:
Text
- Plain text: a run of text not containing the
<
character Start
- A start tag:
<tag>
or<tag att="val">
Empty
- An empty tag:
<tag/>
or<tag att="val"/>
End
- An end tag:
</tag>
Comment
- A comment:
<!-- comment -->
PI
- A processing instruction:
<?target instruction?>
XmlDecl
- An XML Declaration:
<?xml version="1.0" ...?>
Doctype
- A DOCTYPE declaration:
<!DOCTYPE tag ...>
Cdata
- A CDATA section:
<![CDATA[ literal <markup/> ]]>
Error
- A markup error: a token starting with
<
that does not end with>
All token classes are subclasses of the abstract class Token
. Here is the class hierarchy for tokens:
[Token] | ------------------------------------------------------------------- | | | | | | | Text [Tag] Comment PI Doctype Cdata Error | | ------------ XMLDecl | | [StartOrEmpty] End | --------- | | Start Empty [...] = abstract class
Let's tokenize some XML and see what we get:
>>> from rexlib import *
>>> s = '<p>Hello <pre>rexlib</pre> World!</p>'
>>> tokens = tokenize(s)
Note that tokenize()
is a generator function; it returns a generator object that yields tokens.
>>> type(tokens)
<type 'generator'>
The easiest way to see what's inside tokens
is to wrap it in list()
.
>>> print list(tokens)
[Start('<p>'),
Text('Hello '),
Start('<pre>'),
Text('rexlib'),
End('</pre>'),
Text(' World!'),
End('</p>')]
You can also get at the tokens by iterating with a for
loop
>>> tokens = tokenize(s)
>>> for token in tokens:
... print token
...
Start('<p>')
Text('Hello ')
Start('<pre>')
Text('rexlib')
End('</pre>')
Text(' World!')
End('</p>')
or you can invoke the generator's next()
method
>>> tokens = tokenize(s)
>>> tokens.next() # next(tokens) works as well
Start('<p>')
>>> tokens.next()
Text('Hello ')
or by using a list comprehension.
>>> tokens = tokenize(s)
>>> [token.xml for token in tokens]
['<p>', 'Hello ', '<pre>', 'rexlib', '</pre>', ' World!', '</p>']
Note token.xml
, in the list comprehension above. All subclasses of Token have an xml
attribute that stores the current serialization of the token. When each token is instantiated, it is parsed into its components (tag name, attributes, etc.). Unless you modify the token, token.xml
is just the original (unaltered) XML string. As soon as you change the token in some way, the token is reserialized (rebuilt from its components). Reserialization doesn't happen until you make a change (or manually call token.reserialize()). For example,
>>> token = Start('<p>')
>>> token.xml
'<p>'
>>> token['class'] = 'block' # assignment triggers reserialization
>>> token.xml
'<p class="block">'
>>> token.name = 'para' # assignment triggers reserialization
>>> token.xml
'<para class="block">'
It's worth noting my use of Start('<p>')
in the first line of the example above. You'll rarely instantiate a token manually like this. Normally you'll just use tokenize(). But for testing, its easier to type Start('<p>')
than
>>> tokenize('<p>').next().xml
'<p>'
The main advantage to using tokenize() is that it identifies the type of token (text or markup) and instantiates the proper class. It would be very tedious if you had to create new XML by typing
>>> tokens = iter([Start('<p>'), Text('ouch!'), End('</p>')])
It's much easier to type
>>> tokens = tokenize('<p>ahh!</p>')
When experimenting in the interactive interpreter, it's almost always better to assign your XML string to a variable. This way you can easily refresh your token generator. For example,
>>> s = '<p>some xml string ...</p>'
>>> tokens = tokenize(s)
>>> tokens.next()
Start('<p>')
>>> tokens.next()
Text('some xml string ...')
Say now that you want to start over in order to test something else. All you have to do is refresh the generator.
>>> tokens = tokenize(s)
>>> tokens.next()
Start('<p>')
Don't worry, this doesn't get expensive. Because of the lazy nature of generators/iterators, you're only tokenizing as much as you consume. tokenize(s)
costs nothing. It's not until you start consuming tokens that any actual work happens. The example above is similar in effect to doing a seek(0)
on a file object. For example,
>>> fin = open('some_file')
>>> print fin.read(12)
>>> fin.seek(0) # go back to beginning of file
If you want to loop over the same sequence of tokens several times, you can also convert the generator to a list and then emulate a token generator using iter().
>>> tokens = tokenize(s)
>>> token_list = list(tokens)
>>> tokens = iter(token_list) # first pass over sequence
>>> tokens.next()
Start('<p>')
>>> tokens = iter(token_list) # second pass over sequence
>>> tokens.next()
Start('<p>')
The advantage here is that the token list is reusable while a token generator would be spent after the first pass. To pass a token list (rather than a token generator) to a rexlib
filter (explained below) you'll usually need to wrap it with iter().
Here's a simple example to whet your appetite; it's a token filter that changes tag names according to a mapping you supply.
>>> def tag_filter(tokens, mapping):
... """Rename tags per supplied mapping."""
... for token in tokens:
... if token.is_a(Tag) and token.name in mapping:
... token.name = mapping[token.name]
... yield token
...
The filter doesn't need to differentiate start, end, or empty tags; it only cares that any subclass of Tag
has a name
attribute that may need to be updated. Here's an example of how you might use tag_filter
:
>>> s = '<p>...<ex/><br/>...<ex>...</ex>...</p>'
>>> d = { 'p': 'para',
... 'ex': 'extract'}
>>> tokens = tokenize(s)
>>> tokens = tag_filter(tokens, d)
>>> s = concat_tokens(tokens)
>>> s
'<para>...<extract/><br/>...<extract>...</extract>...</para>'
Extracting text or attribute values from XML is quite straightforward. Just iterate through the tokens looking for what you're interested in and accumulate it in whatever way is convenient.
Here's a rexlib
solution to Nelson Minar's problem extracting 'xmlUrl' attributes from an OPML file:
>>> from rexlib import *
>>> s = open('foo.opml').read()
>>> tokens = tokenize(s)
>>> for token in tokens:
... if token.is_a(StartOrEmpty) and 'xmlUrl' in token:
... print token['xmlUrl']
You could also write a simple generator function.
>>> def extract_xmlUrl_attributes(tokens):
... for token in tokens:
... if token.is_a(StartOrEmpty) and 'xmlUrl' in token:
... yield token['xmlUrl']
...
>>> tokens = tokenize(s)
>>> print list(extract_xmlUrl_attributes(tokens))
Here's a simple extraction problem lifted from an entry in Uche Ogbuji's O'Reilly weblog:
The idea is simply to print all verses containing the word 'begat' [in] Jon Bosak's Old Testament in XML, a 3.3MB document. A quick note on the characteristics of the file: it contains 23145 v elements containing each Bible verse and only text: no child elements. The v elements and their content represent about 3.2 of the file's total 3.3MB.
The fact that the v elements contain only text makes this problem even easier. All we need to do is tokenize the ot.xml file, iterate through the tokens looking for <v>
start tags, grab the next token (which we know will be text) and check it for 'begat'; if 'begat' is found, append it to a list. Here's the code:
>>> from rexlib import *
>>>
>>> ot = open('ot.xml').read()
>>>
>>> l = []
>>>
>>> tokens = tokenize(ot)
>>> for token in tokens:
>>> if token.is_a(Start, 'v'):
>>> text = tokens.next().xml
>>> if 'begat' in text:
>>> l.append(text)
>>> print '\n'.join(l)
To make this problem a little more realistic, let's pretend the document is marked up a little more richly and that the v
elements contain mixed content (i.e., text and child elements). Once we find a <v>
start tag, we'll need a way to find its matching end tag so that we can examine the full content of the element. accumulate_tokens()
is the rexlib
function we'll use; it's another generator function. Here's the code leading up to using accumulate_tokens()
:
tokens = tokenize(ot) # remember to reset the token generator for token in tokens: if token.is_a(Start, 'v'): v_tokens = accumulate_tokens(token, tokens)
accumulate_tokens()
looks at its first argument (which needs to be a Start
or Empty
token) and iterates through its second argument looking for the matching end token. accumulate_tokens()
is a generator function, which means it returns a generator. That generator is now bound to v_tokens
, above.
Remember that generators are lazy: no work is done until you start iterating through them. At this point in the code, we haven't actually accumulated any tokens. The simplest way to force evaluation is to wrap v_tokens
in a list()
.
. v_list = list(v_tokens) # unwind the generator
We now have accumulated the tokens that comprise the current v
element and they exist as a list bound to v_list
. It may be worth pointing out that v_tokens
is now spent. accumulate_tokens()
advanced through tokens
until it found the matching end tag for token
. When the for
loop continues, it implicitly calls tokens.next()
, picking up where we left off (the token following the end tag of the element we just accumulated).
Now it's time to do something with the v
element. Let's say the v
element looks like the following:
<v>And <a href="#Seth">Seth</a> lived an hundred and five years, and begat <a href="#Enos">Enos</a>:</v>
Here it would be safe to concatenate the tokens into an XML string (markup included) and search for "begat". However, since we're probably only interested in finding "begat" in the text and would rather avoid finding begat, say, in an attribute value, we need a way to target only the text tokens. The rexlib
function concat_tokens()
will handle both cases.
>>> s = ('<v>And <a href="#Seth">Seth</a> lived an hundred and five years, '
... 'and begat <a href="#Enos">Enos</a>:</v>')
>>> tokens = tokenize(s)
>>> concat_tokens(tokens) # includes markup
'<v>And <a href="#Seth">Seth</a> lived an hundred and five years, and begat <a href="#Enos">Enos</a>:</v>'
>>> tokens = tokenize(s)
>>> concat_tokens(tokens, Text) # include only Text tokens
'And Seth lived an hundred and five years, and begat Enos:'
The second argument to concat_tokens
is used as a filter: it will preserve tokens of the type specified. If you wanted only the start and end tags, you could use
>>> tokens = tokenize(s)
>>> concat_tokens(tokens_list, (Start, End))
'<v><a href="#Seth"></a><a href="#Enos"></a></v>'
Let's restate the previous code and finish it up.
>>> l = [] # to hold the 'begat' verses
>>> tokens = tokenize(ot)
>>> for token in tokens:
>>> if token.is_a(Start, 'v'):
>>> v_tokens = accumulate_tokens(token, tokens)
>>> v_list = list(v_tokens)
>>> if 'begat' in concat_tokens(v_list, Text): # search text only
>>> l.append(concat_tokens(v_list)) # append text and markup
Simple extraction is one possibility. But with just a little additional work, we can turn this code into a token filter that instead enriches the OT markup by adding an attribute to v
elements that contain 'begat'.
>>> def annotate_begat(tokens):
... for token in tokens:
... if token.is_a(Start, 'v'):
... v_tokens = accumulate_tokens(token, tokens)
... v_list = list(v_tokens)
... if 'begat' in concat_tokens(v_list, Text):
... # add an annotate attribute to <v>
... token['annotate'] = 'begat'
... for token in v_list:
... yield token # yield the element we accumulated
... else:
... yield token # yield all other tokens
...
Note the use of yield
, making annotate_begat()
a generator function.
Here we have a very focused filter that does one thing well. It's almost always better to keep your filters simple and single-minded. You can chain multiple filters together with very little speed penalty. Except for when you have to use list()
to accumulate tokens, the effect of chaining generators is that each token travels through the entire chain of filters before the next token starts, similar to a Unix pipline. As much as it seems like you must be iterating over the same sequence multiple times, it's more like you're iterating over the sequence just once, with each token percolating through the filter chain.
>>> tokens = tokenize(s)
>>> tokens = annotate_begat(tokens)
>>> # tokens = annotate_desciple(tokens) # Here's how you would
>>> # tokens = some_other_filter(tokens) # chain filters.
>>> concat_tokens(tokens)
'<v annotate="begat">And <a href="#Seth">Seth</a> lived an hundred and five years, and begat <a href="#Enos">Enos</a>:</v>'
In fact, you'll need to keep in mind the lazy execution when wrapping filter chains in try
/except
blocks. As an example, let's add a filter that raises an exception:
>>> def error_filter(tokens):
... for token in tokens:
... raise RuntimeError, 'hit error'
... yield token
...
>>> tokens = tokenize(s)
>>> try:
... tokens = annotate_begat(tokens)
... tokens = error_filter(tokens)
... except RuntimeError, value:
... print 'Caught error:', value
...
>>> concat_tokens(tokens)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "rexlib/token_filters.py", line 29, in concat_tokens
return ''.join([token.xml for token in tokens])
File "<stdin>", line 2, in error_filter
File "<stdin>", line 2, in annotate_begat
File "<stdin>", line 3, in error_filter
RuntimeError: hit error
Notice that the exception wasn't caught. That's because the generators don't "unwind" until concat_tokens(tokens)
is run. concat_tokens()
isn't magical, it's just the first bit of code that actually forces iteration though the tokens.
There have been occasions where I've writen token filters thinking as if each filter iterates through the tokens completely before moving on to the next filter, only to find unexpected output. If you have a filter that depends on a previous filter having finished it's job, you'll need to force execution by manually iterating or wrapping with list().
tokens = tokenize(s) tokens = filter1(tokens) tokens = filter2(tokens) token_list = list(tokens) # causes filter1 and filter2 to run to completion tokens = iter(token_list) tokens = filter3(tokens) tokens = filter4(tokens) s = concat_tokens(tokens) # causes filter3 and filter4 to run to completion Or, alternately, tokens = tokenize(s) tokens = filter1(tokens) tokens = filter2(tokens) for token in tokens: ... # a for loop also causes filters1 and filter2 to run to completion
Keep in mind that using list(tokens), not to mention concat_tokens(), will load all the tokens into memory at once; this could consume a lot of memory if you're working with very large XML files. Simple token filters are very memory friendly and fast, much like a pipeline.
Each token type has it's own API (methods, properties, attributes).
All tokens inherit from an abstract base class, Token
, which provides the following informal interface:
is_a(token_class)
- Checks to see whether the current token (self) is an instance of
token_class
. reserialize()
- Rebuilds the token's
xml
attribute based on internal state. Whenever a change is made to the token,reserialize()
is automatically called. About the only time you'll callreserialize
manually is when you've changed thetemplate
class attribute and want the token to reflect the change. See thetemplate
attribute, described below. __repr__()
Controls the representation of the the token in the interactive interpreter. By default, shows only the first 45 characters of the
xml
attribute (controlled by the class attributeMAX_REPR_WIDTH
); for example,>>> Start('<very-long-tag-name att1="value1" att2="value2" att3="value3">') Start('<very-long-tag-name att1="value1" att2="value2" ...')
xml
- Stores the serialized form of the token.
template
- String template used for reserialization.
template
is a class attribute, shared by all instances. If, for example, you wantedEmpty
tags to serialize as<tag />
rather than<tag/>
you could set the class attributeEmpty.template = '<%s%s />
and write a token filter that invokes eachEmpty
token'sreserialize()
method. SettingEmpty.template
does not cause reserialization automatically because the class doesn't hold references to its instances. The default value forEmpty.template
is<%s%s/>
. encoding
- Stores the encoding declared in a document's XML declaration. Defaults to sys.getdefaultencoding. [TODO: What about processing fragments -- only use it if you want to be encoding-aware? How to handle fragments if internal Unicode fanciness is happening?]
To the basic interface inherited from Token
, the Text
class adds one property, isspace
. Text
is the only token class that does not implement a reserialize()
method -- not much point since it, by definition, doesn't contain any markup. To modify a Text
token, just assign directly to its xml
attribute.
isspace
- The value of
isspace
will beTrue
if the token contains only whitespace; it's False otherwise.
The interface for Start
and Empty
tokens is the same. Both inherit from the abstract StartOrEmpty
class. While you'll never see an instance of StartOrEmpty
, it is useful when doing isinstance() tests. For example,
>>> start_token = Start('<tag att="value">')
>>> empty_token = Empty('<tag att="value"/>')
>>> start_token.is_a(StartOrEmpty)
True
>>> empty_token.is_a(StartOrEmpty)
True
>>> start_token.is_a(Empty)
False
Note that token.is_a(Start)
is equivalent to isinstance(token, Start)
, however is_a()
reads better (to me at least) and, for Start
, Empty
, and End
tokens (subclasses of Tag
), is_a()
lets you add one or more tag names as arguments to refine the test.
>>> token = Start('<p>')
>>> isinstance(token, Start)
True
>>> token.is_a(Start, 'p')
True
>>> token.is_a(Start, 'a', 'body', 'span')
False
For processing instructions, is_a()
lets you specify targets (rather than tag names).
Another useful idiom when you want to find one of a number of tags is
>>> tag_names = ['p', 'a', 'span', 'i', 'b', 'body']
>>> token.is_a(Start, *tag_names)
True
Remember, StartOrEmpty
will match both start and empty tags; End
will match end tags; and Tag
will match start, empty, and end tags.
>>> token.is_a(StartOrEmpty), token.is_a(End), token.is_a(Tag)
(True, False, True)
is_a(token_class, *names)
- Checks to see whether the current token (
self
) is an instance oftoken_class
. You can also pass one or more tag names as arguments to refine the test. attribute_name in token
- Checks if token has an attribute named
attribute_name
; returnsTrue
orFalse
. del token[attribute_name]
- Deletes attribute named
attribute_name
if it exists; no error is raised if it doesn't exist. set_attribute_order(attribute_order=['attr1', 'attr2'], sort=False)
- Re-orders attributes based on
attribute_order
list. Any attributes listed inattribute_order
will appear first (and in that order); any remaining attributes will follow in original order. Ifsort
is set toTrue
, any remaining attributes will appear in case-insensitive sorted order. If you want to sort all attributes, useset_attribute_order(sort=True)
. __getitem__
,__setitem__
, and__delitem__
Attributes can be assigned, retrieved, and deleted using index notation on each token. Getting or deleting an attribute that is not present will not raise an exception.
>>> token = Start('<p>') >>> token['class'] = 'block' # assign attribute Start('<p class="block">')
>>> token['class'] # get attribute 'block'
>>> del token['class'] # delete attribute >>> token Start('<p>')
attributes
A dictionary-like object that preserves attribute order. You'll usually get and set attributes using index notation. See
__getitem__
description above for examples.attributes
is an instance ofAttributeDict
, which adds three methods to the usual dictionary interface:has_key_nocase()
, which simplifies matching attributes with inconsistent case;set_attribute_order()
, which lets you specify attribute order; andto_xml()
, which serializes the attributes as XML.>>> token = Start('<p Class="block" indent="no">') >>> token.attributes {'Class': 'block', 'indent': 'no'} >>> token.attributes.has_key_nocase('class') True
>>> token.set_attribute_order(['indent', 'Class']) >>> token Start('<p indent="no" Class="block">')
>>> token.attributes.to_xml() ' Class="block" indent' >>> token.template % (token.name, token.attributes.to_xml()) '<p Class="block" indent="no">'
Note that
to_xml()
normalizes attribute value delimiters to double quotes. Any double quotes appearing in attribute values are escaped as ". Adjust the source if you prefer single quotes.>>> token = Start("""<p x='funky "quoted" attribute'>""") >>> token Start('<p x=\'funky "quoted" attribute\'>') >>> token.attributes {'x': 'funky "quoted" attribute'} >>> token.attributes.to_xml() ' x="funky "quoted" attribute"'
Note that this normalization only happens if the token is modified (which triggers the reserialize()
method).
name
- The tag name.
ns_prefix
The namespace prefix, if present; an empty string otherwise.
Namespaces disclaimer: Since
rexlib
works mostly at the lexical level, it doesn't try to be sophisticated about namespaces. Tag names are treated as strings; you're free to map them to URIs and track scope as part of a token filter. However, if namespaces are important to your application, it wouldn't be hard for you to extendrexlib
, say to makeis_a()
tests work something liketoken.is_a(Start, (HTML_URI, 'p'))
to match<html:p>
and where "html" is actually mapped to a URI for purposes of comparison. Of course, each token would then need store the namespace mappings that were in effect when it was instantiated. More practically, the Tag class could be used to store all known namespace mappings as they're encountered (with the mapping being visible to theStart
,Empty
, andEnd
subclasses); this would be much lighter-weight solution. The whole point ofrexlib
for me was that it was easy to extend whenever a new problem proved akward to solve with XSLT, etc. So don't be afraid to read the source and modify it to solve the problems you face.
Let's first take a look at the Start
token:
>>> s = '<p class="text" indent="no">'
>>> token = Start(s)
>>> token
Start('<p class="text" indent="no">')
>>> dir(token) # the list below is trimmed
['attributes', 'delete_attribute', 'has_attribute', 'is_a', 'local_part',
'name', 'prefix', 'reserialize', 'set_attribute_order', 'template', 'xml']
Here are examples of how the methods and attributes for Start
tokens are used:
>>> token.xml
'<p class="text" indent="no">'
>>> token.name
'p'
Note that name
is a property rather than a simple attribute so that when you assign a new name
>>> token.name = 'para'
reserialization is triggered.
>>> token.xml
'<para class="text" indent="no">'
Another property is ns_prefix
.
>>> token = Start('<xhtml:p>')
>>> token.ns_prefix
'xhtml'
>>> token.ns_prefix = 'html'
>>> token.xml
'<html:p>'
>>> token.ns_prefix = ''
>>> token.xml
'<p>'
You can also change the namespace prefix by changing token.name
.
>>> token.name = 'html:p'
>>> token.xml
'<html:p>'
XML attributes are stored in a special dictionary that keeps track of order.
>>> token.attributes
{'class': 'text', 'indent': 'no'}
>>> token.has_attribute('class')
True
>>> token.delete_attribute('class')
>>> token.xml
'<p indent="no">'
>>> token.is_a(Start)
True
>>> token.is_a(Start, 'p', 'para')
True
>>> token['class'] = 'newer_text'
>>> token.xml
'<p indent="no" class="newer_text">'
>>> token.set_attribute_order(['class', 'indent'])
>>> token.xml
'<p class="newer_text" indent="no">'
>>> token.name = 'para'
>>> token.xml
'<para class="newer_text" indent="no">'
>>> token.template
'<%s%s>'
>>> token.template % (token.name, token.attributes.to_xml())
'<para class="newer_text" indent="no">'
The Empty
token is exactly the same as Start
except for it's template
class attribute.
>>> Start.template
'<%s%s>'
>>> Empty.template
'<%s%s/>'
The End
token does not have an attributes
attribute and has a different template
class attribute.
>>> End.template
'</%s>'
The Text
token is the most primitive. It's has only one attribute.
>>> token = Text('Here is some text')
>>> token.xml
'Here is some text'
It also has an isspace
property, used to test whether the token is all whitespace.
>>> token.isspace
False
>>> Text(' \t\r \n').isspace
True
Here are the basics of the PI
token.
>>> s = '<?xml version="1.0" encoding="utf-8"?>'
>>> token = PI(s)
PI
tokens have two useful attributes, target
and instruction
.
>>> token.target
'xml'
>>> token.instruction
'version="1.0" encoding="utf-8"'
Processing instructions will sometimes contain pseudo-attributes, as in the example above. You can read a processing instruction's pseudo-attributes just like you would attributes
>>> token['version']
'1.0'
>>> token['encoding']
'utf-8'
Note, however, that the PI
tokens pseudo attributes are read only.
>>> token['encoding'] = "ascii"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
If you want to change a pseudo-attribute, you'll need to rewrite the whole instruction. For example
>>> token.instruction = 'version="1.0" encoding="ascii"'
>>> token.xml
'<?xml version="1.0" encoding="ascii"?>'
XML Declarations (XmlDecl) are a subclass of PI and so have the same properties.
>>> token = XmlDecl('<?xml version="1.0" encoding="utf-8"?>')
>>> token.target
'xml'
>>> token.instruction
'version="1.0" encoding="utf-8"'
>>> token['version']
'1.0'
Doctypes have four properties: document_element, identifier_type, identifier, and internal_subset.
>>> token = Doctype('<!DOCTYPE x:body SYSTEM "/S:/xml/dtd/xhtml1-strict-prefixed.dtd" [<!ENTITY abc "xyz">]>')
>>> token.document_element
'x:body'
>>> token.id_type
'SYSTEM'
>>> token.id_value
'/S:/xml/dtd/xhtml1-strict-prefixed.dtd'
>>> token.internal_subset
'<!ENTITY abc "xyz">'
>>> token = Doctype('<!DOCTYPE x:body SYSTEM "/S:/xml/dtd/xhtml1-strict-prefixed.dtd" [<!ENTITY abc "xyz">]>')
>>> token.document_element = 'html'
>>> token.id_type = 'PUBLIC'
>>> token.id_value = '-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
>>> token.internal_subset = ''
>>> token.xml
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">'
CDATA sections (Cdata) have two properties: content and escaped_content.
>>> token = Cdata('<![CDATA[ literal <markup/> ]]>')
>>> token.to_text_token()
Text(' literal <markup/> ')
>>> token.content
' literal <markup/> '
>>> token.escaped_content
' literal <markup/> '
>>> token.content = 'abc'
>>> token.xml
'<![CDATA[abc]]>'
Comment tokens have a single property: content.
>>> token = Comment('<!-- A comment. -->')
>>> token.content
' A comment. '
>>> token.xml
'<!-- A comment. -->'
>>> token.content = 'A different comment. '
>>> token.xml
'<!--A different comment. -->'
Error tokens add a span attribute, which shows the token's location in the original string.
>>> s = '<p>Some <i text.</p>'
>>> tokens = tokenize(s)
>>> for token in tokens:
... if token.is_a(Error):
... print repr(token)
... print token.span
...
Error('<i ')
(8, 11)
>>> s[8:11]
'<i '
In the example above, the tokenizer will also report syntax errors on stderr, showing the exact location of the error.
Syntax error in markup: "<p>Some <i text.</p>"
This behavior can be disabled by setting the tokenizer's error_stream to None.
>>> tokens = tokenize(s, error_stream=None)
Explain the UTF-8 expectations of the SPE. Changing the SPE to use Unicode? PCRE has a DFA algorithm -- how to access pcre_test from Python?
Document intent of Token.encoding and Doctype trigger that updates class attribute.
Show examples of enumerate() idiom and why it's useful: lets you do lookaheads by calling next() within a loop but makes it easy to keep track of current index while also letting you use continue to skip over some code but continue looping.
Add a Limitations section, giving examples where token processing can become onerous or error-prone.
Explain that SGML can be tokenized by using a modified shallow parsing expression, providing that the SGML resembles XML (handles SGML's different PI and empty tag syntax -- although lack of well-formedness makes SGML processing not terribly fun: show example of making SGML well-formed (sgml -> xml), etc.).
Note that assigning directly to token.xml (except for Text
) should not be done if there's a chance that reserialization might be triggered later on: reserialize()
overwrites token.xml
based on internal state. (I'd rather not make token.xml
a property.)
More real-world (simple, complex, and too-complex) examples.
Copyright 2013, David Niergarth