Module:SLAXML
Note - this module has not yet been tested thoroughly on Meta. Some features may not work. |
SLAXML is a pure-Lua SAX-like streaming XML parser. It is more robust than many (simpler) pattern-based parsers that exist, properly supporting code like <expr test="5 > 7" />
, CDATA nodes, comments, namespaces, and processing instructions.
It is currently not a truly valid XML parser, however, as it allows certain XML that is syntactically-invalid (not well-formed) to be parsed without reporting an error.
Features
[edit]- Pure Lua in a single file (two files if you use the DOM parser).
- Streaming parser does a single pass through the input and reports what it sees along the way.
- Supports processing instructions (
<?foo bar?>
). - Supports comments (
<!-- hello world -->
). - Supports CDATA sections (
<![CDATA[ whoa <xml> & other content as text ]]>
). - Supports namespaces, resolving prefixes to the proper namespace URI (
<foo xmlns="bar">
and<wrap xmlns:bar="bar"><bar:kittens/></wrap>
). - Supports unescaped greater-than symbols in attribute content (a common failing for simpler pattern-based parsers).
- Unescapes named XML entities (
< > & " '
) and numeric entities (e.g.
) in attributes and text nodes (but—properly—not in comments or CDATA). Properly handles edge cases like&
. - Optionally ignore whitespace-only text nodes (as appear when indenting XML markup).
- Includes a DOM parser that is both a convenient way to pull in XML to use as well as a nice example of using the streaming parser.
- Does not add any keys to the global namespace.
Usage
[edit]local SLAXML = require 'slaxml'
local myxml = io.open('my.xml'):read()
-- Specify as many/few of these as you like
parser = SLAXML:parser{
startElement = function(name,nsURI) end, -- When "<foo" or <x:foo is seen
attribute = function(name,value,nsURI) end, -- attribute found on current element
closeElement = function(name,nsURI) end, -- When "</foo>" or </x:foo> or "/>" is seen
text = function(text) end, -- text and CDATA nodes
comment = function(content) end, -- comments
pi = function(target,content) end, -- processing instructions e.g. "<?yes mon?>"
}
-- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
-- (does not strip leading/trailing whitespace from CDATA)
parser:parse(myxml,{stripWhitespace=true})
If you just want to see if it will parse your document correctly, you can simply do:
local SLAXML = require 'slaxml'
SLAXML:parse(myxml)
…which will cause SLAXML to use its built-in callbacks that print the results as seen.
DOM Builder
[edit]If you simply want to build tables from your XML, you can alternatively:
local SLAXML = require 'slaxdom' -- also requires slaxml.lua; be sure to copy both files
local doc = SLAXML:dom(myxml)
The returned table is a 'document' comprising tables for elements, attributes, text nodes, comments, and processing instructions. See the following documentation for what each supports.
DOM Table Features
[edit]- Document - the root table returned from the
SLAXML:dom()
method.doc.type
: the string"document"
doc.name
: the string"#doc"
doc.kids
: an array table of child processing instructions, the root element, and comment nodes.doc.root
: the root element for the document
- Element
someEl.type
: the string"element"
someEl.name
: the string name of the element (without any namespace prefix)someEl.nsURI
: the namespace URI for this element;nil
if no namespace is appliedsomeEl.attr
: a table of attributes, indexed by name and indexlocal value = someEl.attr['attribute-name']
: any namespace prefix of the attribute is not part of the namelocal someAttr = someEl.attr[1]
: an single attribute table (see below); useful for iterating all attributes of an element, or for disambiguating attributes with the same name in different namespaces
someEl.kids
: an array table of child elements, text nodes, comment nodes, and processing instructionssomeEl.el
: an array table of child elements onlysomeEl.parent
: reference to the parent element or document table
- Attribute
someAttr.type
: the string"attribute"
someAttr.name
: the name of the attribute (without any namespace prefix)someAttr.value
: the string value of the attribute (with XML and numeric entities unescaped)someEl.nsURI
: the namespace URI for the attribute;nil
if no namespace is appliedsomeEl.parent
: reference to the parent element table
- Text - for both CDATA and normal text nodes
someText.type
: the string"text"
someText.name
: the string"#text"
someText.value
: the string content of the text node (with XML and numeric entities unescaped for non-CDATA elements)someText.parent
: reference to the parent element table
- Comment
someComment.type
: the string"comment"
someComment.name
: the string"#comment"
someComment.value
: the string content of the attributesomeComment.parent
: reference to the parent element or document table
- Processing Instruction
someComment.type
: the string"pi"
someComment.name
: the string name of the PI, e.g.<?foo …?>
has a name of"foo"
someComment.value
: the string content of the PI, i.e. everything but the namesomeComment.parent
: reference to the parent element or document table
Finding Text for a DOM Element
[edit]The following function can be used to calculate the "inner text" for an element:
function elementText(el)
local pieces = {}
for _,n in ipairs(el.kids) do
if n.type=='element' then pieces[#pieces+1] = elementText(n)
elseif n.type=='text' then pieces[#pieces+1] = n.value
end
end
return table.concat(pieces)
end
local xml = [[<p>Hello <em>you crazy <b>World</b></em>!</p>>]]
local para = SLAXML:dom(xml).root
print(elementText(para)) --> "Hello you crazy World!""
A Simpler DOM
[edit]If you want the DOM tables to be simpler-to-serialize you can supply the simple option via:
local dom = SLAXML:dom(myXML,{ simple=true })
In this case no table will have a parent attribute, elements will not have the el collection, and the attr collection will be a simple array (without values accessible directly via attribute name). In short, the output will be a strict hierarchy with no internal references to other tables, and all data represented in exactly one spot.
Known Limitations / TODO
[edit]- Does not require or enforce well-formed XML. Certain syntax errors are silently ignored and consumed. For example:
foo="yes & no"
is seen as a valid attribute- <root><child> invokes two
startElement()
calls but nocloseElement()
calls <foo></bar>
invokesstartElement("foo")
followed bycloseElement("bar")
<foo> 5 < 6 </foo>
is seen as valid text contents
- No support for custom entity expansion other than the standard XML entities (
< > " ' &
) and numeric ASCII entities (e.g. - XML Declarations (
<?xml version="1.x"?>
) are incorrectly reported as Processing Instructions - No support for DTDs
- No support for extended (Unicode) characters in element/attribute names
- No support for charset
- No support for XInclude
History
[edit]- v0.5.2 2013-Nov-7
- Lua 5.2 compatible
- Parser now errors if it finishes without finding a root element, or if there are unclosed elements at the end.
- (Proper element pairing is not enforced by the parser, but is—as in previous releases—enforced by the DOM builder.)
- v0.5.1 2013-Feb-18
<foo xmlns="bar">
now directly generatesstartElement("foo","bar")
with no post callback fornamespace
required.
- v0.5 2013-Feb-18
- Use the
local SLAXML=require 'slaxml'
pattern to prevent any pollution of the global namespace.
- v0.4.3 2013-Feb-17
- Bugfix to allow empty attributes, i.e.
foo=""
closeElement
no longer includes namespace prefix in the name, includes the nsURI
- v0.4 2013-Feb-16
- DOM adds
.parent
references SLAXML.ignoreWhitespace
is now:parse(xml,{stripWhitespace=true})
- "simple" mode for DOM parsing
- v0.3 2013-Feb-15
- Support namespaces for elements and attributes
<foo xmlns="barURI">
will callstartElement("foo",nil)
followed bynamespace("barURI")
(and thenattribute("xmlns","barURI",nil)
); you must apply the namespace to your element after creation.- Child elements without a namespace prefix that inherit a namespace will receive
startElement("child","barURI")
<xy:foo>
will callstartElement("foo","uri-for-xy")
<foo xy:bar="yay">
will callattribute("bar","yay","uri-for-xy")
- Runtime errors are generated for any namespace prefix that cannot be resolved
- Add (optional) DOM parser that validates hierarchy and supports namespaces
- v0.2 2013-Feb-15
- Supports expanding numeric entities e.g.
"
->"
- Utility functions are local to parsing (not spamming the global namespace)
- v0.1 2013-Feb-7
- Option to ignore whitespace-only text nodes
- Supports unescaped > in attributes
- Supports CDATA
- Supports Comments
- Supports Processing Instructions
License
[edit]Copyright (c) 2013 Gavin Kistner
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--[=====================================================================[
v0.5.2 Copyright © 2013 Gavin Kistner <!@phrogz.net>; MIT Licensed
See http://github.com/Phrogz/SLAXML for details.
--]=====================================================================]
local SLAXML = {
VERSION = "0.5.2",
_call = {
pi = function(target,content)
mw.log(string.format("<?%s %s?>",target,content))
end,
comment = function(content)
mw.log(string.format("<!-- %s -->",content))
end,
startElement = function(name,nsURI)
mw.log(string.format("<%s%s>",name,nsURI and (" ("..nsURI..")") or ""))
end,
attribute = function(name,value,nsURI)
mw.log(string.format(" %s=%q%s",name,value,nsURI and (" ("..nsURI..")") or ""))
end,
text = function(text)
mw.log(string.format(" text: %q",text))
end,
closeElement = function(name,nsURI)
mw.log(string.format("</%s>",name))
end,
}
}
function SLAXML:parser(callbacks)
return { _call=callbacks or self._call, parse=SLAXML.parse }
end
function SLAXML:parse(xml,options)
if not options then options = { stripWhitespace=false } end
-- Cache references for maximum speed
local find, sub, gsub, char, push, pop = string.find, string.sub, string.gsub, string.char, table.insert, table.remove
local first, last, match1, match2, match3, pos2, nsURI
local unpack = unpack or table.unpack
local pos = 1
local state = "text"
local textStart = 1
local currentElement={}
local currentAttributes={}
local currentAttributeCt
local nsStack = {}
local entityMap = { ["lt"]="<", ["gt"]=">", ["amp"]="&", ["quot"]='"', ["apos"]="'" }
local entitySwap = function(orig,n,s) return entityMap[s] or n=="#" and char(s) or orig end
local function unescape(str) return gsub( str, '(&(#?)([%d%a]+);)', entitySwap ) end
local anyElement = false
local function finishText()
if first>textStart and self._call.text then
local text = sub(xml,textStart,first-1)
if options.stripWhitespace then
text = gsub(text,'^%s+','')
text = gsub(text,'%s+$','')
if #text==0 then text=nil end
end
if text then self._call.text(unescape(text)) end
end
end
local function findPI()
first, last, match1, match2 = find( xml, '^<%?([:%a_][:%w_.-]*) ?(.-)%?>', pos )
if first then
finishText()
if self._call.pi then self._call.pi(match1,match2) end
pos = last+1
textStart = pos
return true
end
end
local function findComment()
first, last, match1 = find( xml, '^<!%-%-(.-)%-%->', pos )
if first then
finishText()
if self._call.comment then self._call.comment(match1) end
pos = last+1
textStart = pos
return true
end
end
local function nsForPrefix(prefix)
for i=#nsStack,1,-1 do if nsStack[i][prefix] then return nsStack[i][prefix] end end
error(("Cannot find namespace for prefix %s"):format(prefix))
end
local function startElement()
anyElement = true
first, last, match1 = find( xml, '^<([%a_][%w_.-]*)', pos )
if first then
currentElement[2] = nil
finishText()
pos = last+1
first,last,match2 = find(xml, '^:([%a_][%w_.-]*)', pos )
if first then
currentElement[1] = match2
currentElement[2] = nsForPrefix(match1)
match1 = match2
pos = last+1
else
currentElement[1] = match1
for i=#nsStack,1,-1 do if nsStack[i]['!'] then currentElement[2] = nsStack[i]['!']; break end end
end
currentAttributeCt = 0
push(nsStack,{})
return true
end
end
local function findAttribute()
first, last, match1 = find( xml, '^%s+([:%a_][:%w_.-]*)%s*=%s*', pos )
if first then
pos2 = last+1
first, last, match2 = find( xml, '^"([^<"]*)"', pos2 ) -- FIXME: disallow non-entity ampersands
if first then
pos = last+1
match2 = unescape(match2)
else
first, last, match2 = find( xml, "^'([^<']*)'", pos2 ) -- FIXME: disallow non-entity ampersands
if first then
pos = last+1
match2 = unescape(match2)
end
end
end
if match1 and match2 then
local currentAttribute = {match1,match2}
local prefix,name = string.match(match1,'^([^:]+):([^:]+)$')
if prefix then
if prefix=='xmlns' then
nsStack[#nsStack][name] = match2
else
currentAttribute[1] = name
currentAttribute[3] = nsForPrefix(prefix)
end
else
if match1=='xmlns' then
nsStack[#nsStack]['!'] = match2
currentElement[2] = match2
end
end
currentAttributeCt = currentAttributeCt + 1
currentAttributes[currentAttributeCt] = currentAttribute
return true
end
end
local function findCDATA()
first, last, match1 = find( xml, '^<!%[CDATA%[(.-)%]%]>', pos )
if first then
finishText()
if self._call.text then self._call.text(match1) end
pos = last+1
textStart = pos
return true
end
end
local function closeElement()
first, last, match1 = find( xml, '^%s*(/?)>', pos )
if first then
state = "text"
pos = last+1
textStart = pos
if self._call.startElement then self._call.startElement(unpack(currentElement)) end
if self._call.attribute then for i=1,currentAttributeCt do self._call.attribute(unpack(currentAttributes[i])) end end
if match1=="/" then
pop(nsStack)
if self._call.closeElement then self._call.closeElement(unpack(currentElement)) end
end
return true
end
end
local function findElementClose()
first, last, match1, match2 = find( xml, '^</([%a_][%w_.-]*)%s*>', pos )
if first then
nsURI = nil
for i=#nsStack,1,-1 do if nsStack[i]['!'] then nsURI = nsStack[i]['!']; break end end
else
first, last, match2, match1 = find( xml, '^</([%a_][%w_.-]*):([%a_][%w_.-]*)%s*>', pos )
if first then nsURI = nsForPrefix(match2) end
end
if first then
finishText()
if self._call.closeElement then self._call.closeElement(match1,nsURI) end
pos = last+1
textStart = pos
pop(nsStack)
return true
end
end
while pos<#xml do
if state=="text" then
if not (findPI() or findComment() or findCDATA() or findElementClose()) then
if startElement() then
state = "attributes"
else
first, last = find( xml, '^[^<]+', pos )
pos = (first and last or pos) + 1
end
end
elseif state=="attributes" then
if not findAttribute() then
if not closeElement() then
error("Was in an element and couldn't find attributes or the close.")
end
end
end
end
if not anyElement then error("Parsing did not discover any elements") end
if #nsStack > 0 then error("Parsing ended with unclosed elements") end
end
return SLAXML