bbc.co.uk

Home
Contacts
 NMLParser - an XML - object serializer for python

Ideas for a generic language-independent object model to XML mapping

The aim of this package is to support convenient language independent serialisation of object trees in memory to XML, without having to write a custom parser with each change in object model. There have been a number of packages to do this for a number of languages, but all have their minor irritations - either they make XML in a format you wouldn't choose if you were writing it by hand (e.g. Java's XML serialisation, GNOSIS objectify for Python) - or they require ugly coding styles which result from machine-generated source code (e.g. JAXB) - or ugly mapping files to customise the type of XML it will cope with (e.g. Castor for Java).

This package aims to combine clear XML with clear code while minimising the effort to read or write object trees in XML.

Some of the issues can be illustrated with an example: start with a class definition:

class X {
  int f1 = 123;
  String f2 = "abc";
  Y f3;
  Y f4;
  List<Z> f5;
  Ref<A> f6;
}
What should the XML look like? One option would be:
<X f1="123" f2="abc">
  <f3><Y>ystuff</Y></f3>
  <f4><Y>ystuff</Y></f4>
  <f5><Z>zstuff</Z><Z>sztuff</Z></f5>
  <f6>path/of/A</f6>
</X>

The problem here is that there are many possible ways of going from an object model to XML. Different people will prefer different encodings, so any XML file you see which is often a serialisation of an object model will have chosen a different selection of mappings, so you end up having to modify your parser just to read the new file format - when you ought to be able to feed your generic parser the class descriptions and have it figure out how to read the format.

Some design objectives are:
  • make the XML as concise and readable as possible
  • be systematic about mapping objects to XML
  • be unambiguous
  • make the common case look good

Some arbitrary choices

There are a number of arbitrary choices when it comes to serialising an object in XML, with the result that each developer ends up doing it differently. Most of these choices are a matter of style (e.g. use attributes not elements). Because people do things differently, it can be hard reading someone else's XML format without recoding a parser.

A) Attributes or elements for simple types?

(Where a simple type can be coded as a single line string, e.g. int, double, String).

Should it be
i)	<X f1="123"/>
or
ii)	<X><f1>123</f1></X>
?

B) Are XML <elements> the class names, the field names, or both?

Should it be:

i)      <f3><Y>ystuff</Y></f3>
or
ii)	<f3>ystuff</f3>
or
iii)	<Y>ystuff</y>
?

iii) introduces an ambiguity if there are >1 fields of class Y, but the common case is that there is only one field of a given class. (ii) has the disadvantage that it becomes harder to pull out the instances of of Y from the file.

C) Do lists need an explicit tag?

i)
	<f5><Z>zstuff</Z><Z>zstuff</Z></f5>
ii)
	<Z>zstuff</Z><Z>zstuff</Z>
iii)
	<f5>zstuff</f5><f5>zstuff</f5>
iv)
	<listOfZ><Z>zstuff</Z><Z>zstuff</Z></listOfZ>

An explicit tag for lists reduces ambiguity but introduces an extra level in the hierarchy. The common case is for a class to have a single sublist, in which case (ii) is the best choice.

D) How are class names / namespaces resolved?

Assume classes are packaged into namespaces, with full names like "org.neuroml.channel.XYZ". How is this full name coded in XML?

i) Short names, "classpath" attribute included at top.
   <topelem classpath="org.neuroml.channel; org.neuroml.network">
...
	<XYZ>xyzstuff</XYZ>

+ : concise, avoids cluttering XML with namespaces
- : may need an extra mechanism to resolve ambiguities
ii) Full names in XML elements
	<org.neuroml.channel.XYZ>xyzstuff</org.neuroml.channel.XYZ>

+ : unambiguous
- : unreadable
iii) Using XML namespaces xmlns:n="org.neuroml.channel"
	<n:XYZ>xyzstuff</n:XYZ>

+ : uses XML standards
- : makes the XML hard to read ; most of the time (i) is unambiguous.
  • iv) Using an attribute like "class=" or "type=" to disambiguate where necessary
    	<XYZ class="org.neuroml.channel.XYZ">xyzstuff</XYZ>
    
    + : keeps the common case simple
    - : the unusual case is verbose
    

    Note that (i), (ii), (iii) assume the tag is the classname. If choose B(iii) (tag is fieldname) we have to use an attribute if the object's class is a subclass of the one in the schema.

    Observations

    Having the tag as a class is good news, as it makes it simpler to browse an XML file and pick out all instances of a class.

    Avoiding unnecessary levels in the hierarchy can also be good news - e.g. skipping the list field name if the class has a single list.