Ideas for a generic language-independent object model to XML mapping
The aim of this package is to support convenient language independent serialisation
of object trees in memory to XML, without having to
write a custom parser with each change in object model.
There have been a number of packages to do this for a number
of languages, but all have their minor irritations - either
they make XML in a format you wouldn't choose if you
were writing it by hand (e.g. Java's XML serialisation, GNOSIS objectify for Python) -
or they require ugly coding styles which result from machine-generated source code
(e.g. JAXB) - or ugly mapping files to customise the type of XML it will cope with
(e.g. Castor for Java).
This package aims to combine clear XML with clear code while
minimising the effort to read or write object trees in XML.
Some of the issues can be illustrated with an example: start with a class definition:
class X {
int f1 = 123;
String f2 = "abc";
Y f3;
Y f4;
List<Z> f5;
Ref<A> f6;
}
What should the XML look like? One option would be:
<X f1="123" f2="abc">
<f3><Y>ystuff</Y></f3>
<f4><Y>ystuff</Y></f4>
<f5><Z>zstuff</Z><Z>sztuff</Z></f5>
<f6>path/of/A</f6>
</X>
The problem here is that there are many possible ways of
going from an object model to XML. Different people will
prefer different encodings, so any XML file you see which
is often a serialisation of an object model will have chosen
a different selection of mappings, so you end up having
to modify your parser just to read the new file format - when you
ought to be able to feed your generic parser the class descriptions
and have it figure out how to read the format.
Some design objectives are:
- make the XML as concise and readable as possible
- be systematic about mapping objects to XML
- be unambiguous
- make the common case look good
Some arbitrary choices
There are a number of arbitrary choices when it comes to
serialising an object in XML, with the result that each
developer ends up doing it differently. Most of these
choices are a matter of style (e.g. use attributes not
elements). Because people do things differently, it
can be hard reading someone else's XML format without
recoding a parser.
A) Attributes or elements for simple types?
(Where a simple type can be coded as a single line string,
e.g. int, double, String).
Should it be
i) <X f1="123"/>
or
ii) <X><f1>123</f1></X>
?
B) Are XML <elements> the class names, the field names, or both?
Should it be:
i) <f3><Y>ystuff</Y></f3>
or
ii) <f3>ystuff</f3>
or
iii) <Y>ystuff</y>
?
iii) introduces an ambiguity if there are >1 fields of
class Y, but the common case is that there is only
one field of a given class. (ii) has the disadvantage that
it becomes harder to pull out the instances of of Y from the
file.
C) Do lists need an explicit tag?
i)
<f5><Z>zstuff</Z><Z>zstuff</Z></f5>
ii)
<Z>zstuff</Z><Z>zstuff</Z>
iii)
<f5>zstuff</f5><f5>zstuff</f5>
iv)
<listOfZ><Z>zstuff</Z><Z>zstuff</Z></listOfZ>
An explicit tag for lists reduces ambiguity but introduces
an extra level in the hierarchy. The common case is
for a class to have a single sublist, in which case
(ii) is the best choice.
D) How are class names / namespaces resolved?
Assume classes are packaged into namespaces,
with full names like "org.neuroml.channel.XYZ".
How is this full name coded in XML?
i) Short names, "classpath" attribute included at top.
<topelem classpath="org.neuroml.channel; org.neuroml.network">
...
<XYZ>xyzstuff</XYZ>
+ : concise, avoids cluttering XML with namespaces
- : may need an extra mechanism to resolve ambiguities
ii) Full names in XML elements
<org.neuroml.channel.XYZ>xyzstuff</org.neuroml.channel.XYZ>
+ : unambiguous
- : unreadable
iii) Using XML namespaces xmlns:n="org.neuroml.channel"
<n:XYZ>xyzstuff</n:XYZ>
+ : uses XML standards
- : makes the XML hard to read ; most of the time (i) is unambiguous.
iv) Using an attribute like "class=" or "type=" to disambiguate where necessary
<XYZ class="org.neuroml.channel.XYZ">xyzstuff</XYZ>
+ : keeps the common case simple
- : the unusual case is verbose
Note that (i), (ii), (iii) assume the tag is the classname.
If choose B(iii) (tag is fieldname) we have to use an
attribute if the object's class is a subclass of the one in the schema.
Observations
Having the tag as a class is good news, as it makes
it simpler to browse an XML file and pick out all
instances of a class.
Avoiding unnecessary levels in the hierarchy can also be
good news - e.g. skipping the list field name if the
class has a single list.