Monday, April 25, 2011

XML specification and duplicate tag processing

XML has been long touted as a very promising method for information exchange. Some count it as too verbose and doubt how efficient XML turns out to be if the information is voluminous. However, XML still reigns as the most widely accepted method to convey structured data, in a human readable form, for which parsers are widely available and one that is extensible.

One pattern of usage was noticed at my work product : Referring to another tag, to copy content -

A huge XML file that carries product control configuration of the entire application is usually being edited by humans. It basically stores configuration properties for various services that run as part of the product. What should we do if there are multiple duplicate services and they essentially have the identical properties ?

For example -

<Top_Parent_Node attr1="val1">
<Service_Node attr2="val21">
<Prop attr3="val3">
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>

<Service_Node attr2="val22"> <!-- duplicated service tag : we need this for the application -->
<Prop attr3="val3"> <!-- Forced to repeat this from the previous tag -->
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>
.... More such repetitions ....
</Top_Parent_Node>

The simplest way is to repeat the properties at both locations by copy-paste. We are rather good at that.
We, however, screw up miserably when it comes to propagating changes to one set of properties to all other identical locations.

I have a suspicion that this is a common situation that others run into as well. Which makes a good case for formalising this requirement in the XML specifications itself. The XML specification should allow a choice - either specify tags or make a reference to other tag that will be as good as copied into this tag while parsing.
For example -

<Top_Parent_Node attr1=val1>
<Service_Node attr2=val21 ?xmlref="N1" > <!-- Label this tag as a reference -->
<Prop attr3=val3>
....
... Complex set of enclosed tags ....
....
</Prop>
</Service_Node>

<Service_Node attr2=val22> <!-- duplicated service tag : we need this for the application -->
<?xmlref="N1" /> <!-- No need to repeat - referred label is treated as copied -->
</Service_Node>
.... More such repetitions ....
</Top_Parent_Node>

Few points to note :
- Only one place where entire spec of a node that will possibly duplicate resides.
- Any changes made to one place will reflect in all other places which refer to it.
- The first Service_Node, that carries the complete spec is labelled in a unique manner. This label is part of the specification and any node can be labelled in this manner. Thus it need not appear in any dtds or xsls as an available attribute.
- Any node can refer to this label by enclosing a <?xmlref> with a label identifier. The parser should copy the entire specification within the referred node into this node.
- The referring node and the referred node need not be in the same hierarchy or tree depth. The parser should deal with a referring node appearing before the referred node in the file. This is to keep the XML parsing independent of ordering. If the referred node is not found, the parser should throw an exception. I can see that DOM parsers can handle this in a straightforward manner. The SAX however should need to parse to the end in search of a referred node.
- I don't quite see the possibility to provide partial overriding capability to this idea without unnecessarily complicating the idea and obfuscating the XML specification.
- The fact that integrity is maintained easily with changes to the spec gives some credence and value to this idea over the fact that readability of the XML is somewhat hampered.

When I encountered this problem at my workplace, I must say that the problem was solved at the application layer, i.e- a new tag was added inside the duplicates to refer to the other node. It was a simple hack to the problem but it seems not to solve the problem but rather work around it. As you would have known, this is what happens in a commercial context under time pressure.

No comments: