Monday, April 25, 2011

XML file comparison

Another problem is to compare two XML files with the same structure and bring out a result that spells out the differences in the two files. Again I am set out to talk on a solution more involved than the simple one we actually chose to implement.

Simple approach :
A simple approach is to start by referring to the topmost parent node in the code and try to find it in both the XML files. If the node is found in both the files, then no difference found at the first level and we can proceed to the next deeper level of nodes to query. The actual XML tags are thus hard-coded and the same process can be repeated. If it is not found in either of the files, then the first difference is found. We can note this difference and then generate some other structure to describe the difference or a "delta". We can call this as the difference structure. If the XML is a representation of a relational database then this difference structure can then be used to translate into a set of RDBMS queries that equalise the database contents. Thus it is quite plausible that the difference structure is a linear structure rather than a tree-like XML.

Drawbacks of the simple approach :
The logic to parse the structure is inextricably mixed up with the logic to generate a difference structure that is a linear one. If the XML structure changes, the parsing logic also changes and the changes in the delta generated also follow. What would be ideal is the parsing of the nodes be independent of what the XML is about and just the actual difference structure (and subsequent RDBMS queries) generated be dependent on the actual differences in the XML. Its the changes needed to the parsing code that is a matter of contention. I think we can do better.

A better approach :
As it usually happens in computer algorithms, the better approach is more complex than the simple ones. As an aside, the situation is quite different when dealing with mathematical statements and proofs - the better proofs are shorter and have an "elegance" quality in them, in the long run, they prove to be more readily intuitive.
Ok, here we should let the coding logic be independent of any actual XML tags. We actually create a nice array of strings that is multi-dimensional and "hand-write" the XML tags in them. So If the XML has a hierarchy that goes two levels deep, then the array will be actually a list of a list of strings. We just create a string structure that mirrors the XML. This looks tedious but is way simpler than coding that structure for parsing into something the compiler finds acceptable.
The parsing code traverses this string array and treats it as as a descriptor to the XML. It does the same for both the files and if it now finds a difference, it will proceed to generate a in-memory linear difference structure. A separate piece of code will translate the differences into something like a set of RDBMS queries as required and the important fact is that this query generator is strictly different from the parser.
If the XML structure changes (or more likely, extended), no change is needed to the parsing logic. Just the hand-written string array should be changed to reflect the changes. Some skill involved in coding the parsing logic but once done, this approach scores over the earlier one on counts of maintainability and extensibility.

I see parallels in this approach and this is one that could be described as a "data-driven" approach (my term, my quotes). The actual code is only like a markup processor, that is very generic and simplfied. The key point is not that compilation is avoided but that the changes are not in code but in an array-like stucture of simple strings or values.

No comments: