The XML Module
XML lacks of a well-founded semantics. Its use and meaning mainly lie in its syntax, which allow different applications to access content delivered in XML format more easily than it would be the case, if the same content was presented as a simple text file. Nervertheless, with the evolution of semantic technologies pure syntax is not enough.
The XML module tries to build rules based on the syntatic structure of XML nodes and hence enriches the semantic shortcomings of XML.
Content
- DataSelection
- DatabasePreparation
- DataPreProcessing
- DataSelection
- DataTransformation
- DataMining
- Results
- Conclusion
DataSelection
Create a new project named xml and add the trains.xml file from the examples/trains folder to the project.

Create a new experiment and start the workflow.

DatabasePreparation
This task prepares the database for the XML module. It does not require any interaction.
DataPreProcessing
This task fetches the document and prepares the internal XML database. Depending on the size of the selected XML file this might take a while.
DataSelection
Next, we want to choose all XML nodes identified by some "<train>" tag, which have a subnode "<direction>" with the PCDATA "east".
To do so click "Select nodes" at the element "<train>, depth: 1".

A dialog will show up:

Execute the following steps:
- Select "<direction>, depth: 2" from subelements
- Click "Add"
- Select "east" from the Text drop-down list of the SubNode Selection dialog
- Click "Execute"
- Mark the results as "POSITIVE"
- Click "Apply" and close the dialog
Complete the dataset with negative examples by applying these steps to XML nodes with the PCDATA "west" in the subelement "<direction>, depth: 2". After that your dataset should look like this:

The column "object" states the internal identifier of the corresponding node. Click "Finish" to proceeed.
DataTransformation
The XML data is transformed into clauses and the modes for Progol are extracted.
DataMining
First, the input data is written to a temp file in the progol/temp folder.
Run Progol by clicking the "Run" button and sending the "generalise(node/1)?" command, which will result in the following output:

node(A) :- has_elem_direction(A,B), has_text(B,east).
This means, that all positive nodes have a subelement direction with the text east. This result is not very surprising because we choose our dataset according to this criteria, but in larger XML documents, you might reveal more interesting common structures of nodes. Again, experimenting with the modes might yield other results.
Results
Depending on which result you have kept from your Progol executions, the results are displayed as the final workflow step.

The results are displayed in html format, so you can save them and view them in a browser.
Conclusions
The approach offered by Helenos is based on the analysis of the syntactic structure of XML documents. It gives basic insights into the common structures of selected XML nodes by extrating rules from a dataset of nodes. This might be a first step to transform shallow XML data into a format with more semantics like RDF or perhaps even OWL.
