The Helenos Project
KDD Workbench for the Semantic Web

The XML Module

XML lacks of a well-founded semantics. Its use and meaning mainly lie in its syntax, which allow different applications to access content delivered in XML format more easily than it would be the case, if the same content was presented as a simple text file. Nervertheless, with the evolution of semantic technologies pure syntax is not enough.

The XML module tries to build rules based on the syntatic structure of XML nodes and hence enriches the semantic shortcomings of XML.

Content

DataSelection

Create a new project named xml and add the trains.xml file from the examples/trains folder to the project.

Create a new experiment and start the workflow.

DatabasePreparation

This task prepares the database for the XML module. It does not require any interaction.

DataPreProcessing

This task fetches the document and prepares the internal XML database. Depending on the size of the selected XML file this might take a while.

DataSelection

Next, we want to choose all XML nodes identified by some "<train>" tag, which have a subnode "<direction>" with the PCDATA "east".

To do so click "Select nodes" at the element "<train>, depth: 1".

A dialog will show up:

Execute the following steps:

Complete the dataset with negative examples by applying these steps to XML nodes with the PCDATA "west" in the subelement "<direction>, depth: 2". After that your dataset should look like this:

The column "object" states the internal identifier of the corresponding node. Click "Finish" to proceeed.

DataTransformation

The XML data is transformed into clauses and the modes for Progol are extracted.

DataMining

First, the input data is written to a temp file in the progol/temp folder.

Run Progol by clicking the "Run" button and sending the "generalise(node/1)?" command, which will result in the following output:

node(A) :- has_elem_direction(A,B), has_text(B,east).

This means, that all positive nodes have a subelement direction with the text east. This result is not very surprising because we choose our dataset according to this criteria, but in larger XML documents, you might reveal more interesting common structures of nodes. Again, experimenting with the modes might yield other results.

Results

Depending on which result you have kept from your Progol executions, the results are displayed as the final workflow step.

The results are displayed in html format, so you can save them and view them in a browser.

Conclusions

The approach offered by Helenos is based on the analysis of the syntactic structure of XML documents. It gives basic insights into the common structures of selected XML nodes by extrating rules from a dataset of nodes. This might be a first step to transform shallow XML data into a format with more semantics like RDF or perhaps even OWL.

© 2003-2006 AIFB - OntoWare Team