The Helenos Project
KDD Workbench for the Semantic Web

The HTML Module

The HTML module fetches documents, either from the web or the local file system, lets the teacher (you) build a dataset of positive and negative examples of documents, transforms the documents and tries to determine a general pattern, which describes the dataset. Let us work through a simple example to let you become acquainted with this module.

Content

Creating a project

Start Helenos, connect to the database and create a new project named html.

By clicking the "Suggest" button, Helenos offers a a database, which will be created and used for this project.

DataSelection

Automatically, the DataSelection will show up. Select the HTML module and add the index.html file from the examples/almo folder of the release.

Creating an experiment

After finishing the DataSelection, the ExperimentManager appears. Create a new experiment by clicking the "Suggest" buttons. This is the best way to avoid conflicts with already existing experiments.

Again a database is created and used for the storage of all our data.

The ExperimentView

Next, you will get the ExperimentView. Helenos uses Almo as workflow engine for the KDD process. First, you will get the configuration view of Almo. This offers you the opportunity to add optional elements to the process. The (+) at the workflow node indicates that you can add another task. Let us add the DataAlgorithm task by right-clicking on the workflow node and adding it. The DataAlgorithm task may be removed, which is indicated by the (-), and also has two more optional elements (indicated by (+)). For now, we will you use the default optional elements Hits and PageRank.

After clicking the "Create instance" button, we will get the running instance of our KDD workflow.

The experiment settings

Before starting our workflow, take a look at the experiment settings.

General settings

Setting Meaning
ERROR_HANDLING Determines how occurring errors, should be handled:
  • IGNORE: Leaves the documents with errors in the database, but ignores them.
  • DELETE: Deletes all documents with errors from the database.
  • DISPLAY: Displays a dialog, which asks how specific errors should be handled.
DISPLAY_PROGOL_WARNINGS
  • true: Displays all warnings of the ILP tool Progol.
  • false: Only displays the Progol output without warnings.

HTML module settings

You will find the specific HTML module settings at the /MODULES/HTML node. We will get to them as we go through each KDD task. Do not change any default settings for this tutorial.

Now we will start the workflow and in the further course of the tutorial you will get introduced to all KDD tasks.

DatabasePreparation

This task prepares the database for the HTML module. It does not require any interaction.

DataPreProcessing

The HTML documents are fetched, cleaned (transformed from possibly dirty html into proper xhtml) and integrated (crawled). Depending on your crawler settings this task might take a while. Use the crawler settings wisely to limit the amount of documents in your dataset. You will find the settings at the /MODULES/HTML/CRAWLER node.

Setting Meaning
STAYONHOST
  • true: Only deal with documents from the same domain. If you are using documents from your local file system, only those will be used.
  • false: Accept any domain.
DEPTH Limit the document depth. The crawler uses broad search as strategy. -1 means that there shall be no limit, regarding the depth.
MAXDOCUMENTS The maximum amount of documents in your dataset. -1 indicates an infinite amout, this way all documents of a specific domain will be added.

As the preprocessing is running, the most likely error is, that a ressource does not exist. Since we choose, that errors should be displayed, we will get the following dialog for the various documents that can not be found.

Please select "delete" every time this dialog shows up. If you do not want to be asked about every errors when you are crawling a domain of a few hundred documents, select the DELETE or IGNORE value of the ERROR_HANDLING setting.

DataAlgorithms

This task runs the popular Hits and PageRank algorithms on the documents, which might help you, when you are building a dataset of positive and negative examples.

Clicking "Start algorithm" executes the algorithm and by clicking "Next" you will get to the next element of the workflow.

DataSelection

Now we have reached the crucial part of the workflow. At the DataSelection you choose your training set of positive and negative examples. You can base your selection on the DataAlgorithms you have executed. Clicking "Select" will choose a dataset based on the "Percentage of documents used as training set" and the "Ratio of positive to negative examples". But we will not use these features. Select the documents manually according to the screenshot above.

You can use the "Load ..." and "Save ..." buttons to serialize your data selection into a text file.

DataTransformation

The next crucial point is the DataTransformation. This task depends on the following settings:

/MODULES/HTML/TRIPLE_FACTORY

Setting Meaning
IGNORE_TAGS Ignores the nodes specified by the tags. The element tags have to be written in capital letters.

/MODULES/HTML/PROGOL/MODES

The settings DOCTITLE, METATAGS, MAIL, RELATIONS_INTERN and RELATIONS_EXTERN extracts "intelligent" triples from the html documents, independant from the IGNORE_TAGS.

The settings ELEMENT, TAG and ATTRIBUTE generate triples from the original document without the IGNORE_TAGS.

Setting Meaning
DOCTITLE Extracts the title of the document.
METATAGS Extracts the metatag elements.
MAIL Extracts emails from the documents (the attribute value of the <A href="mailto:xxx" /> nodes).
RELATIONS_INTERN Represents the link relations between documents within the dataset.
RELATIONS_EXTERN Represents the link relations from documents within the dataset to documents, which are outside of the crawled domain.
ELEMENT Represents relations between html element nodes.
TEXT Represents the text contained in element tags.
ATTRIBUTE Represents the attributes of element tags.

Since we have not changed the default settings only the "intelligent" triples are generated for our example.

DataMining

After the input data was written to a temp file in the progol/temp folder, Progol can be executed. Editing the modes is not very clever, unless you are absolutely sure, what you are doing. But you can edit the settings to limit or expand the search space of Progol. (Click here for further information about Progol.)

After clicking the "Run" button, the Progol process is started and the input data is written to it. The command for finding a general pattern in the input data is "generalise(document/1)?". Send it and Progol will find the following rule:

document(A) :- internal_relation(A,A).

This means that the general rule, which describes all positive examples and does not describe any negative example is, that all positive documents have a relation to themselves.

If you are not satisfied with this result, remove the "modeb(*,internal_relation(+object,-object))?" mode and run Progol again. Now we will get the following result:

document(A) :- metatag(A,content,'text html charset iso 8859 1').

After removing the "modeb(*,metatag(+object,#object,#object))?" mode, the output will be:

document(A) :- doctitle(A,'Almo').

You might find a new result, if you remove the "modeb(1,doctitle(+object,#object))?" mode as well. The reason is that Progol stops as soon as it finds a general rule. With the doctitle mode being active, it finds the doctitle rule first and stops. Play around by adding and removing the modes, but be careful not to remove the document mode, because then you will not be able to generalize the documents.

The "Save files ..." buttons lets you save the input file for Progol to your file systems. Furthermore, you will find "Load ..." and "Save ..." buttons at the Modes and Settings tab, which allows you to save all relevant Progol input files generated by Helenos to your file systems.

Results

Depending on which result you have kept from your Progol executions, the results are displayed as the final workflow step.

The results are displayed in html format, so you can save them and view them in a browser.

Conclusion

It can be quite frustrating to find general rules within your html dataset. Sometimes it is pure luck to discover something within a domain, of which you do not have any knowledge. Playing around with the triple factory, Progol modes and Progol settings helps to limit the basic search space. At the dataset task you can specify a number of iterations to be carried out. Also, adjusting the "Percentage of documents used as training set" and the "Ratio of positive to negative examples" settings and checking the "Random" box, Helenos offers to iteratively explore a domain, based on a random document selection.

© 2003-2006 AIFB - OntoWare Team