The HTML Module
The HTML module fetches documents, either from the web or the local file system, lets the teacher (you) build a dataset of positive and negative examples of documents, transforms the documents and tries to determine a general pattern, which describes the dataset. Let us work through a simple example to let you become acquainted with this module.
Content
- Creating a project
- DataSelection
- Creating an experiment
- The ExperimentView
- The experiment settings
- DatabasePreparation
- DataPreProcessing
- DataAlgorithms
- DataSelection
- DataTransformation
- DataMining
- Results
- Conclusion
Creating a project
Start Helenos, connect to the database and create a new project named html.

By clicking the "Suggest" button, Helenos offers a a database, which will be created and used for this project.
DataSelection
Automatically, the DataSelection will show up. Select the HTML module and add the index.html file from the examples/almo folder of the release.

Creating an experiment
After finishing the DataSelection, the ExperimentManager appears. Create a new experiment by clicking the "Suggest" buttons. This is the best way to avoid conflicts with already existing experiments.

Again a database is created and used for the storage of all our data.
The ExperimentView
Next, you will get the ExperimentView. Helenos uses Almo as workflow engine for the KDD process. First, you will get the configuration view of Almo. This offers you the opportunity to add optional elements to the process. The (+) at the workflow node indicates that you can add another task. Let us add the DataAlgorithm task by right-clicking on the workflow node and adding it. The DataAlgorithm task may be removed, which is indicated by the (-), and also has two more optional elements (indicated by (+)). For now, we will you use the default optional elements Hits and PageRank.

After clicking the "Create instance" button, we will get the running instance of our KDD workflow.
The experiment settings
Before starting our workflow, take a look at the experiment settings.

General settings
| Setting | Meaning |
|---|---|
| ERROR_HANDLING | Determines how occurring errors, should be handled:
|
| DISPLAY_PROGOL_WARNINGS |
|
HTML module settings
You will find the specific HTML module settings at the /MODULES/HTML node. We will get to them as we go through each KDD task. Do not change any default settings for this tutorial.
Now we will start the workflow and in the further course of the tutorial you will get introduced to all KDD tasks.
DatabasePreparation
This task prepares the database for the HTML module. It does not require any interaction.
DataPreProcessing
The HTML documents are fetched, cleaned (transformed from possibly dirty html into proper xhtml) and integrated (crawled). Depending on your crawler settings this task might take a while. Use the crawler settings wisely to limit the amount of documents in your dataset. You will find the settings at the /MODULES/HTML/CRAWLER node.
| Setting | Meaning |
|---|---|
| STAYONHOST |
|
| DEPTH | Limit the document depth. The crawler uses broad search as strategy. -1 means that there shall be no limit, regarding the depth. |
| MAXDOCUMENTS | The maximum amount of documents in your dataset. -1 indicates an infinite amout, this way all documents of a specific domain will be added. |
As the preprocessing is running, the most likely error is, that a ressource does not exist. Since we choose, that errors should be displayed, we will get the following dialog for the various documents that can not be found.

Please select "delete" every time this dialog shows up. If you do not want to be asked about every errors when you are crawling a domain of a few hundred documents, select the DELETE or IGNORE value of the ERROR_HANDLING setting.
DataAlgorithms
This task runs the popular Hits and PageRank algorithms on the documents, which might help you, when you are building a dataset of positive and negative examples.

Clicking "Start algorithm" executes the algorithm and by clicking "Next" you will get to the next element of the workflow.
DataSelection

Now we have reached the crucial part of the workflow. At the DataSelection you choose your training set of positive and negative examples. You can base your selection on the DataAlgorithms you have executed. Clicking "Select" will choose a dataset based on the "Percentage of documents used as training set" and the "Ratio of positive to negative examples". But we will not use these features. Select the documents manually according to the screenshot above.
You can use the "Load ..." and "Save ..." buttons to serialize your data selection into a text file.
DataTransformation
The next crucial point is the DataTransformation. This task depends on the following settings:

/MODULES/HTML/TRIPLE_FACTORY
| Setting | Meaning |
|---|---|
| IGNORE_TAGS | Ignores the nodes specified by the tags. The element tags have to be written in capital letters. |
/MODULES/HTML/PROGOL/MODES
The settings DOCTITLE, METATAGS, MAIL, RELATIONS_INTERN and RELATIONS_EXTERN extracts "intelligent" triples from the html documents, independant from the IGNORE_TAGS.
The settings ELEMENT, TAG and ATTRIBUTE generate triples from the original document without the IGNORE_TAGS.
| Setting | Meaning |
|---|---|
| DOCTITLE | Extracts the title of the document. |
| METATAGS | Extracts the metatag elements. |
| Extracts emails from the documents (the attribute value of the <A href="mailto:xxx" /> nodes). | |
| RELATIONS_INTERN | Represents the link relations between documents within the dataset. |
| RELATIONS_EXTERN | Represents the link relations from documents within the dataset to documents, which are outside of the crawled domain. |
| ELEMENT | Represents relations between html element nodes. |
| TEXT | Represents the text contained in element tags. |
| ATTRIBUTE | Represents the attributes of element tags. |
Since we have not changed the default settings only the "intelligent" triples are generated for our example.
DataMining
After the input data was written to a temp file in the progol/temp folder, Progol can be executed. Editing the modes is not very clever, unless you are absolutely sure, what you are doing. But you can edit the settings to limit or expand the search space of Progol. (Click here for further information about Progol.)
After clicking the "Run" button, the Progol process is started and the input data is written to it. The command for finding a general pattern in the input data is "generalise(document/1)?". Send it and Progol will find the following rule:
document(A) :- internal_relation(A,A).
This means that the general rule, which describes all positive examples and does not describe any negative example is, that all positive documents have a relation to themselves.
If you are not satisfied with this result, remove the "modeb(*,internal_relation(+object,-object))?" mode and run Progol again. Now we will get the following result:
document(A) :- metatag(A,content,'text html charset iso 8859 1').
After removing the "modeb(*,metatag(+object,#object,#object))?" mode, the output will be:
document(A) :- doctitle(A,'Almo').
You might find a new result, if you remove the "modeb(1,doctitle(+object,#object))?" mode as well. The reason is that Progol stops as soon as it finds a general rule. With the doctitle mode being active, it finds the doctitle rule first and stops. Play around by adding and removing the modes, but be careful not to remove the document mode, because then you will not be able to generalize the documents.
The "Save files ..." buttons lets you save the input file for Progol to your file systems. Furthermore, you will find "Load ..." and "Save ..." buttons at the Modes and Settings tab, which allows you to save all relevant Progol input files generated by Helenos to your file systems.
Results
Depending on which result you have kept from your Progol executions, the results are displayed as the final workflow step.

The results are displayed in html format, so you can save them and view them in a browser.
Conclusion
It can be quite frustrating to find general rules within your html dataset. Sometimes it is pure luck to discover something within a domain, of which you do not have any knowledge. Playing around with the triple factory, Progol modes and Progol settings helps to limit the basic search space. At the dataset task you can specify a number of iterations to be carried out. Also, adjusting the "Percentage of documents used as training set" and the "Ratio of positive to negative examples" settings and checking the "Random" box, Helenos offers to iteratively explore a domain, based on a random document selection.
