Paul Buitelaar, Michael Sintek, Yasir Iqbal
OntoLT is a plug-in for Protégé with which concepts (Protégé classes) and relations (Protégé slots) can be extracted automatically from a linguistically annotated text collection. For this purpose, OntoLT provides mapping rules, defined by use of a precondition language, that allow for a mapping between linguistic entities in text and class/slot candidates in Protégé.
To experiment with OntoLT you will have access to a number of text collections that have been automatically annotated with the necessary linguistic information. Annotation is according to a proprietary XML-format that includes: part-of-speech tags, morphological analysis, phrases (including head-modifier analysis) and predicate-argument structure (including grammatical functions like subject, direct object).
The annotation allows for the automatic extraction of linguistic entities according to predefined preconditions: heads of nominal phrases, predicates and their subjects, etc. Linguistic objects that can be extracted in this way can be used to construct a shallow ontology of domain relevant concepts, sub-concepts and relations between concepts.
Here we describe the relevant terms you need to know before using or extending OntoLT. Throughout this document we assume that the reader has some knowledge of XML, linguistic annotation and a basic understanding of ontologies. For a more detailed overview we recommend to read the following publication:
Paul Buitelaar, Daniel Olejnik, Michael Sintek A Protégé Plug-In for Ontology Extraction from Text Based on Linguistic Analysis In: Proceedings of the 1st European Semantic Web Symposium (ESWS), Heraklion, Greece, May 2004.
The input for OntoLT consists of a linguistically annotated corpus (in a proprietary XML format – see Buitelaar et al., 2004), which will be processed by a set of mapping rules that are defined over XPath expressions.
In the context of the OntoLT tool, a corpus is a text collection in XML format, where text is annotated using linguistic annotation tools.
OntoLT uses linguistic annotation that includes part-of-speech, morphological inflection and decomposition, phrase and dependency structure (head-complement, head-modifier and grammatical functions).
An XPath expression is used to extract required elements or attributes from a given XML document. (see for instance http://www.w3schools.com/xpath/).
A mapping rule defines how to map linguistic entities in a corpus to Protégé classes and/or slots. Such rules are implemented by use of a precondition language, where a precondition is implemented as an XPath expression over the XML-based linguistic annotation. If all preconditions of a mapping rule are satisfied, a set of linguistic entities will be generated and one or more operators will be activated that define how each of these will be mapped to a corresponding Protégé class or slot.
OntoLT includes a statistical analysis functionality to lexically constrain a mapping rule towards linguistic entities that are relevant for the domain. It computes a relevance score for each linguistic entity by comparison of its frequency in a domain corpus with that of its frequency in a reference corpus. Linguistic entities that are more specific for the domain corpus will receive a higher score.
In order to work with OntoLT you should first start Protégé and open the “OntoLT” project.
Please note that you should use Protégé-2000 version 1.8
Plug-Ins in Protégé are implemented as tabs. You will find OntoLT as the right-most tab. By clicking, you will see something like in Figure 1. (to see the content of the right-side pane go to the first mapping rule in the left-most pane and click).
OntoLT consists also again of tabs, which have the following functionality (left-to-right):
· Mappings Here you can define a new mapping rule or refine an
existing mapping rule (see section Statistical Relevance)
· XPaths Here you can define Xpath-expressions (i.e. preconditions)
or redefine them according to a different XML format
· Corpora Here you can upload one or more annotated texts or text
collections (see section Upload a Corpus)
· CandidateView Here you can extract concepts and/or relations from the
annotated text collection (see section Extract Candidates)
Figure 1: An Example Mapping Rule in the “Mappings” Tab
To extract concepts and/or relations from text you will need to upload an annotated text collection (in the context of the Short User Guide this will be the KMI corpus). You can do this in the “Corpora” tab as shown in Figure 2.
First click on the “C” button in the left pane. You will be able to upload a new corpus and to give it a name. Write a name (for naming instructions see below) in the dedicated box and click in the lower pane. Now you can upload files by clicking on the + in the same pane.
For ease of use in the experiments, upload two corpora:
· KMI_small with the first 10 documents in the ‘data/xml/KMI’ directory
· KMI_all with all of the documents in the ‘data/xml/KMI’ directory
Figure 2: Uploading a Corpus in the "Corpora” Tab
You will now be able to extract class and/or slot candidates from the annotated corpus. To do this you should go to the “CandidateView” tab. Press the “binoculars” button in the left-most pane. Select a corpus to work with (use KMI_small the first time) and press “ok”. A progress window will appear, after which an extraction-ID (date/time) is generated in the left-most pane. To see the extracted candidates click on this ID. You will now see something similar to the example shown in Figure 3.
To inspect extraction information for individual classes or slots, you should select one by clicking on its ‘key icon’, then on the ‘key icon’ of the “Candidates” folder and then on one of the class-candidates listed. After this, the detailed extraction information for this particular candidate will be shown, which includes: the mapping rule that has been applied, the sentence from which the candidate was extracted and the operator that has been executed. An example looks as shown in Figure 4.
Figure 3: An Example of Extracted Candidates in the "CandidateView" Tab
Figure 4: Context for an Extraction Example in the "CandidateView" Tab
In ontology development for particular tasks and/or domains there is a need to focus only on those concepts and relations that are (highly) relevant for the task or domain in question. For this purpose, OntoLT includes a functionality for the computation of a statistical relevance score for extracted linguistic entities by comparison of their frequencies in a (domain-specific) corpus with frequencies in a reference (more general) corpus. In this way, word use in a particular domain is contrasted with that of more general word use.
You can use this functionality in the “Mappings” tab by pressing the “binoculars” button in the left-most pane. By pressing “ok”, you will be presented a form that you could fill out for instance as follows:
· Mapping Select HeadNounToClass_ModifierToClass
· Relevant Variable Select HeadNoun_Text
· Corpus Select KMI
· ReferenceCorpus file Upload Nouns_BNC.txt file in ‘data/analyzer’ directory
By pressing “ok” a progress window will appear after which the results of the statistical relevance computation will be presented as shown in Figure 5.
Figure 5: An Example of Statistical Relevance Computation in the "Mappings" Tab
You can edit the results by (de-)selecting any of the extracted terms. If you are satisfied with the results press “ok”. The selected mapping rule will now have been extended with a further precondition (in this case on the value of a head-noun). To display this update push the “update” button in the “Conditions” pane (the button with the small green arrow).
After refinement of the mapping rules you can extract domain-specific candidate classes and slots by returning to the “CandidateView” tab and run the extraction again. The results will now be restricted to the selected terms only.
Extracted candidate classes and slots can be validated (select or deselect them) and you can generate ontology fragments accordingly by pushing the right-most button in the middle pane. After doing this, switch to the “Classes” tab (the left-most of the main tabs of Protégé). The extracted classes and slots will be displayed here, which will look something as shown in Figure 6.
Figure 6: An Example of Extracted Ontology Fragments
To integrate extracted ontology fragments with other tools, you are advised to export results to the Protégé or RDF/S format (go to the Project menu in the main window of Protégé, select “Save in Format…” and select your favorite format).