Recently studies on ontologies are receiving a growing attention basically due to the advent of Web services. The Semantic Web [TBL01] represents a new direction in the area of knowledge representation as the attempt to make information, texts and knowledge shareable throughout the different actors (e.g. producer/provider vs. consumer/user) involved in the scenario of a Web application. From the point of view of language and text processing, the Semantic Web represents a challange in several directions. Here linguistic capabilities are needed at least in two major perspectives.
First, a strong linguistic ground is needed for expressing domain ontologies, i.e. the facts about a domain that are in fact critical to the inferences of the target Web application. An ontology is thus seen as a theory about a domain and it obeys to specific logical contraints. However, it is usually expressed in linguistic terms, whereas relations and entities are firstly named by the knowledge engineer. Most of the work done in knowledge representation and modeling for language understanding (e.g. [Gruber1993] is analogous and it is targeted to a very similar task: disambiguation of texts. Any research about treatment of semantics over the Web should thus take into account all the cumulated experiences (principles, formalisms, and results, i.e. existing knowledge bases) and language oriented resources (e.g. dictionaries, lexicons and thesauri).
A second aspect makes the Semantic Web very interesting for research in NLP. Before entering in the scenario of interoperable services, any Web document is (at a large extent) a textual object and, as such, it obeys to laws that are linguistic in nature. In order to map any such instance into a (set of) semantically interoperable data object(s), we need to give it a structure. As already noticed some years ago, the task of information extraction (IE) [SCIE97] can be formulated as the activity of matching/discovery stuctured information (i.e. the target templates) where such a structure is only implicitly present (i.e. in texts). In other words, IE explicits what is only implictly expressed by the language. Clearly, this applies to most of the information currently available (in a logical, i.e. structured, form) within the Semantic Web. Although the IE mapping can be under the responsibility of the provider, it must be noticed that every textual object needs to be rewritten in some for semantic interoperability. This calls for robust and large scale NLP capabilities.
The cross-fertilization between advanced data modeling technologies (as aimed in the Semantic Web) and NLP is thus a very interesting research line. From one side NLP is a kind of tool for "uploading" the semantic Web. On the other side, the Web provides more context for a semantic interpretation process where the necessary world and language knowledge is more constrained.
Ontological and lexical knowledge interact critically in these processes. The ontology determines the domain (i.e. the set of things and their relations) in which the extraction process is immerse. However, it does not capture the "linguistic" ways such properties are realized in texts. As an example, an ontology does not inform us about the different ways a given event is realized as many verbs denote the same event. Under a NLP perspective, such knowledge is usually coded in a lexicon. However, such a lexicon must express linguistic properties connected to some relevant concept in the ontology. I will refer to the set of such linguistic information as the linguistic interface of the ontology.
It is clear that the linguistic interface of an ontology is not a static form of knowledge. It interacts infacts with the ontology and grows with it, whenever changes occur in the domain. Whenever new "linguistic" evidence is available for a concept then new factual information can be detected and added to the existing ontology. Here two processes interplay in a fruitful manner: learning to extract and learning new concepts.
The ontology makes available to the linguistic component a conceptual framework by which ambiguity in source texts can be governed. This leads to a scenario where learning the linguistic interface is possible as ways concepts are expressed can be induced by observation in corpora (learning to extract structures from the raw texts).
When rules for matching information in texts are available and they are also linked to the ontology concepts, some odd phenomena (e.g. nouns/senses that are sistematically related in texts) can be better discovered as new ontological knowledge. This is the process of learning new concepts. The above two mechanisms interact as the more accurate is one the better is the control over the other.
DataDemos |