Beyond Named Entity Recognition

Semantic labelling for NLP tasks



Centro Cultural de Belem
LISBON, Portugal
25th may 2004

In Association with
Main conference 26-27-28 May 2004

Workshop Program (NEW)

Paper Abstracts

Motivation and Aims


Although it is generally assumed that improvements in language processing will be made through the integration of linguistic information and statistical techniques, the reality is that language is very diverse and looking for specific patterns of words that repeat enough to be statistically significant tends not to be a very fruitful task: sequences longer than three words are not generally repeated often enough to be statistically significant. At the same time, the identification of named entities: names, dates, places, organizations etc., has proved to be a very usefulpreliminary task in many natural language processing systems. We are interested in pursuing approaches which extend this notion by identifying and labeling other semantic information in a text, in such a way as to allow repeatable semantic patterns to emerge.  Our interest is in attacking the data sparseness problem by exploring ways to collapse (semantically) related phrases which are expressed by different word sequences.


As this seems closely related to previously proposed class-based language models (see for example Brown et al. 90 in Computational Linguistics), it is different in that the empirical notion of classes used in the previous work (e.g. classes made up of collocationally similar words) are replaced by semantically justified sets.


Notice how Name Entity (NE) tagging and Word Sense Disambiguation (WSD) represent, in terms of granularity and representational complexity, two extremes of a single general problem: semantic disambiguation. Semantic disambiguation serves thus the purpose of improving the generalization power of statistical models. One of the questions here is how to determine a suitable level of clustering (for NE identification and for WSD) that would lead to high accuracy and to performance improvement by obtained statistical models.


Reason of Interest


It is to be noticed that several independent research efforts that focused recently on the statistical treatment of semantic phenomena (e.g. WordNet navigation as a stochastic process, as studied in Light and Abney or in Ciaramita & Johnson, 2003) correlated highly with the research program proposed above.


The workshop will offer a forum where experience from lexical semantics and statistical learning will be presented and fruitfuldiscussion among researchers in both fields will be promoted. The workshop is expected to attract researchers and practitioners from a range of areas as well as developers of large scale semantic resources who are interested in effective methods of semantic labeling.


Topics (to be addressed in the workshop include, but are not limited to)

  • Methods for lexical - semantic annotation of corpora
  • Methods and standards for lexical semantic representation of dictionary information
  • Lexico-semantic taxonomies
  • Existing sources of classification: dictionaries, thesauri and computerized ontologies
  • Corpus-driven methods for semantic disambiguation
  • Feature selection for semantic disambiguation
  • Lexico-semantic tagging of very large corpora
  • Algorithms and methods for disambiguation of semantic phenomena
  • Statistical learning models and their applications to semantic labeling
  • Computational learning frameworks for Natural Language Learning
  • Semi-supervised and unsupervised statistical semantic disambiguation
  • Evaluation of semantic disambiguation



Workshop format


The workshop will be a half-day event with position statements from invited speakers (half an hour each) with two hours for 4-6 presentations of scientific papers. Submissions are intended to present works in progress and more completed works which fall within the scope defined by the topics listed above.  A final 1 hour open discussion among all the workshop participants will be moderated by the organizers. In order to stimulate an interesting general discussion, each member of the program committee will be invited to submit a position statement of max. 1000 words.




Participants are invited to submit an extended abstract of max. 3500 words concerning one or more of the topics of interest. Each accepted paper receives a slot of 25 minutes for presentation (15 minutes talk and 10 minutes for discussion). Each submission should show: title; author(s); affiliation(s); and contact author's e-mail address, postal address, telephone and fax numbers. Submissions must be sent electronically in PDF to the following adddress:

Roberto Basili

Dept. of Computer Science, Systems and Management
University of Roma Tor Vergata


Proceedings and Publications


Proceedings of the workshop will be printed by the LREC Local Organising Committee.

Organizers are negotiating for the publications of a special issue on “Semantic tagging/labelling for NLP taskswith the Computer Speech and Language Journal and selected papers will appear on in that issue.


Important dates

Extended abstract submission (max. 3500 words)

16th of February 2004

Notification of acceptance

8th of March 2004

Preliminary Program

29th of March 2004

Submission of the final version of paper

5th of April 2004


25th May 2004





Organizing Committee


Louise Guthrie - University of Sheffield, UK

Roberto Basili - University of Rome, Tor Vergata, Italy

Eva Hajicova - Charles University, Czech Republic

Fred Jelinek - Johns Hopkins University, Maryland, USA

Further Information

For any information related to the organization, please contact: 

Roberto Basili 

Dept. of Computer Science, Systems and Management
University of Roma Tor Vergata
Via di Tor Vergata
00133 Roma (ITALY)

tel:     +39 06 72597391
fax:    +39 06 72597460


Important Dates

February, 16th 2004
• Submission Deadline

March 8th, 2004
• Notification

April 5th, 2004

• Camera ready versions due

May 25th, 2004
• Workshop