Projects

SenSem: Databank of Spanish sentences annotated syntactically and semantically

As part of SenSem project (Sentence Semantics: Creación de una Base de Datos de Semántica Oracional), we created a corpus of sentences annotated at the semantic and syntactic levels.

The source corpus is made up of around 13 million words extracted from the online versions of a Spanish newspaper (El Periódico). From this corpus, 25.000 sentences have been randomly selected, 100 for each of the 250 more frequent verbs in current Spanish. Each sentence has been labeled according to the verb sense it exemplifies, the type of complements it takes (arguments or adjunts), their syntactic category and function, and finally each argument has been labelled with a semantic role. The sentence has also been annotated as to its semantics both in relation with aspectual information and the type of construction being expressed.

From this annotated corpus a lexical data base of verbs has been created in which all the previous information will be recollected. The unit of description of the verbs is the sense. In the description of the verbs, argument structure is included, incorporating subcategorization patterns, with the information of frequency of them, semantic roles and information regarding sentence semantics.

The lexicon and the corpus are associated at sense level and together shape up what we call the data bank of the sentential semantic of the Spanish verbs. Both resources are available via web and will form a very important source of linguistic information which we hope will be of utility in different areas of the natural language processing and linguistic research in general.

This project has been prolonged with new funding.

Funding:

2004-2006 – Ministerio de Ciencia y Tecnología (BFF2003-06456)

Staff:

Ana Fernández Montraveta

Irene Castellón Masalles

Glòria Vázquez García

Mihaela Topor

Jaume Tió i Casacuberta

 

Development:

Joan Antoni Capilla Pérez

Laura Alonso i Alemany

Iolanda Mateu Dolcet

Marta Coll-Florit

José Lara

Publications:

  • Fernández, A., G. Vázquez, I. Castellón (2004). “La desambigüación automática de oraciones pronominales”. J. Valera, J.M. Oró, J. Anderson (ed.), Lengua y Sociedad: Lingüística aplicada en la era global y multicultural. Universidad de Santiago de Compostela:, p. 127-144. ISBN: 84-9750-398-9
  • Vázquez, G., L. Alonso, I. Castellón, A. Fernández Montraveta (2004). “A Set of Heuristics for Semantic Sentence Disambiguation for Spanish”, 4th International Conference on Language Resources and Evaluation (LREC 2004). Lisboa, Portugal. ISBN: 2-9517408-1-6
  • Vázquez, G., A. Fernández, I. Castellón (2004). “El corpus Sensem: caracterización sintáctico-semántica de los verbos del español”. XXXIV Simposio de la Sociedad Española de Lingüística. Madrid
  • Castellón, I., A. Fernández, G. Vázquez (2005). “La semántica oracional del español: perspectiva desde el léxico”. G. Wotjak, J. Cantero (ed.), Entre semántica léxica, teoría del léxico y sintaxis. Frankfurt:Leipzig. Peter Lang, Europaishcher Verlag der Wissenschaften, p. 113-122. ISBN: 3-631-53207-5. ISSN: 1436-1914
  • Vázquez, G., A. Fernández, L. Alonso (2005). “Description of the Guidelines for the Syntactico-semantic Annotations of a Corpus in Spanish”. Angelova, G., K. Bontcheva, R. Mitkov, N. Nicolov (ed.), International Conference Recent Advances in Natural Language. Shoumen (Bulgaria):, p. 603-607. ISBN: 954-91743-3-6
  • Fernández, A., G. Vázquez, I. Castellón (2004). “Sensem: base de datos verbal del español”. G. de Ita, O. Fuentes, M. Osorio (ed.), IX Ibero-American Workshop on Artificial Intelligence, IBERAMIA. Puebla de los Ángeles, Mexico:, p. 155-163. ISBN: 968-863-786-6
  • Alonso, L., J.A. Capilla, I. Castellón, A. Fernández, G. Vázquez (2005). “The Sensem Project: Syntactico-Semantic Annotation of Sentences in Spanish”, Proceedings of the International Conference RANLP, p. 39-46. Borovets, Bulgaria. ISBN: 954-91743-3-6
  • Castellón, I., A. Fernández, G. Vázquez, L. Alonso, J.A. Capilla (2006). “The Sensem Corpus: a Corpus Annotated at the Syntactic and Semantic Level”, Fifth International Conference on Language Resources and Evaluation (LREC), p. 355-359
  • Alonso, L., I. Castellón, N. Tincheva (2006). “Detección automática de errores en el Corpus Sensem”, Congreso de la Asociación Española de Lingüística Aplicada (AESLA)
  • Vázquez, G., L. Alonso, J.A. Capilla, I. Castellón, A. Fernández (2006). “SenSem: sentidos verbales, semántica oracional y anotación de corpus”, Procesamiento del Lenguaje Natural, 37, p. 113-120. ISSN: 1135-5948
  • Fernández, A., G. Vázquez, I. Castellón (2006). “SenSem: a Databank for Spanish Verbs”, Proceedings of the X Ibero-American Workshop on Artificial Intelligence, IBERAMIA.. Ribeirão Preto, Brasil
  • Fernández, A., G. Vázquez y D. Teruel (2007). “Interfaz de explotación del corpus SenSem”. R. Mairal et al. (ed.), Aprendizaje de lengua, uso del lenguaje y modelación cognitiva. Perspectivas aplicadas entre disciplinas.. Madrid:UNED, p. 1501-08. ISBN: 978-84-611-6897-2. ISSN: 978-84-611-6897-2