Resources

Lexical and Semantic Resources

Verbal databases with syntactic and semantic information
In these databases we describe several constructions in which a verb can participate.

 

Spanish SenSem >

This database contains the description of most frequent 250 Spanish verbs from a syntactic and semantic perspective. The information provided in the lexical database has been inferred from the annotation of a journalistic corpus of over 700,000 words and a small literary corpus. We define all the senses for each lemma (verb form) and the degree of representation of each sense in the corpus is indicated. For each sense, we propose a definition, the Aktionsart and the semantic roles. Also, each sense is linked to its WordNet equivalent and the corresponding examples from the corpus. The sentences are organized in the different subcategorization patterns that represent. Frequency of each pattern is also indicated.

+ information

 

Spanish Volem >

For the construction of this resource, we started from the Spanish database included in the multilingual lexicon VOLEM. We enlarged the number of entries described as well as a the number of constructions taken into account.

 

Multilingual Volem >

In this multilingual resource (Spanish-Catalan-French-Basque), subcategorization frames together with their semantics are especified for each verb. It also provides information regarding semantic roles and examples of use.

 

Catalan SenSem >

This database contains the description of approximately 250 Catalan verbs from a syntactic and semantic perspective. The information provided in the lexical database has been inferred from the annotation of a journalistic corpus of over 700,000 words. We define all the senses for each lemma (verb form) and the degree of representation of each sense in the corpus is indicated. For each sense, we propose a definition, the Aktionsart and the semantic roles. Also, each sense is linked to its WordNet equivalent and the corresponding examples from the corpus. The sentences are organized in the different subcategorization patterns that represent. Frequency of each pattern is also indicated.

+ information

Semantic nets
We have been working on various fronts to update the Spanish and Catalan Wordnet, and all of these improvements have been incorporated into the Multilingual Central Repository (MCR). MCR also includes the English, Basque and Galician Wordnet which has been jointly built with other research groups, TALP, IXA, IULA and University of Vigo.

 

EuroWordNet interlinguistic annotation with the Top Concept Ontology >

Complete and consistent ontological tagging of the nominal structure of WordNet 1.6 with the semantic features defined in the EuroWordNet Top Concept Ontology. WordNet 1.6 is mapped to the EuroWordNet Inter-Lingual Index (ILI), therefore this tagging can be applied to any WordNet of any language mapped to the ILI. This semantically-tagged WordNet can be useful in many semantic processing NLP tasks.

Reference:
Álvez J., J. Atserias, J. Carrera, S. Climent, A. Oliver and G. Rigau (2008) Consistent annotation of EuroWordNet with the Top Concept Ontology. In Proceedings of The 4th Global Wordnet Association Conference. Szeged. Hungary. http://cv.uoc.es/~grc0_001091_web/files/Alvez-et-al-GWA2008.pdf
This resource has been developed within the following projects:
MEANING: Developing Multilingual Web-Scale Language Technologies. UE. IST Programme. FP5. IST-2001-34460 (2002-2005)
KNOW. Desarrollo de tecnologias multilingues a gran escala para la comprension del lenguaje. Ministerio de Educación y Ciencia. TIN2006-15049-C03-02. (2006-2009)

 

Spanish and Catalan WordNet 3.0 – Authomatic construction >

Work on the WordNet 3.0 version for Catalan was automated and we have completed WordNet 3.0 for Spanish. A variety of methods based on the automatic translation of annotated corpus, WordNet glosses, bilingual dictionaries and encyclopedic sources were employed.

 

Spanish WordNet 3.0 – Manual construction >

This is the first available version of Spanish WordNet 3.0. It was originated in the preexisting English resource. Approximately 10,000 glosses have been translated, which means that there are about 30,000 available lexical entries for Spanish. The noteworthy addition to this version is that corpus definitions and example words have been annotated both morphosyntactically and semantically.

+ information

 

WN-Toolkit >

WN-Toolkit is a toolkit for the semiautomatic creation of wordnets of any language. It is based on either dictionaries or parallel corpora. It has been developed within the project SKR (Representación del conocimiento semántico). Ministerio de Ciencia e Innovación. TIN2009-14715-C04.

Reference:
Oliver A. (2014) WN-Toolkit: Automatic Generation of WordNets following the expand model. In Proceedings of the 7th International Global WordNet Conference. Tartu, Estonia.

Others

Dictionary Catalan-German >

This dictionary is a resource created by Dr. Jaume Tió. The different types of searches that can be made are from words and phrases to flexed paradigms, syntactic analysis and final or initial fragments of canonical entries or phrases.

 

Lexicon of prototypical discourse markers >

This is the seminal discourse marker lexicon used in the thesis Representing discourse for automatic text summarization via shallow NLP techniques. The discourse markers listed here were the primary source of evidence to draw the semantic maps to obtain an inventory of basic discursive meanings. This lexicon is also the basis for the implementations of a discourse segmenter and for the discourse analysis exploited by the e-mail summarizer Carpanta. The lexicon is parallel in three languages: Catalan, Spanish and English. Therefore, in this starting version of the lexicon we have only included those discourse markers that have a near-synonym in one of the other languages. Those that do not have a near-synonym have been included in the extended version of the lexicon created by bootstrapping techniques applied to this starting lexicon.The discourse markers that constitute the prototypical lexicon were obtained from previous work, mostly Knott (1996) and Marcu (1997), with the restriction that they are highly grammaticalized. We have also included in the lexicon some closed class words, obtained from the dictionary of the FreeLing morphosyntactic analyzer. We have discarded closed class words that are very vague and highly ambiguous discourse markers. The lexicon is formed by 84 discourse markers, representing different discursive meanings. Some discourse markers have been assigned to more or less than one meaning per dimension, because they are ambiguous or underspecified, respectively.

 

Trilingual dictionary of periphrasis: Spanish –> Romanian/Catalan >

This tool has been designed based on the investigation Mihaela Topor carried out for her doctoral thesis. In this thesis, she defines 44 Spanish periphrases and includes their translations into Romanian and Catalan. After taking these 44 periphrases into account, different degrees of grammaticalization are established. Each verbal complex is described semantically by providing a definition and is assigned to one of two possible groups: aspectual or modal. Additionally, she specifies the semantic subclass and, whenever possible, provides other equivalent periphrases (synonyms or near synonyms). Her description includes the following usage restrictions: actional, temporal, recursivity and semantic type of subject. Users of the tool can browse through a significant number of examples together with the bibliographic references for each periphrasis.

+ information

 

Terminology Extraction Suite >

Terminology Extraction Suite is an automatic terminology extraction application. Its function is to provide an effective terminology extraction tool that is useful and easy to use. The application is written in Perl and can be run on Linux, Windows and Mac. The application uses a statistics-based method to automatically extract terminology. It can extract candidate terms from one language and automatically search for an equivalent translation in a parallel corpus.