Resources

Corpora

Gold Standard de Factualitat TAGFACT

This corpus contains the annotation of the factuality of the verbal predicates of a Spanish journalistic corpus of 22 news items extracted from 6 newspapers. It contains more than 10,000 words and about 1,300 predicates have been tagged. The news is basically politically themed. It has been manually annotated with a previous training phase. It is available to the public in the section.  Download.

+ information

 

Catalan SenSem >

This corpus was created by translating journalistic texts from the Spanish SenSem corpus. It comprises approximately 600,000 words (20,000 sentences). The Catalan SenSem corpus has been formed by transferring the syntactic and semantic annotations from the Spanish corpus after making the appropriate amendments to them. These two corpora are parallel to one another. Nevertheless, the search engine only allows one to view the sentences of each language separately. This engine displays several search options and it also allows one to see the annotation of the sentences.

+ information

 

Spanish SenSem >

The corpus features texts from both journalistic and, to a lesser extent, literary sources, and consists of approximately a million words (30,000 sentences). It was created by collecting 125 sentences containing one of the 250 most common verbs in Spanish. The sentences have been annotated manually at a syntactic and semantic level (semantic roles, syntactic functions, phrase categories, construction types, aspect, aspectuality, modality and polarity). The comprehensive search engine contemplates search criteria such as verbs and/or diverse linguistic phenomena. You can also visualize the linguistic annotation of the sentences.

+ information

 

Semantic anotation of nouns in SenSem Corpus

The lexical annotation of the SenSem corpus has been expanded in this resource by annotating the semantic tags of the verbal predicates of the corpus sentences. The Spanish WordNet 1.6 was used as the semantic resource to do so.

+ information

 

Trilingual Parallel Corpora GRIAL >

This corpus is formed from manuals, text books and magazines from the computar science area. It is a parallel corpora for English, Spanish and Catalan. It is comprised of a total of 2.257.498 words (1.031.911, English: 891.903, Spanish; 393.684, Catalan) and it has been automatically tagged at a morpho-syntactic level.

+ information