Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat (2004)

Author: Antoni Oliver Gonzalez

Supervisor: Irene Castellón and Lluís Màrquez

In this thesis different methodologies of automatic acquisition of lexical and morphosyntactical information are presented as well as the no supervised learning of the morphology from corpuses without annotation. The methodologies that we present have been proved with two Slavic languages: the Russian and the Croat; languages that are characterised for having a very rich morphology which is mainly concatenative. This characteristic has been useful for the design of algorisms that can be adopted very easily to work for other languages, provided that they present a relatively rich morphology and with the main morphological processes, whether they will be sufixales or prefixales, that can be described in a concatenative way. An exhaustive evaluation of the introduced morphologies has been done and it has been demonstrated that they work very well for these languages. The fact that some algorisms work from corpuses without annotation makes them very interesting for the creation of new lexical resources as well as for the increase of the existent resources. The algorisms presented in this work can use internet to search information that it is not present in the corpus. This allows the application of processes without the necessity of a recompilation of corpuses of big dimensions.