On the constitution of corpora of less-resourced languages





less-resourced languages, NLP, corpus


While the use of corpora in linguistic studies is quite old, Corpus Linguistics is a relatively new area of study, emerging with the expansion of access to computers and, consequently, to Natural Language Processing (NLP). As the subject gained influence within linguistic research, the concept of corpus became more specific. Breadth of sampling and standard references, as well as machine readability and finiteness became essential elements to compose samples. At the same time, however, smaller and much narrower corpora emerged, having distinct purposes, such as those documenting endangered languages. From this understanding, this paper aims to discuss differences between “prototypical” corpora built from the assumptions of Corpus Linguistics and those of less-resourced languages, with a small digital footprint. I demonstrate that the corpora of less-resourced languages tend to be more specialized and hardly ever fulfill the criteria required of a broad and representative corpus. In spite of limitations entailed by issues specific to each language, I conclude that the constitution of corpora for less-resourced languages must be undertaken, even if they do not fulfill all desirable criteria of Corpus Linguistics. The results must be exploited in diverse ways, whether though the creation of new technologies, as empirical support for linguistic theories or in promoting the language in the community.

Author Biography

Lílian Teixeira de Sousa, Universidade Federal da Bahia (UFBa)

Possui graduação em Letras pela Universidade Federal de Ouro Preto (2004), mestrado em Estudos Linguísticos pela Universidade Federal de Minas Gerais (2007) e doutorado em Linguística pela Universidade Estadual de Campinas (2012) com período de sanduíche na Universidade Livre de Berlim. Atualmente é professora adjunta da Universidade Federal da Bahia, atuando na graduação e pós-graduação. Tem experiência na área de Linguística, com ênfase em Teoria e Análise Linguística, atuando principalmente nos seguintes temas: diacronia, sintaxe e interfaces.

