On the constitution of corpora of less-resourced languages
DOI:
https://doi.org/10.31513/linguistica.2020.v16n1a31709Keywords:
less-resourced languages, NLP, corpusAbstract
While the use of corpora in linguistic studies is quite old, Corpus Linguistics is a relatively new area of study, emerging with the expansion of access to computers and, consequently, to Natural Language Processing (NLP). As the subject gained influence within linguistic research, the concept of corpus became more specific. Breadth of sampling and standard references, as well as machine readability and finiteness became essential elements to compose samples. At the same time, however, smaller and much narrower corpora emerged, having distinct purposes, such as those documenting endangered languages. From this understanding, this paper aims to discuss differences between “prototypical” corpora built from the assumptions of Corpus Linguistics and those of less-resourced languages, with a small digital footprint. I demonstrate that the corpora of less-resourced languages tend to be more specialized and hardly ever fulfill the criteria required of a broad and representative corpus. In spite of limitations entailed by issues specific to each language, I conclude that the constitution of corpora for less-resourced languages must be undertaken, even if they do not fulfill all desirable criteria of Corpus Linguistics. The results must be exploited in diverse ways, whether though the creation of new technologies, as empirical support for linguistic theories or in promoting the language in the community.
Downloads
Published
Issue
Section
License
Authors who publish in the Revista Linguí∫tica agree with the following terms:
The authors maintain their rights, ceding to the journal the right to first publication of the article, simultaneously submitted to a Creative Commons license permitting the sharing with third-parties of published content as long as it mentions the author and its first publication in the Revista Linguí∫tica.
Authors may enter into additional agreements for the non-exclusive distribution of their published work (for example, posting in online institutional or non-profit repositories, or book chapters) so long as they acknowledge its initial publication in the Revista Linguí∫tica.
The journal Revista Linguí∫tica is published by the Post-Graduate program in Linguistics of UFRJ and employs a Creative Commons - Attribution-NonCommercial 4.0 International (CC-BY-NC).