LX-Lemmatizer's documentation
LX-Lemmatizer
LX-Lemmatizer is a freely available online service for fully-fledged lemmatization of Portuguese verbs. It was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
You may be also interested to use our LX-Suite online service for the shallow processing of Portuguese.
Features
LX-Lemmatizer takes a Portuguese verb form and delivers all the corresponding lemmata (infinitive forms) together with the inflectional feature values. Lemmata that are less likely, but still orthographically possible, are grouped together in a last section under the header "Other possible lemmata".
At the date of its inception (November 2005), it is the first freely available online service for fully-fledged Portuguese verb lemmatization, including the full range of pronominal conjugation forms. It thus handles:
- Pronominal conjugation
The Portuguese verbal inflection system is a most complex part of the Portuguese morphology, and of the Portuguese language, given the high number of conjugated forms for each verb (ca. 70 forms in non pronominal conjugation), the number of productive inflection rules involved and the number of non regular forms and exceptions to such rules.
This complexity is further increased when the so-called pronominal conjugation is taken into account. The Portuguese language has verbal clitics, which according to some authors are to be analyzed as integrating the inflectional suffix system:- the forms of the clitics may depend on the Number (Singular vs. Plural), the Person (First, Second, Third or Second courtesy), the Gender (Masculine vs. Feminine), the grammatical function which they are in correspondence with (Subject, Direct object or Indirect object), and the anaphoric properties (Pronominal vs. Reflexive);
- up to three clitics (e.g. deu-se-lho / gave-One-ToHim_It) may be associated with a verb form;
- clitics may occur in so called enclisis, i.e. as a final part of the verb form (e.g. deu-o / gave-It), or in mesoclisis, i.e. as a medial part of the verb form (e.g. dá-lo-ia / give-it-CONDITIONAL). In some variants, when the verb form occurs in certain syntactic or semantic contexts (e.g in the scope of negation), the clitics appear in proclisis, i.e. before the verb form (ex.: não o deu / NOT it gave);
- clitics follow specific rules for their concatenation.
Additionally, LX-Lemmatizer exhaustively handles a set of inflection cases which tend not to be supported together in verbal lemmatizers:
- Compound tenses
- Double forms for past participles (regular and irregular)
- Past participle forms inflected for number and gender
- Negative imperative forms
- Courtesy forms for second person
LX-Lemmatizer handles both known verbs and unknown verbs. It thus lemmatizes:
- Neologisms (with orthographic suffix)
It is also worth noting the following design principles, that LX-Lemmatizer adopts with respect to the so called defective verbs:
- Defectives
Some unsubstantiated assumptions from traditional grammar were not followed, according to which many verb forms do not exist and/or should not be used because they sound awkward or because their use is semantically very restricted.
Accordingly, to give an example, all conjugated forms of weather verbs are lemmatized, as they can be used at least non literally. To give another example, all verb forms of verbs like falir are also lemmatized. - Special cases
LX-Lemmatizer does assume that some forms are impossible though (e.g. the imperative forms of verbs such as querer / to want: *quer tu) and that some clitics do not combine with certain verb forms (e.g. second person non-courtesy clitics and second person courtesy verb forms with the same number: *você ama-te/you_COURTESY love-yourself_NONCOURTESY).
Other special cases, also not lemmatized, include impersonal se and passive se, which do not occur with first or second person verb forms.
LX-Lemmatizer handles the very few cases where there may be different forms in different variants:
- Orthographic and paradigmatic differences
When a given verb, inflected with a given set of feature values, has different orthographic representations, all such representations can be lemmatized. To give an example, both representations for '(I) argued', argui (European) and argüi (Brazilian), are lemmatized. - Other cases
Differences in irregular forms are also handled under the same approach. One such example is the past participle of 'to accept', with aceite (European) and aceito (Brazilian), which will be both lemmatized.
Note that in general LX-Lemmatizer acknowledges different lemmata for different verb forms both with the same semantics and the same set of inflectional feature values when such representations can be predicted from the representation of the form (to be entered by the user). For instance, all lemmata of verb forms of 'to act' will start either with act- or with at-, depending on whether the user enters act* (European) or at* (Brazilian) as the representation of the form to be lemmatized.
Aiming at optimizing usability, LX-Lemmatizer adopts the following scheme concerning the position of clitics:
- Clitic placement
Variants of Portuguese may differ with respect to the relative order between the clitic forms and the verb forms. In some variants, e.g. Brazilian, as a rule clitics occur invariably to the left of the verb form (in so called proclisis), while in some others, e.g. European, the clitics appear to the left, to the right (enclisis) of the verb, or still in medial position (mesoclisis), depending on the context where the verb form occurs. In order to preserve usability of the verbal lemmatizer, pronominal forms can be entered according to any variant.
Authorship
LX-Lemmatizer is being developed by António Branco and Filipe Nunes, with the help of Francisco Costa, of the NLX-Natural Language and Speech Group, at the University of Lisbon, Department of Informatics.
Acknowledgments
The work leading to the LX-Lemmatizer was partly supported by FCT-Fundação para a Ciência e Tecnologia under the grant POSI/PLP/47058/2002 for the project TagShare.
References
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Branco, António e Filipe Nunes, 2012, "Verb Analysis in a Highly Inflective Language with an MFF Algorithm", in Proceedings, 11th International Conference on the Computational Processing of Portuguese, Lecture Notes in Artificial Intelligence, 7243, Berlim, Springer, pp. 1-11.
Contact us
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Why LX-Lemmatizer?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.