Tag	Category	Examples
ADJ	Adjectives	bom, brilhante, eficaz, …
ADV	Adverbs	hoje, já, sim, felizmente, …
CARD	Cardinals	zero, dez, cem, mil, …
CJ	Conjunctions	e, ou, tal como, …
CL	Clitics	o, lhe, se, …
CN	Common nouns	computador, cidade, ideia, …
DA	Definite articles	o, os, …
DEM	Demonstratives	este, esses, aquele, …
DFR	Denominators of fractions	meio, terço, décimo, %, …
DGTR	Roman numerals	VI, LX, MMIII, MCMXCIX, …
DGT	Digits	0, 1, 42, 12345, 67890, …
DM	Discourse marker	olá, …
EADR	Electronic addresses	http://www.di.fc.ul.pt, …
EOE	End of enumeration	etc
EXC	Exclamative	ah, ei, etc.
GER	Gerunds	sendo, afirmando, vivendo, …
GERAUX	Gerund "ter"/"haver" in compound tenses	tendo, havendo …
IA	Indefinite articles	uns, umas, …
IND	Indefinites	tudo, alguém, ninguém, …
INF	Infinitive	ser, afirmar, viver, …
INFAUX	Infinitive "ter"/"haver" in compound tenses	ter, haver …
INT	Interrogatives	quem, como, quando, …
ITJ	Interjection	bolas, caramba, …
LTR	Letters	a, b, c, …
MGT	Magnitude classes	unidade, dezena, dúzia, resma, …
MTH	Months	janeiro, dezembro, …
NP	Noun phrases	idem, …
ORD	Ordinals	primeiro, centésimo, penúltimo, …
PADR	Part of address	Rua, av., rot., …
PNM	Part of name	Lisboa, António, João, …
PNT	Punctuation marks	., ?, (, …
POSS	Possessives	meu, teu, seu, …
PPA	Past participles not in compound tenses	afirmados, vivida, …
PP	Prepositional phrases	algures, …
PPT	Past participle in compound tenses	sido, afirmado, vivido, …
PREP	Prepositions	de, para, em redor de, …
PRS	Personals	eu, tu, ele, …
QNT	Quantifiers	todos, muitos, nenhum, …
REL	Relatives	que, cujo, tal que, …
STT	Social titles	Presidente, drª., prof., …
SYB	Symbols	@, #, &, …
TERMN	Optional terminations	(s), (as), …
UM	"um" or "uma"	um, uma
UNIT	Abbreviated measurement units	kg., km., …
VAUX	Finite "ter" or "haver" in compound tenses	temos, haveriam, …
V	Verbs (other than PPA, PPT, INF or GER)	falou, falaria, …
WD	Week days	segunda, terça-feira, sábado, …
Multi-word expressions
LADV1…LADVn	Multi-word adverbs	de facto, em suma, um pouco, …
LCJ1…LCJn	Multi-word conjunctions	assim como, já que, …
LDEM1…LDEMn	Multi-word demonstratives	o mesmo, …
LDFR1…LDFRn	Multi-word denominators of fractions	por cento
LDM1…LDMn	Multi-word discourse markers	pois não, até logo, …
LITJ1…LITJn	Multi-word interjections	meu Deus
LPRS1…LPRSn	Multi-word personals	a gente, si mesmo, V. Exa., …
LPREP1…LPREPn	Multi-word prepositions	através de, a partir de, …
LQD1…LQDn	Multi-word quantifiers	uns quantos, …
LREL1…LRELn	Multi-word relatives	tal como, …

Tag	Description
m	Masculine
f	Feminine
s	Singular
p	Plural
dim	Diminutive
sup	Superlative
comp	Comparative
1	First person
2	Second person
3	Third person
pi	Presente do indicativo
ppi	Pretérito perfeito do indicativo
ii	Pretérito imperfeito do indicativo
mpi	Pretérito mais que perfeito do indicativo
fi	Futuro do indicativo
c	Condicional
pc	Presente do conjuntivo
ic	Pretérito imperfeito do conjuntivo
fc	Futuro do conjuntivo
imp	Imperativo

LX-Suite documentation

LX-Suite

LX-Suite is a freely available online service for the shallow processing of Portuguese. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.

You may be interested to use also our LX-Sentence Splitter, LX-Tokenizer or LX-Tagger online services for delimiting sentences, tokenization or part-of-speech tagging of Portuguese. You may also be interested to use our LX-Conjugator and LX-Lemmatizer online services for the conjugation and lemmatization of verbs, and LX-Inflector online service for the inflection of nominal classes.

Features and evaluation

LX-Suite is composed by a set of shallow processing tools:

LX Sentence Splitter:
Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
Unwraps sentences split over different lines.
A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
LX-Tokenizer:

Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.

um exemplo → |um|exemplo|

Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:

do → |de_|o|

Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:

um, dois e três → |um|,*/|dois|e|três|

5.3 → |5|.|3|

1. 2 → |1|.*/|2|

8 . 6 → |8|\*.*/|6|

Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:

dá-se-lho → |dá|-se|-lhe_|-o|

afirmar-se-ia → |afirmar-CL-ia|-se|

vê-las → |vê#|-las|

This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:

deste → |deste| when occurring as a Verb

deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)

This tool achieves a f-score of 99.72%.
LX-Tagger:

Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token, using a / (slash) symbol as separator:

um exemplo → um/IA exemplo/CN

Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:

de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4

This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under a 10-fold cross-evaluation.
LX-Featurizer (nominal):

Assigns inflection feature values to words from the nominal categories (Gender (masculine or feminine), Number (singular or plural) and, when applicable, Person (1st, 2nd and 3rd)):

os/DA gatos/CN → os/DA#mp gatos/CN#mp

Assigns degree feature values (diminutive, superlative and comparative) to words from the nominal categories:

os/DA gatinhos/CN → os/DA#mp gatinhos/CN#mp-dim

Sometimes, due to the so-called invariant words, the featurizer is not able to determine a feature value. In those cases, it assigns a g value for an underspecified Gender and n value for an underspecified Number. Note, however, that if provided with an adequate context, the featurizer might resolve such cases:

Vi/V pianistas/CN → Vi/V pianistas/CN#gp

Vi/V as/DA pianistas/CN → Vi/V as/DA#fp pianistas/CN#fp

This tool has 91.07% f-score. For an online service supported by this tool (without performing disambiguation) see LX-Inflector.
LX-Lemmatizer (nominal):

Assigns a lemma to words from the nominal categories (Adjectives, Common Nouns and Past Participles). This lemma corresponds to the form that one would find in a dictionary, typically the masculine singular form. The lemma is inserted into the token, with / (slash) as a delimiter.

gatas/CN#fp → gatas/GATO/CN#fp

normalíssimo/ADJ#ms-sup → normalíssimo/NORMAL/ADJ#ms-sup

This tool has 97.67% f-score. For an online service supported by this tool (without performing disambiguation) see LX-Inflector.
LX-Lemmatizer and Featurizer (verbal):

Assigns a lemma and inflection feature values to verbs. The lemma corresponds to the infinitive form of the verb. The lemma is inserted into the token, with / (slash) as a delimiter.

escrevi/V → escrevi/ESCREVER/V#ppi-1s

The tool disambiguates among the various lemma-inflection pairs that can be assigned to a verb form, achieving 95.96% accuracy.

For an online service supported by this tool (without performing disambiguation) see LX-Lemmatizer.

These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.

Authorship

LX-Suite is being developed by António Branco and João Silva, with the key contribution of Filipe Nunes (verbal lemmatizer), and the help of Francisco Costa, Catarina Ribeiro and Ricardo Santos at the NLX—Natural Language and Speech Group at the University of Lisbon, Department of Informatics.

Acknowledgments

The development of a state-of-the-art, complete suite of shallow processing tools for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.

This project was developed in cooperation with CLUL—Centro de Linguística da Universidade de Lisboa. The training and test corpora prepared for the development of this demo evolved from a corpus provided by CLUL.

This demo includes a part-of-speech tagger developed with Thorsten Brants' TnT software with his written permission.

Publications

Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:

Branco, António and João Silva, 2004. "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese". In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Paris, ELRA, pp.507-510.

Other publications:

Branco, António and João Silva, 2006. "Dedicated Nominal Featurization of Portuguese". In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR'06)
Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project". In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
Branco, António, Filipe Nunes and João Silva, 2006. Verb Analysis in an Inflective Language: Simpler is better. Internal report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.
Branco, António and João Silva, 2005. "Accurate Annotation: an Efficiency Metric". In Nicolas Nicolov, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.), Recent Advances in Natural Language Processing III, Amsterdam, John Benjamins, pp. 173-182.
Branco, António and João Silva, 2004. "Swift Development of State of the Art Taggers for Portuguese". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 29-46.
Branco, António and João Silva, 2004. "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese". In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Paris, ELRA, pp. 507-510.
Branco, António, Amália Mendes and Ricardo Ribeiro (eds.), 2003. Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28.
Branco, António and João Silva, 2003. "Portuguese-specific Issues in the Rapid Development of State of the Art Taggers". In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003, Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, TR-2003-28, pp.7-10.
Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2004. "Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 47-62.
Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2003. "Reusing Available Resources for Tagging a Spoken Portuguese Corpus". In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003, 2003, pp. 25-28.
TagShare, 2004, Manual de Etiquetação e Convenções, Internal Report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.

Contact us

Tagset

Part-of-speech tags

Tag	Category	Examples
ADJ	Adjectives	bom, brilhante, eficaz, …
ADV	Adverbs	hoje, já, sim, felizmente, …
CARD	Cardinals	zero, dez, cem, mil, …
CJ	Conjunctions	e, ou, tal como, …
CL	Clitics	o, lhe, se, …
CN	Common nouns	computador, cidade, ideia, …
DA	Definite articles	o, os, …
DEM	Demonstratives	este, esses, aquele, …
DFR	Denominators of fractions	meio, terço, décimo, %, …
DGTR	Roman numerals	VI, LX, MMIII, MCMXCIX, …
DGT	Digits	0, 1, 42, 12345, 67890, …
DM	Discourse marker	olá, …
EADR	Electronic addresses	http://www.di.fc.ul.pt, …
EOE	End of enumeration	etc
EXC	Exclamative	ah, ei, etc.
GER	Gerunds	sendo, afirmando, vivendo, …
GERAUX	Gerund "ter"/"haver" in compound tenses	tendo, havendo …
IA	Indefinite articles	uns, umas, …
IND	Indefinites	tudo, alguém, ninguém, …
INF	Infinitive	ser, afirmar, viver, …
INFAUX	Infinitive "ter"/"haver" in compound tenses	ter, haver …
INT	Interrogatives	quem, como, quando, …
ITJ	Interjection	bolas, caramba, …
LTR	Letters	a, b, c, …
MGT	Magnitude classes	unidade, dezena, dúzia, resma, …
MTH	Months	janeiro, dezembro, …
NP	Noun phrases	idem, …
ORD	Ordinals	primeiro, centésimo, penúltimo, …
PADR	Part of address	Rua, av., rot., …
PNM	Part of name	Lisboa, António, João, …
PNT	Punctuation marks	., ?, (, …
POSS	Possessives	meu, teu, seu, …
PPA	Past participles not in compound tenses	afirmados, vivida, …
PP	Prepositional phrases	algures, …
PPT	Past participle in compound tenses	sido, afirmado, vivido, …
PREP	Prepositions	de, para, em redor de, …
PRS	Personals	eu, tu, ele, …
QNT	Quantifiers	todos, muitos, nenhum, …
REL	Relatives	que, cujo, tal que, …
STT	Social titles	Presidente, drª., prof., …
SYB	Symbols	@, #, &, …
TERMN	Optional terminations	(s), (as), …
UM	"um" or "uma"	um, uma
UNIT	Abbreviated measurement units	kg., km., …
VAUX	Finite "ter" or "haver" in compound tenses	temos, haveriam, …
V	Verbs (other than PPA, PPT, INF or GER)	falou, falaria, …
WD	Week days	segunda, terça-feira, sábado, …
Multi-word expressions
LADV1…LADVn	Multi-word adverbs	de facto, em suma, um pouco, …
LCJ1…LCJn	Multi-word conjunctions	assim como, já que, …
LDEM1…LDEMn	Multi-word demonstratives	o mesmo, …
LDFR1…LDFRn	Multi-word denominators of fractions	por cento
LDM1…LDMn	Multi-word discourse markers	pois não, até logo, …
LITJ1…LITJn	Multi-word interjections	meu Deus
LPRS1…LPRSn	Multi-word personals	a gente, si mesmo, V. Exa., …
LPREP1…LPREPn	Multi-word prepositions	através de, a partir de, …
LQD1…LQDn	Multi-word quantifiers	uns quantos, …
LREL1…LRELn	Multi-word relatives	tal como, …

Other tags

Tag	Description
m	Masculine
f	Feminine
s	Singular
p	Plural
dim	Diminutive
sup	Superlative
comp	Comparative
1	First person
2	Second person
3	Third person
pi	Presente do indicativo
ppi	Pretérito perfeito do indicativo
ii	Pretérito imperfeito do indicativo
mpi	Pretérito mais que perfeito do indicativo
fi	Futuro do indicativo
c	Condicional
pc	Presente do conjuntivo
ic	Pretérito imperfeito do conjuntivo
fc	Futuro do conjuntivo
imp	Imperativo

Why LX-Suite?

LX because LX is the "code" name Lisboners like to use to refer to their hometown.

License

The complete text of this license is here.

To:	`request@portulanclarin.net`
Subject:

To:	`request@portulanclarin.net`
Subject: