LX-Tokenizer is a freely available online service for tokenizing Portuguese text. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Features and evaluation
The LX-Tokenizer service is composed by two processing tools:
- LX Sentence Splitter:
Marks sentence boundaries with
<s>…</s>, and paragraph boundaries with
Unwraps sentences split over different lines.
A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
- Segments text into lexically relevant tokens, using
whitespace as the separator. Note that, in these examples,
|(vertical bar) symbol is used to mark the token boundaries more clearly.
um exemplo → |um|exemplo|
- Expands contractions. Note that the first element of an
expanded contraction is marked with an
do → |de_|o|
- Marks spacing around punctuation or symbols. The
*/symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
- Detaches clitic pronouns from the verb. The detached pronoun
is marked with a
-(hyphen) symbol. When in mesoclisis, a
-CL-mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a
dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
- This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
- Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
The development of a state-of-the-art, complete suite of shallow processing tools for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
This project was developed in cooperation with CLUL—Centro de Linguística da Universidade de Lisboa. The training and test corpora prepared for the development of this demo evolved from a corpus provided by CLUL.
To reference this work, please cite the following paper:
- Branco, António and João Silva, 2006. A Suite of Shallow Processing Tools for Portuguese: LX-Tokenizer. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL'06).
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
LX because LX is the "code" name Lisboners like to use to refer to their hometown.