
File processing
Input format: Input files must be in .txt FORMAT with UTF-8 ENCODING and contain PORTUGUESE TEXT. Input files and folders can also be compressed to the .zip format.
Privacy: The input file you upload and the respective output files will be automatically deleted from our computer after being processed and the result downloaded by you. No copies of your files will be retained after your use of this service.
The size of your input file is large and its processing may take some time. To receive by email the url link from which to download your processed file when ready, enter your email address below. After being used for this purpose, your email address will be deleted from our computer.
Instructions to use this web service
The web service for this application is available at https://portulanclarin.net/workbench/lx-tagger/api/.
Below you find an example of how to use this web service with Python 3.
This example resorts to the requests package. To install this package, run this command in the command line:
pip3 install requests
.
To use this web service, you need an access key you can obtain by clicking in the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 4000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data and the respective output will be automatically deleted from our computer after being processed. No copies will be retained after your use of this service.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | import json import requests # to install this library, enter in your command line: # pip3 install requests # This is a simple example to illustrate how you can use the LX-Tagger web service # Requires: key is a string with your access key # Requires: text is a string, UTF-8, with a maximum 4000 characters, Portuguese text, with # the input to be processed # Requires: format is a string, indicating the output format which can be one of these three # values: 'CINTIL', 'CONLL', 'column' or 'JSON' # Ensures: output according to specification in https://portulanclarin.net/workbench/lx-tagger/ # Ensures: dict with number of requests and characters input so far with the access key, and # its date of expiry key = 'access_key_goes_here' # before you run this example, replace access_key_goes_here by # your access key format = 'CONLL' # other possible values are 'CINTIL', 'column' or JSON # this string can be replaced by your input text = '''Esta frase serve para testar o funcionamento do LX-Tagger. Esta outra frase faz o mesmo.''' # To read input text from a file, uncomment this block #inputFile = open("myInputFileName", "r", encoding="utf-8") # replace myInputFileName by # the name of your file #text = inputFile.read() #inputFile.close() # Processing: url = "https://portulanclarin.net/workbench/lx-tagger/api/" request_data = { 'method': 'tag', 'jsonrpc': '2.0', 'id': 0, 'params': { 'text': text, 'format': format, 'key': key, }, } request = requests.post(url, json=request_data) response_data = request.json() if "error" in response_data: print("Error:", response_data["error"]) else: print("Result:") print(response_data["result"]) # To write output in a file, uncomment this block #outputFile = open("myOutputFileName","w", encoding="utf-8") # replace myOutputFileName by # the name of your file #output = response_data["result"] #outputFile.write(output) #outputFile.close() # Getting acess key status: request_data = { 'method': 'key_status', 'jsonrpc': '2.0', 'id': 0, 'params': { 'key': key, }, } request = requests.post(url, json=request_data) response_data = request.json() if "error" in response_data: print("Error:", response_data["error"]) else: print("Key status:") print(json.dumps(response_data["result"], indent=4)) |
Access key for the web service
This is your access key for this web service.
The following access key for this web service is already associated with .
This key is valid until and can be used to process requests or characters.
Make sure to save this key before closing this dialog box.
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common Nouns | computador, cidade, ideia, … |
DA | Definite Articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of Fractions | meio, terço, décimo, %, … |
DGTR | Roman Numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse Marker | olá, … |
EADR | Electronic Addresses | http://www.di.fc.ul.pt, … |
EOE | End of Enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite Articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude Classes | unidade, dezena, dúzia, resma, … |
MTH | Months | Janeiro, Dezembro, … |
NP | Noun Phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of Address | Rua, av., rot., … |
PNM | Part of Name | Lisboa, António, João, … |
PNT | Punctuation Marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past Participles not in compound tenses | afirmados, vivida, … |
PP | Prepositional Phrases | algures, … |
PPT | Past Participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social Titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional Terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated Measurement Units | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week Days | segunda, terça-feira, sábado, … |
Multi-Word Expressions | ||
LADV1…LADVn | Multi-Word Adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-Word Conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-Word Demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-Word Denominators of Fractions | por cento |
LDM1…LDMn | Multi-Word Discourse Markers | pois não, até logo, … |
LITJ1…LITJn | Multi-Word Interjections | meu Deus |
LPRS1…LPRSn | Multi-Word Personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-Word Prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-Word Quantifiers | uns quantos, … |
LREL1…LRELn | Multi-Word Relatives | tal como, … |
LX-Tagger documentation
LX-Tagger
LX-Tagger is a freely available online service for the part-of-speech tagging of Portuguese. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
You may also be interested to use our LX-Suite online service, which adds sub-syntactic analysis.
Features and evaluation
The LX-Tagger service is composed by a set of shallow processing tools:
- LX Sentence Splitter:
Marks sentence boundaries with<s>…</s>
, and paragraph boundaries with<p>…</p>
.
Unwraps sentences split over different lines.A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
- LX-Tokenizer:
- Segments text into lexically relevant tokens, using whitespace as
the separator. Note that, in these examples, the
|
(vertical bar) symbol is used to mark the token boundaries more clearly. um exemplo → |um|exemplo|
- Expands contractions. Note that the first element of an expanded
contraction is marked with an
_
(underscore) symbol: do → |de_|o|
- Marks spacing around punctuation or symbols. The
\*
and the*/
symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
- Detaches clitic pronouns from the verb. The detached pronoun is
marked with a
-
(hyphen) symbol. When in mesoclisis, a-CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a#
(hash) symbol: dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
- This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste|
when occurring as a Verbdeste → |de|este|
when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
- Segments text into lexically relevant tokens, using whitespace as
the separator. Note that, in these examples, the
- LX-Tagger:
- Assigns a single morpho-syntactic tag, from the tagset,
to every token. The tag is attached to the token by using a
/
(slash) symbol as separator: um exemplo → um/IA exemplo/CN
- Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under 10-fold cross-evaluation.
- Assigns a single morpho-syntactic tag, from the tagset,
to every token. The tag is attached to the token by using a
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Authorship
LX-Tagger was developed by António Branco and João Silva at the NLX—Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Acknowledgments
The development of this state-of-the-art tagger for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
This project was developed in cooperation with CLUL—Centro de Linguística da Universidade de Lisboa. The training and test corpora prepared for the development of this service evolved from a corpus provided by CLUL.
The part-of-speech tagger underlying this service was developed with Thorsten Brants' TnT software with his written permission.
References
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Branco, António and João Silva, 2004, "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese" . In Proceedings, LREC2004 - 4th International Conference on Language Resources and Evaluation, Lisbon, 26-28 May 2004, pp.507-510.
Contact us
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Tagset
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common Nouns | computador, cidade, ideia, … |
DA | Definite Articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of Fractions | meio, terço, décimo, %, … |
DGTR | Roman Numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse Marker | olá, … |
EADR | Electronic Addresses | http://www.di.fc.ul.pt, … |
EOE | End of Enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite Articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude Classes | unidade, dezena, dúzia, resma, … |
MTH | Months | Janeiro, Dezembro, … |
NP | Noun Phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of Address | Rua, av., rot., … |
PNM | Part of Name | Lisboa, António, João, … |
PNT | Punctuation Marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past Participles not in compound tenses | afirmados, vivida, … |
PP | Prepositional Phrases | algures, … |
PPT | Past Participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social Titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional Terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated Measurement Units | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week Days | segunda, terça-feira, sábado, … |
Multi-Word Expressions | ||
LADV1…LADVn | Multi-Word Adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-Word Conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-Word Demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-Word Denominators of Fractions | por cento |
LDM1…LDMn | Multi-Word Discourse Markers | pois não, até logo, … |
LITJ1…LITJn | Multi-Word Interjections | meu Deus |
LPRS1…LPRSn | Multi-Word Personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-Word Prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-Word Quantifiers | uns quantos, … |
LREL1…LRELn | Multi-Word Relatives | tal como, … |
Why LX-Tagger?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.