
File processing
Input format: Input files must be in .txt FORMAT with UTF-8 ENCODING and contain PORTUGUESE TEXT. Input files and folders can also be compressed to the .zip format.
Privacy: The input file you upload and the respective output files will be automatically deleted from our computer after being processed and the result downloaded by you. No copies of your files will be retained after your use of this service.
The size of your input file is large and its processing may take some time. To receive by email the url link from which to download your processed file when ready, enter your email address below. After being used for this purpose, your email address will be deleted from our computer.
Instructions to use this web service
The web service for this application is available at https://portulanclarin.net/workbench/lx-tokenizer/api/.
Below you find an example of how to use this web service with Python 3.
This example resorts to the requests package. To install this package, run this command in the command line:
pip3 install requests
.
To use this web service, you need an access key you can obtain by clicking in the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 200000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data and the respective output will be automatically deleted from our computer after being processed. No copies will be retained after your use of this service.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import json import requests # to install this library, enter in your command line: # pip3 install requests # This is a simple example to illustrate how you can use the LX-Tokenizer web service # Requires: key is a string with your access key # Requires: text is a string, UTF-8, with a maximum 200000 characters, Portuguese text, with # the input to be processed # Requires: format is a string, indicating the output format, which can be either # 'CINTIL' or 'JSON' # Ensures: output according to specification in https://portulanclarin.net/workbench/lx-tokenizer/ # Ensures: dict with number of requests and characters input so far with the access key, and # its date of expiry key = 'access_key_goes_here' # before you run this example, replace access_key_goes_here by # your access key # this string can be replaced by your input text = '''Dentro deste parágrafo, há vários casos especiais para a separação de palavras na ortografia do português. Tu também deste este exemplo: dar-se-lho-ia. E prà frente é que é pelo caminho certo, etc.''' # To read input text from a file, uncomment this block #inputFile = open("myInputFileName", "r", encoding="utf-8") # replace myInputFileName by # the name of your file #text = inputFile.read() #inputFile.close() format = 'CINTIL' # other possible value is 'JSON' # Processing: url = "https://portulanclarin.net/workbench/lx-tokenizer/api/" request_data = { 'method': 'tokenize', 'jsonrpc': '2.0', 'id': 0, 'params': { 'text': text, 'format': format, 'key': key, }, } request = requests.post(url, json=request_data) response_data = request.json() if "error" in response_data: print("Error:", response_data["error"]) else: print("Result:") print(response_data["result"]) # To write output in a file, uncomment this block #outputFile = open("myOutputFileName","w", encoding="utf-8") # replace myOutputFileName by # the name of your file #output = response_data["result"] #outputFile.write(output) #outputFile.close() # Getting acess key status: request_data = { 'method': 'key_status', 'jsonrpc': '2.0', 'id': 0, 'params': { 'key': key, }, } request = requests.post(url, json=request_data) response_data = request.json() if "error" in response_data: print("Error:", response_data["error"]) else: print("Key status:") print(json.dumps(response_data["result"], indent=4)) |
Access key for the web service
This is your access key for this web service.
The following access key for this web service is already associated with .
This key is valid until and can be used to process requests or characters.
Make sure to save this key before closing this dialog box.
LX-Tokenizer's documentation
LX-Tokenizer
LX-Tokenizer is a freely available online service for tokenizing Portuguese text. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
You may also be interested to use our LX Sentence Splitter, LX-Tagger, or LX-Suite online services for delimiting sentences, part-of-speech tagging and sub-syntactic analysis of Portuguese.
Features and evaluation
The LX-Tokenizer service is composed by two processing tools:
- LX Sentence Splitter:
Marks sentence boundaries with<s>…</s>
, and paragraph boundaries with<p>…</p>
.
Unwraps sentences split over different lines.A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
- LX-Tokenizer:
- Segments text into lexically relevant tokens, using
whitespace as the separator. Note that, in these examples,
the
|
(vertical bar) symbol is used to mark the token boundaries more clearly. um exemplo → |um|exemplo|
- Expands contractions. Note that the first element of an
expanded contraction is marked with an
_
(underscore) symbol: do → |de_|o|
- Marks spacing around punctuation or symbols. The
\*
and the*/
symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
- Detaches clitic pronouns from the verb. The detached pronoun
is marked with a
-
(hyphen) symbol. When in mesoclisis, a-CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a#
(hash) symbol: dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
- This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste|
when occurring as a Verbdeste → |de|este|
when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
- Segments text into lexically relevant tokens, using
whitespace as the separator. Note that, in these examples,
the
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Authorship
LX-Tokenizer was developed by António Branco and João Silva at the NLX—Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Acknowledgments
The development of a state-of-the-art, complete suite of shallow processing tools for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
Publications
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Silva, João, António Branco, Sérgio Castro and Ruben Reis, 2010, "Out-of-the-Box Robust Parsing of Portuguese". In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR2010), Lecture Notes in Artificial Intelligence, 6001, Berlin, Springer, pp. 75–85.
Contact us
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Why LX-Tokenizer?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.