File processing
Input format: Input files must be in .txt FORMAT with UTF-8 ENCODING and contain PORTUGUESE TEXT. Input files and folders can also be compressed to the .zip format.
Privacy: The input file you upload and the respective output files will be automatically deleted from our computer after being processed and the result downloaded by you. No copies of your files will be retained after your use of this service.
Email address validation
Loading...
The size of your input file is large and its processing may take some time.
To receive by email an URL from which to download your processed file, please copy the code displayed below into the field "Subject:" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: After we reply to you with the URL for download, your email address is automatically deleted from our records.
Designing your own experiment with a Jupyter Notebook
A Jupyter notebook (hereafter just notebook, for short) is a type of document that contains executable code interspersed with visualizations of code execution results and narrative text.
Below we provide an example notebook which you may use as a starting point for designing your own experiments using language resources offered by PORTULAN CLARIN.
Pre-requisites
To execute this notebook, you need an access key you can obtain by clicking the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 4000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data sent to any PORTULAN CLARIN web service and the respective output will be automatically deleted from our computers after being processed. However, when running a notebook on an external service, such as the ones suggested below, you should take their data privacy policies into consideration.
Running the notebook
You have three options to run the notebook presented below:
- Run on Binder — The Binder Project is funded by a 501c3 non-profit
organization and is described in detail in the following paper:
Jupyter et al., "Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale."
Proceedings of the 17th Python in Science Conference. 2018. doi://10.25080/Majora-4af1f417-011 - Run on Google Colab — Google Colaboratory is a free-to-use product from Google Research.
- Download the notebook from our public Github repository and run it on your computer.
This is a more advanced option, which requires you to install Python 3 and Jupyter on your computer. For anyone without prior experience setting up a Python development environment, we strongly recommend one of the two options above.
This is only a preview of the notebook. To run it, please choose one of the following options:
Using LX-Suite to annotate a text from the BDCamões corpus¶
This is an example notebook that illustrates how you can use the LX-Suite web service to annotate a sample text from the BDCamões corpus (the full corpus is available from the PORTULAN CLARIN repository).
Before you run this example, replace access_key_goes_here
by your webservice access key, below:
LXSUITE_WS_API_KEY = 'access_key_goes_here'
LXSUITE_WS_API_URL = 'https://portulanclarin.net/workbench/lx-suite/api/'
Importing required Python modules¶
The next cell will take care of installing the requests
and matplotlib
packages,
if not already installed, and make them available to use in this notebook.
try:
import requests
except:
!pip3 install requests
import requests
try:
import matplotlib.pyplot as plt
except:
!pip3 install matplotlib
import matplotlib.pyplot as plt
import collections
Wrapping the complexities of the JSON-RPC API in a simple, easy to use function¶
The WSException
class defined below, will be used later to identify errors
from the webservice.
class WSException(Exception):
'Webservice Exception'
def __init__(self, errordata):
"errordata is a dict returned by the webservice with details about the error"
super().__init__(self)
assert isinstance(errordata, dict)
self.message = errordata["message"]
# see https://json-rpc.readthedocs.io/en/latest/exceptions.html for more info
# about JSON-RPC error codes
if -32099 <= errordata["code"] <= -32000: # Server Error
if errordata["data"]["type"] == "WebServiceException":
self.message += f": {errordata['data']['message']}"
else:
self.message += f": {errordata['data']!r}"
def __str__(self):
return self.message
The next function invokes the LX-Suite webservice through its public JSON-RPC API.
def annotate(text, format):
'''
Arguments
text: a string with a maximum of 4000 characters, Portuguese text, with
the input to be processed
format: either 'CINTIL', 'CONLL' or 'JSON'
Returns a string with the output according to specification in
https://portulanclarin.net/workbench/lx-suite/
Raises a WSException if an error occurs.
'''
request_data = {
'method': 'annotate',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'format': format,
'key': LXSUITE_WS_API_KEY,
},
}
request = requests.post(LXSUITE_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
Let us test the function we just defined:
text = '''Esta frase serve para testar o funcionamento da suite. Esta outra
frase faz o mesmo.'''
# the CONLL annotation format is a popular format for annotating part of speech
result = annotate(text, format="CONLL")
print(result)
#id form lemma cpos pos feat head deprel phead pdeprel 1 Esta - DEM DEM fs - - - - 2 frase FRASE CN CN fs - - - - 3 serve SERVIR V V pi-3s - - - - 4 para - PREP PREP - - - - - 5 testar TESTAR V V INF-nInf - - - - 6 o - DA DA ms - - - - 7 funcionamento FUNCIONAMENTO CN CN ms - - - - 8 de_ - PREP PREP - - - - - 9 a - DA DA fs - - - - 10 suite SUITE CN CN fs - - - - 11 . - PNT PNT - - - - - #id form lemma cpos pos feat head deprel phead pdeprel 1 Esta - DEM DEM fs - - - - 2 outra OUTRO ADJ ADJ fs - - - - 3 frase FRASE CN CN fs - - - - 4 faz FAZER V V pi-3s - - - - 5 o - LDEM1 LDEM1 - - - - - 6 mesmo - LDEM2 LDEM2 - - - - - 7 . - PNT PNT - - - - -
The JSON output format¶
The JSON format (which we obtain by passing format="JSON"
into the annotate
function) is more
convenient when we need to further process the annotations, because each abstraction is mapped
directly into a Python native object (lists, dicts, strings, etc) as follows:
- The returned object is a
list
, where each element corresponds to a paragraph of the given text; - In turn, each paragraph is a
list
where each element represents a sentence; - Each sentence is a
list
where each element represents a token; - Each token is a
dict
where each key-value pair is an attribute of the token.
annotated_text = annotate(text, format="JSON")
for pnum, paragraph in enumerate(annotated_text, start=1): # enumerate paragraphs in text, starting at 1
print(f"paragraph {pnum}:")
for snum, sentence in enumerate(paragraph, start=1): # enumerate sentences in paragraph, starting at 1
print(f" sentence {snum}:")
for tnum, token in enumerate(sentence, start=1): # enumerate tokens in sentence, starting at 1
print(f" token {tnum}: {token!r}") # print a token representation
paragraph 1: sentence 1: token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs'} token 2: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs'} token 3: {'form': 'serve', 'space': 'LR', 'pos': 'V', 'lemma': 'SERVIR', 'infl': 'pi-3s'} token 4: {'form': 'para', 'space': 'LR', 'pos': 'PREP'} token 5: {'form': 'testar', 'space': 'LR', 'pos': 'V', 'lemma': 'TESTAR', 'infl': 'INF-nInf'} token 6: {'form': 'o', 'space': 'LR', 'pos': 'DA', 'infl': 'ms'} token 7: {'form': 'funcionamento', 'space': 'LR', 'pos': 'CN', 'lemma': 'FUNCIONAMENTO', 'infl': 'ms'} token 8: {'form': 'de_', 'space': 'L', 'raw': 'da', 'pos': 'PREP'} token 9: {'form': 'a', 'space': 'R', 'pos': 'DA', 'infl': 'fs'} token 10: {'form': 'suite', 'space': 'L', 'pos': 'CN', 'lemma': 'SUITE', 'infl': 'fs'} token 11: {'form': '.', 'space': 'R', 'pos': 'PNT'} sentence 2: token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs'} token 2: {'form': 'outra', 'space': 'LR', 'pos': 'ADJ', 'lemma': 'OUTRO', 'infl': 'fs'} token 3: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs'} token 4: {'form': 'faz', 'space': 'LR', 'pos': 'V', 'lemma': 'FAZER', 'infl': 'pi-3s'} token 5: {'form': 'o', 'space': 'LR', 'pos': 'LDEM1'} token 6: {'form': 'mesmo', 'space': 'L', 'pos': 'LDEM2'} token 7: {'form': '.', 'space': 'R', 'pos': 'PNT'}
Downloading and preparing our working text¶
In the next code cell, we will download a copy of the book "Viagens na minha terra" and prepare it to be used as our working text.
# A plain text version of this book is available from our Gitbub repository:
sample_text_url = "https://github.com/portulanclarin/jupyter-notebooks/raw/main/sample-data/viagensnaminhaterra.txt"
req = requests.get(sample_text_url)
sample_text_lines = req.text.splitlines()
num_lines = len(sample_text_lines)
print(f"The downloaded text contains {num_lines} lines")
# discard whitespace at beginning and end of each line:
sample_text_lines = [line.strip() for line in sample_text_lines]
# discard empty lines
sample_text_lines = [line for line in sample_text_lines if line]
# how many lines do we have left?
num_lines = len(sample_text_lines)
print(f"After discarding empty lines we are left with {num_lines} non-empty lines")
The downloaded text contains 2509 lines After discarding empty lines we are left with 2205 non-empty lines
Annotating with the LX-Suite web service¶
There is a limit on the number of web service requests per hour that can be made in association with any given key. Thus, we should send as much text as possible in each request while also conforming with the 4000 characters per request limit.
To this end, the following function slices our text into chunks smaller than 4K:
def slice_into_chunks(lines, max_chunk_size=4000):
chunk, chunk_size = [], 0
for lnum, line in enumerate(lines, start=1):
if (chunk_size + len(line)) <= max_chunk_size:
chunk.append(line)
chunk_size += len(line) + 1
# the + 1 above is for the newline character terminating each line
else:
yield "\n".join(chunk)
if len(line) > max_chunk_size:
print(f"line {lnum} is longer than 4000 characters; truncating")
line = line[:4000]
chunk, chunk_size = [line], len(line) + 1
if chunk:
yield "\n".join(chunk)
Next, we will apply slice_into_chunks
to the sample text to get the chunks to be annotated.
chunks = list(slice_into_chunks(sample_text_lines))
annotated_text = [] # annotated paragraphs will be stored here
chunks_processed = 0 # this variable keeps track of which chunks have been processed already
print(f"There are {len(chunks)} chunks to be annotated")
There are 105 chunks to be annotated
Next, we will invoke annotate
on each chunk.
If we get an exception while annotating a chunk:
- check the exception message to determine what was the cause;
- if the maximum number of requests per hour has been exceeded, then wait some time before retrying;
- if a temporary error occurred in the webservice, try again later.
In any case, as long as the notebook is not shutdown or restarted, the text that has been annotated thus far is not lost, and re-running the following cell will pick up from the point where the exception occurred.
for cnum, chunk in enumerate(chunks[chunks_processed:], start=chunks_processed+1):
try:
annotated_text.extend(annotate(chunk, format="JSON"))
chunks_processed = cnum
# print one dot for each annotated chunk to get some progress feedback
print(".", end="", flush=True)
except Exception as exc:
chunk_preview = chunk[:100] + "[...]" if len(chunk) > 100 else chunk
print(
f"\nError: annotation of chunk {cnum} failed ({exc}); chunk contents:\n\n{chunk_preview}\n\n"
)
break
.........................................................................................................
Let's create a pie chart with the most common part-of-speech tags¶
%matplotlib inline
tag_frequencies = collections.Counter(
token["pos"]
for paragraph in annotated_text
for sentence in paragraph
for token in sentence
).most_common()
tags = [tag for tag, _ in tag_frequencies[:9]]
freqs = [freq for _, freq in tag_frequencies[:9]]
tags.append("other")
freqs.append(sum(freq for _, freq in tag_frequencies[10:]))
plt.rcParams['figure.figsize'] = [10, 10]
fig1, ax1 = plt.subplots()
ax1.pie(freqs, labels=tags, autopct='%1.1f%%', startangle=90)
ax1.axis('equal') # equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# To learn more about matplotlib visit https://matplotlib.org/
Getting the status of a webservice access key¶
def get_key_status():
'''Returns a string with the detailed status of the webservice access key'''
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': LXSUITE_WS_API_KEY,
},
}
request = requests.post(LXSUITE_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
get_key_status()
{'requests_remaining': 99999140, 'chars_remaining': 998236690, 'expiry': '2030-01-10T00:00+00:00'}
Instructions to use this web service
The web service for this application is available at https://portulanclarin.net/workbench/lx-suite/api/.
Below you find an example of how to use this web service with Python 3.
This example resorts to the requests package. To install this package, run this command in the command line:
pip3 install requests
.
To use this web service, you need an access key you can obtain by clicking in the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 4000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data and the respective output will be automatically deleted from our computer after being processed. No copies will be retained after your use of this service.
import json
import requests # to install this library, enter in your command line:
# pip3 install requests
# This is a simple example to illustrate how you can use the LX-Suite web service
# Requires: key is a string with your access key
# Requires: text is a string, UTF-8, with a maximum 4000 characters, Portuguese text, with
# the input to be processed
# Requires: format is a string, indicating the output format which can be either 'CINTIL',
# 'CONLL' or JSON
# Ensures: output according to specification in https://portulanclarin.net/workbench/lx-suite/
# Ensures: dict with number of requests and characters input so far with the access key, and
# its date of expiry
key = 'access_key_goes_here' # before you run this example, replace access_key_goes_here by
# your access key
format = 'CONLL' # other possible values are 'CINTIL' and JSON
# this string can be replaced by your input
text = '''Esta frase serve para testar o funcionamento da suite. Esta outra
frase faz o mesmo.'''
# To read input text from a file, uncomment this block
#inputFile = open("myInputFileName", "r", encoding="utf-8") # replace myInputFileName by
# the name of your file
#text = inputFile.read()
#inputFile.close()
# Processing:
url = "https://portulanclarin.net/workbench/lx-suite/api/"
request_data = {
'method': 'annotate',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'format': format,
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Result:")
print(response_data["result"])
# To write output in a file, uncomment this block
#outputFile = open("myOutputFileName","w", encoding="utf-8") # replace myOutputFileName by
# the name of your file
#output = response_data["result"]
#outputFile.write(output)
#outputFile.close()
# Getting acess key status:
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Key status:")
print(json.dumps(response_data["result"], indent=4))
Access key for the web service
This is your access key for this web service.
The following access key for this web service is already associated with .
This key is valid until and can be used to process requests or characters.
An email message has been sent into your address with the information above.
Email address validation
Loading...
To receive by email your access key for this webservice, please copy the code displayed below into the field "Subject" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: When your access key expires, your email address is automatically deleted from our records.
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common nouns | computador, cidade, ideia, … |
DA | Definite articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of fractions | meio, terço, décimo, %, … |
DGTR | Roman numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse marker | olá, … |
EADR | Electronic addresses | http://www.di.fc.ul.pt, … |
EOE | End of enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude classes | unidade, dezena, dúzia, resma, … |
MTH | Months | janeiro, dezembro, … |
NP | Noun phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of address | Rua, av., rot., … |
PNM | Part of name | Lisboa, António, João, … |
PNT | Punctuation marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past participles not in compound tenses | afirmados, vivida, … |
PP | Prepositional phrases | algures, … |
PPT | Past participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated measurement units | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week days | segunda, terça-feira, sábado, … |
Multi-word expressions | ||
LADV1…LADVn | Multi-word adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-word conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-word demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-word denominators of fractions | por cento |
LDM1…LDMn | Multi-word discourse markers | pois não, até logo, … |
LITJ1…LITJn | Multi-word interjections | meu Deus |
LPRS1…LPRSn | Multi-word personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-word prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-word quantifiers | uns quantos, … |
LREL1…LRELn | Multi-word relatives | tal como, … |
Tag | Description |
---|---|
m | Masculine |
f | Feminine |
s | Singular |
p | Plural |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
1 | First person |
2 | Second person |
3 | Third person |
pi | Presente do indicativo |
ppi | Pretérito perfeito do indicativo |
ii | Pretérito imperfeito do indicativo |
mpi | Pretérito mais que perfeito do indicativo |
fi | Futuro do indicativo |
c | Condicional |
pc | Presente do conjuntivo |
ic | Pretérito imperfeito do conjuntivo |
fc | Futuro do conjuntivo |
imp | Imperativo |
LX-Suite documentation
LX-Suite
LX-Suite is a freely available online service for the shallow processing of Portuguese. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
You may be interested to use also our LX-Sentence Splitter, LX-Tokenizer or LX-Tagger online services for delimiting sentences, tokenization or part-of-speech tagging of Portuguese. You may also be interested to use our LX-Conjugator and LX-Lemmatizer online services for the conjugation and lemmatization of verbs, and LX-Inflector online service for the inflection of nominal classes.
Features and evaluation
LX-Suite is composed by a set of shallow processing tools:
- LX Sentence Splitter:
Marks sentence boundaries with<s>…</s>
, and paragraph boundaries with<p>…</p>
.
Unwraps sentences split over different lines.A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
- LX-Tokenizer:
- Segments text into lexically relevant tokens, using
whitespace as the separator. Note that, in these examples,
the
|
(vertical bar) symbol is used to mark the token boundaries more clearly. um exemplo → |um|exemplo|
- Expands contractions. Note that the first element of an
expanded contraction is marked with an
_
(underscore) symbol: do → |de_|o|
- Marks spacing around punctuation or symbols. The
\*
and the*/
symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
- Detaches clitic pronouns from the verb. The detached pronoun
is marked with a
-
(hyphen) symbol. When in mesoclisis, a-CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a#
(hash) symbol: dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
- This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste|
when occurring as a Verbdeste → |de|este|
when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
- Segments text into lexically relevant tokens, using
whitespace as the separator. Note that, in these examples,
the
- LX-Tagger:
- Assigns a single morpho-syntactic tag, from the tagset,
to every token. The tag is attached to the token, using a
/
(slash) symbol as separator: um exemplo → um/IA exemplo/CN
- Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under a 10-fold cross-evaluation.
- Assigns a single morpho-syntactic tag, from the tagset,
to every token. The tag is attached to the token, using a
- LX-Featurizer (nominal):
- Assigns inflection feature values to words from the nominal categories (Gender (masculine or feminine), Number (singular or plural) and, when applicable, Person (1st, 2nd and 3rd)):
os/DA gatos/CN → os/DA#mp gatos/CN#mp
- Assigns degree feature values (diminutive, superlative and comparative) to words from the nominal categories:
os/DA gatinhos/CN → os/DA#mp gatinhos/CN#mp-dim
- Sometimes, due to the so-called invariant words, the featurizer is not able to determine a feature value. In those cases, it assigns a g value for an underspecified Gender and n value for an underspecified Number. Note, however, that if provided with an adequate context, the featurizer might resolve such cases:
Vi/V pianistas/CN → Vi/V pianistas/CN#gp
Vi/V as/DA pianistas/CN → Vi/V as/DA#fp pianistas/CN#fp
This tool has 91.07% f-score. For an online service supported by this tool (without performing disambiguation) see LX-Inflector.
- LX-Lemmatizer (nominal):
- Assigns a lemma to words from the nominal categories
(Adjectives, Common Nouns and Past Participles). This lemma
corresponds to the form that one would find in a dictionary,
typically the masculine singular form. The lemma is inserted
into the token, with
/
(slash) as a delimiter. -
gatas/CN#fp → gatas/GATO/CN#fp
normalíssimo/ADJ#ms-sup → normalíssimo/NORMAL/ADJ#ms-sup
This tool has 97.67% f-score. For an online service supported by this tool (without performing disambiguation) see LX-Inflector.
- Assigns a lemma to words from the nominal categories
(Adjectives, Common Nouns and Past Participles). This lemma
corresponds to the form that one would find in a dictionary,
typically the masculine singular form. The lemma is inserted
into the token, with
- LX-Lemmatizer and Featurizer (verbal):
- Assigns a lemma and inflection feature values to verbs. The
lemma corresponds to the infinitive form of the verb. The
lemma is inserted into the token, with
/
(slash) as a delimiter. escrevi/V → escrevi/ESCREVER/V#ppi-1s
The tool disambiguates among the various lemma-inflection pairs that can be assigned to a verb form, achieving 95.96% accuracy.
For an online service supported by this tool (without performing disambiguation) see LX-Lemmatizer.
- Assigns a lemma and inflection feature values to verbs. The
lemma corresponds to the infinitive form of the verb. The
lemma is inserted into the token, with
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Authorship
LX-Suite is being developed by António Branco and João Silva, with the key contribution of Filipe Nunes (verbal lemmatizer), and the help of Francisco Costa, Catarina Ribeiro and Ricardo Santos at the NLX—Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Acknowledgments
The development of a state-of-the-art, complete suite of shallow processing tools for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
This project was developed in cooperation with CLUL—Centro de Linguística da Universidade de Lisboa. The training and test corpora prepared for the development of this demo evolved from a corpus provided by CLUL.
This demo includes a part-of-speech tagger developed with Thorsten Brants' TnT software with his written permission.
Publications
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Branco, António and João Silva, 2004. "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese". In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Paris, ELRA, pp.507-510.
Other publications:
- Branco, António and João Silva, 2006. "Dedicated Nominal Featurization of Portuguese". In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR'06)
- Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project". In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
- Branco, António, Filipe Nunes and João Silva, 2006. Verb Analysis in an Inflective Language: Simpler is better. Internal report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.
- Branco, António and João Silva, 2005. "Accurate Annotation: an Efficiency Metric". In Nicolas Nicolov, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.), Recent Advances in Natural Language Processing III, Amsterdam, John Benjamins, pp. 173-182.
- Branco, António and João Silva, 2004. "Swift Development of State of the Art Taggers for Portuguese". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 29-46.
- Branco, António and João Silva, 2004. "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese". In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Paris, ELRA, pp. 507-510.
- Branco, António, Amália Mendes and Ricardo Ribeiro (eds.), 2003. Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28.
- Branco, António and João Silva, 2003. "Portuguese-specific Issues in the Rapid Development of State of the Art Taggers". In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003, Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, TR-2003-28, pp.7-10.
- Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2004. "Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 47-62.
- Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2003. "Reusing Available Resources for Tagging a Spoken Portuguese Corpus". In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003, 2003, pp. 25-28.
- TagShare, 2004, Manual de Etiquetação e Convenções, Internal Report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.
Contact us
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Tagset
Part-of-speech tags
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common nouns | computador, cidade, ideia, … |
DA | Definite articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of fractions | meio, terço, décimo, %, … |
DGTR | Roman numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse marker | olá, … |
EADR | Electronic addresses | http://www.di.fc.ul.pt, … |
EOE | End of enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude classes | unidade, dezena, dúzia, resma, … |
MTH | Months | janeiro, dezembro, … |
NP | Noun phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of address | Rua, av., rot., … |
PNM | Part of name | Lisboa, António, João, … |
PNT | Punctuation marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past participles not in compound tenses | afirmados, vivida, … |
PP | Prepositional phrases | algures, … |
PPT | Past participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated measurement units | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week days | segunda, terça-feira, sábado, … |
Multi-word expressions | ||
LADV1…LADVn | Multi-word adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-word conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-word demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-word denominators of fractions | por cento |
LDM1…LDMn | Multi-word discourse markers | pois não, até logo, … |
LITJ1…LITJn | Multi-word interjections | meu Deus |
LPRS1…LPRSn | Multi-word personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-word prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-word quantifiers | uns quantos, … |
LREL1…LRELn | Multi-word relatives | tal como, … |
Other tags
Tag | Description |
---|---|
m | Masculine |
f | Feminine |
s | Singular |
p | Plural |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
1 | First person |
2 | Second person |
3 | Third person |
pi | Presente do indicativo |
ppi | Pretérito perfeito do indicativo |
ii | Pretérito imperfeito do indicativo |
mpi | Pretérito mais que perfeito do indicativo |
fi | Futuro do indicativo |
c | Condicional |
pc | Presente do conjuntivo |
ic | Pretérito imperfeito do conjuntivo |
fc | Futuro do conjuntivo |
imp | Imperativo |
Why LX-Suite?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.