File processing
Input format: Input files must be in .txt FORMAT with UTF-8 ENCODING and contain PORTUGUESE TEXT. Input files and folders can also be compressed to the .zip format.
Privacy: The input file you upload and the respective output files will be automatically deleted from our computer after being processed and the result downloaded by you. No copies of your files will be retained after your use of this service.
Email address validation
Loading...
The size of your input file is large and its processing may take some time.
To receive by email an URL from which to download your processed file, please copy the code displayed below into the field "Subject:" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: After we reply to you with the URL for download, your email address is automatically deleted from our records.
Designing your own experiment with a Jupyter Notebook
A Jupyter notebook (hereafter just notebook, for short) is a type of document that contains executable code interspersed with visualizations of code execution results and narrative text.
Below we provide an example notebook which you may use as a starting point for designing your own experiments using language resources offered by PORTULAN CLARIN.
Pre-requisites
To execute this notebook, you need an access key you can obtain by clicking the button below. A key is valid for 31 days. It allows to submit a total of 500 million characters by means of requests with no more 2000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data sent to any PORTULAN CLARIN web service and the respective output will be automatically deleted from our computers after being processed. However, when running a notebook on an external service, such as the ones suggested below, you should take their data privacy policies into consideration.
Running the notebook
You have three options to run the notebook presented below:
- Run on Binder — The Binder Project is funded by a 501c3 non-profit
organization and is described in detail in the following paper:
Jupyter et al., "Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale."
Proceedings of the 17th Python in Science Conference. 2018. doi://10.25080/Majora-4af1f417-011 - Run on Google Colab — Google Colaboratory is a free-to-use product from Google Research.
- Download the notebook from our public Github repository and run it on your computer.
This is a more advanced option, which requires you to install Python 3 and Jupyter on your computer. For anyone without prior experience setting up a Python development environment, we strongly recommend one of the two options above.
This is only a preview of the notebook. To run it, please choose one of the following options:
Using LX-DepParser to parse sentences and displaying dependency tree graphs¶
This is an example notebook that illustrates how you can use the LX-DepParser web service to parse sentences and how to visualize dependency tree graphs in a notebook.
Before you run this example, replace access_key_goes_here
by your webservice access key, below:
LXDEPPARSER_WS_API_KEY = 'access_key_goes_here'
LXDEPPARSER_WS_API_URL = 'https://portulanclarin.net/workbench/lx-depparser/api/'
Importing required Python modules¶
The next cell will take care of installing the requests
and pydependencygrapher
packages,
if not already installed, and make them available to use in this notebook.
try:
import requests
except:
!pip3 install requests
import requests
try:
import pydependencygrapher
except:
# see https://github.com/pygobject/pycairo/issues/39#issuecomment-391830334
!apt-get install libcairo2-dev libjpeg-dev libgif-dev
!pip3 install pydependencygrapher
import pydependencygrapher
import base64
import IPython
Wrapping the complexities of the JSON-RPC API in a simple, easy to use function¶
The WSException
class defined below, will be used later to identify errors
from the webservice.
class WSException(Exception):
'Webservice Exception'
def __init__(self, errordata):
"errordata is a dict returned by the webservice with details about the error"
super().__init__(self)
assert isinstance(errordata, dict)
self.message = errordata["message"]
# see https://json-rpc.readthedocs.io/en/latest/exceptions.html for more info
# about JSON-RPC error codes
if -32099 <= errordata["code"] <= -32000: # Server Error
if errordata["data"]["type"] == "WebServiceException":
self.message += f": {errordata['data']['message']}"
else:
self.message += f": {errordata['data']!r}"
def __str__(self):
return self.message
The next function invoques the LX-DepParser webservice through it's public JSON-RPC API.
def parse(text, tagset, format):
'''
Arguments
text: a string with a maximum of 2000 characters, Portuguese text, with
the input to be processed
tagset: either 'CINTIL' or 'UD' (universal dependencies)
format: either 'CONLL' or 'JSON'
Returns a string with the output according to specification in
https://portulanclarin.net/workbench/lx-depparser/
Raises a WSException if an error occurs.
'''
request_data = {
'method': 'parse',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'tagset': tagset,
'format': format,
'key': LXDEPPARSER_WS_API_KEY,
},
}
request = requests.post(LXDEPPARSER_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
Let us test the function we just defined:
text = '''Esta frase serve para testar o funcionamento do parser de dependências. Esta outra
frase faz o mesmo.'''
# the CONLL annotation format is a popular format for annotating part of speech
# and dependency tree graphs
result = parse(text, tagset="CINTIL", format="CONLL")
print(result)
#id form lemma cpos pos feat head deprel phead pdeprel 1 Esta - DEM DEM fs 2 SP 2 SP 2 frase FRASE CN CN fs 3 SJ 3 SJ 3 serve SERVIR V V pi-3s 0 ROOT 0 ROOT 4 para - PREP PREP - 3 C 3 C 5 testar TESTAR V V INF-nInf 3 COORD 3 COORD 6 o - DA DA ms 7 SP 7 SP 7 funcionamento FUNCIONAMENTO CN CN ms 5 DO 5 DO 8 de_ - PREP PREP - 7 OBL 7 OBL 9 o - DA DA ms 10 SP 10 SP 10 parser PARSER CN CN ms 8 C 8 C 11 de - PREP PREP - 10 M 10 M 12 dependências DEPENDÊNCIA CN CN fp 11 C 11 C 13 . - PNT PNT - 3 PUNCT 3 PUNCT #id form lemma cpos pos feat head deprel phead pdeprel 1 Esta - DEM DEM fs 3 SP 3 SP 2 outra OUTRO ADJ ADJ fs 3 SP 3 SP 3 frase FRASE CN CN fs 4 SJ 4 SJ 4 faz FAZER V V pi-3s 0 ROOT 0 ROOT 5 o - LDEM1 LDEM1 - 4 DO 4 DO 6 mesmo - LDEM2 LDEM2 - 4 DO 4 DO 7 . - PNT PNT - 4 PUNCT 4 PUNCT
Displaying dependency tree graphs from parsed text in CONLL format¶
To view dependency tree graphs for the parsed sentences, first we will split the CONLL output on empty lines to get one set of lines per sentence (each line carrying information pertaining to each token).
def group_sentence_conll_lines(conll_lines):
"""Groups CONLL-encoded lines (one line encodes one token), according to sentences.
This generator function takes as argument a sequence of CONLL lines, and generates
a sequence of lists, each one containing the CONLL lines of a sentence
"""
parsed_sentences = []
current_sentence = []
for line in conll_lines:
# lines starting with # are comments; ignore
if line.startswith("#"):
continue
# one or more consecutive empty lines mark the end of a sentence
if not line:
if current_sentence:
parsed_sentences.append(current_sentence)
current_sentence = []
else:
current_sentence.append(line)
if current_sentence:
parsed_sentences.append(current_sentence)
return parsed_sentences
Let us define a function render_tree
that displays a sentence dependency graph, making use of the pydependencygrapher
package for rendering the graph into an image and the IPython
package for displaying the resulting image.
We also define a function render_tree_from_conll
that will take a CONLL sentence (a list of CONLL-formatted lines, one for each token) and create one pydependencygrapher.Token
object for each token, before calling render_tree
to display the dependency graph.
def render_tree(sentence):
graph = pydependencygrapher.DependencyGraph(sentence)
graph.draw()
b64png = graph.save_buffer()
IPython.display.display(IPython.display.Image(data=base64.b64decode(b64png)))
def render_tree_from_conll(conll_sentence):
sentence = [pydependencygrapher.Token(*conll_token.split("\t")) for conll_token in conll_sentence]
return render_tree(sentence)
conll_lines = result.splitlines(keepends=False)
for conll_sentence in group_sentence_conll_lines(conll_lines):
data = render_tree_from_conll(conll_sentence)
The JSON output format¶
The JSON format (which we obtain by passing format="JSON"
into the parse
function) is more
convenient when we need to further process the annotations, because each abstraction is mapped
directly into a Python native object (lists, dicts, strings, etc) as follows:
- The returned object is a
list
, where each element corresponds to a paragraph of the given text; - In turn, each paragraph is a
list
where each element represents a sentence; - Each sentence is a
list
where each element represents a token; - Each token is a
dict
where each key-value pair is an attribute of the token.
parsed_text = parse(text, tagset="CINTIL", format="JSON")
for pnum, paragraph in enumerate(parsed_text, start=1): # enumerate paragraphs in text, starting at 1
print(f"paragraph {pnum}:")
for snum, sentence in enumerate(paragraph, start=1): # enumerate sentences in paragraph, starting at 1
print(f" sentence {snum}:")
for tnum, token in enumerate(sentence, start=1): # enumerate tokens in sentence, starting at 1
print(f" token {tnum}: {token!r}") # print a token representation
paragraph 1: sentence 1: token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs', 'deprel': 'SP', 'parent': 2} token 2: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs', 'deprel': 'SJ', 'parent': 3} token 3: {'form': 'serve', 'space': 'LR', 'pos': 'V', 'lemma': 'SERVIR', 'infl': 'pi-3s', 'deprel': 'ROOT', 'parent': 0} token 4: {'form': 'para', 'space': 'LR', 'pos': 'PREP', 'deprel': 'C', 'parent': 3} token 5: {'form': 'testar', 'space': 'LR', 'pos': 'V', 'lemma': 'TESTAR', 'infl': 'INF-nInf', 'deprel': 'COORD', 'parent': 3} token 6: {'form': 'o', 'space': 'LR', 'pos': 'DA', 'infl': 'ms', 'deprel': 'SP', 'parent': 7} token 7: {'form': 'funcionamento', 'space': 'LR', 'pos': 'CN', 'lemma': 'FUNCIONAMENTO', 'infl': 'ms', 'deprel': 'DO', 'parent': 5} token 8: {'form': 'de_', 'space': 'L', 'raw': 'do', 'pos': 'PREP', 'deprel': 'OBL', 'parent': 7} token 9: {'form': 'o', 'space': 'R', 'pos': 'DA', 'infl': 'ms', 'deprel': 'SP', 'parent': 10} token 10: {'form': 'parser', 'space': 'LR', 'pos': 'CN', 'lemma': 'PARSER', 'infl': 'ms', 'deprel': 'C', 'parent': 8} token 11: {'form': 'de', 'space': 'LR', 'pos': 'PREP', 'deprel': 'M', 'parent': 10} token 12: {'form': 'dependências', 'space': 'L', 'pos': 'CN', 'lemma': 'DEPENDÊNCIA', 'infl': 'fp', 'deprel': 'C', 'parent': 11} token 13: {'form': '.', 'space': 'R', 'pos': 'PNT', 'deprel': 'PUNCT', 'parent': 3} sentence 2: token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs', 'deprel': 'SP', 'parent': 3} token 2: {'form': 'outra', 'space': 'LR', 'pos': 'ADJ', 'lemma': 'OUTRO', 'infl': 'fs', 'deprel': 'SP', 'parent': 3} token 3: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs', 'deprel': 'SJ', 'parent': 4} token 4: {'form': 'faz', 'space': 'LR', 'pos': 'V', 'lemma': 'FAZER', 'infl': 'pi-3s', 'deprel': 'ROOT', 'parent': 0} token 5: {'form': 'o', 'space': 'LR', 'pos': 'LDEM1', 'deprel': 'DO', 'parent': 4} token 6: {'form': 'mesmo', 'space': 'L', 'pos': 'LDEM2', 'deprel': 'DO', 'parent': 4} token 7: {'form': '.', 'space': 'R', 'pos': 'PNT', 'deprel': 'PUNCT', 'parent': 4}
Displaying dependency graphs from parsed text in JSON format¶
Let us define a function, similar to render_tree_from_conll
to display dependency graphs for JSON-encoded sentences.
def render_tree_from_json(json_sentence):
token_attributes = ["form", "lemma", "pos", "pos", "infl", "parent", "deprel", "parent", "deprel"]
sentence = []
for num, token in enumerate(json_sentence, start=1):
sentence.append(
pydependencygrapher.Token(
num,
*[token.get(attribute, "_") for attribute in token_attributes]
)
)
return render_tree(sentence)
Let us test the function we just defined
for paragraph in parsed_text:
for sentence in paragraph:
render_tree_from_json(sentence)
Getting the status of a webservice access key¶
def get_key_status():
'''Returns a string with the detailed status of the webservice access key'''
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': LXDEPPARSER_WS_API_KEY,
},
}
request = requests.post(LXDEPPARSER_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
get_key_status()
{'requests_remaining': 99999970, 'chars_remaining': 999998849, 'expiry': '2030-01-10T00:00+00:00'}
Instructions to use this web service
The web service for this application is available at https://portulanclarin.net/workbench/lx-depparser/api/.
Below you find an example of how to use this web service with Python 3.
This example resorts to the requests package. To install this package, run this command in the command line:
pip3 install requests
.
To use this web service, you need an access key you can obtain by clicking in the button below. A key is valid for 31 days. It allows to submit a total of 500 million characters by means of requests with no more 2000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data and the respective output will be automatically deleted from our computer after being processed. No copies will be retained after your use of this service.
import json
import requests # to install this library, enter in your command line:
# pip3 install requests
# This is a simple example to illustrate how you can use the LX-DepParser web service
# Requires: key is a string with your access key
# Requires: text is a string, UTF-8, with a maximum 2000 characters, Portuguese text, with
# the input to be processed
# Requires: tagset is a string, indicating the tagset to be used in the output, which can
# be either 'CINTIL' or 'UD' (universal dependencies)
# Requires: format is a string, indicating the output format, which can be either
# 'CONLL' or 'JSON'
# Ensures: output according to specification in https://portulanclarin.net/workbench/lx-depparser/
# Ensures: dict with number of requests and characters input so far with the access key, and
# its date of expiry
key = 'access_key_goes_here' # before you run this example, replace access_key_goes_here by
# your access key
# this string can be replaced by your input
text = '''A Maria tem razão.
Mesmo assim, ensaia algumas aproximações.
A emissão será cotada na Bolsa de Valores do Luxemburgo.'''
tagset = 'CINTIL' # set this to 'UD' to get universal dependencies
format = 'CONLL' # other possible value is 'JSON'
# To read input text from a file, uncomment this block
#inputFile = open("myInputFileName", "r", encoding="utf-8") # replace myInputFileName by
# the name of your file
#text = inputFile.read()
#inputFile.close()
# Processing:
url = "https://portulanclarin.net/workbench/lx-depparser/api/"
request_data = {
'method': 'parse',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'tagset': tagset,
'format': format,
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Result:")
print(response_data["result"])
# To write output in a file, uncomment this block
#outputFile = open("myOutputFileName","w", encoding="utf-8") # replace myOutputFileName by
# the name of your file
#output = response_data["result"]
#outputFile.write(output)
#outputFile.close()
# Getting acess key status:
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Key status:")
print(json.dumps(response_data["result"], indent=4))
Access key for the web service
This is your access key for this web service.
The following access key for this web service is already associated with .
This key is valid until and can be used to process requests or characters.
An email message has been sent into your address with the information above.
Email address validation
Loading...
To receive by email your access key for this webservice, please copy the code displayed below into the field "Subject" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: When your access key expires, your email address is automatically deleted from our records.
Tag | Category |
---|---|
C | Complement |
CARD | Cardinal in multi-word cardinals |
COORD | Coordination |
CONJ | Conjunction |
DEP | Dependency |
DO | Direct Object |
IO | Indirect Object |
M | Modifier |
N | Name in multi-word proper names |
OBL | Oblique Complement |
PRD | Predicate |
PUNCT | Punctuation |
ROOT | Sentence root |
SJ | Subject |
SJac | Subject of an anticausative |
SJcp | Subject of complex predicate |
SP | Specifier |
Tag | Category |
---|---|
A | Adjective |
AP | Adjective Phrase |
ADV | Adverb |
ADVP | Adverb Phrase |
C | Complementizer |
CP | Complementizer Phrase |
CARD | Cardinal |
CONJ | Conjuction |
CONJP | Conjuction Phrase |
D | Determiner |
DEM | Demonstrative |
N | Noun |
NP | Noun Phrase |
P | Preposition |
PP | Preposition Phrase |
POSS | Possessive |
QNT | Predeterminer |
S | Sentence |
V | Verb |
VP | Verb Phrase |
Tag | Description |
---|---|
Tags for nominal categories | |
m | Masculine |
f | Feminine |
g | Indeterminate Gender |
s | Singular |
p | Plural |
n | Indeterminate Number |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
Tags for verbs | |
1 | First Person |
2 | Second Person |
3 | Third Person |
pi | Presente do Indicativo |
ppi | Pretérito Perfeito do Indicativo |
ii | Pretérito Imperfeito do Indicativo |
mpi | Pretérito Mais que Perfeito do Indicativo |
fi | Futuro do Indicativo |
c | Condicional |
pc | Presente do Conjuntivo |
ic | Pretérito Imperfeito do Conjuntivo |
fc | Futuro do Conjuntivo |
imp | Imperativo |
Tags for infinitive verbs | |
ifl | Inflected |
nifl | Not Inflected |
LX-DepParser's documentation
LX-DepParser
LX-DepParser is a free online service for the syntactic analysis of Portuguese. It allows the automatic parsing of sentences in Portuguese in terms of the grammatical functions of their words.
This service was developed and is maintained at the University of Lisbon by the NLX-Speech and Natural Language Group, Department of Informatics.
Parser
LX-DepParser is a MSTParser trained with Portuguese data.
For the training of the parser, 22,118 sentences were used (comprising 250,056 word tokens). The sentences were taken from the CINTIL-DependencyBank. This treebank is being developed and maintained at the University of Lisbon by the NLX-Speech and Natural Language Group of the Department of Informatics. In terms of evaluation, LX-DepParser's UAS (unlabeled attachment score) is 94.42 and its LAS (labeled attachment score) is 91.23. Scores were obtained through 10-fold cross-validation.
Consequently, the parser output complies with the design options adopted for the construction of the CINTIL-DependencyBank (see "Annotation Guidelines" below). The output of the parser can be obtained also in the format of Google's so-called Universal Dependencies, which results from the conversion of the original CINTIL output format by means of a set of regular expression rules over dependency trees, from which some residual distortion cases may happen to be introduced.
Tagset
Gramatical function tagset
Tag | Category |
---|---|
C | Complement |
CARD | Cardinal in multi-word cardinals |
COORD | Coordination |
CONJ | Conjunction |
DEP | Dependency |
DO | Direct Object |
IO | Indirect Object |
M | Modifier |
N | Name in multi-word proper names |
OBL | Oblique Complement |
PRD | Predicate |
PUNCT | Punctuation |
ROOT | Sentence root |
SJ | Subject |
SJac | Subject of an anticausative |
SJcp | Subject of complex predicate |
SP | Specifier |
Part-of-speech tags (high granularity)
Tag | Category |
---|---|
A | Adjective |
AP | Adjective Phrase |
ADV | Adverb |
ADVP | Adverb Phrase |
C | Complementizer |
CP | Complementizer Phrase |
CARD | Cardinal |
CONJ | Conjuction |
CONJP | Conjuction Phrase |
D | Determiner |
DEM | Demonstrative |
N | Noun |
NP | Noun Phrase |
P | Preposition |
PP | Preposition Phrase |
POSS | Possessive |
QNT | Predeterminer |
S | Sentence |
V | Verb |
VP | Verb Phrase |
Inflection tags
Tag | Description |
---|---|
Tags for nominal categories | |
m | Masculine |
f | Feminine |
g | Indeterminate Gender |
s | Singular |
p | Plural |
n | Indeterminate Number |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
Tags for verbs | |
1 | First Person |
2 | Second Person |
3 | Third Person |
pi | Presente do Indicativo |
ppi | Pretérito Perfeito do Indicativo |
ii | Pretérito Imperfeito do Indicativo |
mpi | Pretérito Mais que Perfeito do Indicativo |
fi | Futuro do Indicativo |
c | Condicional |
pc | Presente do Conjuntivo |
ic | Pretérito Imperfeito do Conjuntivo |
fc | Futuro do Conjuntivo |
imp | Imperativo |
Tags for infinitive verbs | |
ifl | Inflected |
nifl | Not Inflected |
Annotation guidelines
The analyses produced by LX-DepParser are similar to the dependency representations found in the CINTIL-DependencyBank on which LX-DepParser was trained. This dependency treebank was designed along the principles described in the following handbook:
- Branco António, Sérgio Castro, João Silva, Francisco Costa, 2011, CINTIL DepBank Handbook: Design options for the representation of grammatical dependencies. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tr-11-03.
Authorship
LX-DepParser was developed by Rúben Reis, under the direction of António Branco at the NLX-Group on Natural Language and Speech.
Publications
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Branco António, Sérgio Castro, João Silva, Francisco Costa, 2011, CINTIL DepBank Handbook: Design options for the representation of grammatical dependencies. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tr-11-03.
Contact us
You can contact us at the following email address: 'nlx' followed by '@' followed by 'di.fc.ul.pt'.
Acknowledgments
LX-DepParser was partially funded by FCT-Foundation for Science and Technology, under the contract FCT/PTDC/PLP/81157/2006 for the project SemanticShare.
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.