
File processing
Input format: Input files must be in .txt FORMAT with UTF-8 ENCODING and contain PORTUGUESE TEXT. Input files and folders can also be compressed to the .zip format.
Privacy: The input file you upload and the respective output files will be automatically deleted from our computer after being processed and the result downloaded by you. No copies of your files will be retained after your use of this service.
Email address validation
Loading...
The size of your input file is large and its processing may take some time.
To receive by email an URL from which to download your processed file, please copy the code displayed below into the field "Subject:" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: After we reply to you with the URL for download, your email address is automatically deleted from our records.
Designing your own experiment with a Jupyter Notebook
A Jupyter notebook (hereafter just notebook, for short) is a type of document that contains executable code interspersed with visualizations of code execution results and narrative text.
Below we provide an example notebook which you may use as a starting point for designing your own experiments using language resources offered by PORTULAN CLARIN.
Pre-requisites
To execute this notebook, you need an access key you can obtain by clicking the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 4000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data sent to any PORTULAN CLARIN web service and the respective output will be automatically deleted from our computers after being processed. However, when running a notebook on an external service, such as the ones suggested below, you should take their data privacy policies into consideration.
Running the notebook
You have three options to run the notebook presented below:
- Run on Binder — The Binder Project is funded by a 501c3 non-profit
organization and is described in detail in the following paper:
Jupyter et al., "Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale."
Proceedings of the 17th Python in Science Conference. 2018. doi://10.25080/Majora-4af1f417-011 - Run on Google Colab — Google Colaboratory is a free-to-use product from Google Research.
- Download the notebook from our public Github repository and run it on your computer.
This is a more advanced option, which requires you to install Python 3 and Jupyter on your computer. For anyone without prior experience setting up a Python development environment, we strongly recommend one of the two options above.
This is only a preview of the notebook. To run it, please choose one of the following options:
Using LX-NER to make a quantitative analysis of a text¶
This is an example notebook that illustrates how you can use the LX-NER web service to analyse a text.
Before you run this example, replace access_key_goes_here
by your webservice access key, below:
LXNER_WS_API_KEY = 'access_key_goes_here'
LXNER_WS_API_URL = 'https://portulanclarin.net/workbench/lx-ner/api/'
Importing required Python modules¶
The next cell will take care of installing the requests
package,
if not already installed, and make it available to use in this notebook.
try:
import requests
except:
!pip3 install requests
import requests
from IPython.display import HTML, display_html
Wrapping the complexities of the JSON-RPC API in a simple, easy to use function¶
The WSException
class defined below, will be used later to identify errors
from the webservice.
class WSException(Exception):
'Webservice Exception'
def __init__(self, errordata):
"errordata is a dict returned by the webservice with details about the error"
super().__init__(self)
assert isinstance(errordata, dict)
self.message = errordata["message"]
# see https://json-rpc.readthedocs.io/en/latest/exceptions.html for more info
# about JSON-RPC error codes
if -32099 <= errordata["code"] <= -32000: # Server Error
if errordata["data"]["type"] == "WebServiceException":
self.message += f": {errordata['data']['message']}"
else:
self.message += f": {errordata['data']!r}"
def __str__(self):
return self.message
The next function invoques the LX-Suite webservice through it's public JSON-RPC API.
def recognize(text, format):
'''
Arguments
text: a string with a maximum of 4000 characters, Portuguese text, with
the input to be processed
format: either "tagged" or "JSON"
Returns a string or JSON object with the output according to specification in
https://portulanclarin.net/workbench/lx-ner/
Raises a WSException if an error occurs.
'''
request_data = {
'method': 'recognize',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'format': format,
'key': LXNER_WS_API_KEY,
},
}
request = requests.post(LXNER_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
Highlighting recognized entities¶
Let's define a function to pretty print a text with recognized named entities highlighted:
def print_text_with_nes(paragraphs):
html = ["<div class=\"ner-output\">"]
for paragraph in paragraphs:
html.append("<p>")
for sentence in paragraph:
html.append("<span class=\"sentence\">")
within_ne = False
within_ne_rb = False
for token in sentence:
# ne = named entity recognized with statistical recognizer
# ne_rb = named entity recognized with rule-based recognizer
ne, ne_rb = token["ne"], token["ne_rb"]
if within_ne and not ne.startswith("I-"):
# close previous named entity
html.append("</span>")
if within_ne_rb and not ne_rb.startswith("I-"):
# close previous rule-based named entity
html.append("</span>")
if ne.startswith("B-"):
html.append(f'<span class="ne {ne[2:].lower()}">')
within_ne = True
if ne_rb.startswith("B-"):
html.append(f'<span class="ne {ne_rb[2:].lower()}">')
within_ne_rb = True
html.append(token["form"])
if "R" in token["space"]:
html.append(" ")
if within_ne:
html.append("</span>")
if within_ne_rb:
html.append("</span>")
html.append("</span>")
html.append("</p>")
display_html(HTML("".join(html)))
Le's define a set of CSS rules for color-coding recognized named entities:
display_html(HTML("""<style>
.ne {
color: #000;
background-color: #eee;
margin: 3px;
padding: 3px 5px;
border-radius: 3px;
font-weight: bold;
}
.ne.numex { color: brown; }
.ne.measex { color: blue; }
.ne.timex { color: green; }
.ne.addrex { color: red; }
.ne.per { color: brown; }
.ne.org { color: blue; }
.ne.loc { color: green; }
.ne.evt { color: red; }
.ne.wrk { color: purple; }
.ne.msc { color: orchid; }
.reference {
float: left;
padding: 16px;
border: 1px dotted #aaa;
}
</style>
"""))
The next function will print a reference for the color-coded higlighting of named entities:
def print_color_reference():
display_html(HTML("""
<p class="reference">Color coding for recognized named entities:
<span class="ne numex">number</span>
<span class="ne measex">measure</span>
<span class="ne timex">time</span>
<span class="ne addrex">address</span>
<span class="ne per">person</span>
<span class="ne org">organization</span>
<span class="ne loc">location</span>
<span class="ne evt">event</span>
<span class="ne wrk">work</span>
<span class="ne msc">miscellaneous</span>
</p>
"""))
print_color_reference()
Color coding for recognized named entities: number measure time address person organization location event work miscellaneous
Next, we will use the functions we defined above for recognizing named entites, pretty-printing them as HTML and finally we also print a reference for the color-coded highlighting:
text = '''
A final do Campeonato Europeu de Futebol de 2016 realizou-se em 10 de julho de 2016 no Stade de France
em Saint-Denis, França. Foi disputada entre Portugal e a França, que era a equipa anfitriã. Os portugueses
ganharam a partida e sagraram-se campeões europeus de futebol. Esta foi a segunda participação numa final
deste campeonato para Portugal e a terceira para a França. Os portugueses haviam participado anteriormente
nas edições de 1984 e em todas as edições desde 1996. O seu melhor resultado anterior foi em 2004, com o
título de vice-campeão. Já os franceses participaram em 1960, 1984 e em todas as edições desde 1992,
tendo-se sagrado campeões nas edições de 1984 e de 2000.
'''
result = recognize(text, format="JSON")
print_text_with_nes(result)
print_color_reference()
A final de_o Campeonato Europeu de Futebol de 2016 realizou-se em 10 de julho de 2016 em_ o Stade de France em Saint-Denis, França. Foi disputada entre Portugal e a França, que era a equipa anfitriã. Os portugueses ganharam a partida e sagraram-se campeões europeus de futebol. Esta foi a segunda participação em_uma final de_ este campeonato para Portugal e a terceira para a França. Os portugueses haviam participado anteriormente em_ as edições de 1984 e em todas as edições desde 1996. O seu melhor resultado anterior foi em 2004, com o título de vice-campeão. Já os franceses participaram em 1960, 1984 e em todas as edições desde 1992, tendo-se sagrado campeões em_ as edições de 1984 e de 2000.
Color coding for recognized named entities: number measure time address person organization location event work miscellaneous
Getting the status of a webservice access key¶
def get_key_status():
'''Returns a string with the detailed status of the webservice access key'''
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': LXNER_WS_API_KEY,
},
}
request = requests.post(LXNER_WS_API_URL, json=request_data)
response_data = request.json()
if "error" in response_data:
raise WSException(response_data["error"])
else:
return response_data["result"]
get_key_status()
{'requests_remaining': 99999982, 'chars_remaining': 999989426, 'expiry': '2030-01-10T00:00+00:00'}
Instructions to use this web service
The web service for this application is available at https://portulanclarin.net/workbench/lx-ner/api/.
Below you find an example of how to use this web service with Python 3.
This example resorts to the requests package. To install this package, run this command in the command line:
pip3 install requests
.
To use this web service, you need an access key you can obtain by clicking in the button below. A key is valid for 31 days. It allows to submit a total of 1 billion characters by means of requests with no more 4000 characters each. It allows to enter 100,000 requests, at a rate of no more than 200 requests per hour.
For other usage regimes, you should contact the helpdesk.
The input data and the respective output will be automatically deleted from our computer after being processed. No copies will be retained after your use of this service.
import json
import requests # to install this library, enter in your command line:
# pip3 install requests
# This is a simple example to illustrate how you can use the LX-NER web service
# Requires: key is a string with your access key
# Requires: text is a string, UTF-8, with a maximum 4000 characters, Portuguese text, with
# the input to be processed
# Requires: format is a string, indicating the output format, which can be either
# 'tagged' or 'JSON'
# Ensures: output according to specification in https://portulanclarin.net/workbench/lx-ner/
# Ensures: dict with number of requests and characters input so far with the access key, and
# its date of expiry
key = 'access_key_goes_here' # before you run this example, replace access_key_goes_here by
# your access key
# this string can be replaced by your input
text = '''Longos anos o Ramalhete permanecera desabitado, com teias de aranha pelas grades
dos postigos térreos, e cobrindo-se de tons de ruína.'''
# To read input text from a file, uncomment this block
#inputFile = open("myInputFileName", "r", encoding="utf-8") # replace myInputFileName by
# the name of your file
#text = inputFile.read()
#inputFile.close()
format = 'tagged' # other possible value is 'JSON'
# Processing:
url = "https://portulanclarin.net/workbench/lx-ner/api/"
request_data = {
'method': 'recognize',
'jsonrpc': '2.0',
'id': 0,
'params': {
'text': text,
'format': format,
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Result:")
print(response_data["result"])
# To write output in a file, uncomment this block
#outputFile = open("myOutputFileName","w", encoding="utf-8") # replace myOutputFileName by
# the name of your file
#output = response_data["result"]
#outputFile.write(output)
#outputFile.close()
# Getting acess key status:
request_data = {
'method': 'key_status',
'jsonrpc': '2.0',
'id': 0,
'params': {
'key': key,
},
}
request = requests.post(url, json=request_data)
response_data = request.json()
if "error" in response_data:
print("Error:", response_data["error"])
else:
print("Key status:")
print(json.dumps(response_data["result"], indent=4))
Access key for the web service
This is your access key for this web service.
The following access key for this web service is already associated with .
This key is valid until and can be used to process requests or characters.
An email message has been sent into your address with the information above.
Email address validation
Loading...
To receive by email your access key for this webservice, please copy the code displayed below into the field "Subject" of an email message (with the message body empty) and send it to request@portulanclarin.net
To proceed, please send an email to request@portulanclarin.net with the following code in the "Subject" field:
To: | request@portulanclarin.net |
|
Subject: |
|
The communication with the server cannot be established. Please try again later.
We are sorry but an unexpected error has occurred. Please try again later.
The code has expired. Please click the button below to get a new code.
For enhanced security, a new code has to be validated. Please click the button below to get a new code.
Privacy: When your access key expires, your email address is automatically deleted from our records.
LX-NER's documentation
LX-NER
LX-NER is a freely available online service for the recognition of expressions for named entities in Portuguese. It was developed and is maintained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
You may be also interested to use our LX-Suite online service for the shallow processing of Portuguese.
Features
LX-NER takes a segment of Portuguese text and identifies, circumscribes and classifies the expressions for named entities it contains. Furthermore, each named entity receives a standard representation. It handles the following types of expressions:
- Number-based expressions
- Numbers:
Expressions denoting numbers are marked asNUMEX
. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Arabic:
Entities expressed by a sequence of digits, with the option of using a period to separate a string of 3 digits, counting from the right. - Decimal:
Entities expressed by an arabic number followed by a decimal part, with a comma separating both parts. - Non-compliant:
Entities expressed by digits, the period and comma symbols, organized in any possible way. All entities not covered by the previous 2 subtypes are included here. - Roman:
Entities expressed by the roman letters [IVXLCDM], in either uppercase or lowercase, with the string of letters obeying the well-formedness rules for roman numerals. - Cardinal:
Entities that are expressed by a full or partial word description of an arabic or decimal number. A full cardinal numeral is composed of words, while a partial cardinal number is a hybrid composed by words and arabic or decimal numbers. - Fraction:
Entities expressed by arabic, decimal or cardinal numbers, and specific symbols or expressions representing division. - Magnitude class:
Entities expressed by arabic, decimal or cardinal numbers together with expressions representing numerical magnitude.
- Arabic:
- Measures:
Terms expressing measure values are marked asMEASEX
. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Currency:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a currency (e.g. libras). - Time:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a time measure (e.g. segundos). - Scientifc units:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a scientific unit (e.g. toneladas).
- Currency:
- Time:
Terms expressing time are marked asTIMEX
. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Date:
Expressions representing a date, whose components can be a day of the week (e.g. Segunda-Feira), a day of the month (e.g. 27), a month (e.g. Novembro) or a year (e.g. 2006). - Time periods:
Expressions made by arabic, roman or cardinal numbers and an explicit indication of a period of time concerning a specific year, decade or century. - Time of the day:
Expressions with different formats, indicating a specific time of the day.
- Date:
- Addresses:
Expressions conveying addresses are marked asADDREX
. A list of subparts is considered, allowing for a more refined classification of these expressions:- Global section:
Expressions referring to the global position of a certain location (e.g. Rua Almeida Garrett). This address part is mandatory for an address to be recognized. - Local section:
Expressions referring to a specific position within the global position (e.g. Nº 17 - 7º Dto). - Zip code:
Expressions referring to the zip code component of an address (e.g. 3654-548 Lisboa).
- Global section:
- Name-based expressions
- Names:
Expressions conveying names are marked asNAMEX
. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Persons:
Expressions conveying names of people, with the option of considering the job or social status of a person if present (e.g. Presidente Cavaco Silva). - Organizations:
Expressions conveying names of companies (e.g. LG Electronics) and political organizations (e.g. ONU). - Locations:
Expressions referring to specific geographical locations (e.g. Portugal). - Events:
Expressions referring to competitions, conferences, workshops and similar events (e.g. 2ª Conferência Sobre o Acesso Livre ao Conhecimento). - Works:
Expressions referring to movies, books, paintings and similar works (e.g. O Retrato de Dorian Gray). - Miscellaneous:
Expressions referring to entities that can't be classified according to any of the previous subtypes (e.g. Boeing 747).
- Persons:
Evaluation
- Number-based expressions The number-based component is built upon handcrafted regular expressions. It was developed and evaluated against a manually constructed test-suite including over 300 examples. It scored 85.19% precision and 85.91% recall.
- Name-based expressions The name-based component is built upon stochastic procedures. It was trained over a manually annotated corpus of approximately 208,000 words, and evaluated against an unseen portion with approximately 52,000 words. It scored 86.53% precision and 84.94% recall.
Authorship
LX-NER is being developed by João Balsa, António Branco, Eduardo Ferreira and Sara Silveira, with the help of João Silva, of the NLX-Natural Language and Speech Group, at the University of Lisbon, Department of Informatics.
Acknowledgments
The work leading to the LX-NER was partly supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
References
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project". In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Linguistic Resources and Software for Shallow Processing". In Actas do XXI Encontro da Associação Portuguesa de Linguística (APL'05).
Contact us
Contact us using the following e-mail address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Why LX-NER?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
Output encoding
Coverage and output encoding for number-based terms | ||
---|---|---|
terms for … | examples | output |
numbers | 257, setenta e sete, 6/34, … | brown |
measure | 75 kg, 2,34 horas, 52 EUR, … | blue |
time | 10:35, 7 de Maio, séc. XXI, … | green |
addresses | Av. de Paris, R. 1º de Maio, … | red |
Coverage and output encoding for name-based terms | ||
terms for … | examples | output |
persons | João Silva, Ex. Sr. Dr. José Francisco … | brown |
organizations | DI-FCUL, Ordem dos Engenheiros, … | blue |
locations | Lisboa, Inglaterra, Serra da Estrela, … | green |
events | Euro 2004, Feira da Agricultura, … | red |
works | Os Lusíadas, A Guerra das Estrelas, … | purple |
miscellaneous | Natureza, Matemática, Psicologia, … | orchid |
License
No fee, attribution, all rights reserved, no redistribution, non commercial, no warranty, no liability, no endorsement, temporary, non exclusive, share alike.
The complete text of this license is here.