beta

LX-NER's documentation

LX-NER

LX-NER is a freely available online service for the recognition of expressions for named entities in Portuguese. It was developed and is maintained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.

You may be also interested to use our LX-Suite online service for the shallow processing of Portuguese.

Features

LX-NER takes a segment of Portuguese text and identifies, circumscribes and classifies the expressions for named entities it contains. Furthermore, each named entity receives a standard representation. It handles the following types of expressions:

Number-based expressions

Numbers:
Expressions denoting numbers are marked as NUMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:
- Arabic:
  Entities expressed by a sequence of digits, with the option of using a period to separate a string of 3 digits, counting from the right.
- Decimal:
  Entities expressed by an arabic number followed by a decimal part, with a comma separating both parts.
- Non-compliant:
  Entities expressed by digits, the period and comma symbols, organized in any possible way. All entities not covered by the previous 2 subtypes are included here.
- Roman:
  Entities expressed by the roman letters [IVXLCDM], in either uppercase or lowercase, with the string of letters obeying the well-formedness rules for roman numerals.
- Cardinal:
  Entities that are expressed by a full or partial word description of an arabic or decimal number. A full cardinal numeral is composed of words, while a partial cardinal number is a hybrid composed by words and arabic or decimal numbers.
- Fraction:
  Entities expressed by arabic, decimal or cardinal numbers, and specific symbols or expressions representing division.
- Magnitude class:
  Entities expressed by arabic, decimal or cardinal numbers together with expressions representing numerical magnitude.
Measures:
Terms expressing measure values are marked as MEASEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:
- Currency:
  Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a currency (e.g. libras).
- Time:
  Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a time measure (e.g. segundos).
- Scientifc units:
  Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a scientific unit (e.g. toneladas).
Time:
Terms expressing time are marked as TIMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:
- Date:
  Expressions representing a date, whose components can be a day of the week (e.g. Segunda-Feira), a day of the month (e.g. 27), a month (e.g. Novembro) or a year (e.g. 2006).
- Time periods:
  Expressions made by arabic, roman or cardinal numbers and an explicit indication of a period of time concerning a specific year, decade or century.
- Time of the day:
  Expressions with different formats, indicating a specific time of the day.
Addresses:
Expressions conveying addresses are marked as ADDREX. A list of subparts is considered, allowing for a more refined classification of these expressions:
- Global section:
  Expressions referring to the global position of a certain location (e.g. Rua Almeida Garrett). This address part is mandatory for an address to be recognized.
- Local section:
  Expressions referring to a specific position within the global position (e.g. Nº 17 - 7º Dto).
- Zip code:
  Expressions referring to the zip code component of an address (e.g. 3654-548 Lisboa).

Name-based expressions

Names:
Expressions conveying names are marked as NAMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:
- Persons:
  Expressions conveying names of people, with the option of considering the job or social status of a person if present (e.g. Presidente Cavaco Silva).
- Organizations:
  Expressions conveying names of companies (e.g. LG Electronics) and political organizations (e.g. ONU).
- Locations:
  Expressions referring to specific geographical locations (e.g. Portugal).
- Events:
  Expressions referring to competitions, conferences, workshops and similar events (e.g. 2ª Conferência Sobre o Acesso Livre ao Conhecimento).
- Works:
  Expressions referring to movies, books, paintings and similar works (e.g. O Retrato de Dorian Gray).
- Miscellaneous:
  Expressions referring to entities that can't be classified according to any of the previous subtypes (e.g. Boeing 747).

Evaluation

Number-based expressions
Name-based expressions

Authorship

LX-NER is being developed by João Balsa, António Branco, Eduardo Ferreira and Sara Silveira, with the help of João Silva, of the NLX-Natural Language and Speech Group, at the University of Lisbon, Department of Informatics.

Acknowledgments

The work leading to the LX-NER was partly supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.

References

Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:

Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project". In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).

Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. "Linguistic Resources and Software for Shallow Processing". In Actas do XXI Encontro da Associação Portuguesa de Linguística (APL'05).

Contact us

Why LX-NER?

LX because LX is the "code" name Lisboners like to use to refer to their hometown.

Output encoding

Coverage and output encoding for number-based terms
terms for …	examples	output
numbers	257, setenta e sete, 6/34, …	brown
measure	75 kg, 2,34 horas, 52 EUR, …	blue
time	10:35, 7 de Maio, séc. XXI, …	green
addresses	Av. de Paris, R. 1º de Maio, …	red
Coverage and output encoding for name-based terms
terms for …	examples	output
persons	João Silva, Ex. Sr. Dr. José Francisco …	brown
organizations	DI-FCUL, Ordem dos Engenheiros, …	blue
locations	Lisboa, Inglaterra, Serra da Estrela, …	green
events	Euro 2004, Feira da Agricultura, …	red
works	Os Lusíadas, A Guerra das Estrelas, …	purple
miscellaneous	Natureza, Matemática, Psicologia, …	orchid

License

The complete text of this license is here.

To:	`request@portulanclarin.net`
Subject:

To:	`request@portulanclarin.net`
Subject: