Tagset

Tag	Category
A	Adjective
AP	Adjective Phrase
ADV	Adverb
ADVP	Adverb Phrase
C	Complementizer
CL	Clitics
CP	Complementizer Phrase
CARD	Cardinal
CONJ	Conjuction
CONJP	Conjuction Phrase
D	Determiner
DEM	Demonstrative
N	Noun
NP	Noun Phrase
O	Ordinals
P	Preposition
PP	Preposition Phrase
PPA	Past Participles/Adjectives
POSS	Possessive
PRS	Personals
QNT	Predeterminer
REL	Relatives
S	Sentence
V	Verb
VP	Verb Phrase

LX-Parser's documentation

LX-Parser

LX-Parser is a freely available on-line service for constituency parsing of Portuguese sentences. This service was developed and is maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.

LX-Parser performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.

Supporting parser

LX-Parser is supported by the Stanford Parser. The parser developed by the Stanford University is a statistical parser that is trained over a previously annotated corpus.

A total of 22,118 sentences from CINTIL-Treebank were used for training. This treebank is being developed and maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.

The parser uses probabilistic grammars. Under the Parseval metric it achieves an f-score of 89% (value obtained through 10-fold cross-evaluation).

Annotation guidelines

The syntactic analyses produced by LX-Parser are similar to the analyses found in the treebank on which LX-Parser was trained. This treebank was designed along the principles described in the following handbook:

Branco António, João Silva, Francisco Costa, Sérgio Castro, 2011, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tp-11-02.

Authorship

LX-Parser was developed by Patricia Gonçalves and João Silva, managed by António Branco, at the NLX-Natural Language and Speech Group, partly in the scope of the SemanticShare Project, funded by FCT-Fundação para a Ciência e Tecnologia.

Publications

Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:

Silva, João, António Branco, Sérgio Castro and Ruben Reis, 2010, "Out-of-the-Box Robust Parsing of Portuguese". In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR2010), Lecture Notes in Artificial Intelligence, 6001, Berlin, Springer, pp.75–85.

Contact us

Acknowledgments

This work was partly supported by FCT-Fundation of Science and Technology under the grant FCT/PTDC/PLP/81157/2006 for project SemanticShare. The system uses the PHPSyntaxTree Visualizer and the Stanford Parser.

Release

LX-Parser is made available as a standalone parser that you can download and run locally in your computer.

License

LX-Parser is distributed under an MIT license.

Required download

The parser model file, cintil.ser.gz.
Stanford Parser (requires Java 5 or later). Note that the model was created with version 1.6.5 of the parser. More recent versions of the software seem to be unable to load the model.
LX-Tokenizer to tokenize input prior to parsing.

Instructions

Example command line:

java -Xmx500m -cp /path/to/stanford-parser.jar \
    edu.stanford.nlp.parser.lexparser.LexicalizedParser \
    -tokenized -sentences newline -outputFormat oneline \
    -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel \
    cintil.ser.gz input.txt

A quick explanation of the options:

For some more complex sentences, the default heap size used by Java might not be enough. We increase the maximum heap size to 500 megabytes with the -Xmx500m option.
The path to the Stanford Parser JAR file is provided with the -cp option.
The name of the Java class we wish to run (LexicalizedParser).
The input to the parser must already be tokenized (see LX-Tokenizer for details on tokenization decisions). We indicate this through the -tokenized option.
Each sentence in the input is separated by newline. We indicate this through the -sentences newline option.
The output format is one parse per line. NB: The parser always adds a ROOT node. You can remove it in a post-processing step.
A class (BaseUnknownWordModel, part of the Stanford parser package) that implements a baseline word model is used to handle unknonwn words. It is chosen by the -uwModel option.
The final two arguments are the model file and the input file.

Tagset

Tag	Category
A	Adjective
AP	Adjective Phrase
ADV	Adverb
ADVP	Adverb Phrase
C	Complementizer
CL	Clitics
CP	Complementizer Phrase
CARD	Cardinal
CONJ	Conjuction
CONJP	Conjuction Phrase
D	Determiner
DEM	Demonstrative
N	Noun
NP	Noun Phrase
O	Ordinals
P	Preposition
PP	Preposition Phrase
PPA	Past Participles/Adjectives
POSS	Possessive
PRS	Personals
QNT	Predeterminer
REL	Relatives
S	Sentence
V	Verb
VP	Verb Phrase

Why LX-Parser?

LX because LX is the shorthand form Lisboners often use to refer to their hometown.

License

The complete text of this license is here.

To:	`request@portulanclarin.net`
Subject:

To:	`request@portulanclarin.net`
Subject: