- Annotation guidelines
- Contact us
- Why LX-Parser?
LX-Parser (beta version) is a freely available on-line service for constituency parsing of Portuguese sentences. This service was developed and is maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
LX-Parser performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.
LX-Parser is supported by the Stanford Parser.
The parser developed by the Stanford University is a statistical parser
that is trained over a previously annotated corpus.
A total of 22118 sentences from CINTIL Treebank were used for training. This treebank is being developed and maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
The parser uses probabilistic grammars. Under the Parseval metric it achieves an f-score of 89% (value obtained through 10-fold cross-evaluation).
The syntactic analyses produced by LX-Parser are similar to the analyses found in the treebank on which LX-Parser was trained. This treebank was designed along the principles described in the following handbook:
Branco António, João Silva, Francisco Costa, Sérgio Castro, 2011, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tp-11-02.
Lx-Parser is being developed by Patrícia Gonçalves and João Silva, managed by António Branco, by the NLX-Natural Language and Speech Group, partly in the scope of the SemanticShare Project, funded by FCT-Fundação para a Ciência e Tecnologia.
When mentioning this parser, this is the reference to be used:
- Silva, João and António Branco and Sérgio Castro and Ruben Reis. Out-of-the-Box Robust Parsing of Portuguese. In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR'10), pp. 75–85.
To use LX-Parser you must agree with its license.
Contact us using the following email address: 'nlx' concatenated with 'at' concatenated with 'di.fc.ul.pt'
This work was partly supported by FCT-Fundation of Science and Technology under the grant FCT/PTDC/PLP/81157/2006 for project SemanticShare
The system uses the PHPSyntaxTree Visualizer and the Stanford Parser
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
LX-Parser is made available as a standalone parser that you can download and run locally in your computer
- The parser model file, cintil.ser.gz
- Stanford Parser (requires Java 5 or later). Note that the model was created with version 1.6.5 of the parser. More recent versions of the software seem to be unable to load the model.
- LX-Tokenizer to tokenize input prior to parsing.
Example command line:
java -Xmx500m -cp /path/to/stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -sentences newline -outputFormat oneline -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel cintil.ser.gz input.txt
A quick explanation of the options:
- For some more complex sentences, the default heap size used by Java might not be enough. We increase the maximum heap size to 500 megabytes with the -Xmx500m option.
- The path to the Stanford Parser JAR file is provided with the -cp option.
- The name of the Java class we wish to run (LexicalizedParser).
- The input to the parser must already be tokenized (see LX-Tokenizer for details on tokenization decisions). We indicate this through the -tokenized option.
- Each sentence in the input is separated by newline. We indicate this through the -sentences newline option.
- The output format is one parse per line. NB: The parser always adds a ROOT node. You can remove it in a post-processing step.
- A class (BaseUnknownWordModel, part of the Stanford parser package) that implements a baseline word model is used to handle unknonwn words. It is chosen by the -uwModel option.
- The final two arguments are the model file and the input file.