LX-Parser is a freely available on-line service for constituency parsing of Portuguese sentences. This service was developed and is maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
LX-Parser performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.
LX-Parser is supported by the Stanford Parser. The parser developed by the Stanford
University is a statistical parser that is trained over a previously
A total of 22,118 sentences from CINTIL-Treebank were used for training. This treebank is being developed and maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
The parser uses probabilistic grammars. Under the Parseval metric it achieves an f-score of 89% (value obtained through 10-fold cross-evaluation).
The syntactic analyses produced by LX-Parser are similar to the analyses found in the treebank on which LX-Parser was trained. This treebank was designed along the principles described in the following handbook:
- Branco António, João Silva, Francisco Costa, Sérgio Castro, 2011, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tp-11-02.
LX-Parser was developed by Patricia Gonçalves and João Silva, managed by António Branco, at the NLX-Natural Language and Speech Group, partly in the scope of the SemanticShare Project, funded by FCT-Fundação para a Ciência e Tecnologia.
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Silva, João, António Branco, Sérgio Castro and Ruben Reis, 2010, "Out-of-the-Box Robust Parsing of Portuguese". In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR2010), Lecture Notes in Artificial Intelligence, 6001, Berlin, Springer, pp.75–85.
Contact us using the following email address: 'nlx' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
This work was partly supported by FCT-Fundation of Science and Technology under the grant FCT/PTDC/PLP/81157/2006 for project SemanticShare. The system uses the PHPSyntaxTree Visualizer and the Stanford Parser.
LX-Parser is made available as a standalone parser that you can download and run locally in your computer.
LX-Parser is distributed under an MIT license.
- The parser model file, cintil.ser.gz.
- Stanford Parser (requires Java 5 or later). Note that the model was created with version 1.6.5 of the parser. More recent versions of the software seem to be unable to load the model.
- LX-Tokenizer to tokenize input prior to parsing.
Example command line:
java -Xmx500m -cp /path/to/stanford-parser.jar \ edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -tokenized -sentences newline -outputFormat oneline \ -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel \ cintil.ser.gz input.txt
A quick explanation of the options:
- For some more complex sentences, the default heap size used by Java
might not be enough. We increase the maximum heap size to 500 megabytes
- The path to the Stanford Parser JAR file is provided with the
- The name of the Java class we wish to run (
- The input to the parser must already be tokenized (see LX-Tokenizer for details on tokenization
decisions). We indicate this through the
- Each sentence in the input is separated by newline. We indicate this
- The output format is one parse per line. NB: The parser always adds a ROOT node. You can remove it in a post-processing step.
- A class (
BaseUnknownWordModel, part of the Stanford parser package) that implements a baseline word model is used to handle unknonwn words. It is chosen by the
- The final two arguments are the model file and the input file.
LX because LX is the shorthand form Lisboners often use to refer to their hometown.