|SNS||Sentence with null subject|
LX-Parser is a freely available on-line service for constituency parsing of Portuguese sentences. This service was developed and is maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
LX-Parser performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.
LX-Parser is supported by the Stanford Parser. The parser developed by the Stanford
University is a statistical parser that is trained over a previously
A total of 22,118 sentences from CINTIL Treebank were used for training. This treebank is being developed and maintained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
The parser uses probabilistic grammars. Under the Parseval metric it achieves an f-score of 89% (value obtained through 10-fold cross-evaluation).
The syntactic analyses produced by LX-Parser are similar to the analyses found in the treebank on which LX-Parser was trained. This treebank was designed along the principles described in the following handbook:
- Branco António, João Silva, Francisco Costa, Sérgio Castro, 2011, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency. Department of Informatics, University of Lisbon, Technical Reports series, nb. di-fcul-tp-11-02.
LX-Parser was developed by Patricia Gonçalves and João Silva, managed by António Branco, at the NLX-Natural Language and Speech Group, partly in the scope of the SemanticShare Project, funded by FCT-Fundação para a Ciência e Tecnologia.
Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:
- Silva, João, António Branco, Sérgio Castro e Ruben Reis. "Out-of-the-Box Robust Parsing of Portuguese". In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR'10), pp. 75–85.
Contact us using the following email address: 'nlx' concatenated with 'at' concatenated with 'di.fc.ul.pt'
This work was partly supported by FCT-Fundation of Science and Technology
under the grant FCT/PTDC/PLP/81157/2006 for project SemanticShare
The system uses the PHPSyntaxTree Visualizer and the Stanford Parser
LX-Parser is made available as a standalone parser that you can download and run locally in your computer
To use LX-Parser you must agree with its license.
- The parser model file, cintil.ser.gz
- Stanford Parser (requires Java 5 or later). Note that the model was created with version 1.6.5 of the parser. More recent versions of the software seem to be unable to load the model.
- LX-Tokenizer to tokenize input prior to parsing.
Example command line:
java -Xmx500m -cp /path/to/stanford-parser.jar \ edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -tokenized -sentences newline -outputFormat oneline \ -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel \ cintil.ser.gz input.txt
A quick explanation of the options:
- For some more complex sentences, the default heap size used by Java
might not be enough. We increase the maximum heap size to 500 megabytes
- The path to the Stanford Parser JAR file is provided with the
- The name of the Java class we wish to run (
- The input to the parser must already be tokenized (see LX-Tokenizer for details on tokenization
decisions). We indicate this through the
- Each sentence in the input is separated by newline. We indicate this
- The output format is one parse per line. NB: The parser always adds a ROOT node. You can remove it in a post-processing step.
- A class (
BaseUnknownWordModel, part of the Stanford parser package) that implements a baseline word model is used to handle unknonwn words. It is chosen by the
- The final two arguments are the model file and the input file.
|SNS||Sentence with null subject|
LX because LX is the shorthand form Lisboners often use to refer to their hometown.