Porttinari – PORTuguese Treebank
Porttinari
Handle: | https://hdl.handle.net/21.11129/0000-0011-2ACB-A (persistent URL to this page) |
---|---|
URL: | https://sites.google.com/icmc.usp.br/poetisa/porttinari |
Porttinari-base (Duran et al., 2023) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021).
As reported by Duran et al., (2023), Porttinari is currently composed by three subcorpora with different characteristics and purposes:
· Porttinari-base (released here), a corpus that is manually revised in detail to serve as gold standard (divided into training, development and test folds), with average annotation review agreement (kappa) of 97.8% and 96.2% for part of speech tags and dependency relations, respectively. It has 8,418 sentences and 168,080 tokens;
· Porttinari-check, a small corpus structurally similar to Porttinari-base to serve as testbed for additional and diversified evaluations and to illustrate the contrast between manual and automatic annotations. It has 1,685 sentences and 33,576 tokens;
· Porttinari-automatic, a very large corpus that was automatically annotated by a state of the art parser trained on Porttinari-base. It has 3,954,218 sentences and 94,444,424 tokens.
The texts in the treebank are from Folha de São Paulo newspaper, which are publicly available at Kaggle website.
For the interested reader, Porttinari-check and Porttinari-automatic, as well as other related information, may be accessed at https://sites.google.com/icmc.usp.br/poetisa/porttinari