BDCamões DependencyBank (Part I)
Handle: | https://hdl.handle.net/21.11129/0000-000D-F8AA-C (persistent URL to this page) |
---|
BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a time span from the 15th to the 21st century, and adhering to different orthographic conventions. This set of characteristics makes of BDCamões an invaluable resource for research in Language Science and Technology, and in Digital Humanities.
All Modern Portuguese documents, or which are older but whose edition has been transcribed into that orthographic norm, have been automatically parsed with state-of-the-art language processing tools for Portuguese (Branco and Silva, 2006), and thus annotated with linguistic information that follows from the design of these tools and that can be found in detail in their guidelines and documentation (Branco et al., 2015).
The resulting linguistic annotation comprises part-of-speech tags (e.g. PREP, ADV, etc.), morphology (lemmas for words from the open categories; gender and number for words from nominal categories; tense, aspect, person and number for verbs), named entities (in BIO notation), syntactic analysis in terms of graphs of grammatical dependencies (e.g. SJ, OBL, M, etc.), and semantic analysis in terms of semantic roles (e.g. ARG1, ARG2, LOC, etc.). A second version of the dependency graphs was obtained by converting them to the so called Universal Dependencies (de Marneffe et al., 2014).
BDCamões DependencyBank is distributed in two parts. Part I is distributed with license CC-BY, and Part II with license MS NC-NoReD-ND 2.0.
This Part I consists of 114 complete documents, comprising over 3.4 million tokens.