
Developed at the University of Lisbon, by NLX — Natural Language and Speech Group and CLUL — Centro de Linguística da Universidade de Lisboa
Table of contents
- CINTIL online concordancer
- CINTIL corpus
- Acquiring CINTIL
- Authorship
- Contact us
- References
- Acknowledgements
CINTIL online concordancer
CINTIL online concordancer (beta version) is a freely available online concordancing service to support the research usage of the CINTIL Corpus. This concordancer was developed and is mantained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics, in cooperation with the REPORT Group of CLUL-Centro de Linguística da Universidade de Lisboa.
CINTIL concordancer allows the use of generic patterns to specify the occurrences to be retrieved. This permits to uncover linguistic structures of high complexity and use this service as a powerful research tool.
You may be interested also in using the companion tools.
CINTIL corpus
CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese. At present it is composed of 1 Million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).
This corpus is being developed and maintained at the University of Lisbon by the REPORT Group of CLUL-Centro de Linguística da Universidade de Lisboa in cooperation with NLX-Natural Language and Speech Group of the Department of Informatics. It was the first of its class to be developed for Portuguese in terms of the combined dimensions of size, depth of linguistic information, range of domains and sources, and level of accuracy. The present version is the most recent outcome of an ongoing and long-term endeavour to continuously enlarge and refine this corpus along all these dimensions, with the purpose of providing an enhanced resource for the research on the Linguistics of Portuguese and the development of language technology.
Acquiring CINTIL
The CINTIL corpus is released through ELDA-Evaluation and Language Resources Distribution Agency. Details are provided here.Authorship
The CINTIL Corpus received several contributions:
- Raw text, and previous versions
REPORT Group of the CLUL-Centro de Linguística da Universidade de Lisboa - Present version
The CINTIL Corpus was developed, between March 2004 and December 2006, under the coordination of António Branco (FCUL-Faculdade de Ciências da Universidade de Lisboa) and Maria Fernanda Bacelar do Nascimento (CLUL-Centro de Linguística da Universidade de Lisboa), by the team including Sandra Antunes (CLUL), Florbela Barreto (CLUL), José Bettencourt Gonçalves (CLUL), João Silva (FCUL), Amália Mendes (CLUL) e Filipe Nunes (FCUL), partly in the scope of the TagShare Project, funded by FCT-Fundação para a Ciência e Tecnologia under the research contract POSI/PLP/47058/2002.
Contact us
Contact us using the following email address: 'cintil' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
CINTIL is an ongoing endeavour to develop a corpus with increasingly enhanced accuracy. After having checked the underlying assumptions under which the current version was produced here, in case you have detected something that deserves to be improved, let us know.
Please note that this is not an online linguistic help-desk service, and questions unrelated to the CINTIL Corpus will not be attended.
References
Barreto, Florbela, António Branco, Eduardo Ferreira,
Amália Mendes, Maria Fernanda Nascimento, Filipe Nunes and
João Silva, 2006, "Open Resources and Tools for the Shallow
Processing of Portuguese", Proceedings of the 5th
International Conference on Language Resources and Evaluation
(LREC2006), Genoa, Italy.
Branco, António and João Silva, 2006,
"LX-Suite: Shallow Processing Tools for Portuguese", Proceedings of
the 11th Conference of the European Chapter of the Association
for Computational Linguistics (EACL2006), Trento, Italy,
pp.179-182.
Barreto, Florbela, António Branco, Eduardo Ferreira,
Amália Mendes, Fernanda Bacelar Nascimento, Filipe Nunes and
João Silva, 2006, "Linguistic Resources and Software for
Shallow Processing", In Actas do XXI Encontro Anual da
Associação Portuguesa de Linguística, Lisbon,
Portugal.
Acknowledgments
The work leading to the CINTIL Corpus was partly supported by
FCT-Fundação para a Ciência e Tecnologia under the
grant POSI/PLP/47058/2002 for the project TagShare.
We are very grateful to Adam Przepiórkowski and his team,
from the
IPIPAN - The Institute of Computer Science of the Polish Academy
of Sciences, Warsaw, for the support in the adaptation of
Poliqarp
to the Portuguese language and CINTIL constraints.
Table of contents
- Cheat sheet / Quick reference
- Query outcome
- User interface
- Searching orthographic forms
- Searching through linguistic information
- Advanced queries
Cheat sheet / Quick reference
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Field | Keyword | Values |
---|---|---|
Orthographic form | orth | any |
Part-of-speech tag | pos | full table |
Inflection feature | gender | f, m, g |
number | s, p, n | |
degree | dim, sup, comp | |
person | 1, 2, 3 | |
time | full table | |
inflection | ifl, nifl | |
Lemma (base form) | base | any |
Named-entity | iob | full table |
Metadata | source | writtennews writtenfiction writtenother spoken |
Query outcome
The CINTIL Online Concordancer permits to retrieve passages with occurrences of a given target expression in the CINTIL corpus.The target expression is entered in the query text box. The retrieved passages are displayed below that box.
When the "Show tags" box is checked, the concordancer displays also the linguistic annotation.
For each token, this annotation is displayed between square brackets, with a colon separating each field. For instance, the annotation for the common noun gatas will be displayed as follows:
occurrence with annotation ⟶
gatas
[ gato : cn : f : p : O ]
keywords ⟶
orth
base pos gender number IOB
Note that this annotation is displayed in a slightly different format than the one used in the corpus release. For a description of the latter, check here.
For practical reasons each passage returned with the occurrence contains at most 10 tokens.
Also for practical reasons, not all passages with occurrences of the target expression in the CINTIL corpus are returned. Also, the order in which the passages are displayed does not correspond to a possible consecutive order of their occurrence in the corpus. Note however that the outcome of the CINTIL online concordancer can be used as a reference in research given that identical queries return identical outcome.
In those usage cases where it is imperative to have access to every occurrence, the interested user can acquire a copy of the corpus and run a concordancer of his choice over that local copy.
User interface
The online concordancer user interface is quite self-explanatory.The operation of "Sort" buttons provides the following functionality: Upon pressing these buttons, the results are alphabetically sorted according to the context.
The right-hand side button sorts the passages using their right side context.
The left-hand side button sorts the passages according to the context on their left side, from right to left.
The following example illustrates the use of these two buttons over the outcome of the same search for carro (with 2 words of left context and 1 word of right context):
|
|
|
Searching orthographic forms
- Case-sensitiveness
- Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:
- by entering gato, occurrences of gato are obtained
- gato/i gets occurrences of gato, Gato, GATO, etc.
- Sub-sequence matching
- The query expression match whole tokens. For instance gato
will not match parts of words, and will not return regato or
obrigatoriamente.
To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the /i mentioned previously).
For instance:- gato will only match gato
- gato/x will match any word containing the string gato, such as obrigatoriamente
- gato/xi is as above, but case-insensitive
- Contractions
- Note that in the CINTIL corpus the contractions (e.g. daquela, aos, nas) are reverted and encoded with two tokens, where the first is concatenated with an underscore symbol (e.g. de_ aquela, a_ os, em_ as)
Searching for patterns
It is possible to search with general pattern (aka regular expressions). A query can thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:
- Alternation
- Alternatives are introduced by the | (vertical bar) character:
- "gato|peixe" matches all occurrences of gato and all occurrences of peixe
- Character sets
- A set of characters within square brackets match occurrences of
any of those characters:
- "gat[ao]" match occurrences of gata and gato
- "[pg]at[ao]" will match occurrences of gata, gato, pata and pato
- "[^abcd][efg]" matches tokens with two characters, the first one not being a, b, c or d and the second one being e, f or g
- Period
- The "." (period) match any single character (letter,
digit or symbol):
- "gat.s" will match gatas, gatbs, gatcs, gat1s, etc.
- Optionality
- The "?" (question mark) permits that the
character/expression preceding it is optionally matched:
- "gatos?" matches gato and gatos.
- Iteration
- There are three forms of expressing iteration. The * (star)
operator permits that the character/expression preceding it is matched zero or more times:
- "gat.*" matches any word starting with gat, including gat itself
- ".*gato.*" matches any word containing the string gato (this is equivalent to gato/x)
- "gat.+" matches any word starting with gat, but not gat since + enforces at least one occurrence
- "gat.{2,4}" matches words that start with gat and that have 2 to 4 additional characters
- "[^aer]{5,}" matches words without a, e or r that have 5 or more characters.
- Grouping
- Parentheses are used to group expressions. The operators
described above can then be applied to the whole expression in
parentheses as if it was a single character:
- "gat(inh)?o" matches gato and gatinho (i.e. the sequence inh that follows t is optional)
- "ga(to)*" matches ga, gato, gatoto, gatototo, etc. (i.e. to may occur zero or more times)
Note that any of these expressions may also be modified by the /i and /x described previously.
For instance:- "ga.*"/i matches words starting with ga, Ga, gA or GA
- "(ra){2}"/x matches words that contain two consecutive occurrences of ra (e.g. rara, mostraram, etc.)
Searching through linguistic information
Each token is associated to linguistic information, encoded by means of annotation tags. Each tag is composed of a field and its value in square brackets ([field=value]). For example, [gender=m], [time=pi], etc.
Each field is instantiated by a keyword.
The values can be matched with any of the methods described above:
- [field=pattern] is the format for such queries.
Field-pattern pairs can be combined by using logical operators: & (ampersand) for conjunction and | (vertical bar) for disjunction:
- [field=pattern & field=pattern]
- [field=pattern | field=pattern]
In addition, the negation symbol ! (exclamation) permits to match tokens whose field values do not conform to a given pattern:
- [!field=pattern] is one format for such negation
- [field!=pattern] is equivalent to the previous query.
Orthographic form (again)
The orthographic form itself can be matched via the keyword orth:
- [orth=gato] matches tokens with the orthographic form gato. This returns the same result as simply searching for gato. Using this alternative but equivalent way is useful when combining orth with other fields (to be discussed below)
- [orth="gat.*" & orth!=gato] matches tokens that begin with gat but that are not gato
Part-of-speech
Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:
- [pos=cn] matches tokens with the POS tag cn (common noun)
- [pos=cn & orth="ga.*"] matches tokens that are common nouns and begin with ga
- [pos="d.*"] matches tokens with any POS tag whose name begins with d
- [pos!=pnt] matches tokens that are not punctuation (the pnt tag)
Nominal inflection
The keywords gender and number have, respectively, the values f (feminine) or m (masculine), and the values s (singular) or p (plural). They permit to match occurrences with selected inflection features:
- [gender=f] matches all tokens with feminine inflection
- [number=s & orth=".*s"] matches all tokens with singular inflection that end in s
- [gender!=m] matches tokens that do not have masculine inflection. Note that this also matches those tokens to which gender inflection is not even applicable, such as prepositions, punctuation, symbols, etc.
Some tokens may bear degree features, accessed through the degree keyword:
- [degree=dim] matches all tokens with diminutive degree
Verbal inflection
In order to match tokens according to their verbal inflection features, one can resort to person, time and number keywords:
- [person="1"] matches tokens inflected for first person
- [time="ppi"] matches tokens inflected for the Pretérito Perfeito Indicativo
- [person="3" & number="s" & time="fc"] matches all forms expressing the third person singular of Futuro Conjuntivo
- [person!="1"] matches tokens that do not have 1st-person inflection. Note that this also matches those tokens to which person inflection is not even applicable, such as prepositions, punctuation, symbols, etc.
Here is the list of verbal inflection tags.
Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.
Lemma
In order to match tokens by their lemma, the base keyword can be used:
- [base=rato] matches words with rato as their base form (lemma), such as rato, ratos or ratinho, etc.
- [pos=cn & base=".*s"] finds common nouns whose lemma ends in s
- [orth=foi & pos=v & base!=ir] matches occurrences of the verb form foi that do not belong to verb ir
Named-entity
To match tokens according to their being part of an expression naming an entity, the iob keyword is used:
- [iob=B-LOC] matches tokens that are the beginning (B-) of an expression naming an entity whose semantic type is "location" (LOC).
- [iob=I-PER] matches tokens that are inside (I-) an expression naming an entity of type "person" (PER).
Here is the list of named-entity tags.
Metadata
It is possible to use metadata to restrict the match to a given type of text through the use of the meta command:
- gato meta source=writtennews matches gato only in the news portion (writtennews) of the corpus
- gato meta source="written.*" matches gato only in the written portion of the corpus (includes writtennews, writtenficiton and writtenother)
For a list of metadata fields and values, see here.
Advanced queries
Through the combination of the different search options described above, it is possible to construct advanced queries and uncover relevant linguistic information:
- situação[pos=adj] returns the occurrences of the word
situação followed by an adjective
- [pos=da][pos=cn] return the occurrences of a definite article (the
da tag) followed by a common noun
- [pos=da][pos=adj]?[pos=cn] is similar to the previous
query, but allows a single, optional adjective (indicated by the adj tag)
between the definite article and the common noun
- [pos="cn|adj"]{3,} returns sequences with at least 3
consecutive adjectives and common nouns (in any relative order)
- [pos=da][pos!=cn]{2,3}[pos=adj] returns sequences of a definite article followed by 2 or 3 tokens that are not common nouns and that are followed by an adjective
- ... etc.
Aligning matches
It is possible to split the outcome of the query into two columns to make it more readable by using the ^ (caret) symbol:
- [pos=da][pos!=cn]{2}^[pos=adj] matches a definite article followed by 2 tokens that are not common nouns, followed by an adjective. The definite article and the following 2 tokens will be displayed in a column while the final adjective will be shown in a column by itself.