Developed at the University of Lisbon,
by

and

Query outcome

The CINTIL Online Concordancer permits to retrieve passages with occurrences of a given target expression in the CINTIL corpus.

The target expression is entered in the query text box. The retrieved passages are displayed below that box.

When the "Show tags" box is checked, the concordancer displays also the linguistic annotation.

For each token, this annotation is displayed between square brackets, with a colon separating each field. For instance, the annotation for the common noun carros will be displayed as follows:

Note that this annotation is displayed in a slightly different format than the one used in the corpus release. For a description of the latter, check here.

For practical reasons each passage returned with the occurrence contains at most 10 tokens.

Also for practical reasons, not all passages with occurrences of the target expression in the CINTIL corpus are returned. Also, the order in which the passages are displayed does not correspond to a possible consecutive order of their occurrence in the corpus. Note however that the outcome of the CINTIL online concordancer can be used as a reference in research given that identical queries return identical outcome.

In those usage cases where it is imperative to have access to every occurrence, the interested user can acquire a copy of the corpus and run a concordancer of his choice over that local copy.

Searching orthographic forms

Case-sensitiveness

Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:

by entering gato, occurrences of gato are obtained
gato/i gets occurrences of gato, Gato, GATO, etc.

Sub-sequence matching

The query expression match whole tokens. For instance gato will not match parts of words, and will not return regato or obrigatoriamente.

To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the /i mentioned previously).

For instance:

gato will only match gato
gato/x will match any word containing the string gato, such as obrigatoriamente
gato/xi is as above, but case-insensitive

Contractions

Note that in the CINTIL corpus the contractions (e.g. daquela, aos, nas) are reverted and encoded with two tokens, where the first is concatenated with an underscore symbol (e.g. de_ aquela, a_ os, em_ as)

Searching for patterns

It is possible to search with general pattern (regular expressions). A query can thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:

Alternation

Alternatives are introduced by the | (vertical bar) character:

"gato|peixe" matches all occurrences of gato and all occurrences of peixe

Character sets

A set of characters within square brackets match occurrences of any of those characters:

"gat[ao]" match occurrences of gata and gato
"[pg]at[ao]" will match occurrences of gata, gato, pata and pato

A set can be negated by placing a ^ (caret) symbol immediately after the opening bracket.

"[^abcd][efg]" matches tokens with two characters, the first one not being a, b, c or d and the second one being e, f or g

Period

The "." (period) match any single character (letter, digit or symbol):

"gat.s" will match gatas, gatbs, gatcs, gat1s, etc.

Optionality

The "?" (question mark) permits that the character/expression preceding it is optionally matched:

"gatos?" matches gato and gatos.

Iteration

There are three forms of expressing iteration. The * (star) operator permits that the character/expression preceding it is matched zero or more times:

"gat.*" matches any word starting with gat, including gat itself
".*gato.*" matches any word containing the string gato (this is equivalent to gato/x)

The + (plus) operator is similar, but enforces that there is at least one occurrence of the character/expression preceding it:

"gat.+" matches any word starting with gat, but not gat since + enforces at least one occurrence

Finally, {l,u} permits that the number of iterations is bounded by a lower (l) and an upper (u) value. Note that either bound may be omitted. In such cases, {l,} means "at least l times", {,u} means "at most u times" and {n} means "exactly n times":

"gat.{2,4}" matches words that start with gat and that have 2 to 4 additional characters
"[^aer]{5,}" matches words without a, e or r that have 5 or more characters.

Grouping

Parentheses are used to group expressions. The operators described above can then be applied to the whole expression in parentheses as if it was a single character:

"gat(inh)?o" matches gato and gatinho (i.e. the sequence inh that follows t is optional)
"ga(to)*" matches ga, gato, gatoto, gatototo, etc. (i.e. to may occur zero or more times)

Note that any of these expressions may also be modified by the /i and /x described previously.

For instance:

"ga.*"/i matches words starting with ga, Ga, gA or GA
"(ra){2}"/x matches words that contain two consecutive occurrences of ra (e.g. rara, mostraram, etc.)

Searching through linguistic information

Each token is associated to linguistic information, encoded by means of annotation tags. Each tag is composed of a field and its value in square brackets ([field=value]). For example, [gender=m], [time=pi], etc.

Each field is instantiated by a keyword.

The values can be matched with any of the methods described above:

[field=pattern] is the format for such queries.

Field-pattern pairs can be combined by using logical operators: & (ampersand) for conjunction and | (vertical bar) for disjunction:

[field=pattern & field=pattern]
[field=pattern | field=pattern]

In addition, the negation symbol ! (exclamation) permits to match tokens whose field values do not conform to a given pattern:

[!field=pattern] is one format for such negation
[field!=pattern] is equivalent to the previous query.

Orthographic form (again)

The orthographic form itself can be matched via the keyword orth:

[orth=gato] matches tokens with the orthographic form gato. This returns the same result as simply searching for gato. Using this alternative but equivalent way is useful when combining orth with other fields (to be discussed below)
[orth="gat.*" & orth!=gato] matches tokens that begin with gat but that are not gato

Part-of-speech

Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:

[pos=cn] matches tokens with the POS tag cn (common noun)
[pos=cn & orth="ga.*"] matches tokens that are common nouns and begin with ga
[pos="d.*"] matches tokens with any POS tag whose name begins with d
[pos!=pnt] matches tokens that are not punctuation (the pnt tag)

The list of POS tags may be consulted under the "Tagsets" tab at the top of this panel.

Nominal inflection

The keywords gender and number have, respectively, the values f (feminine) or m (masculine), and the values s (singular) or p (plural). They permit to match occurrences with selected inflection features:

[gender=f] matches all tokens with feminine inflection
[number=s & orth=".*s"] matches all tokens with singular inflection that end in s
[gender!=m] matches tokens that do not have masculine inflection. Note that this also matches those tokens to which gender inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

Some tokens may bear degree features, accessed through the degree keyword:

[degree=dim] matches all tokens with diminutive degree

Verbal inflection

In order to match tokens according to their verbal inflection features, one can resort to person, time and number keywords:

[person="1"] matches tokens inflected for first person
[time="ppi"] matches tokens inflected for the Pretérito Perfeito Indicativo
[person="3" & number="s" & time="fc"] matches all forms expressing the third person singular of Futuro Conjuntivo
[person!="1"] matches tokens that do not have 1st-person inflection. Note that this also matches those tokens to which person inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

The list of verbal inflection tags may be consulted under the "Tagsets" tab at the top of this panel.

Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.

Lemma

In order to match tokens by their lemma, the base keyword can be used:

[base=rato] matches words with rato as their base form (lemma), such as rato, ratos or ratinho, etc.
[pos=cn & base=".*s"] finds common nouns whose lemma ends in s
[orth=foi & pos=v & base!=ir] matches occurrences of the verb form foi that do not belong to verb ir

Named-entity

To match tokens according to their being part of an expression naming an entity, the iob keyword is used:

[iob=B-LOC] matches tokens that are the beginning (B-) of an expression naming an entity whose semantic type is "location" (LOC).
[iob=I-PER] matches tokens that are inside (I-) an expression naming an entity of type "person" (PER).

The list of named-entity tags may be consulted under the "Tagsets" tab at the top of this panel.

Metadata

It is possible to use metadata to restrict the match to a given type of text through the use of the meta command:

gato meta source=writtennews matches gato only in the news portion (writtennews) of the corpus
gato meta source="written.*" matches gato only in the written portion of the corpus (includes writtennews, writtenficiton and writtenother)

For a list of metadata fields and values, see the "Quick reference" tab at the top of this panel.

Advanced queries

Through the combination of the different search options described above, it is possible to construct advanced queries and uncover relevant linguistic information:

situação[pos=adj] returns the occurrences of the word situação followed by an adjective
[pos=da][pos=cn] return the occurrences of a definite article (the da tag) followed by a common noun
[pos=da][pos=adj]?[pos=cn] is similar to the previous query, but allows a single, optional adjective (indicated by the adj tag) between the definite article and the common noun
[pos="cn|adj"]{3,} returns sequences with at least 3 consecutive adjectives and common nouns (in any relative order)
[pos=da][pos!=cn]{2,3}[pos=adj] returns sequences of a definite article followed by 2 or 3 tokens that are not common nouns and that are followed by an adjective
... etc.

Aligning matches

It is possible to split the outcome of the query into two columns to make it more readable by using the ^ (caret) symbol:

[pos=da][pos!=cn]{2}^[pos=adj] matches a definite article followed by 2 tokens that are not common nouns, followed by an adjective. The definite article and the following 2 tokens will be displayed in a column while the final adjective will be shown in a column by itself.

Query syntax cheat sheet

Basic query
	a word matches itself
Query modifiers
`/i`	case-insensitive match
`/x`	sub-sequence matching
Character expressions
`.`	any single character
`[ ]`	character from a set
`[^ ]`	character from negated set

Repetition operators
`?`	optional
`*`	zero or more times
`+`	one or more times
`{n}`	exactly `n` times
`{n,}`	`n` or more times
`{,n}`	up to `n` times
`{m,n}`	from `m` to `n` times

Combining expressions
`e₁e₂`	`e₁` followed by `e₂`
`\|`	alternation
`( )`	grouping
Search by annotation
`[keyword=expression]`
`[keyword!=expression]`
`[key₁=exp₁ & key₂=exp₂]`
`[key₁=exp₁ \| key₂=exp₂]`

Regular expressions must be enclosed in quotes.
Contractions are reverted and encoded as two tokens, where the first is concatenated with an underscore.

Quick reference

Field	Keyword	Values
Orthographic form	`orth`	any
Part-of-speech tag	`pos`	full table
Inflection feature	`gender`	`f`, `m`, `g`
	`number`	`s`, `p`, `n`
	`degree`	`dim`, `sup`, `comp`
	`person`	`1`, `2`, `3`
	`time`	full table
	`inflection`	`ifl`, `nifl`
Lemma (base form)	`base`	any
Named-entity	`iob`	full table
Metadata	`source`	`writtennews` `writtenfiction` `writtenother` `spoken`

Part-of-speech tags

Tag	Category	Examples
ADJ	Adjectives	bom, brilhante, eficaz, …
ADV	Adverbs	hoje, já, sim, felizmente, …
CARD	Cardinals	zero, dez, cem, mil, …
CJ	Conjunctions	e, ou, tal como, …
CL	Clitics	o, lhe, se, …
CN	Common Nouns	computador, cidade, ideia, …
DA	Definite Articles	o, os, …
DEM	Demonstratives	este, esses, aquele, …
DFR	Denominators of Fractions	meio, terço, décimo, %, …
DGTR	Roman Numerals	VI, LX, MMIII, MCMXCIX, …
DGT	Arabic Numerals	0, 1, 42, 12345, 67890, …
DM	Discourse Marker	olá, …
EADR	Electronic Addresses	http://www.di.fc.ul.pt, …
EOE	End of Enumeration	etc
EXC	Exclamation	ah, ei, …
GER	Gerunds	sendo, afirmando, vivendo, …
GERAUX	Gerund "ter"/"haver" in compound tenses	tendo, havendo
IA	Indefinite Articles	uns, umas, …
IND	Indefinites	tudo, alguém, ninguém, …
INF	Infinitive	ser, afirmar, viver, …
INFAUX	Infinitive "ter"/"haver" in compound tenses	ter, haver, …
INT	Interrogatives	quem, como, quando, …
ITJ	Interjection	bolas, caramba, …
LTR	Letters	a, b, c, …
MGT	Magnitude Classes	unidade, dezena, dúzia, resma, …
MTH	Months	Janeiro, Dezembro, …
NP	Noun Phrases	idem, …
ORD	Ordinals	primeiro, centésimo, penúltimo, …
PADR	Part of Address	Rua, av., rot., …
PNM	Part of Name	Lisboa, António, João, …
PNT	Punctuation Marks	., ?, (, …
POSS	Possessives	meu, teu, seu, …
PPA	Past Participles not in compound tenses	sido, afirmados, vivida, …
PP	Prepositional Phrases	algures, …
PPT	Past Participle in compound tenses	sido, afirmado, vivido, …
PREP	Prepositions	de, para, em redor de, …
PRS	Personals	eu, tu, ele, …
QNT	Quantifiers	todos, muitos, nenhum, …
REL	Relatives	que, cujo, tal que, …
STT	Social Titles	Presidente, drª., prof., …
SYB	Symbols	@, #, &, …
TERMN	Optional Terminations	(s), (as), …
UM	"um" or "uma"	um, uma
UNIT	Abbreviated Measurement Unit	kg., km., …
VAUX	Finite "ter" or "haver" in compound tenses	temos, haveriam, …
V	Verbs (other than PPA, PPT, INF or GER)	falou, falaria, …
WD	Week Days	segunda, terça-feira, sábado, …
Tags for multi-word expressions
LADV1…LADVn	Multi-Word Adverbs	de facto, em suma, um pouco, …
LCJ1…LCJn	Multi-Word Conjunctions	assim como, já que, …
LDEM1…LDEMn	Multi-Word Demonstratives	o mesmo, …
LDFR1…LDFRn	Multi-Word Denominators of Fractions	por cento
LDM1…LDMn	Multi-Word Discourse Markers	pois não, até logo, …
LITJ1…LITJn	Multi-Word Interjections	meu Deus
LPRS1…LPRSn	Multi-Word Personals	a gente, si mesmo, V. Exa., …
LPREP1…LPREPn	Multi-Word Prepositions	através de, a partir de, …
LQD1…LQDn	Multi-Word Quantifiers	uns quantos, …
LREL1…LRELn	Multi-Word Relatives	tal como, …
Tags specific to the spoken corpus
EMP	Emphasis
EL	Extra-linguistic
PL	Para-linguistic
FRG	Fragment

Inflection tags

Tag	Description
Tags for nominal categories
m	Masculine
f	Feminine
g	Underspecified gender
s	Singular
p	Plural
n	Underspecified number
dim	Diminutive
sup	Superlative
comp	Comparative
Tags for verbs
1	First Person
2	Second Person
3	Third Person
pi	Presente do Indicativo
ppi	Pretérito Perfeito do Indicativo
ii	Pretérito Imperfeito do Indicativo
mpi	Pretérito Mais que Perfeito do Indicativo
fi	Futuro do Indicativo
c	Condicional
pc	Presente do Conjuntivo
ic	Pretérito Imperfeito do Conjuntivo
fc	Futuro do Conjuntivo
imp	Imperativo
Tags for infinitive verbs
ifl	Inflected
nifl	Not Inflected

Named-entity tags

Semantic type	description	example
PER	person	...o[O] João[B-PER] Silva[I-PER] disse[O]...
ORG	organization	...a[O] Universidade[B-ORG] de[I-ORG] Lisboa[I-ORG] comprou[O]...
LOC	location	...de[O] Londres[B-LOC] a[O] Paris[B-LOC]...
WRK	work	...a[O] Mona[B-WRK] Lisa[I-WRK] está[O]...
MSC	other cases	...o[O] RMS[B-MSC] Titanic[I-MSC] afundou[O]...

CINTIL Corpus Concordancer's documentation

CINTIL online concordancer

CINTIL online concordancer is a freely available online concordancing service to support the research usage of the CINTIL Corpus. This concordancer was developed and is mantained at the University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics, in cooperation with the Grammar and Resources (formerly REPORT) Group of CLUL-Centro de Linguística da Universidade de Lisboa.

CINTIL concordancer allows the use of patterns to specify the occurrences to be retrieved. This permits to uncover linguistic structures of high complexity and use this service as a powerful research tool.

How to use the concordancer?

You may be interested also in using the companion tools.

CINTIL corpus

CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese. At present it is composed of 1 Million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).

This corpus is being developed and maintained at the University of Lisbon by the Grammar and Resources (formerly REPORT) Group of CLUL-Centro de Linguística da Universidade de Lisboa in cooperation with NLX-Natural Language and Speech Group of the Department of Informatics. It was the first of its class to be developed for Portuguese in terms of the combined dimensions of size, depth of linguistic information, range of domains and sources, and level of accuracy. The present version is the most recent outcome of an ongoing and long-term endeavour to continuously enlarge and refine this corpus along all these dimensions, with the purpose of providing an enhanced resource for the research on the Linguistics of Portuguese and the development of language technology.

What is in the corpus?

Acquiring CINTIL

CINTIL corpus is released through ELDA-Evaluation and Language Resources Distribution Agency. Details are provided here.

Authorship

The CINTIL Corpus received several contributions:

Raw text, and previous versions
Grammar and Resources (formerly REPORT) Group of the CLUL-Centro de Linguística da Universidade de Lisboa
Present version
The CINTIL Corpus was developed, between March 2004 and December 2006, under the coordination of António Branco (FCUL-Faculdade de Ciências da Universidade de Lisboa) and Maria Fernanda Bacelar do Nascimento (CLUL-Centro de Linguística da Universidade de Lisboa), by the team including Sandra Antunes (CLUL), Florbela Barreto (CLUL), José Bettencourt Gonçalves (CLUL), João Silva (FCUL), Amália Mendes (CLUL) e Filipe Nunes (FCUL), partly in the scope of the TagShare Project, funded by FCT-Fundação para a Ciência e Tecnologia under the research contract POSI/PLP/47058/2002.

Contact us

CINTIL is an ongoing endeavour to develop a corpus with increasingly enhanced accuracy. After having checked the underlying assumptions under which the current version was produced here, in case you have detected something that deserves to be improved, let us know.

Please note that this is not an online linguistic help-desk service, and questions unrelated to the CINTIL Corpus will not be attended.

White Papers

Irrespective of the most recent version of this tool you may use, when mentioning it, please cite this reference:

Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Nascimento, Filipe Nunes and João Silva, 2006, "Open Resources and Tools for the Shallow Processing of Portuguese", Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), Genoa, Italy.

Branco, António and João Silva, 2006, "LX-Suite: Shallow Processing Tools for Portuguese", Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), Trento, Italy, pp.179-182.

Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Fernanda Bacelar Nascimento, Filipe Nunes and João Silva, 2006, "Linguistic Resources and Software for Shallow Processing", In Actas do XXI Encontro Anual da Associação Portuguesa de Linguística, Lisbon, Portugal.

Acknowledgments

The work leading to the CINTIL Corpus was partly supported by FCT-Fundação para a Ciência e Tecnologia under the grant POSI/PLP/47058/2002 for the project TagShare.

We are very grateful to Adam Przepiórkowski and his team, from the IPIPAN - The Institute of Computer Science of the Polish Academy of Sciences, Warsaw, for the support in the adaptation of Poliqarp to the Portuguese language and CINTIL constraints.

Corpus composition

CINTIL is a corpus of Portuguese with 1 Million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).

Over one third of the corpus is composed of transcribed spoken materials, with about half of that being the transcription of informal conversations.

The remaining corpus is composed of written materials. The majority (58.73%) of this written corpus includes articles from newspapers and magazines, such as the Jornal Público, Diário de Notícias, Revista Visão, etc. The rest of the written corpus is mostly composed of fiction.

A more detailed breakdown of the composition of CINTIL is presented in the following table:

*Breakdown of the CINTIL corpus*
Written 689,124 tokens	News	33.96%	404,690
	Fiction	16.80%	200,194
	Other	7.07%	84,240
Spoken 502,622 tokens	Formal/Natural	8.18%	97,499
	Formal/Media	7.45%	88,727
	Formal/Phone	4.05%	48,284
	Informal/Private	18.26%	217,604
	Informal/Public	4.05%	48,221
	Informal/Phone	0.19%	2,287
Total			1,191,746

Companion tools and online services

You may be also interested in the companion tools. These tools deliver output in strict accordance to CINTIL's annotation conventions.

They include the following individual tools covering analysis and generation procedures:

Sentence chunker: detects and marks paragraph and sentence boundaries; 99.94% accuracy
Tokenizer: segments text into tokens, expands contractions, detaches clitic pronouns from verbs, etc.; 99.72% accuracy
POS tagger: assigns POS tags to tokens in context; 96.87% accuracy
Nominal featurizer: assigns inflection features (gender and number; and person and degree when applicable) to (attested or putative) words from the nominal POS categories, resolving ambiguity in context; 91.07% f-score
Nominal lemmatizer: assigns a lemma to words from the nominal POS categories (viz. common nouns and adjectives), resolving ambiguity in context; 97.67% f-score
Verbal featurizer and lemmatizer: assigns inflection features (tense, person and number) and lemma (infinitive form) to any (attested or putative) verb form, resolving ambiguity in context; 95.96% f-score
Verbal conjugator: delivers a conjugation table given any (attested or putative) infinitive verb form and the specification of possible associated clitics
Nominal inflector: delivers an inflected form given another (attested or putative) form and the inflected features required for the output

These tools were bundled into four suites of self-contained functionality and made available as the following online services:

LX-Inflector (portulanclarin.net) This service allows to obtain the lemma and the inflected form from any nominal form provided (common nouns or adjectives - neologisms included), and according to a given specification of inflection feature values entered. This functionality is supported by the composition of the nominal featurizer, lemmatizer and inflector tools indicated above.
LX-Lemmatizer (portulanclarin.net/workbench/lx-lemmatizer/) This service provides fully-fledged lemmatization of Portuguese verbs: given a verb form (neologisms included), it delivers all possible lemmata together with their respective inflection features.
LX-Conjugator (portulanclarin.net/workbench/lx-conjugator/) This service provides fully-fledged conjugation tables for any verb form entered (neologism included), including the full range of pronominal forms.
LX-Suite (portulanclarin.net/workbench/lx-suite/) This service is supported by the composition of the range of processing tools for analysis: sentence chunker, tokenizer, POS tagger, nominal featurizer, nominal lemmatizer and verbal lemmatizer. The resulting functionality ensures that the input, entered as raw text, is sentence and token-segmented, their tokens are associated with corresponding lemmata and tagged with linguistic information on their POS and inflection feature values. The outcome is resolved with respect to ambiguity that arise at the different levels of processing.

Annotation guidelines

The linguistic information encoded in CINTIL adheres to the annotation guidelines described here. However, for practical reasons, the concordancer displays the annotation—when the "Show tags" box is checked—in a slightly different format. For more details, see query outcome.

Tagset

Part-of-speech tags

Tag	Category	Examples
ADJ	Adjectives	bom, brilhante, eficaz, …
ADV	Adverbs	hoje, já, sim, felizmente, …
CARD	Cardinals	zero, dez, cem, mil, …
CJ	Conjunctions	e, ou, tal como, …
CL	Clitics	o, lhe, se, …
CN	Common Nouns	computador, cidade, ideia, …
DA	Definite Articles	o, os, …
DEM	Demonstratives	este, esses, aquele, …
DFR	Denominators of Fractions	meio, terço, décimo, %, …
DGTR	Roman Numerals	VI, LX, MMIII, MCMXCIX, …
DGT	Arabic Numerals	0, 1, 42, 12345, 67890, …
DM	Discourse Marker	olá, …
EADR	Electronic Addresses	http://www.di.fc.ul.pt, …
EOE	End of Enumeration	etc
EXC	Exclamation	ah, ei, …
GER	Gerunds	sendo, afirmando, vivendo, …
GERAUX	Gerund "ter"/"haver" in compound tenses	tendo, havendo
IA	Indefinite Articles	uns, umas, …
IND	Indefinites	tudo, alguém, ninguém, …
INF	Infinitive	ser, afirmar, viver, …
INFAUX	Infinitive "ter"/"haver" in compound tenses	ter, haver, …
INT	Interrogatives	quem, como, quando, …
ITJ	Interjection	bolas, caramba, …
LTR	Letters	a, b, c, …
MGT	Magnitude Classes	unidade, dezena, dúzia, resma, …
MTH	Months	Janeiro, Dezembro, …
NP	Noun Phrases	idem, …
ORD	Ordinals	primeiro, centésimo, penúltimo, …
PADR	Part of Address	Rua, av., rot., …
PNM	Part of Name	Lisboa, António, João, …
PNT	Punctuation Marks	., ?, (, …
POSS	Possessives	meu, teu, seu, …
PPA	Past Participles not in compound tenses	sido, afirmados, vivida, …
PP	Prepositional Phrases	algures, …
PPT	Past Participle in compound tenses	sido, afirmado, vivido, …
PREP	Prepositions	de, para, em redor de, …
PRS	Personals	eu, tu, ele, …
QNT	Quantifiers	todos, muitos, nenhum, …
REL	Relatives	que, cujo, tal que, …
STT	Social Titles	Presidente, drª., prof., …
SYB	Symbols	@, #, &, …
TERMN	Optional Terminations	(s), (as), …
UM	"um" or "uma"	um, uma
UNIT	Abbreviated Measurement Unit	kg., km., …
VAUX	Finite "ter" or "haver" in compound tenses	temos, haveriam, …
V	Verbs (other than PPA, PPT, INF or GER)	falou, falaria, …
WD	Week Days	segunda, terça-feira, sábado, …
Tags for multi-word expressions
LADV1…LADVn	Multi-Word Adverbs	de facto, em suma, um pouco, …
LCJ1…LCJn	Multi-Word Conjunctions	assim como, já que, …
LDEM1…LDEMn	Multi-Word Demonstratives	o mesmo, …
LDFR1…LDFRn	Multi-Word Denominators of Fractions	por cento
LDM1…LDMn	Multi-Word Discourse Markers	pois não, até logo, …
LITJ1…LITJn	Multi-Word Interjections	meu Deus
LPRS1…LPRSn	Multi-Word Personals	a gente, si mesmo, V. Exa., …
LPREP1…LPREPn	Multi-Word Prepositions	através de, a partir de, …
LQD1…LQDn	Multi-Word Quantifiers	uns quantos, …
LREL1…LRELn	Multi-Word Relatives	tal como, …
Tags specific to the spoken corpus
EMP	Emphasis
EL	Extra-linguistic
PL	Para-linguistic
FRG	Fragment

Inflection tags

Tag	Description
Tags for nominal categories
m	Masculine
f	Feminine
g	Underspecified gender
s	Singular
p	Plural
n	Underspecified number
dim	Diminutive
sup	Superlative
comp	Comparative
Tags for verbs
1	First Person
2	Second Person
3	Third Person
pi	Presente do Indicativo
ppi	Pretérito Perfeito do Indicativo
ii	Pretérito Imperfeito do Indicativo
mpi	Pretérito Mais que Perfeito do Indicativo
fi	Futuro do Indicativo
c	Condicional
pc	Presente do Conjuntivo
ic	Pretérito Imperfeito do Conjuntivo
fc	Futuro do Conjuntivo
imp	Imperativo
Tags for infinitive verbs
ifl	Inflected
nifl	Not Inflected

Named-entity tags

Named-entity tags are composed of two parts, seperated by a hyphen. The first part is used to mark the span of the entity, using the BIO notation: B indicates the first token of the entity, I indicates a token belonging to the entity other than the first, and O indicates a token that is not part of an entity. The second part of the tag indicates the semantic type of the entity. The list of types together with examples is shown in the table below.

Semantic type	description	example
PER	person	...o[O] João[B-PER] Silva[I-PER] disse[O]...
ORG	organization	...a[O] Universidade[B-ORG] de[I-ORG] Lisboa[I-ORG] comprou[O]...
LOC	location	...de[O] Londres[B-LOC] a[O] Paris[B-LOC]...
WRK	work	...a[O] Mona[B-WRK] Lisa[I-WRK] está[O]...
MSC	other cases	...o[O] RMS[B-MSC] Titanic[I-MSC] afundou[O]...

Quick reference

Query syntax cheat sheet

Basic query
	a word matches itself
Query modifiers
`/i`	case-insensitive match
`/x`	sub-sequence matching
Character expressions
`.`	any single character
`[ ]`	character from a set
`[^ ]`	character from negated set

Repetition operators
`?`	optional
`*`	zero or more times
`+`	one or more times
`{n}`	exactly `n` times
`{n,}`	`n` or more times
`{,n}`	up to `n` times
`{m,n}`	from `m` to `n` times

Combining expressions
`e₁e₂`	`e₁` followed by `e₂`
`\|`	alternation
`( )`	grouping
Search by annotation
`[keyword=expression]`
`[keyword!=expression]`
`[key₁=exp₁ & key₂=exp₂]`
`[key₁=exp₁ \| key₂=exp₂]`

Regular expressions must be enclosed in quotes.
Contractions are reverted and encoded as two tokens, where the first is concatenated with an underscore.

Quick reference

Field	Keyword	Values
Orthographic form	`orth`	any
Part-of-speech tag	`pos`	full table
Inflection feature	`gender`	`f`, `m`, `g`
	`number`	`s`, `p`, `n`
	`degree`	`dim`, `sup`, `comp`
	`person`	`1`, `2`, `3`
	`time`	full table
	`inflection`	`ifl`, `nifl`
Lemma (base form)	`base`	any
Named-entity	`iob`	full table
Metadata	`source`	`writtennews` `writtenfiction` `writtenother` `spoken`

Query outcome

The CINTIL Online Concordancer permits to retrieve passages with occurrences of a given target expression in the CINTIL corpus.

The target expression is entered in the query text box. The retrieved passages are displayed below that box.

When the "Show tags" box is checked, the concordancer displays also the linguistic annotation.

For each token, this annotation is displayed between square brackets, with a colon separating each field. For instance, the annotation for the common noun carros will be displayed as follows:

keywords

Note that this annotation is displayed in a slightly different format than the one used in the corpus release. For a description of the latter, check here.

For practical reasons each passage returned with the occurrence contains at most 10 tokens.

In those usage cases where it is imperative to have access to every occurrence, the interested user can acquire a copy of the corpus and run a concordancer of his choice over that local copy.

Searching orthographic forms

Case-sensitiveness

Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:

by entering gato, occurrences of gato are obtained
gato/i gets occurrences of gato, Gato, GATO, etc.

Sub-sequence matching

The query expression match whole tokens. For instance gato will not match parts of words, and will not return regato or obrigatoriamente.

To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the /i mentioned previously).

For instance:

gato will only match gato
gato/x will match any word containing the string gato, such as obrigatoriamente
gato/xi is as above, but case-insensitive

Contractions

Searching for patterns

It is possible to search with general pattern (regular expressions). A query can thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:

Alternation

Alternatives are introduced by the | (vertical bar) character:

"gato|peixe" matches all occurrences of gato and all occurrences of peixe

Character sets

A set of characters within square brackets match occurrences of any of those characters:

"gat[ao]" match occurrences of gata and gato
"[pg]at[ao]" will match occurrences of gata, gato, pata and pato

A set can be negated by placing a ^ (caret) symbol immediately after the opening bracket.

"[^abcd][efg]" matches tokens with two characters, the first one not being a, b, c or d and the second one being e, f or g

Period

The "." (period) match any single character (letter, digit or symbol):

"gat.s" will match gatas, gatbs, gatcs, gat1s, etc.

Optionality

The "?" (question mark) permits that the character/expression preceding it is optionally matched:

"gatos?" matches gato and gatos.

Iteration

There are three forms of expressing iteration. The * (star) operator permits that the character/expression preceding it is matched zero or more times:

"gat.*" matches any word starting with gat, including gat itself
".*gato.*" matches any word containing the string gato (this is equivalent to gato/x)

The + (plus) operator is similar, but enforces that there is at least one occurrence of the character/expression preceding it:

"gat.+" matches any word starting with gat, but not gat since + enforces at least one occurrence

"gat.{2,4}" matches words that start with gat and that have 2 to 4 additional characters
"[^aer]{5,}" matches words without a, e or r that have 5 or more characters.

Grouping

Parentheses are used to group expressions. The operators described above can then be applied to the whole expression in parentheses as if it was a single character:

"gat(inh)?o" matches gato and gatinho (i.e. the sequence inh that follows t is optional)
"ga(to)*" matches ga, gato, gatoto, gatototo, etc. (i.e. to may occur zero or more times)

Note that any of these expressions may also be modified by the /i and /x described previously.

For instance:

"ga.*"/i matches words starting with ga, Ga, gA or GA
"(ra){2}"/x matches words that contain two consecutive occurrences of ra (e.g. rara, mostraram, etc.)

Searching through linguistic information

Each field is instantiated by a keyword.

The values can be matched with any of the methods described above:

[field=pattern] is the format for such queries.

Field-pattern pairs can be combined by using logical operators: & (ampersand) for conjunction and | (vertical bar) for disjunction:

[field=pattern & field=pattern]
[field=pattern | field=pattern]

In addition, the negation symbol ! (exclamation) permits to match tokens whose field values do not conform to a given pattern:

[!field=pattern] is one format for such negation
[field!=pattern] is equivalent to the previous query.

Orthographic form (again)

The orthographic form itself can be matched via the keyword orth:

[orth=gato] matches tokens with the orthographic form gato. This returns the same result as simply searching for gato. Using this alternative but equivalent way is useful when combining orth with other fields (to be discussed below)
[orth="gat.*" & orth!=gato] matches tokens that begin with gat but that are not gato

Part-of-speech

Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:

[pos=cn] matches tokens with the POS tag cn (common noun)
[pos=cn & orth="ga.*"] matches tokens that are common nouns and begin with ga
[pos="d.*"] matches tokens with any POS tag whose name begins with d
[pos!=pnt] matches tokens that are not punctuation (the pnt tag)

Here is the list of POS tags.

Nominal inflection

[gender=f] matches all tokens with feminine inflection
[number=s & orth=".*s"] matches all tokens with singular inflection that end in s
[gender!=m] matches tokens that do not have masculine inflection. Note that this also matches those tokens to which gender inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

Some tokens may bear degree features, accessed through the degree keyword:

[degree=dim] matches all tokens with diminutive degree

Verbal inflection

In order to match tokens according to their verbal inflection features, one can resort to person, time and number keywords:

[person="1"] matches tokens inflected for first person
[time="ppi"] matches tokens inflected for the Pretérito Perfeito Indicativo
[person="3" & number="s" & time="fc"] matches all forms expressing the third person singular of Futuro Conjuntivo
[person!="1"] matches tokens that do not have 1st-person inflection. Note that this also matches those tokens to which person inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

Here is the list of verbal inflection tags.

Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.

Lemma

In order to match tokens by their lemma, the base keyword can be used:

[base=rato] matches words with rato as their base form (lemma), such as rato, ratos or ratinho, etc.
[pos=cn & base=".*s"] finds common nouns whose lemma ends in s
[orth=foi & pos=v & base!=ir] matches occurrences of the verb form foi that do not belong to verb ir

Named-entity

To match tokens according to their being part of an expression naming an entity, the iob keyword is used:

[iob=B-LOC] matches tokens that are the beginning (B-) of an expression naming an entity whose semantic type is "location" (LOC).
[iob=I-PER] matches tokens that are inside (I-) an expression naming an entity of type "person" (PER).

Here is the list of named-entity tags.

Metadata

It is possible to use metadata to restrict the match to a given type of text through the use of the meta command:

gato meta source=writtennews matches gato only in the news portion (writtennews) of the corpus
gato meta source="written.*" matches gato only in the written portion of the corpus (includes writtennews, writtenficiton and writtenother)

For a list of metadata fields and values, see here.

Advanced queries

Through the combination of the different search options described above, it is possible to construct advanced queries and uncover relevant linguistic information:

situação[pos=adj] returns the occurrences of the word situação followed by an adjective
[pos=da][pos=cn] return the occurrences of a definite article (the da tag) followed by a common noun
[pos=da][pos=adj]?[pos=cn] is similar to the previous query, but allows a single, optional adjective (indicated by the adj tag) between the definite article and the common noun
[pos="cn|adj"]{3,} returns sequences with at least 3 consecutive adjectives and common nouns (in any relative order)
[pos=da][pos!=cn]{2,3}[pos=adj] returns sequences of a definite article followed by 2 or 3 tokens that are not common nouns and that are followed by an adjective
... etc.

Aligning matches

It is possible to split the outcome of the query into two columns to make it more readable by using the ^ (caret) symbol:

[pos=da][pos!=cn]{2}^[pos=adj] matches a definite article followed by 2 tokens that are not common nouns, followed by an adjective. The definite article and the following 2 tokens will be displayed in a column while the final adjective will be shown in a column by itself.

License

The complete text of this license is here.