beta
Show tags

Show results per page, starting at result with words to the left and to the right.

To obtain its context with words, click on a result.

Query outcome

The CINTIL Online Concordancer permits to retrieve passages with occurrences of a given target expression in the CINTIL corpus.

The target expression is entered in the query text box. The retrieved passages are displayed below that box.

When the "Show tags" box is checked, the concordancer displays also the linguistic annotation.

For each token, this annotation is displayed between square brackets, with a colon separating each field. For instance, the annotation for the common noun carros will be displayed as follows:

keywords

Note that this annotation is displayed in a slightly different format than the one used in the corpus release. For a description of the latter, check here.

For practical reasons each passage returned with the occurrence contains at most 10 tokens.

Also for practical reasons, not all passages with occurrences of the target expression in the CINTIL corpus are returned. Also, the order in which the passages are displayed does not correspond to a possible consecutive order of their occurrence in the corpus. Note however that the outcome of the CINTIL online concordancer can be used as a reference in research given that identical queries return identical outcome.

In those usage cases where it is imperative to have access to every occurrence, the interested user can acquire a copy of the corpus and run a concordancer of his choice over that local copy.

Searching orthographic forms

Case-sensitiveness
Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:
  • by entering gato, occurrences of gato are obtained
  • gato/i gets occurrences of gato, Gato, GATO, etc.
Sub-sequence matching
The query expression match whole tokens. For instance gato will not match parts of words, and will not return regato or obrigatoriamente.

To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the /i mentioned previously).

For instance:

  • gato will only match gato
  • gato/x will match any word containing the string gato, such as obrigatoriamente
  • gato/xi is as above, but case-insensitive
Contractions
Note that in the CINTIL corpus the contractions (e.g. daquela, aos, nas) are reverted and encoded with two tokens, where the first is concatenated with an underscore symbol (e.g. de_ aquela, a_ os, em_ as)

Searching for patterns

It is possible to search with general pattern (aka regular expressions). A query can thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:

Alternation
Alternatives are introduced by the | (vertical bar) character:
  • "gato|peixe" matches all occurrences of gato and all occurrences of peixe
Character sets
A set of characters within square brackets match occurrences of any of those characters:
  • "gat[ao]" match occurrences of gata and gato
  • "[pg]at[ao]" will match occurrences of gata, gato, pata and pato
A set can be negated by placing a ^ (caret) symbol immediately after the opening bracket.
  • "[^abcd][efg]" matches tokens with two characters, the first one not being a, b, c or d and the second one being e, f or g
Period
The "." (period) match any single character (letter, digit or symbol):
  • "gat.s" will match gatas, gatbs, gatcs, gat1s, etc.
Optionality
The "?" (question mark) permits that the character/expression preceding it is optionally matched:
  • "gatos?" matches gato and gatos.
Iteration
There are three forms of expressing iteration. The * (star) operator permits that the character/expression preceding it is matched zero or more times:
  • "gat.*" matches any word starting with gat, including gat itself
  • ".*gato.*" matches any word containing the string gato (this is equivalent to gato/x)
The + (plus) operator is similar, but enforces that there is at least one occurrence of the character/expression preceding it:
  • "gat.+" matches any word starting with gat, but not gat since + enforces at least one occurrence
Finally, {l,u} permits that the number of iterations is bounded by a lower (l) and an upper (u) value. Note that either bound may be omitted. In such cases, {l,} means "at least l times", {,u} means "at most u times" and {n} means "exactly n times":
  • "gat.{2,4}" matches words that start with gat and that have 2 to 4 additional characters
  • "[^aer]{5,}" matches words without a, e or r that have 5 or more characters.
Grouping
Parentheses are used to group expressions. The operators described above can then be applied to the whole expression in parentheses as if it was a single character:
  • "gat(inh)?o" matches gato and gatinho (i.e. the sequence inh that follows t is optional)
  • "ga(to)*" matches ga, gato, gatoto, gatototo, etc. (i.e. to may occur zero or more times)

Note that any of these expressions may also be modified by the /i and /x described previously.

For instance:

  • "ga.*"/i matches words starting with ga, Ga, gA or GA
  • "(ra){2}"/x matches words that contain two consecutive occurrences of ra (e.g. rara, mostraram, etc.)

Searching through linguistic information

Each token is associated to linguistic information, encoded by means of annotation tags. Each tag is composed of a field and its value in square brackets ([field=value]). For example, [gender=m], [time=pi], etc.

Each field is instantiated by a keyword.

The values can be matched with any of the methods described above:

  • [field=pattern] is the format for such queries.

Field-pattern pairs can be combined by using logical operators: & (ampersand) for conjunction and | (vertical bar) for disjunction:

  • [field=pattern & field=pattern]
  • [field=pattern | field=pattern]

In addition, the negation symbol ! (exclamation) permits to match tokens whose field values do not conform to a given pattern:

  • [!field=pattern] is one format for such negation
  • [field!=pattern] is equivalent to the previous query.

Orthographic form (again)

The orthographic form itself can be matched via the keyword orth:

  • [orth=gato] matches tokens with the orthographic form gato. This returns the same result as simply searching for gato. Using this alternative but equivalent way is useful when combining orth with other fields (to be discussed below)
  • [orth="gat.*" & orth!=gato] matches tokens that begin with gat but that are not gato

Part-of-speech

Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:

  • [pos=cn] matches tokens with the POS tag cn (common noun)
  • [pos=cn & orth="ga.*"] matches tokens that are common nouns and begin with ga
  • [pos="d.*"] matches tokens with any POS tag whose name begins with d
  • [pos!=pnt] matches tokens that are not punctuation (the pnt tag)

Here is the list of POS tags.

Nominal inflection

The keywords gender and number have, respectively, the values f (feminine) or m (masculine), and the values s (singular) or p (plural). They permit to match occurrences with selected inflection features:

  • [gender=f] matches all tokens with feminine inflection
  • [number=s & orth=".*s"] matches all tokens with singular inflection that end in s
  • [gender!=m] matches tokens that do not have masculine inflection. Note that this also matches those tokens to which gender inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

Some tokens may bear degree features, accessed through the degree keyword:

  • [degree=dim] matches all tokens with diminutive degree

Verbal inflection

In order to match tokens according to their verbal inflection features, one can resort to person, time and number keywords:

  • [person="1"] matches tokens inflected for first person
  • [time="ppi"] matches tokens inflected for the Pretérito Perfeito Indicativo
  • [person="3" & number="s" & time="fc"] matches all forms expressing the third person singular of Futuro Conjuntivo
  • [person!="1"] matches tokens that do not have 1st-person inflection. Note that this also matches those tokens to which person inflection is not even applicable, such as prepositions, punctuation, symbols, etc.

Here is the list of verbal inflection tags.

Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.

Lemma

In order to match tokens by their lemma, the base keyword can be used:

  • [base=rato] matches words with rato as their base form (lemma), such as rato, ratos or ratinho, etc.
  • [pos=cn & base=".*s"] finds common nouns whose lemma ends in s
  • [orth=foi & pos=v & base!=ir] matches occurrences of the verb form foi that do not belong to verb ir

Named-entity

To match tokens according to their being part of an expression naming an entity, the iob keyword is used:

  • [iob=B-LOC] matches tokens that are the beginning (B-) of an expression naming an entity whose semantic type is "location" (LOC).
  • [iob=I-PER] matches tokens that are inside (I-) an expression naming an entity of type "person" (PER).

Here is the list of named-entity tags.

Metadata

It is possible to use metadata to restrict the match to a given type of text through the use of the meta command:

  • gato meta source=writtennews matches gato only in the news portion (writtennews) of the corpus
  • gato meta source="written.*" matches gato only in the written portion of the corpus (includes writtennews, writtenficiton and writtenother)

For a list of metadata fields and values, see here.

Advanced queries

Through the combination of the different search options described above, it is possible to construct advanced queries and uncover relevant linguistic information:

  • situação[pos=adj] returns the occurrences of the word situação followed by an adjective
  • [pos=da][pos=cn] return the occurrences of a definite article (the da tag) followed by a common noun
  • [pos=da][pos=adj]?[pos=cn] is similar to the previous query, but allows a single, optional adjective (indicated by the adj tag) between the definite article and the common noun
  • [pos="cn|adj"]{3,} returns sequences with at least 3 consecutive adjectives and common nouns (in any relative order)
  • [pos=da][pos!=cn]{2,3}[pos=adj] returns sequences of a definite article followed by 2 or 3 tokens that are not common nouns and that are followed by an adjective
  • ... etc.

Aligning matches

It is possible to split the outcome of the query into two columns to make it more readable by using the ^ (caret) symbol:

  • [pos=da][pos!=cn]{2}^[pos=adj] matches a definite article followed by 2 tokens that are not common nouns, followed by an adjective. The definite article and the following 2 tokens will be displayed in a column while the final adjective will be shown in a column by itself.

Query syntax cheat sheet

Basic query
a word matches itself
Query modifiers
/i case-insensitive match
/x sub-sequence matching
Character expressions
. any single character
[ ] character from a set
[^ ] character from negated set
Repetition operators
? optional
* zero or more times
+ one or more times
{n} exactly n times
{n,} n or more times
{,n} up to n times
{m,n} from m to n times
Combining expressions
e1e2 e1 followed by e2
| alternation
( ) grouping
Search by annotation
[keyword=expression]
[keyword!=expression]
[key1=exp1 & key2=exp2]
[key1=exp1 | key2=exp2]

Regular expressions must be enclosed in quotes.
Contraction are reverted and encoded as two tokens, where the first is concatenated with an underscore.

Quick reference

Field Keyword Values
Orthographic form orth any
Part-of-speech tag pos full table
Inflection feature gender f, m, g
number s, p, n
degree dim, sup, comp
person 1, 2, 3
time full table
inflection ifl, nifl
Lemma (base form) base any
Named-entity iob full table
Metadata source writtennews
writtenfiction
writtenother
spoken

Part-of-speech tags

Tag Category Examples
ADJ Adjectives bom, brilhante, eficaz, …
ADV Adverbs hoje, já, sim, felizmente, …
CARD Cardinals zero, dez, cem, mil, …
CJ Conjunctions e, ou, tal como, …
CL Clitics o, lhe, se, …
CN Common Nouns computador, cidade, ideia, …
DA Definite Articles o, os, …
DEM Demonstratives este, esses, aquele, …
DFR Denominators of Fractions meio, terço, décimo, %, …
DGTR Roman Numerals VI, LX, MMIII, MCMXCIX, …
DGT Arabic Numerals 0, 1, 42, 12345, 67890, …
DM Discourse Marker olá, …
EADR Electronic Addresses http://www.di.fc.ul.pt, …
EOE End of Enumeration etc
EXC Exclamation ah, ei, …
GER Gerunds sendo, afirmando, vivendo, …
GERAUX Gerund "ter"/"haver" in compound tenses tendo, havendo
IA Indefinite Articles uns, umas, …
IND Indefinites tudo, alguém, ninguém, …
INF Infinitive ser, afirmar, viver, …
INFAUX Infinitive "ter"/"haver" in compound tenses ter, haver, …
INT Interrogatives quem, como, quando, …
ITJ Interjection bolas, caramba, …
LTR Letters a, b, c, …
MGT Magnitude Classes unidade, dezena, dúzia, resma, …
MTH Months Janeiro, Dezembro, …
NP Noun Phrases idem, …
ORD Ordinals primeiro, centésimo, penúltimo, …
PADR Part of Address Rua, av., rot., …
PNM Part of Name Lisboa, António, João, …
PNT Punctuation Marks ., ?, (, …
POSS Possessives meu, teu, seu, …
PPA Past Participles not in compound tenses sido, afirmados, vivida, …
PP Prepositional Phrases algures, …
PPT Past Participle in compound tenses sido, afirmado, vivido, …
PREP Prepositions de, para, em redor de, …
PRS Personals eu, tu, ele, …
QNT Quantifiers todos, muitos, nenhum, …
REL Relatives que, cujo, tal que, …
STT Social Titles Presidente, drª., prof., …
SYB Symbols @, #, &, …
TERMN Optional Terminations (s), (as), …
UM "um" or "uma" um, uma
UNIT Abbreviated Measurement Unit kg., km., …
VAUX Finite "ter" or "haver" in compound tenses temos, haveriam, …
V Verbs (other than PPA, PPT, INF or GER) falou, falaria, …
WD Week Days segunda, terça-feira, sábado, …
Tags for multi-word expressions
LADV1…LADVn Multi-Word Adverbs de facto, em suma, um pouco, …
LCJ1…LCJn Multi-Word Conjunctions assim como, já que, …
LDEM1…LDEMn Multi-Word Demonstratives o mesmo, …
LDFR1…LDFRn Multi-Word Denominators of Fractions por cento
LDM1…LDMn Multi-Word Discourse Markers pois não, até logo, …
LITJ1…LITJn Multi-Word Interjections meu Deus
LPRS1…LPRSn Multi-Word Personals a gente, si mesmo, V. Exa., …
LPREP1…LPREPn Multi-Word Prepositions através de, a partir de, …
LQD1…LQDn Multi-Word Quantifiers uns quantos, …
LREL1…LRELn Multi-Word Relatives tal como, …
Tags specific to the spoken corpus
EMP Emphasis
EL Extra-linguistic
PL Para-linguistic
FRG Fragment

Inflection tags

Tag Description
Tags for nominal categories
m Masculine
f Feminine
g Underspecified gender
s Singular
p Plural
n Underspecified number
dim Diminutive
sup Superlative
comp Comparative
Tags for verbs
1 First Person
2 Second Person
3 Third Person
pi Presente do Indicativo
ppi Pretérito Perfeito do Indicativo
ii Pretérito Imperfeito do Indicativo
mpi Pretérito Mais que Perfeito do Indicativo
fi Futuro do Indicativo
c Condicional
pc Presente do Conjuntivo
ic Pretérito Imperfeito do Conjuntivo
fc Futuro do Conjuntivo
imp Imperativo
Tags for infinitive verbs
ifl Inflected
nifl Not Inflected

Named-entity tags

Semantic type description example
PER person ...o[O] João[B-PER] Silva[I-PER] disse[O]...
ORG organization ...a[O] Universidade[B-ORG] de[I-ORG] Lisboa[I-ORG] comprou[O]...
LOC location ...de[O] Londres[B-LOC] a[O] Paris[B-LOC]...
WRK work ...a[O] Mona[B-WRK] Lisa[I-WRK] está[O]...
MSC other cases ...o[O] RMS[B-MSC] Titanic[I-MSC] afundou[O]...