News corpus categorised

The News corpus developed by LIACC in JSON format was complemented with POS and keyword topics annotation.

POS-tagging
===========
The POS-tagging used the tagger described in Généreux et al. (2012)
The title and text body were extracted, tokenized and pos-tagged.
Two new fields were added to the jsonfile: "POS-body" and "POS-title" that contain the tokenized and pos-tagged versions of the fields 'body' and 'title'.

The sentence boundaries in the fields "POS-body" and "POS-title" are separated with the marker "<utt>\n".

Non-portuguese parts in the news texts were removed from the pos-tagged text.
An example can be found in the Document with ID "_id" 8200539.



Keyword topics
==============

The parts of the original HTML pages were used to retrieve the keywords and topics that were assigned to each news article using the HTML tags:

<meta name="keywords" content="" />
<meta name="news_keywords" content="" />

The result are keyword lists like this:
8196831: ['FC Porto', 'Futebol Nacional', 'Liga ZON Sagres', 'O Jogo', 'OJ']
8197007: ['Arábia Saudita', 'Diário de Notícias', 'DN', 'execução', 'Irão', 'Mundo']
8197454: ['Argentina', 'Dakar', 'Rali']


The scripts for retrieving the files can be found here:
https://github.com/LanguageMachines/news-pt


---
Example JSON entry with added fields "keywords" "POS-title" 'POS-body":


{"_id": 8212175, "body": "O Rio Douro galgou, na madrugada desta segunda-feira, as margens do Porto e Gaia como não acontecia há cerca de cinco anos. Depois de horas de aflição, em Miragaia, na Ribeira do Porto e a Ribeira de Gaia, comerciantes e moradores passaram a manhã a limpar estragos.", "pubdate": {"$date": "2016-01-11T13:59:00.000+0000"}, "title": "Manhã de limpeza após madrugada de cheias", "url": "http://www.jn.pt/live/Atualidade/default.aspx?content_id=4973950&page=-1", "source": "Jornal de Notícias", "keywords": ["Atualidade", "Cheias", "Concelho Vila Nova de Gaia", "JN", "jnlive", "Jornal de Notícias", "Porto", "Rio Douro"], "POS-body": "O/DA#ms Rio/PNM Douro/PNM galgou//V#ppi-3s ,/PNT na/PREP+DA#fs madrugada/CN#fs desta/PREP+DEM#fs segunda-feira/WD#fs ,/PNT as/DA#fp margens/CN#fp do/PREP+DA#ms Porto/PNM e/CJ Gaia/PNM como/CJ não/ADV acontecia/V#ii-3s há/V#pi-3s cerca/ADV de/PREP cinco/CARD#gp anos/CN#mp ./PNT <utt>\nDepois/ADV de/PREP horas/CN#fp de/PREP aflição/CN#fs ,/PNT em/PREP Miragaia//PNM ,/PNT na/PREP+DA#fs Ribeira/PNM do/PREP+DA#ms Porto/PNM e/CJ a/DA#fs Ribeira/PNM de/PREP Gaia/PNM ,/PNT comerciantes/CN#mp e/CJ moradores/CN#mp passaram/V#ppi-3p a/DA#fs manhã/CN#fs a/PREP limpar/INF#ninf estragos/CN#mp ./PNT <utt>\n", "POS-title": "Manhã/PNM de/PREP limpeza/CN#fs após/PREP madrugada/CN#fs de/PREP cheias/CN#fp <utt>\n"}

Download




People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
Resources from the same creators