Welcome on ScienQuest

ScienQuest is a website that allows you to consult structured and annotated text corpora, without having to be a specialist in natural language processing. On ScienQuest, you can search for words, word sequences or syntactic trees in a corpus, and display the results either as KWIC concordances or as lexical frequency tables.

Most of the corpora on ScienQuest include annotations for parts of speech, lemmas and syntactic dependencies, but each corpus is different in detail.

Just starting out? To get started, select a corpus! Try the Demo corpus..

Demo

Comparison of 5 textual genres: encyclopaedia, literature, press, science, tourism.

Comere Intermitents

Corpus #Intermittent, tweets linked to a controversial discursive event.

CoMeRe SMSAlpes

Alpes4science, real SMS corpus in the Alpes

CoMeRe WikiConflicts

Conflicts in French language Wikipedia

Discours académique − EIIDA English

Comparable corpus of written communications and conference transcripts in linguistics and geochemistry.

Discours académique − EIIDA French

Comparable corpus of written communications and conference transcripts in linguistics and geochemistry.

Est Républicain

Les 8894 numéros de la version 0.3 du Corpus du Journal de l'Est Républicain.

frWaC

Corpus construit automatiquement à partir des sites Web du domaine fr.

OpenSubtitles − OpenSubs-en

Échantillon du corpus OpenSubtitles d'Opus.

OpenSubtitles − OpenSubs-es

Échantillon du corpus OpenSubtitles d'Opus.

OpenSubtitles − OpenSubs-fr

Échantillon du corpus OpenSubtitles d'Opus.

OpenSubtitles − OpenSubs-ro

Échantillon du corpus OpenSubtitles d'Opus.

OpenSubtitles − OpenSubs-zh

Échantillon du corpus OpenSubtitles d'Opus.

Scientext − Conference Paper Reviews

This corpus contains 520 comments from reviewers for a conference of young researchers in linguistics (Colloque international des étudiants chercheurs en Didactique des Langues et en Linguistique, 2010).

Version 1.0 of the corpus, built at the LIDILEM laboratory by Françoise Boch and Achille Falaise, as a part of the French ANR Scientext project.

This corpus was annotated using Syntex, a parser developed by Didier Bourigault.

Scientext − English as a Foreign Language

This corpus contains texts written in English by French university students, mainly 2nd and 3rd year English majors. These students learn how to write long argumentative essays in English (4,500 words), based on extensive documentary research.

Version 1.0 of the corpus, built at the LLS laboratory by John Osborne, Alice Henderson and Robert Barr, as part of the ANR Scientext project.

This corpus was annotated using Syntex, a parser developed by Didier Bourigault.

Scientext − Scientific Texts in English

This corpus was built by the LiCorn research group from the University of South Brittany (Geoffrey Williams, Chrystel Millon). The texts, taken from BioMed Central, an independent publisher, are exclusively focused on the disciplines of biology and medicine

This corpus was annotated using Syntex, a parser developed by Didier Bourigault.

Scientext − Academic writing in French

This corpus was built to create a representative sample of the different branches of science and their scientific disciplines.

TALN

The TALN Archives corpus was compiled by Florian Bourdin in 2013 using TALN and RECITAL conference websites (1997-2014). The corpus contains texts in pdf format with metadata (BibTex annotations and summaries). A subset of 586 articles was then chosen and manipulated by Ludovic Tanguy to extract and perform a syntactic analysis of the full text with TALISMANE. The resulting treebank contains 2.3 million tokens, and is annotated according to part of speech, lemma and syntactic dependency.

Wikivoyage

Wikivoyage is a web-based tourist guide written by volunteer authors.