Welcome on ScienQuest

ScienQuest is a website that allows you to consult structured and annotated text corpora, without having to be a specialist in natural language processing. On ScienQuest, you can search for words, word sequences or syntactic trees in a corpus, and display the results either as KWIC concordances or as lexical frequency tables.

Most of the corpora on ScienQuest include annotations for parts of speech, lemmas and syntactic dependencies, but each corpus is different in detail.

Just starting out? To get started, select a corpus! Try the Demo corpus..
Demo
Comparison of 5 textual genres: encyclopaedia, literature, press, science, tourism.
Comere GETALP
This is a textchat corpus, in French, from the EpikNet network of Internet Relay Chat. The corpus was collected in 2004.
Comere Intermitents
Corpus #Intermittent, tweets linked to a controversial discursive event.
CoMeRe SMSAlpes
Alpes4science, real SMS corpus in the Alpes
CoMeRe WikiConflicts
Conflicts in French language Wikipedia
Discours académique − EIIDA English
Comparable corpus of written communications and conference transcripts in linguistics and geochemistry.
Discours académique − EIIDA French
Comparable corpus of written communications and conference transcripts in linguistics and geochemistry.
Est Républicain
Les 8894 numéros de la version 0.3 du Corpus du Journal de l'Est Républicain.
frWaC
Corpus construit automatiquement à partir des sites Web du domaine fr.
OpenSubtitles − OpenSubs-en
Échantillon du corpus OpenSubtitles d'Opus.
OpenSubtitles − OpenSubs-es
Échantillon du corpus OpenSubtitles d'Opus.
OpenSubtitles − OpenSubs-fr
Échantillon du corpus OpenSubtitles d'Opus.
OpenSubtitles − OpenSubs-ro
Échantillon du corpus OpenSubtitles d'Opus.
OpenSubtitles − OpenSubs-zh
Échantillon du corpus OpenSubtitles d'Opus.
Scientext − Conference Paper Reviews
This corpus contains 520 comments from reviewers for a conference of young researchers in linguistics (Colloque international des étudiants chercheurs en Didactique des Langues et en Linguistique, 2010).
Version 1.0 of the corpus, built at the LIDILEM laboratory by Françoise Boch and Achille Falaise, as a part of the French ANR Scientext project.
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
Scientext − English as a Foreign Language
This corpus contains texts written in English by French university students, mainly 2nd and 3rd year English majors. These students learn how to write long argumentative essays in English (4,500 words), based on extensive documentary research.
Version 1.0 of the corpus, built at the LLS laboratory by John Osborne, Alice Henderson and Robert Barr, as part of the ANR Scientext project.
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
Scientext − Scientific Texts in English
This corpus was built by the LiCorn research group from the University of South Brittany (Geoffrey Williams, Chrystel Millon). The texts, taken from BioMed Central, an independent publisher, are exclusively focused on the disciplines of biology and medicine
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
Scientext − Academic writing in French
This corpus was built to create a representative sample of the different branches of science and their scientific disciplines.
TALN
The TALN Archives corpus was compiled by Florian Bourdin in 2013 using TALN and RECITAL conference websites (1997-2014). The corpus contains texts in pdf format with metadata (BibTex annotations and summaries). A subset of 586 articles was then chosen and manipulated by Ludovic Tanguy to extract and perform a syntactic analysis of the full text with TALISMANE. The resulting treebank contains 2.3 million tokens, and is annotated according to part of speech, lemma and syntactic dependency.
Wikivoyage
Wikivoyage is a web-based tourist guide written by volunteer authors.