Bienvenue sur ScienQuest

ScienQuest un logiciel permettant de consulter des corpus textuels structurés et annotés, d'y rechercher des mots, séquences de mots, ou arbres syntaxiques, et d'afficher les résultats sous forme de concordances KWIC ainsi que des fréquences lexicales. La plupart des corpus présents sur ScienQuest comportent des annotations en parties du discours, lemmes et dépendances syntaxiques, mais des corpus différents ne sont pas nécessairement annotés de la même manière.

Select a Corpus

Dans un premier temps, choisissez un corpus en fonction de la langue et du genre textuel que vous souhaitez étudier.

Pour chaque corpus, est indiqué un court descriptif, le nombre de textes et de mots qu'il contient, ainsi que le logiciel ayant servi à l'analyse.

Démo

Scientext

TALN

586 texts

2M words

Talismane

The TALN Archives corpus was compiled by Florian Bourdin in 2013 using TALN and RECITAL conference websites (1997-2014). The corpus contains texts in pdf format with metadata (BibTex annotations and summaries). A subset of 586 articles was then chosen and manipulated by Ludovic Tanguy to extract and perform a syntactic analysis of the full text with TALISMANE. The resulting treebank contains 2.3 million tokens, and is annotated according to part of speech, lemma and syntactic dependency. En savoir plus.

Conference Paper Reviews

570 texts

35k words

Syntex

This corpus contains 520 comments from reviewers for a conference of young researchers in linguistics (Colloque international des étudiants chercheurs en Didactique des Langues et en Linguistique, 2010).

Version 1.0 of the corpus, built at the LIDILEM laboratory by Françoise Boch and Achille Falaise, as a part of the French ANR Scientext project.

This corpus was annotated using Syntex, a parser developed by Didier Bourigault.

English as a Foreign Language

272 texts

1M words

Syntex

This corpus contains texts written in English by French university students, mainly 2nd and 3rd year English majors. These students learn how to write long argumentative essays in English (4,500 words), based on extensive documentary research.

Version 1.0 of the corpus, built at the LLS laboratory by John Osborne, Alice Henderson and Robert Barr, as part of the ANR Scientext project.

This corpus was annotated using Syntex, a parser developed by Didier Bourigault.

Scientific Texts in English − Version:

8k texts

43M words

Syntex

The texts, taken from BioMed Central, an independent publisher, are exclusively focused on the disciplines of biology and medicine

This corpus was annotated using Syntex, a parser developed by Didier Bourigault. This corpus was built by the LiCorn research group from the University of South Brittany (Geoffrey Williams, Chrystel Millon).

Scientific and Technical Texts in French

fr	205 texts 6M words Syntex This corpus was built to create a representative sample of the different branches of science and their scientific disciplines. En savoir plus.

Discours académique

EIIDA anglais

en	60 texts 332k words TreeTagger/Susanne Corpus comparable de communications écrites et de transcriptions de conférences en linguistique et géochimie. En savoir plus.

EIIDA français

fr	60 texts 410k words TreeTagger/PERCEO Corpus comparable de communications écrites et de transcriptions de conférences en linguistique et géochimie. En savoir plus.

Masse textuelle

Presse

Est Républicain − Version:

fr	9k texts 87M words Talismane Les 8894 numéros de la version 0.3 du Corpus du Journal de l'Est Républicain. En savoir plus.

Tourisme

Wikivoyage

fr	639 texts 1M words Talismane Corpus en phase de test. Wikivoyage est un guide touristique sur le Web, rédigé de manière participative par des auteurs bénévoles. En savoir plus.