EIIDA
EIIDA français
Études interdisciplinaires et interlinguistiques du discours académique − Corpus français
L’objectif du projet EIIDA est de comparer le discours scientifique écrit et oral, et d’interroger l’impact de la transmission directe sur le discours scientifique. L’étude interlinguistique porte sur la comparaison des discours académiques écrits et oraux (articles de recherche vs communications de congrès) en trois langues – anglais, français et espagnol – afin d’analyser l’impact de la culture linguistique de l’orateur/le scripteur dans ces deux modes de communication. Pour la comparaison interdisciplinaire, nous rassemblons des corpus en sciences exactes (géochimie) et en sciences humaines (linguistique). Les corpus en anglais et en français, dont les transcriptions ont été vérifiées, sont disponibles sur ScienQuest. Ce corpus contient les contenus en français.
This corpus contains 61 texts (353,060 words).
Scientext
TALN
TALN Conference Proceedings
The TALN Archives corpus was compiled by Florian Bourdin in 2013 using TALN and RECITAL conference websites (1997-2014). The corpus contains texts in pdf format with metadata (BibTex annotations and summaries). A subset of 586 articles was then chosen and manipulated by Ludovic Tanguy to extract and perform a syntactic analysis of the full text with TALISMANE. The resulting treebank contains 2.3 million tokens, and is annotated according to part of speech, lemma and syntactic dependency.
This corpus contains 586 texts (2,335,943 words).
Conference Paper Reviews
Scientext Corpus - Reviews of the 2010 CEDIL conference
This corpus contains 520 comments from reviewers for a conference of young researchers in linguistics (Colloque international des étudiants chercheurs en Didactique des Langues et en Linguistique, 2010).
Version 1.0 of the corpus, built at the LIDILEM laboratory by Françoise Boch and Achille Falaise, as a part of the French ANR Scientext project.
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
This corpus contains 570 texts (34,805 words).
English as a Foreign Language
Scientext Corpus - Texts in English as a Foreign Language
This corpus contains texts written in English by French university students, mainly 2nd and 3rd year English majors. These students learn how to write long argumentative essays in English (4,500 words), based on extensive documentary research.
Version 1.0 of the corpus, built at the LLS laboratory by John Osborne, Alice Henderson and Robert Barr, as part of the ANR Scientext project.
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
This corpus contains 272 texts (1,020,146 words).
Scientific Texts in English
Scientext Corpus - Scientific Texts in English
This corpus was built by the LiCorn research group from the University of South Brittany (Geoffrey Williams, Chrystel Millon). The texts, taken from BioMed Central, an independent publisher, are exclusively focused on the disciplines of biology and medicine
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
This corpus contains 7,564 texts (35,244,378 words).
Scientific and Technical Texts in French
Scientext Corpus - Scientific and Technical Texts in French
This corpus was built to create a representative sample of the different branches of science and their scientific disciplines. It encompasses three different branches - Social sciences (linguistics, psychology, education or educational sciences, and to a certain degree, Natural Language Processing), Life sciences (biology, medicine) and Applied sciences, or Engineering and Engineering sciences (electrical and mechanical engineering), although the lines between these branches are not clearly definable.
This corpus was annotated using Syntex, a parser developed by Didier Bourigault.
This corpus contains 205 texts (5,063,315 words).
Presse
Est Républicain
Corpus du quotidien régional «l'Est Républicain» (1999-2003)
Ce corpus contient 58 numéros issus de la version 0.3 du corpus, normalisé par Bertrand Gaiffe et Kamel Nehbi sous la direction de Bertrand Gaiffe, distribué sous licence Creative Commons par le CNRTL.
Annotation pour ScienQuest avec MElt et Malt, entraînnés sur le French Treebank.
This corpus contains 58 texts (15,668,642 words).
Le corpus (non annoté) est téléchargeable ici: http://www.cnrtl.fr/corpus/estrepublicain/
Les logiciels d'annotation sont téléchargeables ici: http://alpage.inria.fr/statgram/frdep/fr_stat_dep_malt.html