Un instant...

The corpus



The corpus is composed of 4872 sonnets, mainly from the 19th century. We have identified 760 authors: 4414 sonnets written by men (660), 439 sonnets written by women (107). This leaves 19 sonnets to which we have not been able to assign a female or male author.
Of the authors identified, 19 have been selected, and are one of the possible constraints on the poem generator page. They are :

  • Théodore de Banville (1823 - 1891) - 37 poems
  • Charles Baudelaire (1821 - 1867) - 66 poems
  • Eugénie Casanova (1825 - 1908) - 22 poems
  • François Coppée (1842 - 1908) - 28 poems
  • Charles des Guerrois (1817 - 1916) - 66 poems
  • Théophile Gautier (1811 - 1872) - 55 poems
  • José-Maria de Heredia (1842 - 1893) - 127 poems
  • Leconte de Lisle (1818 -1894) - 25 poems
  • Stéphane Mallarmé (1842 - 1898) - 27 poems
  • Robert de Montesquiou (1855 - 1921) - 79 poems
  • Alfred de Musset (1810 - 1857) - 21 poems
  • Gérard de Nerval (1808 - 1855) - 18 poems
  • Claudius Popelin (1825 - 1892) - 70 poems
  • Sully Prudhomme (1839 - 1907) - 166 poems
  • Arthur Rimbaud(1854 -1891) - 17 poems
  • Charles-Augustin Sainte-Beuve (1804 - 1869) - 57 poems
  • Blanche Sari-Flégier (1852 - 1914) - 32 poems
  • Paul Verlaine (1844 - 1896) - 110 poems

The corpus was constituted from various sources, mainly data available online such as those of the French National Library (3980 sonnets), or Wikisource (781 sonnets) but also data extracted from blogs, anthologies, and the Malherbe corpus.
The composition of our corpus is rather uneven, since :
  • The work on the first two sources was done automatically and partly checked and corrected by hand. Errors therefore remain.
  • The corpus contains mostly data from the BNF, since this work is the result of a partnership established with the latter.
  • Some authors have very few sonnets, others are completely absent.
  • Experiments with particular corpora (for example with contemporary authors whose works are still under copyright) are envisaged but are more difficult to implement and to make public, for obvious reasons.