Numapresse is a research project in digital humanities funded by the French Agency for Research (ANR) launched in october 2016. It aims to apply text mining techniques to analyze the digital archives of French-speaking newspaper on a very long time frame, from 1800 till today. Initially focused on experimental attempts, the project has recently shifted to large-scale implementation resulting, for instance, in the automated classification of news genre for every issue of two French daily, Le Matin et le Petit Parisien from 1900 till 1940.
This is the first episode of an ongoing series of reports. It aims to present the contextual approach of text mining advocated by Numapresse using enriched textual and editorial data to reconstruct the complex layout of the newspaper format. The next episodes will focus on text classification, news image analysis and, finally, the ecosystem of the circulating news.
This translation from the original French was started by Simon Hengchen and completed by Pierre-Carl Langlais.
Representing text in space
For distant reading, newspapers are complicated: they cannot be separated from their editorial context. A standard newspaper like Le Matin published hundred of thousands of pages for decades, each one containing an heterogeneous set of documents belonging to different genres and authors. Unlike books, newspaper text is not “a flow”, but a collection of autonomous pieces (articles, illustrations, ads, tables), that may be combined into more complex “sections” (feuilletons, thematic coverage of a specific event, etc.). Although they are rarely recorded in digital archives non-verbal elements, such as frames or separators, play a crucial part to make this sophisticated editorial architecture readable.
The visualisation below shows all words published in Le Matin in September 1935. Each word is point in space. The use of transparent coloring explicits some regular news structure such as the 7-columns shape, the Feuilleton in page 2 or, the fact that some pages acted as occasional supplements (pages 9 and 10 are lighter, because the default length of Le Matin is 8 pages).
This visualisation is made possible by the use of the complete digitized archives created by the Optical Character Recognition software. A set of XML tags in the METS/ALTO format register the coordinates of each word, line and text block recomposed from the original image file.
Text mining tools frequently relies on compressed data representation of the original corpora. Term / document matrix typically counts the number of occurrences of a given word in a text, eliminating therefore any duplicates. More recent approaches, such as word embedding, assigns an unique semantic vector to each words. All theses compressions are fairly effective. They allow to tackle complex tasks such as supervised classification or topics modeling for a fraction of memory storage.
With news archives, theses compressing strategies prove too limitative: to preserve the layout structure we are bound to use data representations bulkier than the original corpora. We have used a long table structure historically used by linguists and privileged by recent text mining tools such as Tidy text (created by Julia Silges and David Robinson for the R programming language). Text is not compressed but restituted through a continuous string of tokens: each line record a token in the order in which they occur in the text. With theses long table, each token can be associated with a rich set of token-level metadata, such as the coordinates in the image file, but also the size of the text, its font or its style (italic, bloc, underlined…) or the OCR confidence level.
This representation is not limited to text. Other editorial objects preserved in the digital archives can be recorded as well. The XML tree documents for instance the position of news illustration (and can then be extracted through the use of the API of the French National Library).
The preservation of all theses elements bring new insights. The visualization below displays the 4-page issue of La Liberté from the 16th july 1865, with the OCR confidence level. This allows to determine where problems might occur: the Optical Character Recognition is clearly less trustworthy for complex typographic structures like tables or ads.
Other projects using similar approaches such as Oceanic Exchange, NewsEye or Impresso. As a highly challenging corpora, news archives encourage the development of innovative tools and methods.
The use of a long table based on continuous tokens make it possible to convert advanced XML tree into structured data formats. They also allow to join supplementary informations. Within Numapresse we have created a complete enriching pipeling of the original text through the use of the syntax parser, Spacy. Each word can therefore be associated to a lemma, morphological syntactic and dependency classes, as well as Named-entity recognition.
All theses outputs are stored in sizeable tables. An issue of a newspaper is typically transformed into a dataset of thousands of lines (one for each word) and twenty columns (for each metadata associated to one word). Since one newspaper can have published ten of thousands of issues, data management quickly becomes an important issue. Fortunately, we were able to make all the needed processing and store all the output on a programming platform hosted by the French research Infrastructure Humanum: a remote version of R Studio allows to launch R or python script from a web browser.
Modelling editorial structures
The enriched textual files can serve to make enhanced queries such as “get all the person names in newspaper X in the year Y”. More crucially they are also the stepping stone to recreating complex editorial structures.
Newspaper text has always been highly structured: more or less informal rules have dictated the placing of texts, the form of titles, or where signatures can be found. These rules have changed a lot across time: until the end of the 19thC, the contemporary “article” did not really have an equivalent, and texts are often “floating paragraphs” copied from newspaper to newspaper, without mentioning authors.
The majority of the pre-19thC French press gradually adopted a similar ontology to order the newspaper text with such rubrics as “Premier Paris”, “Foreign news”, “feuilletons”. Some of theses category were later adopted internationally: the “feuilleton”, with its french name, is still a staple of German newspaper even though it has long disappeared from France. Conversely, numerous French categories were likely borrowed or inspired from foreign sources. All theses shared standards contribute to important similarities of news layout within a 20 to 30 years timeframe across the european continent and beyond.
Through the combination of textual data and page-structure information about the articles, it becomes possible to recreate recurring norms and editorial structures with a low margin of error. A robust linear regression can retrieve the “normal” limits of columns, and then measure the spatial anomalies between the expected start and end of the text, and the actual start and end of the text. Lines with significant shifts to the left or to the right are likely right- or left-justified. Line with equal shifts are probably centered. Editorial objects tends to conform to highly codified positions as shown in the illustration below.
Then, textual data can be used to refine the classification, as well as to correct some incertitudes. Many titles are already known in advance (“Paris”, “Foreign news”, etc.) and do not vary beyond occasional OCR mistakes. Signatures are always made of an initial and a name. Press agencies sign within parentheses or italics, and later newspapers have some formulas to introduce some journalists (“From our foreign correspondent…”, “Dispatch from the New York Herald”), which too can be used.
All these rules show the pertinence of a combined qualitative and quantitative reading of the newspaper. We cannot recompose those informal modes of reading without knowing the specifics of newspaper archives, and the use of text mining tools plays as a creative constraint which sheds light on patterns that are not always picked up by humans (usually because they are too “obvious” and tends to fade in the background).
With this mixed approach and the metadata already collected by the project Le Rez-de-chaussée, Numapresse managed to identify almost all the serial novels (romans-feuilletons) in the Journal des débats between 1837 and 1845. An experimental platform aggregates every episode of well-known novel first published in the news like Dumas’ Count of Monte-Christo or Eugene Sue’s Mysteries of Paris. Automated extraction based on editorial modeling can concretely extend the work of humanities project.
Diversifying our collections
The first analyses of Numapresse relied on digitized archives generated by the Europeana Newspaper project. It includes about 10 French newspaper published from 1800 to 1945.
We were focused on this corpora for two reasons.
- It is possible to download complete archives in one time as a dump (whereas Gallica only allows a page per page XML download through an API, which can takes much more time).
- The Europeana Newspaper project has been a leader in Optical Layout Recognition. Supplementary XML files record whether a piece of text was part of a higher-order elements such as articles, ads or image captions which makes it fairly easy to build structured corpora.
Since the Numapresse project aims to make this enriched text mining approach applicable to other sources, we had to develop new tools.
We have written a Python program to read and extract data from PDFs. PDF is a richer format than customarily believed. It usually contain layout information and font information, allowing to keep track of the spatiality of the text (but may be lacking some important data, such as OCR confidence).
Our biggest focus is currently to reconstruct the editorial structure of newspapers. Subdividing the text into coherent objects (“articles”, “news wire”, “illustration”) turned out to be a fundamental prerequisite to recover important contextual information, since the original time and place of a correspondence or the signatures of the journalists. All the rich textual data available may not prove sufficient for this task. They obliterate the complex set of frames and dividing lines that makes the newspaper readable as an heterogeneous collection. We currently tries to reconstruct theses supplementary information using pattern recognition (with open cv). On the long run, deep learning approach might be much more effective, as the initial results of DH Segment with manuscript layout suggest. Nevertheless, fitting a model architecture to such a text structure as complex as modern newspaper would certainly be challengeable.
Finally, thanks to the dramatic progress of the free OCR tool tesseract, we were able to create a text version of newspaper that were only available in an image format. For the the time being we haven’t attempted to train tesseract to the specific fonts of older newspaper. Ongoing tests at the Impresso project suggest that accuracy can be significantly enhanced. Anyway, the correct results outputted by the latest version of tesseract has been instrumental in allowing to deal with a more diverse set of sources.