Open Humanities Awards: finderApp WITTFind update 1

*This is the first in a series of posts from Dr Maximilian Hadersbeck, the recipient of the DM2E Open Humanities Awards – DM2E track.*

One aim of our project was to extend our FinderApp WiTTFind, which is currently used for exploring and researching only Ludwig Wittgenstein’s Big Typescript TS-213 (BT), to the rest of the 5000 pages of Wittgenstein’s Nachlass that are made freely available by the Wittgenstein Archives at the University of Bergen and are used as linked data software from the DM2E project. With the money from the award, we could engage three new members in our research group “Wittgenstein in Co-Text”: Roman Capsamun, Yuliya Kalasouskaya and Stefan Schweter.

To get an a good insight in the actual work of the Archive, the Bergen Electronic Edition (BEE) and the open-available parts of Wittgenstein’s Nachlass, two members of our research group, Angela Krey und Matthias Lindinger, traveled to the Wittgenstein Archive in Bergen. Together with Dr. Alois Pichler and Øyvind Liland Gjesdal (University of Bergen Library) they discussed the latest developments at the archive and transferred the rest of the 5000 pages from the Nachlass of Ludwig Wittgenstein to our institute. At the Bergen archive they also discussed the high density (HD) scanning of the complete Wittgenstein Nachlass, which is done in cooperation with Trinity College, Cambridge. During their visit they could join lessons of Prof. Dr. Peter Hacker, a famous Wittgenstein researchers. He was on a visit at the archive and spoke about „Philosophy and Neuroscience“ and „The Nature of Consciousness“. After the speeches, they could present him a demo of our FinderApp WiTTFind, which impressed him very much.

University of Bergen, Department of Philosophy, which houses the Wittgenstein Archives

Finished Milestones in our Award project in September 2014 include:

Extending the Nachlass-data for our FinderApp WiTTFind

We transferred the rest of all 5000 Pages of the free available part of Wittgenstein’s Nachlass into our storage-area at our institute. One problem of the XML-TEI-P5 compatible edition data in Bergen is, that they defined XML-tags with a lot of information, which is not important for our FinderApp. So we defined a restricted, limited XML-TEI-P5 compatible tagset which includes all information which is necessary for our FinderApp. We call this tagset: “CISWAB-tagset”. To reduce the Bergen-tagset to our CISWAB-tagset we programmed XSLT-scripts together with our cooperation-partner in Bergen. To validate the CISWAB-tagset data, we defined an XML-DTD-scheme (CISWAB-DTD).

Extending the syntactic disambiguation of the Nachlass-Data

To extend syntactic disambiguation to the rest of the 5000 pages we had to program new scripts, which runs the Part of Speech (POS) tagging stage with the “treetagger” automatically. Every new incoming CISWAB-XML file is automatically tagged and inserted in the storage-area of our FinderApp.

Using “Tesseract” for OCR and switching to HD-scans for our WiTTReader

One central part of our FinderApp is the facsimile reader WiTTReader which allows to display, browse and highlight all the found hits of the Finder within the original facsimile. Up to now we used only single density (SD) facsimile to scroll through the Nachlass. In the next generation of our FinderApp we want to use high density (HD) facsimile, which are currently produced at the Trinity College in Cambridge.
As it is very important in our project, to use only open source tools, we won’t use the OCR tool ABBYY-finereader (version 11) anymore. After some tests, we decided to use “Tesseract” which is also used by the Google Books project. We transferred the first HD-facsimile of the Nachlass to our institute and the first OCR-quality-tests with “Tesseract” are very promising.