Skip to main content

Transkribus, My Hero

                 After fighting with every OCR program, I could get my hands on I was finally recommended Transkribus by my supervisor and two colleagues at work. I was hesitant since the software's I had previously attempted to use was mostly able to recognize individual words but was unable to differentiate between columns as demonstrated in my last post.

                  So, you may be asking, “what is this Transkribus you speak of?” Transkribus is an AI powered platform that can aid in research and transcribing by recognizing text, both typed and written, as well as layout and structure utilizing new AI technology. Historical and archival institutions are even beginning to use it to increase accessibility to records and materials. There is even a free version that provides limited access to their OCR and transcription services.

                  Regardless of my reservations, I went ahead and created a Transkribus account, created a new collection and uploaded my pdf files of the house of commons and senate records on the Citizenship Act readings. The great thing about Transkribus is that they have videos telling you how to do exactly what you need to do and they are short and simple, so uploading was not an issue. I then attempted a few of the first and most popular models for their Text Recognition. Unfortunately, they were unable to distinguish between the two columns as well and I was frustrated and devastated. But I’m a pretty resilient person so I went back and researched some of the other models in other sections. I managed to find one under “Layout” called “Danish Newspapers 1750-1850,” which stated in the description that the model “ is based on training data from Danish newspaper scans from …1750 to 1850 [and that] It works best on pages with two columns”! I figured if this didn’t solve my problems nothing would. 

 


 

                  I went through the process again, ran the model on my pdfs and it actually worked. It wasn’t perfect. It didn’t disregard the page numbers or heading information but I was happy to edit that out. And I’m sure there are other words and letters that were not transcribed accurately, but over the course of attempting to OCR these records this was the best version so far. I’ll take it.

Comments

Popular posts from this blog

DATA DATA DATA!

I have finally published the data sets from the corpus on Zenodo. The following citations contain the links to the data.  Have at it!  Amato, Natalie. “Corpus”. Zenodo , March 27, 2025. https://doi.org/10.5281/zenodo.15098565 . Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765.  Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765 . Amato, Natalie. “Stopwords”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103566 . Amato, Natalie. “Nvivo Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103555 . Amato, Natalie. “Antconc Collocate Files”. Zenodo, March 28, 2025. https://doi.org/10.5281/zenodo.15103493 .   Amato, Natalie. “Antconc Cluster Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103462 .   Amato, Natalie. “Antconc KWIC Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.15098553 .    Amato, Nata...

Why are Two Columns such a Burden?? WHY?

          I apologize for the extended delay in posting. After my last post I attempted to create a work-around to convert my two column text files into one column. This proved insanely difficult. My thought process was that if I could create OCRed readable pdf files (which I thought I had done) with Foxit then I could export them to editable word documents and then convert them from two column to one column files and then export them to txt files. Did it feel like there must be an easier way to do this? Yes. But I could not find it, at least not without hitting a pay wall. Therefore, I surmised that I would have to one-by-one open files in Foxit PDF Editor, go to the “Convert” tab and then select “To MS Office” in the menu and select “To Word”. This would bring up a new “save” window where I would need to select “settings” beside the file format. Then that would bring up another window and here is where I run into another roadblock. In t...

Topic Modeling Tool

  The next tool I moved to on my corpus analysis journey was the topic modelling tool.   The Topic Modeling Tool is an interesting innovation because it utilizes MALLET (Machine Learning for Language Toolkit) to perform LDA (Latent Dirichlet Allocation) topic modeling but also incorporates a user friendly interface allowing individuals like myself who can learn basic coding but just don’t understand how to troubleshoot when things go wrong.   The tool was created by David Newman, part of the Research Faculty of Computer Science at the University of California Irvine, and Arun Balagopalan and further developed by Jonathan Scott Enderle, a Digital Humanities Specialist at the Penn Library at the University of Pennsylvania. [1] Unfortunately Enderle has since passed and therefore development of the tool has stalled until someone else decides to take up cause.   Regardless the tool was still incredibly useful for my purposes.   It ...