To date I still have yet to analyze my selected text through any software because no matter what I do or how many problems I solve I hit roadblock after roadblock.
As previously mentioned, I intend to analyze text from the House of Commons and Senate -- specificlaly the readings pertaining to the first Canadain Citizenship Act. My initial issue was that despite this resource having been digitized and OCRed (Optical Character Recognition -- when software converts images of textual documents into readable, editable and searchable text) the OCR was conducted years ago and was not wholy accurate. Many words were incorrectly read, and despite having two separate colums to a page, the OCR sometimes only recognized them as one in sections.
Therefore my first task was to remove the old bad OCR and redo it with newer techonology to improve the accuracy. Under the recommendation of another digital humanities student, I attempted to formulate python code utilizing ChatGPT for several weeks. As someone who does not have formal training, or even basic training in python, this proved fairly fruitless. After some research, I found articles detailing that studies had been completed observing a significant decline in the accuracy of ChatGPT over time. (Paulo Confino, Fortune) In light of this information I switched tactics and began looking into programs that would be able to re-OCR my documents. Several individuals recommended Adobe Acrobat, but Carleton was unable to provide me with a license, and I am unable to pay for it with my own funds, however, Carleton was able to provide me with a license for Foxit PDF Reader. After a couple online tutorials I discovered how to remove OCR as well as apply it to my documents.
This new OCR did prove to have better accuracy within Foxit PDF Reader. Excited and looking forward to finally beginning my textual analysis after months of struggling I exported the newly re-OCRed documents into .txt files to create my corpus, only to discover that while two column OCRed documents may read as two columns in PDF format, they revert back to reading as one column in .txt format.
Another road block occurs...
Comments
Post a Comment