Skip to main content

And so the data analysis begins…

 I wanted to systematically go through different programs to analyze word frequency  as well as topics. The programs I have selected are Voyant, AntConc, TopicModelling Tool, and NVivo.

I began with Voyant as it is the simplest to use.

What is Voyant you ask?

Voyant is a website based textual analysis tool which provides the ability for users to visualize and analyze textual data and can identify patters within a corpus. The too was created and developed by Stéfan Sinclair of McGill University and Geoffrey Rockwell from the University of Alberta.[1]

You can find this online free tool at voyant-tools.org.

Voyant has several tools that I found to be interesting and potentially of use to this project. For example, the Summary section provides information containing the total number of words as well as the total number of unique words within the entire corpus: Total words, 335,645; Total unique words, 12,346.  This is useful to calculate manual percentages of specific words indicating topics. However, that isn’t necessary as other programs like NVivo will do that as well. But it is an option. Voyant (like the other programs) can create word clouds of varying capacity based on word frequency. I should also mention that you can  refine the total results by adding stopwords to the stopword list which I have done by amalgamating stop words suggested by my digital humanities professor, adding stop words from two separate online suggestions that incorporate common words, and adding my own stop word list to narrow context and topics. By the end of this project you will be able to view all of this online as I will be uploading data for anyone interested to view.

 


            Moving forward, Voyant also provides a really cool tool called “Trends,” which “generates a graph that demonstrates how the frequency of a particular word changes over time.”[2]  This is interesting because there are over 20 documents that take place over the course of the Citizenship Act Readings, from 1945 to 1946. This type of data can demonstrate which topics were more important than others (the topics I am looking at in this thesis). As I specifically included stem words (for example “Jap” which includes Japan and Japanese) for the topics I am investigating this tool has provided a visualization of how the topics compare to one another as well as how they compare over time. 

 


Although Voyant has other interesting tools, the ones formerly mentioned are the ones relevant to my project at this time.



[1] “Guides: Text Analysis: Voyant.” Upenn.edu, 2016, guides.library.upenn.edu/penntdm/tools/voyant. Accessed 2 Feb. 2025.

[2] “Guides: Text Analysis: Voyant.” Upenn.edu, 2016, guides.library.upenn.edu/penntdm/tools/voyant. Accessed 2 Feb. 2025.

Comments

Popular posts from this blog

Topic Modeling Tool

  The next tool I moved to on my corpus analysis journey was the topic modelling tool.   The Topic Modeling Tool is an interesting innovation because it utilizes MALLET (Machine Learning for Language Toolkit) to perform LDA (Latent Dirichlet Allocation) topic modeling but also incorporates a user friendly interface allowing individuals like myself who can learn basic coding but just don’t understand how to troubleshoot when things go wrong.   The tool was created by David Newman, part of the Research Faculty of Computer Science at the University of California Irvine, and Arun Balagopalan and further developed by Jonathan Scott Enderle, a Digital Humanities Specialist at the Penn Library at the University of Pennsylvania. [1] Unfortunately Enderle has since passed and therefore development of the tool has stalled until someone else decides to take up cause.   Regardless the tool was still incredibly useful for my purposes.   It ...

Technology does not like me

 To date I still have yet to analyze my selected text through any software because no matter what I do or how many problems I solve I hit roadblock after roadblock.  As previously mentioned, I intend to analyze text from the House of Commons and Senate -- specificlaly the readings pertaining to the first Canadain Citizenship Act. My initial issue was that despite this resource having been digitized and OCRed (Optical Character Recognition -- when software converts images of textual documents into readable, editable and  searchable text) the OCR was conducted years ago and was not wholy accurate. Many words were incorrectly read, and despite having two separate colums to a page, the OCR sometimes only recognized them as one in sections.  Therefore my first task was to remove the old bad OCR and redo it with newer techonology to improve the accuracy. Under the recommendation of another digital humanities student, I attempted to formulate python code utilizing ChatGPT f...

DATA DATA DATA!

I have finally published the data sets from the corpus on Zenodo. The following citations contain the links to the data.  Have at it!  Amato, Natalie. “Corpus”. Zenodo , March 27, 2025. https://doi.org/10.5281/zenodo.15098565 . Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765.  Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765 . Amato, Natalie. “Stopwords”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103566 . Amato, Natalie. “Nvivo Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103555 . Amato, Natalie. “Antconc Collocate Files”. Zenodo, March 28, 2025. https://doi.org/10.5281/zenodo.15103493 .   Amato, Natalie. “Antconc Cluster Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103462 .   Amato, Natalie. “Antconc KWIC Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.15098553 .    Amato, Nata...