Skip to main content

And so the data analysis begins…

 I wanted to systematically go through different programs to analyze word frequency  as well as topics. The programs I have selected are Voyant, AntConc, TopicModelling Tool, and NVivo.

I began with Voyant as it is the simplest to use.

What is Voyant you ask?

Voyant is a website based textual analysis tool which provides the ability for users to visualize and analyze textual data and can identify patters within a corpus. The too was created and developed by Stéfan Sinclair of McGill University and Geoffrey Rockwell from the University of Alberta.[1]

You can find this online free tool at voyant-tools.org.

Voyant has several tools that I found to be interesting and potentially of use to this project. For example, the Summary section provides information containing the total number of words as well as the total number of unique words within the entire corpus: Total words, 335,645; Total unique words, 12,346.  This is useful to calculate manual percentages of specific words indicating topics. However, that isn’t necessary as other programs like NVivo will do that as well. But it is an option. Voyant (like the other programs) can create word clouds of varying capacity based on word frequency. I should also mention that you can  refine the total results by adding stopwords to the stopword list which I have done by amalgamating stop words suggested by my digital humanities professor, adding stop words from two separate online suggestions that incorporate common words, and adding my own stop word list to narrow context and topics. By the end of this project you will be able to view all of this online as I will be uploading data for anyone interested to view.

 


            Moving forward, Voyant also provides a really cool tool called “Trends,” which “generates a graph that demonstrates how the frequency of a particular word changes over time.”[2]  This is interesting because there are over 20 documents that take place over the course of the Citizenship Act Readings, from 1945 to 1946. This type of data can demonstrate which topics were more important than others (the topics I am looking at in this thesis). As I specifically included stem words (for example “Jap” which includes Japan and Japanese) for the topics I am investigating this tool has provided a visualization of how the topics compare to one another as well as how they compare over time. 

 


Although Voyant has other interesting tools, the ones formerly mentioned are the ones relevant to my project at this time.



[1] “Guides: Text Analysis: Voyant.” Upenn.edu, 2016, guides.library.upenn.edu/penntdm/tools/voyant. Accessed 2 Feb. 2025.

[2] “Guides: Text Analysis: Voyant.” Upenn.edu, 2016, guides.library.upenn.edu/penntdm/tools/voyant. Accessed 2 Feb. 2025.

Comments

Popular posts from this blog

DATA DATA DATA!

I have finally published the data sets from the corpus on Zenodo. The following citations contain the links to the data.  Have at it!  Amato, Natalie. “Corpus”. Zenodo , March 27, 2025. https://doi.org/10.5281/zenodo.15098565 . Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765.  Amato, Natalie. “Voyant Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.14871765 . Amato, Natalie. “Stopwords”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103566 . Amato, Natalie. “Nvivo Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103555 . Amato, Natalie. “Antconc Collocate Files”. Zenodo, March 28, 2025. https://doi.org/10.5281/zenodo.15103493 .   Amato, Natalie. “Antconc Cluster Files”. Zenodo , March 28, 2025. https://doi.org/10.5281/zenodo.15103462 .   Amato, Natalie. “Antconc KWIC Files”. Zenodo, March 27, 2025. https://doi.org/10.5281/zenodo.15098553 .    Amato, Nata...

Why are Two Columns such a Burden?? WHY?

          I apologize for the extended delay in posting. After my last post I attempted to create a work-around to convert my two column text files into one column. This proved insanely difficult. My thought process was that if I could create OCRed readable pdf files (which I thought I had done) with Foxit then I could export them to editable word documents and then convert them from two column to one column files and then export them to txt files. Did it feel like there must be an easier way to do this? Yes. But I could not find it, at least not without hitting a pay wall. Therefore, I surmised that I would have to one-by-one open files in Foxit PDF Editor, go to the “Convert” tab and then select “To MS Office” in the menu and select “To Word”. This would bring up a new “save” window where I would need to select “settings” beside the file format. Then that would bring up another window and here is where I run into another roadblock. In t...

Topic Modeling Tool

  The next tool I moved to on my corpus analysis journey was the topic modelling tool.   The Topic Modeling Tool is an interesting innovation because it utilizes MALLET (Machine Learning for Language Toolkit) to perform LDA (Latent Dirichlet Allocation) topic modeling but also incorporates a user friendly interface allowing individuals like myself who can learn basic coding but just don’t understand how to troubleshoot when things go wrong.   The tool was created by David Newman, part of the Research Faculty of Computer Science at the University of California Irvine, and Arun Balagopalan and further developed by Jonathan Scott Enderle, a Digital Humanities Specialist at the Penn Library at the University of Pennsylvania. [1] Unfortunately Enderle has since passed and therefore development of the tool has stalled until someone else decides to take up cause.   Regardless the tool was still incredibly useful for my purposes.   It ...