AntConc


Antconc: An introduction

My first work with AntConc was in a class on world English varieties. As part of a major project, I was required to download one of the ICE corpora for a variety of English. Being a sucker for punishment, I opted for Canadian English because there isn’t much difference outside of accent and inflection.

This gave me a good initiation to how the software works because I had to really search to find differences. This was around the time I was getting interested in linguistics so I started playing around with the software for fun.

Being freeware, AntConc is great for classes because it provides an overview of general corpus tools but doesn’t add additional costs to students. However, it is also less user-friendly and harder to use without more experienced direction.

The formatting for corpus samples can be difficult, because AntConc can only read .txt documents. This erases all formatting and causes some errors. When uploading books, I found that asterisks and other formatting marks were turned into nonsensical symbols I had to compare to decipher. For example, “Jane’s” becomes “Jane\x92s” which can cause some issues when searching.

However, the tools available are useful. AntConc can generate concordance, concordance plots, general view, clusters or N-grams, collocates, word list, and compare keywords between a corpus and references corpus. Currently, I have the Gutenberg editions of Jane Eyre and Wuthering Heights uploaded as corpus samples.

Word List:

When uploading a new corpus file, I was taught to always run the word list first, as this is the first step for using other features as well. The word list tool analyzes the text for the most commonly found words. From there, the researcher must remove formatting marks and ignore common language before getting to useful data. For example, there are 18,659 instances of “X”, but this is because \x97 replaced some of the formatting when converted to .txt. format in Microsoft Word. That’s my human error.

This table shows the top 5 hits and their frequency:

1
8058
X
2
4748
And
3
4572
The
4
4188
Xa
5
4129
i



As stated above, these don’t tell us much because they’re just articles, commonly used conjunctions, and formatting codes. It isn’t until I scroll down into the hundreds that I start getting unique language that could be of interest. With Wuthering Heights being plugged in, it’s no surprise that the common unique words are related to the home.



Concordance:

Once the word list is generated, concordance and concordance plots can be generated. This involves searching for a specific term which will cause Antconc to return every instance of that term’s use within the corpus samples. I searched for “Jane” because one of the samples is Jane Eyre. Having every instance in order of appearance shows that Jane is initially used to refer to the character and is almost always preceded by “Miss”. That might lead me to do a collocate search. Of interest is the 11th hit, the first unique one, where Jane is preceded by “nasty” instead.

Concordance plots show a graph of every instance of the word’s use in each text. Because there aren’t any characters named Jane in Wuthering Heights, there is no concordance plot for that sample. The concordance plot for Jane Eyre shows that there is a long stretch of text where Jane is not mentioned very often followed by a section where the name is mentioned very often. This might cause me to mark off the sections where the name is being used more often or less often and compare to what is going on in the text at that time.




File View:

This tool just lets me see the text as it normally looks and search for terms this way. This is more useful for looking at erasure. If I had instead uploaded The Great Gatsby and searched for terms relating to African Americans, I would have found only 5 instances. This view would allow me to see the context of those instances.

Clusters and N-grams:

N-grams refer to pieces of text. They’re commonly used in searching through biology because one can look for a piece of protein and return every instance of that protein being in a compound. If I uploaded a biology text and searched for “oxide” as an N-gram, AntConc would return every instance of “oxide” in order with its cluster. So, “carbonmonoxide” might be more common than “carbondioxide” in a text about smoke inhalation. Due to the formatting errors in my corpus samples, most of my N-grams are also formatting errors. With those errors fixed, I could use it to search for morpheme roots and return every modality of a specific term.

Collocates:

Collocation is one of my favorite tools in corpus linguistics. AntConc allows me to search for a term and narrow the collocates by x number to left and x number to the right. Normally, I use this for newspapers in the BYU COCA to search for the AP standard term for a race and search for the 3 terms preceding. This usually shows that empirically some races are portrayed more negatively in the media than others.

AntConc is a little less user friendly than the BYU COCA, but the collocation tool works the same. Because of the nature of the two texts, I searched for the term “child” with a window of 5 collocates to the left and 5 to the right. The returning collocates are largely negative, with “orphan” and “wailing” being the two most common. “cradle” is found commonly to the right, followed by a formatting mark and then the words “tyrannically”, “troublesome”, “thoughtlessly”, and “teases”. This shows that the word “child” is not positively presented in the two texts I uploaded as samples.

Keyword lists:

This tool takes a little more finesse. In order to generate a keyword list, there must be both corpora and reference corpora uploaded. This tool analyzes both and compares them for keywords that are specific to the corpora in reference to the reference corpora. When studying literary texts, the first few keywords are generally the names and places specific to the text, followed by more interesting data.

Using a keyword list, one can see some of the lexical choices authors make in comparison to one another. It is also useful in political instances where one party might have unique buzzword. For example, comparing Obama’s DNC speeches as the corpus samples to Romney’s RNC speeches as the reference corpus will return “change” as a keyword because it was part of Obama’s campaign.

Overall, AntConc is a good software to start with because it allows one to learn corpus software without any risk of wasting money. It is naturally less user-friendly, but there are many resources that explain the software to beginners available online.

I always have a copy of AntConc on my computer for comparing small corpora. With things like song lyrics, it’s easy to clear formatting errors because they’re such small samples. For collocation and keyword lists, AntConc is a good software.

Comments

Popular posts from this blog

Voyant Tools

Google's N-Gram viewer

Getting started