AntConc
Antconc: An introduction
My first work with AntConc was in a class on world English
varieties. As part of a major project, I was required to download one of the
ICE corpora for a variety of English. Being a sucker for punishment, I opted
for Canadian English because there isn’t much difference outside of accent and
inflection.
This gave me a good initiation to how the software works because
I had to really search to find differences. This was around the time I was
getting interested in linguistics so I started playing around with the software
for fun.
Being freeware, AntConc is great for classes because it
provides an overview of general corpus tools but doesn’t add additional costs
to students. However, it is also less user-friendly and harder to use without
more experienced direction.
The formatting for corpus samples can be difficult, because AntConc
can only read .txt documents. This erases all formatting and causes some
errors. When uploading books, I found that asterisks and other formatting marks
were turned into nonsensical symbols I had to compare to decipher. For example,
“Jane’s” becomes “Jane\x92s” which can cause some issues when searching.
However, the tools available are useful. AntConc can generate
concordance, concordance plots, general view, clusters or N-grams, collocates,
word list, and compare keywords between a corpus and references corpus.
Currently, I have the Gutenberg editions of Jane Eyre and Wuthering Heights
uploaded as corpus samples.
Word List:
When uploading a new corpus file, I was taught to always run
the word list first, as this is the first step for using other features as
well. The word list tool analyzes the text for the most commonly found words.
From there, the researcher must remove formatting marks and ignore common
language before getting to useful data. For example, there are 18,659 instances
of “X”, but this is because \x97 replaced some of the formatting when converted
to .txt. format in Microsoft Word. That’s my human error.
This table shows the top 5 hits and their frequency:
1
|
8058
|
X
|
2
|
4748
|
And
|
3
|
4572
|
The
|
4
|
4188
|
Xa
|
5
|
4129
|
i
|
As stated above, these don’t tell us much because they’re
just articles, commonly used conjunctions, and formatting codes. It isn’t until
I scroll down into the hundreds that I start getting unique language that could
be of interest. With Wuthering Heights being plugged in, it’s no surprise that
the common unique words are related to the home.
Concordance:
Once the word list is generated, concordance and concordance
plots can be generated. This involves searching for a specific term which will
cause Antconc to return every instance of that term’s use within the corpus
samples. I searched for “Jane” because one of the samples is Jane Eyre. Having
every instance in order of appearance shows that Jane is initially used to
refer to the character and is almost always preceded by “Miss”. That might lead
me to do a collocate search. Of interest is the 11th hit, the first
unique one, where Jane is preceded by “nasty” instead.
Concordance plots show a graph of every instance of the word’s
use in each text. Because there aren’t any characters named Jane in Wuthering
Heights, there is no concordance plot for that sample. The concordance plot for
Jane Eyre shows that there is a long stretch of text where Jane is not
mentioned very often followed by a section where the name is mentioned very
often. This might cause me to mark off the sections where the name is being
used more often or less often and compare to what is going on in the text at
that time.
File View:
This tool just lets me see the text as it normally looks and
search for terms this way. This is more useful for looking at erasure. If I had
instead uploaded The Great Gatsby and searched for terms relating to African
Americans, I would have found only 5 instances. This view would allow me to see
the context of those instances.
Clusters and N-grams:
N-grams refer to pieces of text. They’re commonly used in
searching through biology because one can look for a piece of protein and
return every instance of that protein being in a compound. If I uploaded a
biology text and searched for “oxide” as an N-gram, AntConc would return every
instance of “oxide” in order with its cluster. So, “carbonmonoxide” might be
more common than “carbondioxide” in a text about smoke inhalation. Due to the
formatting errors in my corpus samples, most of my N-grams are also formatting
errors. With those errors fixed, I could use it to search for morpheme roots
and return every modality of a specific term.
Collocates:
Collocation is one of my favorite tools in corpus
linguistics. AntConc allows me to search for a term and narrow the collocates
by x number to left and x number to the right. Normally, I use this for
newspapers in the BYU COCA to search for the AP standard term for a race and
search for the 3 terms preceding. This usually shows that empirically some
races are portrayed more negatively in the media than others.
AntConc is a little less user friendly than the BYU COCA,
but the collocation tool works the same. Because of the nature of the two
texts, I searched for the term “child” with a window of 5 collocates to the
left and 5 to the right. The returning collocates are largely negative, with “orphan”
and “wailing” being the two most common. “cradle” is found commonly to the
right, followed by a formatting mark and then the words “tyrannically”, “troublesome”,
“thoughtlessly”, and “teases”. This shows that the word “child” is not positively
presented in the two texts I uploaded as samples.
Keyword lists:
This tool takes a little more finesse. In order to generate
a keyword list, there must be both corpora and reference corpora uploaded. This
tool analyzes both and compares them for keywords that are specific to the
corpora in reference to the reference corpora. When studying literary texts,
the first few keywords are generally the names and places specific to the text,
followed by more interesting data.
Using a keyword list, one can see some of the lexical
choices authors make in comparison to one another. It is also useful in
political instances where one party might have unique buzzword. For example,
comparing Obama’s DNC speeches as the corpus samples to Romney’s RNC speeches
as the reference corpus will return “change” as a keyword because it was part
of Obama’s campaign.
Overall, AntConc is a good software to start with because it
allows one to learn corpus software without any risk of wasting money. It is
naturally less user-friendly, but there are many resources that explain the
software to beginners available online.
I always have a copy of AntConc on my computer for comparing
small corpora. With things like song lyrics, it’s easy to clear formatting
errors because they’re such small samples. For collocation and keyword lists,
AntConc is a good software.
Comments
Post a Comment