Getting started

Corpus linguistics has a long and rich history with a great number of misconceptions. The term itself first appears in the 80's, but early analog methods of corpus linguistics have been around much longer. In essence, corpus linguistics is a means of studying a sample of words using a computer. For example, the BYU COCA compiles a massive amount of text from media outlets, such as Scientific American, or The New York Times. From that sample, we can extract every instance of word being used in the media compiled, and draw conclusions about that word's representation (or lack thereof).
The beauty of corpus linguistics, as I mentioned in my previous post, is that it provides empirical data that is more objective than other means of study in linguistics. This allows for stronger arguments and more effective proofs. For example, it can be argued that minorities are portrayed negatively in media. It is far more effective to be able to say that out of the 800 times that black people were referenced in the New York times last year, X number of those were negative, while out of the 800 times white people were referenced, X number of those were negative. If one is significantly higher than the other, we have a decent argument.
A corpus is only as useful as it was created to be. If I build a corpus of heavy metal lyrics, I can uncover commonly use words in metal, how they're related to each other, etc. However, I cannot use a corpus of heavy metal lyrics to make conclusions about other types of music. In the same breath, we must be ethical in how we study our data, and be careful to not to misrepresent results. It may very well be that the word "death" appears often in metal, but that doesn't address whether it appears equally often in other types of music. We can't say that metal is particularly obsessed with death without doing further research.
In addition to correcting for representativeness, a good corpus must be balanced. I was playing with some romantic metal songs recently. Because of the nature of that type of music, there aren't a whole lot of them. Of the three that I had, I noticed that none of them make reference to things we would consider romantic. None of them include the word "love", or references to a partner. It would be very difficult to compare this to pop, because there millions of romantic pop songs. I can't compare three songs to hundreds of other ones. Because of this, I'm looking for more romantic metal songs, a fun and rather difficult task.
I have a handful of corpus projects that I'm currently working on. I am looking at a previously created heavy metal corpus to help me build a few of my own. In particular, I'm looking at comparing the lyrics of Stone Sour with the lyrics of Slipknot. Both bands have the same lead singer, but Slipknot is known for being much heavier. I'm curious if the lyrics are also more aggressive.
In addition, I'm looking at studying Frankenstein for collocations of the word "eyes", which is the most commonly used unique word in the text.
I'm also playing with comparing gothic texts to fairy tales, because the frequency for two gothic texts is very similar to the frequency results for two of Grimm's fairy tales. It might explain why the gothic has an implied relation to fantasy.

Comments

  1. You should check out Tognini-Bonelli's "Theoretical overview and evolution of corpus linguistics" for a historical view of CL. I disagree that the term only appears in the 80s, as there was indeed a key development in the 60s that should earn the distinction.

    "studying a sample of words..." What type of words? from where? I think "authentic collection of language" is more accurate.

    "A corpus is only as useful..." Yes, I agree with this statement. While John Sinclair may have criticized small corpora, he also stated that if created on principle to answer a particular research question, then size is irrelevant. Saying all corpora must be at least 250k words is arbitrary and not based on data.

    Yes, interpretation of results must be done with caution. It's inaccurate to say that corp ling is a solely quantitative endeavor...lots of qualitative work being done.

    ReplyDelete
  2. Frankenstein: I wonder what other body parts are frequently present or absent. You could think more broadly about the body and its discursive representation in the text.

    ReplyDelete

Post a Comment

Popular posts from this blog

Voyant Tools

Google's N-Gram viewer