« Speaking of Lost... | Main | Fakeout »
Question for the Audience
Audience and loyal readers of Varifrank:
I Need Your Help with the following problem.
I'm reading and reviewing Zawahiris most recent letter ( see previous post ) and it occured to me to check this letter against previous letters to see how they compare. A thought then occurs to see if we can apply a little software magic to this problem.
The problem is this. I have two pieces of text, supposedly written by the same author. I want to prove that both peices of text are written by the same person by applying some form of statistical analysis to the two pieces of text. Writing style, commonly misspelled or misused words. Phrases and general wrting style should all be detectable by software.
This falls in the realm of 'Plagiarism detection', this is something I have had some experience with in regards to software but not raw text. It is my understanding that this is an area of recent practice at Universities in regards to term papers.
If you know of any such software, please drop me a note with the information. It will come in very handy in my task of taking apart this beast of a letter and it will answer a couple of theories I have about the text.
Posted @ October 13, 2005 06:55 PM | Current Affairs
It would sem to me that you would have to compare the documents in the original language as otherwise you would be comparing the translators rather than the original authors.
Posted by: geoffb
at October 13, 2005 10:20 PM
I thought of that,but I figured that we should first determine if its possible, then work on what to translate later.
There has to be some quatitative way to evaluate whats been written to determine authorship.
Posted by: varifrank
at October 13, 2005 11:48 PM
There are some good texts on Bayesian analysis of word usage to identify authorship. This was applied to the Federalist papers (among others).
A link to get you started is here:
http://reports-archive.adm.cs.cmu.edu/anon/usr/ftp/cald/CMU-CALD-04-106.pdf
Comparisons should be done in the original language.
In statistical reasoning logic, the null hypothesis (Hn) should be:
The two documents are equivalent.
and the alternative hypothesis (Ha) should be:
The two documents are different.
Set a confidence interval (90% would be a good start). If you fail to prove the alternative hypothesis, then the null hypothesis is true (at the level of confidence).
Note that this will not provide the identity of the author, only that the word usage is similar in both papers and provides a good indication that they are by the same author.
There also may be some open source software that would do this (most likely in English).
Regards
Posted by: Yttrium
at October 14, 2005 01:59 PM
One of your problems is that even if the document is genuine, different translators will add variance making it seem more like different authors. It seems like a long enough document to get good statistics. One approach you might try is to rank the common words by frequency in several of his known documents. Do the same thing for translations of another author writing in the same language. You should be able to get an intuitive field just by visual inspection of the lists. If you want to get formal, there may be non-parametric tests for comparing ranked lists as coming from the same population.
If you can inspect the document you'll do better. In English, punctuation and page layout can be very telling. The kind of thing you're looking for is an idiosyncratic standard that the writer maintains at a subconsious level. Not something that a forger would think to emulate. Much better is to find a subconscious standard that the forger adheres to without realizing it. Some people capitalize a lot more than other people, but not so you would notice. You can look at word length frequencies too. Some people use a lot more two letter words than others. Of course, with someone like Zarqawi you might want to use smaller words anyway.
Posted by: jj
at October 20, 2005 12:41 PM



![Validate my RSS feed [Valid RSS]](http://varifrank.com/images/valid-rss.png)