Conclusions of Voynich Manuscript Analysis

2017-04-09

How I arrived at my conclusions

Visualizing data is extremely useful. To analyse the Voynich manuscript, I used principal component analysis, k-means clustering, and minimum spanning trees to extract useful data from a text which has baffled people, including NSA cryptographers, for over a century, and arrived at a conclusion I wasn't expecting and which many people, who have put much time and effort into their own theories, will be unhappy with: that the text of the Voynich Manuscript is probably meaningless, and appears to have been generated randomly using state transition diagrams. It disappointed me. I had been looking forward to identifing the language, doing the decipherment, and then (because my knowledge of the languages used in the early 15th Century is limited) passing it on to someone else to translate.

The Principal Component Analysis of the Voynich Manuscript Pages initially hinted that the text was meaningful, as similar page types tended to cluster together. I started to become suspicious after working on The Principal Component Analysis of the Voynich Manuscript Words, in which I found that certain words clustered together simply because they occurred frequently on certain pages, but in no particular order. Some of the words which clustered merely had the same suffix. This would be like one or two chapter in a book having significantly more words ending in "ing". I haven't noticed any word order and as far as I know, nobody else has, either.

Other researchers have noticed similar patterns. Prescott Currier spotted that certain words, and certain suffixes, were common on some pages and rare on others (Papers on the Voynich Manuscript). Marcelo A. Montemurro and Damián H. Zanette reported in Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis that certain words occur together across the text (see their Figure 2), which corresponds to my results in A Principal Component Analysis of the Voynich Manuscript Words. Their Figure 4 on linguistic and pictorial relationships between the sections of the Voynich manuscript corresponds to my results in A Principal Component Analysis of the Voynich Manuscript Pages. Only by investigating this further did I show that these results are also consistent with randomly generated text.

So I began examining the distribution of suffixes and prefixes, and showed that the core vocabulary of the manuscript appears to be constructed by concatenating certain prefixes and suffixes, and that the frequency of words in the manuscript was often close to the product of the frequencies of their prefixes and suffixes. Then I noticed that the glyph sequences within prefixes and suffixes had a very simple grammar, and drew state transition diagrams for them. Others have made similar observations, e.g. John Tiltman and Mike Roe (Analysis Section ( 3/5 ) - Word structure). Then I found I could easily extend the state transition diagrams to handle 90% of the words in the text, and switched representation from diagrams, which soon became unwieldy, to tables. More effort, resulting in somewhat larger and more complex tables, could have improved upon this.

As different sections have different word frequencies, I divided the manuscript into its different parts (herbal, text, biological, etc.) and ran a cluster analysis on each. This was done by principal component analysis followed by k-means clustering and minimum spanning trees, which were used to make manual corrections to the clusters found by k-means.

In finite state machines, transitions between states are decided by testing whether certain conditions apply. In the Voynich Manuscript, these conditions are unknown, but the simplest assumption is that transitions between states are made randomly. Some transitions are followed more often than others, and this varies across the manuscript, so the transitions must be weighted, and the weights in different sections of the manuscript must be different. For each page cluster, I assigned different weights to the transitions, based on the frequencies followed in that cluster.

The state transition tables were then used to generate a new manuscript. This had similar word frequencies, followed Zipf's Law approximately and in the same way as the real manuscript, and had a similar word length distribution.

There are still a few loose ends. It's known that the first glyphs of paragraphs and lines aren't the same as subsequent glyphs on paragraphs and lines, e.g. p and f occur more often at the beginning of paragraphs. This could be handled by forcing a paragraph or line break when those characters are generated. If a word with a particular property is required (e.g. a label), words could be discarded until a suitable one is generated.

What I can definitely conclude is that text of the Voynich Manuscript is consistent with text generated using a state transition table and following weighted transitions randomly. The weights used in different parts of the manuscript vary, allowing the various sections of the manuscript to differ from one another.

Why the Voynich Manuscript text is unlikely to be ciphertext

Good cipher systems generate output which appears to be random. The Voynich Manuscript does not appear to be random, but is consistent with the input of the text generation process being random. However, if this was used as input, because the state transition probabilities are weighted, decryption would be ambiguous, and extremely difficult if not impossible.

Why it's unlikely to be plaintext

Stephen Bax in A proposed partial decoding of the Voynich Manuscript is working on the assumption that the text is in an unknown language. He claims to have identified some of the glyphs with letters, and to have translated a few of the words, by matching their occurrences to images. This depends upon correct identification of herbs and constellations, but these are poorly drawn and difficult to identify. In a 240 page illustrated manuscript, there are bound to be some coincidental associations between words and images. Also, deciding upon an unknown language (as opposed to a known but unidentified one) means there's a further degree of freedom which makes coincidences a near certainty. Another big problem with the natural language hypothesis is that there is no discernable word order in the Voynich Manuscript.

If it's meaningless

How could such a method have been devised in the early 15th century? That's a matter for historians, but it involves a state transition table, which is just a table and a simple algorithm: choose a transition, go to the next state, and repeat. It requires a method for making random decisions. I suggest playing or tarot cards, both of which appeared in Europe around the time the manuscript was written, or shortly before. Shuffle the deck, and repeatedly turn over a card, choose a transition (if the transition's probability is very low, as it sometimes is, it may require two cards), and repeat. The only problem is that this is slow and tedious.

How could the creator of the Voynich Manuscript have known their method would generate text whose properties resemble those of natural languages? This might have been an unintended consequence of the method. I've shown that the manuscript's agreement with Zipf's Law, for example, is only approximate, so it doesn't exactly match the properties of natural languages.

Why? Possibly to swindle a rich person out of lots of money, or to convince them its creator should be hired because of their secret knowledge. Or possibly its creator thought they were channeling some higher power.

A few others also have argued that the manuscript is meaningless, and have suggested alternative ways of generating it (see Hoaxing the Voynich Manuscript, part 7: Producing the text and How the Voynich Manuscript was created). I suspect that their methods generate text resembling the actual text to the extent that their methods fit the state machine model I have developed.

How to prove me wrong

If you can produce a meaningful plaintext (as it's in an unknown alphabet, it's at the very least a simple substitution cipher), along with a translation of several pages, and this is arrived at with only limited ambiguity, I'll be convinced you've solved it and that I'm wrong. Or if you produce a simpler method for generating meaningless text which matches the actual text better than or as well as my method, I'll conclude that it's likely that I'm wrong.

If I'm wrong

I hope the work I've done will still be helpful to other investigators.

If I'm right

There are still many unsolved mysteries associated with the Voynich Manuscript. These are still worth investigating.