A Principal Component Analysis of the Voynich Manuscript Glyphs

2017-04-09

Voynichese

The principal component analysis of the pages of the Voynich Manuscript suggested that the text was meaningful, and the small number of glyphs suggested that it was written in an alphabet or abjad. 15th century cryptography wasn't particularly advanced. Common tricks available to cryptographers of the period were replacement of each frequently occurring letter in the plaintext (usually vowels) with several different glyphs in the ciphertext, addition of nulls, and replacement of common morphemes with a single glyph. In addition to these, there is the question of which language the plaintext is written in.

I thought it might be worth checking whether these tricks would succumb to a principal component analysis attack. As I'll show below, each language has its own signature distribution of letters/glyphs on a PCA plot, derived from vectors representing letters, each component of which is the probability of the following letter.

It was thought that homophones might occur close together, though they could be chosen to avoid this by depending on the letter following, and nulls would appear at random places without upsetting the distribution of other letters on the PCA plot.

To reduce clutter, I replaced endings such as iin with a single symbol iX.The resulting PCA plot is shown in Figure 1. In all of the plots below, * denotes a space between words, or punctuation.

Figure 1: Plot of first two principal components for Voynich Manuscript glyphs

Glyph	Frequency
o	0.130177
e	0.129481
y	0.113900
a	0.092133
d	0.083695
ch	0.071018
l	0.067553
k	0.064508
iX	0.044134
r	0.043328
t	0.038670
qo	0.034128
sh	0.029044
s	0.018612
p	0.009116
m	0.006748
cth	0.006129
ckh	0.005858
f	0.002781
i	0.002039
cph	0.001400
h	0.001271
n	0.001013
q	0.000858
c	0.000839
g	0.000619
cfh	0.000477
x	0.000226
ck	0.000174
v	5.806303e-5
z	1.290289e-5

Table 1: Glyph frequencies in the Voynich Manuscript

Latin

Texts of Latin authors, Virgil, Caesar, and Pliny, were analysed. The resulting PCA plots were very similar, suggesting that they can be used to identify languages.

An initial PCA showed that k and z are outliers which affect the discovered principal components. As they are rare letters in Latin, it was decided to repeat the principal component analysis without them.

Figure 2: Plot of first two principal components for Virgil's Aeneid, without outliers k and z

Glyph	Frequency
e	0.121002
a	0.097431
i	0.097350
t	0.083119
u	0.080970
s	0.075550
r	0.070295
n	0.060642
m	0.051480
o	0.050405
c	0.039646
l	0.032904
p	0.026484
d	0.025649
q	0.018154
v	0.015316
b	0.014030
g	0.013167
f	0.010746
h	0.008801
x	0.004818
y	0.001921
z	9.521685e-5
k	2.720481e-5

Table 2: Letter Frequencies in Virgil's Aeneid

Figure 3: Plot of first two principal components for Caesar's De Bello Gallico, Libri V-VIII, without outliers

Figure 4: Plot of first two principal components for Pliny's Naturalis Historia, Libri II,VII,XXV-XXVII,XXXIII, without outliers

After removing the outliers from the Latin texts, the plots of the first two principal components of the Roman letters are all remarkably similar.

English

Next, texts of two authors, Charles Dickens and Jane Austen, were analysed. Again, their resulting PCA plots were very similar.

Figure 5: Plot of first two principal components for Dickens's Oliver Twist

Glyph	Frequency
e	0.125144
t	0.088043
a	0.078001
o	0.076936
i	0.068992
n	0.067815
h	0.066219
s	0.059808
r	0.059404
d	0.047431
l	0.040801
u	0.027248
m	0.026382
w	0.024629
c	0.023227
g	0.021794
f	0.021060
y	0.020697
p	0.017588
b	0.016253
v	0.009536
k	0.009025
j	0.001374
x	0.001357
q	0.001003
z	2.328299e-4

Table 3: Letter Frequencies in Dickens's Oliver Twist

Figure 6: Plot of first two principal components for Austen's Mansfield Park

The plots of both modern English authors are also remarkably similar.

German

Similar results were obtained for two German authors, Johann Wolfgang von Goethe and Hermann Heiberg.

c is an outlier, but isn't rare in German so cannot be omitted. However, as it is invariably followed by h, it can be replaced by the digraph ch. As my displays don't show non ASCII characters, umlauts are replaced by colons, and ß is displayed as ss. This is done in the following plot:

Figure 7: Plot of first two principal components for Goethe's Die Wahlverwandtschaften - Kapitel 2, without outliers

Glyph	Frequency
e	0.173716
n	0.106739
i	0.079999
r	0.065936
s	0.060876
t	0.057965
h	0.054011
a	0.052714
d	0.048093
u	0.039504
l	0.038427
c	0.034150
g	0.032500
m	0.025754
o	0.020510
b	0.018573
w	0.018210
f	0.015596
z	0.012606
k	0.010194
v	0.008176
ü	0.006906
ä	0.005588
ß	0.004425
p	0.003747
ö	0.002772
j	0.001987
q	2.029754e-4
y	8.835400e-5
x	3.581919e-5

Table 4: Letter Frequencies in Goethe's Die Wahlverwandtschaften - Kapitel 2

Figure 8: Plot of first two principal components for Heiberg's Charaktere und Schicksale, without outliers

The plots of both modern German authors are also remarkably similar.

Conclusion

Because languages can be identified by their PCA letter plots, and this is unaffected by simple substitution encipherment, both the language and substitutions can be identified by this method. However, it was not possible to identify the language of the Voynich manuscript, whose plot, with its narrow bands of glyphs, looked very different from any of the languages I examined (which, in addition to the languages shown here, also included Italian, French and Polish).