A Principal Component Analysis of the Voynich Manuscript Glyphs

2017-04-09

Voynichese

The principal component analysis of the pages of the Voynich Manuscript suggested that the text was meaningful, and the small number of glyphs suggested that it was written in an alphabet or abjad. 15th century cryptography wasn't particularly advanced. Common tricks available to cryptographers of the period were replacement of each frequently occurring letter in the plaintext (usually vowels) with several different glyphs in the ciphertext, addition of nulls, and replacement of common morphemes with a single glyph. In addition to these, there is the question of which language the plaintext is written in.

I thought it might be worth checking whether these tricks would succumb to a principal component analysis attack. As I'll show below, each language has its own signature distribution of letters/glyphs on a PCA plot, derived from vectors representing letters, each component of which is the probability of the following letter.

It was thought that homophones might occur close together, though they could be chosen to avoid this by depending on the letter following, and nulls would appear at random places without upsetting the distribution of other letters on the PCA plot.

To reduce clutter, I replaced endings such as iin with a single symbol iX.The resulting PCA plot is shown in Figure 1. In all of the plots below, * denotes a space between words, or punctuation.

Voynich Glyphs PCA

Figure 1: Plot of first two principal components for Voynich Manuscript glyphs

GlyphFrequency
o0.130177
e0.129481
y0.113900
a0.092133
d0.083695
ch0.071018
l0.067553
k0.064508
iX0.044134
r0.043328
t0.038670
qo0.034128
sh0.029044
s0.018612
p0.009116
m0.006748
cth0.006129
ckh0.005858
f0.002781
i0.002039
cph0.001400
h0.001271
n0.001013
q0.000858
c0.000839
g0.000619
cfh0.000477
x0.000226
ck0.000174
v5.806303e-5
z1.290289e-5

Table 1: Glyph frequencies in the Voynich Manuscript

Latin

Texts of Latin authors, Virgil, Caesar, and Pliny, were analysed. The resulting PCA plots were very similar, suggesting that they can be used to identify languages.

An initial PCA showed that k and z are outliers which affect the discovered principal components. As they are rare letters in Latin, it was decided to repeat the principal component analysis without them.

Virgil glyphs without outliers

Figure 2: Plot of first two principal components for Virgil's Aeneid, without outliers k and z

GlyphFrequency
e0.121002
a0.097431
i0.097350
t0.083119
u0.080970
s0.075550
r0.070295
n0.060642
m0.051480
o0.050405
c0.039646
l0.032904
p0.026484
d0.025649
q0.018154
v0.015316
b0.014030
g0.013167
f0.010746
h0.008801
x0.004818
y0.001921
z9.521685e-5
k2.720481e-5

Table 2: Letter Frequencies in Virgil's Aeneid

Caesar glyphs without outliers

Figure 3: Plot of first two principal components for Caesar's De Bello Gallico, Libri V-VIII, without outliers

Pliny glyphs without outliers

Figure 4: Plot of first two principal components for Pliny's Naturalis Historia, Libri II,VII,XXV-XXVII,XXXIII, without outliers

After removing the outliers from the Latin texts, the plots of the first two principal components of the Roman letters are all remarkably similar.

English

Next, texts of two authors, Charles Dickens and Jane Austen, were analysed. Again, their resulting PCA plots were very similar.

Dickens glyphs

Figure 5: Plot of first two principal components for Dickens's Oliver Twist

GlyphFrequency
e0.125144
t0.088043
a0.078001
o0.076936
i0.068992
n0.067815
h0.066219
s0.059808
r0.059404
d0.047431
l0.040801
u0.027248
m0.026382
w0.024629
c0.023227
g0.021794
f0.021060
y0.020697
p0.017588
b0.016253
v0.009536
k0.009025
j0.001374
x0.001357
q0.001003
z2.328299e-4

Table 3: Letter Frequencies in Dickens's Oliver Twist

Austen glyphs

Figure 6: Plot of first two principal components for Austen's Mansfield Park

The plots of both modern English authors are also remarkably similar.

German

Similar results were obtained for two German authors, Johann Wolfgang von Goethe and Hermann Heiberg.

c is an outlier, but isn't rare in German so cannot be omitted. However, as it is invariably followed by h, it can be replaced by the digraph ch. As my displays don't show non ASCII characters, umlauts are replaced by colons, and ß is displayed as ss. This is done in the following plot:

Goethe glyphs without outliers

Figure 7: Plot of first two principal components for Goethe's Die Wahlverwandtschaften - Kapitel 2, without outliers

GlyphFrequency
e0.173716
n0.106739
i0.079999
r0.065936
s0.060876
t0.057965
h0.054011
a0.052714
d0.048093
u0.039504
l0.038427
c0.034150
g0.032500
m0.025754
o0.020510
b0.018573
w0.018210
f0.015596
z0.012606
k0.010194
v0.008176
ü0.006906
ä0.005588
ß0.004425
p0.003747
ö0.002772
j0.001987
q2.029754e-4
y8.835400e-5
x3.581919e-5

Table 4: Letter Frequencies in Goethe's Die Wahlverwandtschaften - Kapitel 2

Heiberg glyphs without outliers

Figure 8: Plot of first two principal components for Heiberg's Charaktere und Schicksale, without outliers

The plots of both modern German authors are also remarkably similar.

Conclusion

Because languages can be identified by their PCA letter plots, and this is unaffected by simple substitution encipherment, both the language and substitutions can be identified by this method. However, it was not possible to identify the language of the Voynich manuscript, whose plot, with its narrow bands of glyphs, looked very different from any of the languages I examined (which, in addition to the languages shown here, also included Italian, French and Polish).

Up

© Copyright Donald Fisk 2017