Zipf's Law and Word Length Distributions

2017-04-09

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. In What We Know About The Voynich Manuscript, it is stated "The word frequency distribution follows Zipf's law, which is a necessary (though not sufficient) test of linguistic plausibility."

Figure 1 shows that Zipf's law is followed by the Voynich Manuscript text, but only approximately. Figure 2 shows that the text generated using state transition tables has an almost identical distribution to that of the real Voynich Manuscript.

Log-log Zipf plot for Voynich Manuscript

Figure 1: loge frequency vs loge rank for Voynich Manuscript

Log-log Zipf plot for generated Voynich Manuscript

Figure 2: loge frequency vs loge rank for generated Voynich Manuscript

The distribution of word lengths in the generated manuscript also resembles that of the original manuscript, but not exactly. The real manuscript has more long words, and the generated manuscript more words with a single glyph. Possibly, this could be remedied by using state transition tables with more feedback.

word length distribution for Voynich Manuscript

Figure 3: Word length distribution for Voynich Manuscript

word-length distribution for generated Voynich Manuscript

Figure 4: Word length distribution for generated Voynich Manuscript

Table 1 shows the number of occurrences of the commonest words in the real and the generated manuscripts.

VoynichGenerated
WordCount
daiin807
ol528
chedy495
aiin457
shedy424
chol381
or354
ar348
chey339
qokeey308
qokeedy301
dar298
qokain277
shey276
qokedy265
qokaiin262
al253
dal243
dy229
okaiin209
s208
chor206
dain189
qokal188
shol175
cheey174
okeey174
cheol167
otedy154
otaiin150
qokar149
qol148
y143
WordCount
ol841
daiin699
chedy620
chey469
shedy455
l452
ar420
dar400
dy393
or359
shey353
qokaiin329
aiin329
chy313
qokeedy305
al302
qokain283
dal274
qokeey247
r246
chdy243
s242
chol239
qokedy215
dain210
qoky203
okaiin203
cheey192
qokal183
chor176
otaiin169
ain165
qokar165

Table 1: Word counts for generated and real Voynich Manuscript

Up

© Copyright Donald Fisk 2017