Generating the Voynich Manuscript

2017-03-29

It is possible to generate most of the words in the manuscript by adding extra states and transitions to the state transition table in Voynich State Transitions. Without much effort, I was able to generate 33,123 words out of a total of 37,123, i.e. just under 90%, using the state transition table shown in Table 1 below. It may be possible to generate almost all of it, using a larger table or different states and transitions. Anything remaining (e.g. rare glyphs) could have been improvised by the scribe.

		start	qo	d1	r1	s1	o1	a1	l1	ch1	sh1	y1	f	k	p	t	cfh	ckh	cph	cth	e1	e2	e3	o3	d2	ch2	sh2	y2	a2	o2	l2	r2	s2	im	iim	in	iin	iiin	ir	iir	m	finish
start			150	94	13	33	205	29	37	160	89	43	3	31	13	25	ε	5	3	13	ε	1	1	2	ε			4	25	18												ε
qo	qo			24	6				48	5	ε		8	597	31	214	ε	13	1	5	5	10	6	1	ε			1	10	4							ε					6
d1	d						51	210		62	34	24									3	8	3	4	2			157	381	34												28
r1	r									43	16										1	1	ε	32	2			44	184	64												613
s1	s						169	158		51	16										3	7	ε	4	4			21	220	87												259
o1	o			75		19			289	9	5		9	260	34	242	1	8	2	10	4	5	6						11		10
a1	a			7	518				461				ε	8	ε	4	ε	ε	ε	ε
l1	l			41			43			109	44		7	177	6	17					3	3	2	6					49	30												465
ch1	ch						180					6	ε	16	2	7	1	24	3	13	27	158	290	18	72			76	42	57	4	2										3
sh1	sh						142					4	ε	13	ε	5	ε	21	ε	12	40	217	343	42	36			49	29	39	3	ε										7
y1	y			60		10			10	148	54		13	349	45	302													9
f	f									523	50										6	ε	ε	25	ε			65	171	106												53
k	k									112	22										48	234	119	14	1			71	309	62												8
p	p									592	57										ε	ε	ε	27	19			34	128	121												20
t	t									175	32										29	168	117	28	2			78	273	88												9
cfh	cfh																				ε	111	130	241	111			296	93	19
ckh	ckh																				5	65	224	103	38			516	41	7
cph	cph																				13	132	182	233	38			258	126	19
cth	cth																				3	65	149	240	32			423	74	14
e1	e																3	113	5	56		823
e2	e																						541	448																		11
e3	e												4	42	6	18									477	13	4	370	37				29
o3	o												ε	34	3	18									268	6	2				255	126	49									239
d2	d																											782	149	14												55
ch2	ch																						388		178			426														8
sh2	sh																						349		181			446														24
y2	y																								7																	993
a2	a																														173	180	6	4	2	155	353	10	43	9	66
o2	o																											23			262	589	11	ε	ε	5	51	9	2	5	41
l2	l																								44	20	9	108	32	20												768
r2	r																								6	8	1	37	69	23												857
s2	s																																									1000
im	im																																									1000
iim	iim																																									1000
in	in																																									1000
iin	iin																																									1000
iiin	iiin																																									1000
ir	ir																																									1000
iir	iir																																									1000
m	m																																									1000
finish		1000

Table 1: State machine diagram accounting for almost 90% of the words in the Voynich Manuscript

It would have been impossible to generate all of the manuscript using a single table such as Table 1, however. To account for the differences between sections, each manuscript section requires its own table, which could have been composed by tweaking entries in a previously produced table.

Measuring transition probabilities using the method described in Transition Probabilities, with the expanded state transition table (Table 1) as a template, on average resulted in a significantly higher error rate (ranging from slightly lower to more than double), so I decided to adjust the transition probabilities, by finding words generated more or less often using the state transition table than are present in the manuscript, and adjusting their transition counts downwards or upwards. This improved the results, but there were still some discrepancies. It might have been possible to improve the results further, e.g. by doing gradient descent in a multi-dimensional space (each dimension corresponding to a non-zero entry in the state transition table), but this would be very computationally intensive.

Figure 1 shows the word count for text, generated ten times using the state transition table for the red herbal pages (Currier B) of the manuscript (Table 2), plotted against actual word count in the Voynich Manuscript. Ideally, this should be a straight line of gradient 1.0, but with any process with an element of randomness will result in points being spread about that line. Here, there is an additional error caused by an imperfect state transition table. Similar plots for some other parts of the manuscript are shown in Figure 2, Figure 3, and Figure 4.

Figure 1: Generated vs actual word counts for red herbal pages

		start	qo	d1	r1	s1	o1	a1	l1	ch1	sh1	y1	f	k	p	t	cfh	ckh	cph	cth	e1	e2	e3	o3	d2	ch2	sh2	y2	a2	o2	l2	r2	s2	im	iim	in	iin	iiin	ir	iir	m	finish
start			119	100	7	21	211	22	19	153	79	63	6	47	16	39	ε	7	3	4	2	1	2	ε	ε			3	44	33												ε
qo	qo			34	3				11	3	ε		26	660	26	198	ε	8	ε	8	ε	3	8	ε	ε			ε	5	5							ε					3
d1	d						32	170		38	32	44									2	8	2	2	ε			152	457	32												28
r1	r									9	ε										9	ε	ε	52	17			78	155	9												672
s1	s						37	110		37	12										ε	24	ε	ε	12			24	463	24												256
o1	o			124		16			189	3	2		18	371	23	209	1	9	1	6	2	1	9						9		7
a1	a			6	577				393				ε	18	ε	ε	ε	ε	ε	6
l1	l			125			78			95	44		20	176	7	31					3	ε	ε	ε					85	37												298
ch1	ch						130					6	ε	29	2	11	2	80	2	15	29	118	300	8	174			56	15	15	ε	ε										6
sh1	sh						160					ε	ε	20	ε	ε	ε	26	ε	15	47	215	334	6	84			49	9	20	ε	ε										15
y1	y			66		4			ε	66	26		31	445	70	284													9
f	f									455	13										26	ε	ε	ε	ε			65	234	78												130
k	k									105	24										22	145	165	16	2			100	353	55												13
p	p									564	60										9	ε	ε	51	51			51	137	34												43
t	t									165	42										6	109	196	44	ε			77	286	60												14
cfh	cfh																				ε	167	167	333	167			167	ε	ε
ckh	ckh																				ε	42	150	33	100			633	42	ε
cph	cph																				ε	333	333	83	ε			250	ε	ε
cth	cth																				ε	22	348	65	87			435	43	ε
e1	e																13	125	ε	75		788
e2	e																						461	514																		26
e3	e												20	150	15	24									560	8	4	181	8				30
o3	o												3	26	3	19									511	10	3				160	70	64									131
d2	d																											781	150	13												55
ch2	ch																						400		450			150														ε
sh2	sh																						222		333			444														ε
y2	y																								10																	990
a2	a																														143	220	8	5	ε	75	392	5	45	14	93
o2	o																											11			257	688	18	ε	ε	ε	7	4	ε	ε	14
l2	l																								63	30	7	133	17	20												730
r2	r																								19	2	4	23	46	8												897
s2	s																																									1000
im	im																																									1000
iim	iim
in	in																																									1000
iin	iin																																									1000
iiin	iiin																																									1000
ir	ir																																									1000
iir	iir																																									1000
m	m																																									1000
finish		1000

Table 2: State transition diagram for the red herbal pages

Figure 2: Generated vs actual word counts for blue herbal pages

Figure 3: Generated vs actual word counts for astronomical/astrological/cosmological pages

Figure 4: Generated vs actual word counts for blue biological pages

The recomputed RMS errors, for the augmented state transition tables, are shown in Table 3.

	blue herbal	black herbal	red herbal	green herbal	astro	pharma	blue text	black text	red text	blue bio	black bio	red bio
Error	3.93	3.70	4.81	3.22	4.03	3.45	3.71	7.81	4.13	3.13	3.59	3.41

Table 3: RMS errors for non-zero values of predicted word counts

Each section of the manuscript was generated using its state transition table. Text was broken every time a p or f was generated at the start of a word, and lines were broken after reaching a certain length. There are probably better ways of determining paragraph breaks, but I didn't investigate this. Note that one of the characteristics of the text, word repetition or similar words in close proximity, occurs naturally as a result of randomness.

The resulting text is contained in Generated Voynich Manuscript. I may convert it to HTML and add some illustrations later. Note that two of the characteristics of the text, word repetition and similar words in close proximity, occur naturally as a result of randomness.

The real Voynich Manuscript, converted into EVA, is here.