Experimenting with text generation to create novel Ferengi Rules of Acquisition

This post presents an evolving attempt to generate novel Ferengi Rules of Acquisition with classic and contemporary approaches to natural language processing.

timely

“Every Ferengi business transaction is governed by 285 Rules of Acquisition to ensure a fair and honest deal for all parties concerned… well most of them anyway.” - Quark, “The Maquis: Part 1”. Star Trek: Deep Space Nine

This means it should be a bit of a challenge to train up an LSTM to generate novel and sensible additions. On the other hand—as you might expect from the socio-cultrual guidelines for a socially-backwards, profit-at-all-costs society—the themes of the Rules are limited, suggesting that a fairly vanilla LSTM might be able to a decent job.

First Attempt: This approach uses an LSTM to generate character-by-character predictions after training on the Rules as a solid block of text. This approach was heavily inspired by Jason Brownlee’s blog post.

LSTM on the Ferengi Rules of Acquisition

getting the rules

load some packages

import requests
from bs4 import BeautifulSoup

def getMiddleColumnFromRow(row):
    return row.findAll('td')[1].text.strip()

grab the text

response = requests.get(
     'http://memory-beta.wikia.com/wiki/Ferengi_Rules_of_Acquisition'
)

soup = BeautifulSoup(response.content,'lxml')
rules = list(map(getMiddleColumnFromRow, soup.find('table').findAll('tr')))
rules = rules[1:len(rules)]

clean up the text

rules = [s.lower() for s in rules]
rules_block = " ".join(rules)
rules_block = rules_block.replace('"', '').replace('*', '').replace('[', '').replace(']', '')

build a dict with all the characters

chars = sorted(list(set(rules_block)))
chars_to_int = dict((c, i) for i, c in enumerate(chars))

basic text properties

n_chars = len(rules_block)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  6155
Total Vocab:  36

advanced properties / ins and outs

seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = rules_block[i:i + seq_length]
	seq_out = rules_block[i + seq_length]
	dataX.append([chars_to_int[char] for char in seq_in])
	dataY.append(chars_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  6055

set up the model

import numpy as np
import sys
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

create input/output encodings

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

set model params

# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

setup checkpointing regime

# define the checkpoint
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

run the model

model.fit(X, y, epochs = 60, batch_size = 128, callbacks = callbacks_list)

Epoch 1/60
6016/6055 [============================>.] - ETA: 0s - loss: 0.4661
Epoch 00001: loss did not improve
6055/6055 [==============================] - 45s 7ms/step - loss: 0.4655
Epoch 2/60
6016/6055 [============================>.] - ETA: 0s - loss: 0.3908
Epoch 00002: loss improved from 0.42985 to 0.39136, saving model to weights-improvement-02-0.3914.hdf5
6055/6055 [==============================] - 41s 7ms/step - loss: 0.3914
Epoch 3/60
6016/6055 [============================>.] - ETA: 0s - loss: 0.3774
Epoch 00003: loss improved from 0.39136 to 0.37890, saving model to weights-improvement-03-0.3789.hdf5
6055/6055 [==============================] - 57s 9ms/step - loss: 0.3789
Epoch 4/60
6016/6055 [============================>.] - ETA: 0s - loss: 0.5022
Epoch 00004: loss did not improve
6055/6055 [==============================] - 64s 11ms/step - loss: 0.5020
Epoch 5/60
6016/6055 [============================>.] - ETA: 0s - loss: 0.3542
Epoch 00005: loss improved from 0.37890 to 0.35414, saving model to weights-improvement-05-0.3541.hdf5
6055/6055 [==============================] - 53s 9ms/step - loss: 0.3541

…

Epoch 60/60
6016/6055 [============================>.] - ETA: 0s - loss: 2.8709
Epoch 00060: loss did not improve

<keras.callbacks.History at 0x7fc8fbad9e10>

checking out the result

after ~200 epochs we see some progress

# load the network weights
filename = "weights-improvement-19-0.2142.hdf5"
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

int_to_char = dict((i, c) for i, c in enumerate(chars))

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed: ")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose = 0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed: 
" t ain't broke, don't fix it. deep down, everyone's a ferengi. no good deed ever goes unpunished. alw "
ays get somebody else to do the lifting. never get into anything that you can't get out of. 
a man is only worth the sum of his possessions. an angry man is an enemy, and a satisfied man is an ally.. 
the less employees know about the cash flow, the smaller the share they can demand. 
only a fool passes up a business opportunity. the more time they take deciding, the more money they will spend.
s tor tear your far dom ner rasi foomims ts neon aootn nn meeee. 
teel if pollh n  eaat irdi yeu cen bbtat tat seth too canio met ss aoos rntr toer tsmk to necd anp llow waat it aoot fn aatinem. 
m mere ssut  on betior lat s tele.t tooeins, wever teee  mrrorian tsene aacitsieni woane auuunens is erre 
aesineess th neal you can beeter take a venetaoe tho merd nothods ianere thtnh torr aon bdsicesiit  lhver allon famile to 
stald to shof t  is oo whe woust oo time tomk toonits in toe einhenier the merce thsesns in  lm yher ysur arrfdds bo  
lhver tekes. foead is the ienllfie iome tour presi seene's fothing
Done.

at $loss \leq .06$ we’ve overfit the data; closer inspection suggests that’s all we’re doing

# load the network weights
filename = "weights-improvement-54-0.0554.hdf5"
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

int_to_char = dict((i, c) for i, c in enumerate(chars))

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed: ")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose = 0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed: 
" etter suit than your own. the bigger the smile, the sharper the knife. never ask when you can take.  "
never trust anybody taller than you. rate divided by time equals profit. (also known as the velocity of wealth.) 
take joy from profit, and profit from joy. good customers are as rare as latinum. treasure them. there is no substitute for success.
free advice is seldom cheap. keep your lies consistent. the riskier the road, the greater the profit. 
work is the best therapy-at least for your employees. win or lose, there's always hupyrian beetle snuff. 
someone's always got bigger ears. ferengi are not responsible for the stupidity of other races. knowledge equals profit. 
home is where the heart is, but the stars are made of latinum. every once in a while, declare peace. 
it confuses the hell out of your enemies. if you break it, i'll charge you for it! beware of the vulcan greed for knowledge. 
the flimsier the product, the higher the price. never let the competition know what you're thinking. 
learn the customer's weaknesses, so that you can better take advantage of him. it ain't over 'til i
Done.

int_to_char = dict((i, c) for i, c in enumerate(chars))

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed: ")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose = 0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed: 
"  a product. time, like latinum, is a highly limited commodity. more is good...all is better. always  "
leave yourself an out. a wife is a luxury... a smart accountant a neccessity. 
when the messenger comes to appropriate your profits, kill the messenger. 
a wealthy man can afford everything except a conscience. never let doubt interfere with your lust for latinum. 
when in doubt, lie. always inspect the merchandise before making a deal. if it ain't broke, don't fix it. 
deep down, everyone's a ferengi. no good deed ever goes unpunished. always get somebody else to do the lifting. 
never get into anything that you can't get out of. a man is only worth the sum of his possessions. 
an angry man is an enemy, and a satisfied man is an ally.. 
the less employees know about the cash flow, the smaller the share they can demand. only a fool passes up a business opportunity.
the more time they take deciding, the more money they will spend.s aon thee tour frodit. fewares anx your fond berine. 
sometimes what you get yrer cace to kropifs ts meol tou  atpi der oe yorrhit  novn ts need t erll frrm das. door
Done.

int_to_char = dict((i, c) for i, c in enumerate(chars))

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed: ")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose = 0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed: 
"  wealthy man can afford everything except a conscience. never let doubt interfere with your lust for "
 latinum. when in doubt, lie. always inspect the merchandise before making a deal. if it ain't broke, don't fix it. 
 deep down, everyone's a ferengi. no good deed ever goes unpunished. always get somebody else to do the lifting. 
 never get into anything that you can't get out of. a man is only worth the sum of his possessions. 
 an angry man is an enemy, and a satisfied man is an ally.. the less employees know about the cash flow, the smaller the share they can demand. 
 only a fool passes up a business opportunity. the more time they take deciding, the more money they will spend.s aon thee tour frodit. 
 fewares anx your fond berine. sometimes what you get yrer cace to kropifs ts meol tou  atpi der oe yorrhit  novn ts need t erll frrm das. 
 dooreit is arofee..o lore as iotioess taah a qeeetiin  s mere sew letgo  lety asmayse dut mv wiat yhu can to feo .
 on ortr thomiod thee y aow you  mead  rriedts it ee wete ns an bufey. aot yhe siace nhere l cela t  is anoeddeeee tht olly.
 on khew touk ansioee
Done.

So, the character-by-character approach clearly is not working for us—the model has only learned to produce a verbatim replication of the input and we don’t get any of the neat novelty we were hoping for. Note: this is before any parameter tuning, choosing some sensible defaults (e.g., the adam optimizer, unit dropout) and setting the input to character-by-character.

The biggest concern with LSTM on this project was that the data were simply too small to build a generative representation. No intermediate step produced good rules—steps only differed in their quality of identical replication of the character sequence. The choice of input as 100 character blocks might be increasing this identical text replication pattern. On the other hand, word-by-word prediction will produce an even smaller set of patterns. Stay tuned for the next iteration where a word-by-word LSTM will be examined.

garrett honke