IMDB Movie Review Dataset¶

This is a library to download and parse the IMDB's Large Movie Review Dataset dataset and a demo of a transformer based model. The dataset has 25K training, and 25K test dataset, plus 50K unlabeled examples.

It's inspired on Keras' Text classification with Transformer demo.

Data Preparation¶

Downloading data files¶

To download, uncompress and untar to the local directory, simply do the following. Notice if it's already downloaded in the given --data directory, it returns immediately.

In [1]:
import (
    "github.com/gomlx/gomlx/examples/imdb"
    "github.com/gomlx/gomlx/support/fsutil"
    "github.com/janpfeifer/must"

    _ "github.com/gomlx/gomlx/backends/default"
)

var (
	flagDataDir    = flag.String("data", "~/tmp/imdb", "Directory to cache downloaded and generated dataset files.")
	flagEval       = flag.Bool("eval", true, "Whether to evaluate the model on the validation data in the end.")
	flagVerbosity  = flag.Int("verbosity", 1, "Level of verbosity, the higher the more verbose.")
	flagCheckpoint = flag.String("checkpoint", "", "Directory save and load checkpoints from. If left empty, no checkpoints are created.")
)

func AssertDownloaded() {
    *flagDataDir = must.M1(fsutil.ReplaceTildeInDir(*flagDataDir))
    if !fsutil.MustFileExists(*flagDataDir) {
        must.M(os.MkdirAll(*flagDataDir, 0777))
    }
    must.M(imdb.Download(*flagDataDir))
}

%%
AssertDownloaded()
> Loading previously generated preprocessed binary file.
Loaded data from "aclImdb.bin": 100000 examples, 141088 unique tokens, 23727054 tokens in total.

Sampling some examples¶

It creates a small dataset and print out some random examples.

It also defines the DType, used for all internal representations of the model, and the flag --max_len that defines the maximum number of tokens used per observation. This will beused in the modeling later.

In [2]:
import "github.com/gomlx/gomlx/examples/imdb"

%%
AssertDownloaded()
imdb.PrintSample(3)
> Loading previously generated preprocessed binary file.
Loaded data from "aclImdb.bin": 100000 examples, 141088 unique tokens, 23727054 tokens in total.
┌────────────────────────────────────────────────────────────┐
│                                                            │
│    [Sample 0 - label 1]                                    │
│    <START> so we saw this on dvd at our apartment here     │
│    in paris we re all here on an exchange program we       │
│    all laughed so hard cuz so much of what was going on    │
│    in the movie happened to us i mean yeah sure some of    │
│    it was pretty clich d but still true know what i m      │
│    saying i think i related more to the quiet guy the      │
│    italian than xavier because i m more of the observer    │
│    in our group anyway i wish i had a hot roommate like    │
│    cecile de france she seems like a cool chick in the     │
│    movie and for real after i saw her hosting the          │
│    cannes festival last month now i m thinking i wanna     │
│    go to barcelona next summer after seeing this movie     │
│    i gotta check out the sequel too which just came out    │
│    here in france                                          │
│                                                            │
│                                                            │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                                                            │
│    [Sample 1 - label 1]                                    │
│    <START> a wonderful film ahead of its time i think      │
│    so in the eighty s it was all about winning greed is    │
│    good remember that one i have seen this film more       │
│    that 20 times to me this is a real desert island        │
│    film i keep watching because there is always            │
│    something more to learn about these flawed              │
│    characters that i just love jessica tandy and hume      │
│    cronin are simply wonderful also beverly d angelo       │
│    beau bridges come in at a close second don t get me     │
│    wrong there are many more great performance s in        │
│    this film and it is also the way it is written that     │
│    made it for me and i hope you a film that you will      │
│    want to see over and over i think tv shows like         │
│    exposure and now earl owe a lot to this film but        │
│    remember it is not a tom cruise film                    │
│                                                            │
│                                                            │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                                                            │
│    [Sample 2 - label 0]                                    │
│    the characters are presented with numerous              │
│    opportunities to a save their loved ones b get the      │
│    police to help c escape or most importantly d kill      │
│    luther i can t feel empathy or fear for characters      │
│    that are too stupid to help themselves snub chances     │
│    to arm themselves with guns and knives while luther     │
│    is away a policeman eventually arrives and is           │
│    equally ineffective in stopping luther even though      │
│    at one point he has a rifle squarely aimed at luther    │
│    while luther clucks and does his rendition of the       │
│    polish chicken dance i found myself futilely            │
│    coaching my television make sure he s dead hes gone     │
│    get out of there or just kill him already luther is     │
│    a bloodthirsty savage but he is hardly hannibal         │
│    lecter if you can t outsmart this egghead you           │
│    deserve what s coming to you by halfway through the     │
│    movie you ll be so lethargic to the fates of the        │
│    half wits that only morbid curiosity will sustain       │
│    you to last to the mildly amusing ending this movie     │
│    was noted as one of fangoria s 101 greatest movies      │
│    you ve never seen well fangoria is half right in the    │
│    case of luther the geek                                 │
│                                                            │
│                                                            │
└────────────────────────────────────────────────────────────┘

Training¶

We will create 3 different types of models for this demo: Bag of Words ("bow"), Convolutionals ("cnn") and Transformers ("transformer").

Model Configuration¶

As with other demos we leverage the model.Store object to store all model and training parameters. One can set specific parameters using the -set command line flag.

The imdb.CreateModelStore() method sets all the default values for the hyperparameters that may be used by any of the 3 model types. The parameter "model" specify the model type.

In [3]:
import (
    "golang.org/x/exp/maps"
    "github.com/gomlx/gomlx/ml/model"
)

// settings is bound to a "-set" flag to be used to set store hyperparameters.
var settings = commandline.CreateSettingsFlag(imdb.CreateModelStore(), "set")

// StoreFromSettings is the default store (createModelStore) changed by -set flag.
// It also returns the list of parameters changed by -set in paramsSet: we use this later to avoid loading over the values from checkpoint.
func StoreFromSettings() (store *model.Store, paramsSet []string) {
    store = imdb.CreateModelStore()
    paramsSet = must.M1(commandline.ParseSettings(store, *settings))
    return store, paramsSet
}

%% -set="model=cnn"
fmt.Printf("Model types: %q\n", maps.Keys(imdb.ValidModels))
store, _ := StoreFromSettings()
fmt.Println(commandline.SprintSettings(store.RootScope()))
Model types: ["transformer" "bow" "cnn"]
	"/activation": (string) 
	"/adam_dtype": (string) 
	"/adam_epsilon": (float64) 1e-07
	"/batch_size": (int) 32
	"/cnn_dropout_rate": (float64) 0.5
	"/cnn_normalization": (string) 
	"/cnn_num_layers": (float64) 5
	"/cosine_schedule_steps": (int) 0
	"/dropout_rate": (float64) 0.1
	"/eval_batch_size": (int) 200
	"/fnn_dropout_rate": (float64) 0.3
	"/fnn_normalization": (string) 
	"/fnn_num_hidden_layers": (int) 2
	"/fnn_num_hidden_nodes": (int) 32
	"/fnn_residual": (bool) true
	"/imdb_content_max_len": (int) 200
	"/imdb_include_separators": (bool) false
	"/imdb_mask_word_task_weight": (float64) 0
	"/imdb_max_vocab": (int) 20000
	"/imdb_token_embedding_size": (int) 32
	"/imdb_use_unsupervised": (bool) false
	"/imdb_word_dropout_rate": (float64) 0
	"/l1_regularization": (float64) 0
	"/l2_regularization": (float64) 0
	"/learning_rate": (float64) 0.0001
	"/model": (string) cnn
	"/normalization": (string) layer
	"/num_checkpoints": (int) 3
	"/optimizer": (string) adamw
	"/plots": (bool) true
	"/train_steps": (int) 5000
	"/transformer_att_key_size": (int) 8
	"/transformer_dropout_rate": (float64) -1
	"/transformer_max_att_len": (int) 200
	"/transformer_num_att_heads": (int) 2
	"/transformer_num_att_layers": (int) 1

Bag Of Words Model (bow)¶

This is the simplest model we are going to train: it embeds each token of the sentence (default size of the is 32 numbers) sum them up, and pass that through a FNN.

The code in imdb.BagOfWordsModelGraph looks like this:

// BagOfWordsModelGraph builds the computation graph for the "bag of words" model: simply the sum of the embeddings
// for each token included.
func BagOfWordsModelGraph(scope *model.Scope, tokens *Node) *Node {
	embed, _ := EmbedTokensGraph(scope, tokens)

	// Take the max over the content length, and put an FNN on top.
	// Shape transformation: [batch_size, content_len, embed_size] -> [batch_size, embed_size]
	embed = ReduceMax(embed, 1)
	logits := fnn.New(scope, embed, 1).Done()
	return logits
}

We played a bit with the hyperparameters to get to ~85% accuracy on the validation data.

The code for imdb.TrainWithStore is here. It's a straight forward GoMLX training loop.

In [4]:
%% --set="model=bow;l2_regularization=1e-3;learning_rate=1e-4;normalization=none;train_steps=10000"
store, paramsSet := StoreFromSettings()
imdb.TrainWithStore(store, *flagDataDir, *flagCheckpoint, paramsSet, *flagEval, *flagVerbosity)
> Loading previously generated preprocessed binary file.
Loaded data from "aclImdb.bin": 100000 examples, 141088 unique tokens, 23727054 tokens in total.
Backend "xla":	xla:cuda - PJRT "cuda" plugin (/home/janpf/.local/lib/go-xla/nvidia/pjrt_c_api_cuda_plugin.so) v0.100 [StableHLO] [1 device(s)]
Model: bow
         7% [=>......................................] (350 steps/s) [0s:26s] [step=719] [loss=0.785] [~loss=0.795] [~acc=56.98%]        
       100% [========================================] (2107 steps/s) [step=9999] [loss=0.343] [~loss=0.338] [~acc=88.24%]                

Metric: accuracy

Metric: loss

	[Step 10000] median train step: 218 microseconds

Results on train-eval:
	Mean Loss (#loss): 0.25
	Mean Accuracy (#acc): 93.23%
Results on test-eval:
	Mean Loss (#loss): 0.385
	Mean Accuracy (#acc): 85.48%

Convolution Model (cnn)¶

The function imdb.Conv1DModelGraph creates a 1D convolution model, with arbitrary number of convolutions. After the convolution, it behaves the same way as the Bag Of Words model.

The core of the convolution model looks like this:

	// 1D Convolution: embed is [batch_size, content_len, embed_size].
	numConvolutions := model.GetParamOr(scope, "cnn_num_layers", 5)
	logits := embed
	for convIdx := range numConvolutions {
		scope := scope.Inf("%03d_conv", convIdx)
		residual := logits
		if convIdx > 0 {
			logits = NormalizeSequence(scope, logits)
		}
		logits = layers.Convolution(scope, embed).KernelSize(7).Filters(embedSize).Strides(1).Done()
		logits = activation.ApplyFromScope(scope, logits)
		if dropoutNode != nil {
			logits = layers.Dropout(scope, logits, dropoutNode)
		}
		if residual.Shape().Equal(logits.Shape()) {
			logits = Add(logits, residual)
		}
	}

	// Take the max over the content length, and put an FNN on top.
	// Shape transformation: [batch_size, content_len, embed_size] -> [batch_size, embed_size]
	logits = ReduceMax(logits, 1)
	logits = fnn.New(scope, logits, 1).Done()
	logits.AssertDims(batchSize, 1)

Convolution Model (cnn)¶

The function imdb.Conv1DModelGraph creates a 1D convolution model, with arbitrary number of convolutions. After the convolution, it behaves the same way as the Bag Of Words model.

The core of the convolution model looks like this:

	// 1D Convolution: embed is [batch_size, content_len, embed_size].
	numConvolutions := model.GetParamOr(scope, "cnn_num_layers", 5)
	logits := embed
	for convIdx := range numConvolutions {
		scope := scope.Inf("%03d_conv", convIdx)
		residual := logits
		if convIdx > 0 {
			logits = NormalizeSequence(scope, logits)
		}
		logits = layers.Convolution(scope, embed).KernelSize(7).Filters(embedSize).Strides(1).Done()
		logits = activation.ApplyFromScope(scope, logits)
		if dropoutNode != nil {
			logits = layers.Dropout(scope, logits, dropoutNode)
		}
		if residual.Shape().Equal(logits.Shape()) {
			logits = Add(logits, residual)
		}
	}

	// Take the max over the content length, and put an FNN on top.
	// Shape transformation: [batch_size, content_len, embed_size] -> [batch_size, embed_size]
	logits = ReduceMax(logits, 1)
	logits = fnn.New(scope, logits, 1).Done()
	logits.AssertDims(batchSize, 1)

Notice how well it can overfit to the training data ... but it doesn't help the test results. To improve this one needs some careful regularization.

In [5]:
%% --set="model=cnn;l2_regularization=1e-3;learning_rate=1e-4;normalization=layer;train_steps=10000"
store, paramsSet := StoreFromSettings()
imdb.TrainWithStore(store, *flagDataDir, *flagCheckpoint, paramsSet, *flagEval, *flagVerbosity)
> Loading previously generated preprocessed binary file.
Loaded data from "aclImdb.bin": 100000 examples, 141088 unique tokens, 23727054 tokens in total.
Backend "xla":	xla:cuda - PJRT "cuda" plugin (/home/janpf/.local/lib/go-xla/nvidia/pjrt_c_api_cuda_plugin.so) v0.100 [StableHLO] [1 device(s)]
Model: cnn
         7% [=>......................................] (368 steps/s) [1s:25s] [step=719] [loss=1.58] [~loss=1.47] [~acc=51.61%]        
       100% [========================================] (944 steps/s) [step=9999] [loss=0.368] [~loss=0.331] [~acc=96.32%]        %]        

Metric: accuracy

Metric: loss

	[Step 10000] median train step: 560 microseconds

Results on train-eval:
	Mean Loss (#loss): 0.263
	Mean Accuracy (#acc): 98.90%
Results on test-eval:
	Mean Loss (#loss): 0.978
	Mean Accuracy (#acc): 84.07%

Transformer Model¶

Finally a Transformer version of the model, as defined in the "Attention Is All You Need" famous paper.

Notice it's not significantly better than our previous simple Bag-Of-Words model. Likely because there is not enough data for the transformer to make any difference. The success of transformers in large-language-models is in large part due to the training with huge amounts of unsupervised (or self-supervised) data, but that is beyond the scope of this small test.

The code is in imdb.TransformerModelGraph, and the core of it looks like this:

    ...
	// Add the requested number of attention layers.
	numAttLayers := model.GetParamOr(scope, "transformer_num_att_layers", 1)
	numAttHeads := model.GetParamOr(scope, "transformer_num_att_heads", 2)
	attKeySize := model.GetParamOr(scope, "transformer_att_key_size", 8)
	for layerNum := range numAttLayers {
		// Each layer in its own scope.
		scope := scope.Inf("%03d_attention_layer", layerNum)
		residual := embed
		embed = layers.MultiHeadAttention(scope.In("000_attention"), embed, embed, embed, numAttHeads, attKeySize).
			SetKeyMask(mask).SetQueryMask(mask).
			SetOutputDim(embedSize).
			SetValueHeadDim(embedSize).Done()
		if dropoutNode != nil {
			embed = layers.Dropout(scope.In("001_dropout"), embed, dropoutNode)
		}
		embed = NormalizeSequence(scope.In("002_normalization"), embed)
		attentionOutput := embed

		// Transformers recipe: 2 dense layers after attention.
		embed = fnn.New(scope.In("003_fnn"), embed, embedSize).NumHiddenLayers(1, embedSize).Done()
		if dropoutNode != nil {
			embed = layers.Dropout(scope.In("004_dropout"), embed, dropoutNode)
		}
		embed = Add(embed, attentionOutput)
		embed = NormalizeSequence(scope.In("005_normalization"), embed)

		// Residual connection:
		if layerNum > 0 {
			embed = Add(residual, embed)
		}
	}
    ...

With only 5000 steps we got ~87% on the test data -- and significant overfitting as well.

In [6]:
%% --set="model=transformer;normalization=none;activation=swish;l2_regularization=1e-3;cnn_dropout_rate=0.5;fnn_dropout_rate=0.3;learning_rate=1e-4;train_steps=5000"
store, paramsSet := StoreFromSettings()
imdb.TrainWithStore(store, *flagDataDir, *flagCheckpoint, paramsSet, *flagEval, *flagVerbosity)
> Loading previously generated preprocessed binary file.
Loaded data from "aclImdb.bin": 100000 examples, 141088 unique tokens, 23727054 tokens in total.
Backend "xla":	xla:cuda - PJRT "cuda" plugin (/home/janpf/.local/lib/go-xla/nvidia/pjrt_c_api_cuda_plugin.so) v0.100 [StableHLO] [1 device(s)]
Model: transformer
        14% [====>...................................] (429 steps/s) [2s:9s] [step=724] [loss=1.16] [~loss=1.2] [~acc=49.66%]          
       100% [========================================] (684 steps/s) [step=4999] [loss=0.625] [~loss=0.582] [~acc=91.01%]                

Metric: accuracy

Metric: loss

	[Step 5000] median train step: 689 microseconds

Results on train-eval:
	Mean Loss (#loss): 0.608
	Mean Accuracy (#acc): 91.68%
Results on test-eval:
	Mean Loss (#loss): 0.676
	Mean Accuracy (#acc): 86.15%