Train seq2seq model in Keras without using one-hot encoded data

Common seq2seq model architectures usually use one-hot encoded inputs and outputs e.g.. However, when the target has a large vocabulary, say, over 3k commonly used charactors like Chinese, building a pre-computed one-hot vector for each sample consumes a large amount of memory and quickly becomes infeasible as sample size increases. As a result, one may want encode each distinct charactor as an integer instead of a vector.

On the input size, this can be easily done by using a embedding look up (keras.layers.embeddings.Embedding). However, the output side is tricky. Since the seq2seq outputs a distribution over the vocabulary, there's no way to match the dimension of the target tensor with the seq2seq model output without any tweaks.

Here's the trick, simply wrap the categorical_crossentropy loss with a customized function and one-hot the target tensor on-demand, manually construct the target tansor and feed the target tensor with integer encoded data.

Related posts