Introduction
Automatic Speech Recognition (ASR) systems provide text transcriptions. Usually, it’s a sequence of words. Cisco uses ASR systems to provide real-time closed captioning in Webex meetings. One problem that arises is that it may be difficult to read captions without punctuation and capitalization. The ability to understand the meaning of text can change based on punctuation. Imagine the following word sequence with two options for punctuation:“thank you your donation just helped someone get a job”.
Option A: “Thank you! Your donation just helped someone get a job.”
Option B: “Thank you! Your donation just helped someone. Get a job.”
One punctuation mark makes a big difference.
We’ll walk through several considerations when building a post-processing system:- High-accuracy models for punctuation restoration and capitalization from raw text. Fast inference on interim results: to keep up with real-time captions.
- Small resources utilization: speech recognition is computationally intensive; we don’t need our punctuation models to be computationally intensive as well.
- Ability to process out-of-vocabulary words: sometimes, we’ll need to punctuate or capitalize words that our model hasn’t seen before.
TruncBiRNN
Intuition and experiments show that it’s essential to have future context when building a punctuation model because it’s harder to determine punctuation marks in a current position without knowing the next several words. To use information about the next tokens and not be forced to update all hidden states for all tokens in the backward direction, we decided to truncate the backward direction to a fixed window. In the forward direction, it’s just a regular RNN. In the backwards direction, we only consider a fixed window at each token, running the RNN over this window (figure 2). Using this window, we can achieve constant time inference for a new input token (we’ll need to compute one hidden state in the forward direction and n+1 in the backward direction). Now, for every token, we have hidden states for forward and backward directions, respectively. Let’s call this layer TruncBiRNN or TruncBiGRU (since we use GRU). These hidden states can be computed in constant time, which does not depend on the input length. The constant time operation is essential for the model to keep up with real-time captions.Architecture
The architecture consists of embedding layer, TruncBiGRU and unidirectional GRU layer, and fully connected layer. For the output, we use two softmax layers for punctuation and capitalization, respectively (figure 3). For every word, the model predicts its capitalization and the punctuation mark after the word. To better synchronize these two outputs and predict capitalization, we need to know embedding from the previous token, too (to restore the punctuation mark from the previous step). Together with a custom loss function (see next section), this allows us to avoid cases where a lowercase word is produced at the beginning of a sentence. For punctuation prediction, it’s also helpful to get the capitalization prediction of the next word. That’s why we concatenate current and next embeddings. An output layer for punctuation predicts distribution over all punctuation marks. For our model, it’s a set:period – a period in the middle of a sentence that doesn’t necessarily imply that the next word should be capitalized (“a.m.,””D.C.,” etc)
comma
question mark
ellipsis
colons
dash
terminal period – a period at the end of a sentence
For capitalization, we have four classes:lower
upper – all letters are capitalized (“IEEE,” “NASA,” etc.)
capitalized
mix_case – for words like “iPhone”
leading capitalized – words that start a sentence
The additional classes, “leading capitalized” and “terminal period,” may seem redundant at first glance, but they help increase the consistency of answers related to capitalization and punctuation. The “terminal period” implies that the next capitalization answer can’t be “lower,” while “leading capitalized” means that the previous punctuation mark is a “terminal period” or question mark. These classes play an important role in the loss function. Loss Function: We need to optimize both capitalization and punctuation. To achieve this, we use a sum of log loss function with a coefficient: However, as stated earlier, the outputs of a neural network may not be perfectly correlated. For example, the punctuator may predict a “terminal period” for the current word, but the capitalizer doesn’t predict “leading capitalized” for the next token. This type of mistake, while rare, can be very striking. To deal with it, we use an additional penalty term in the loss function that penalizes this type of mistake: The first term corresponds to the probability of having “leading capitalized” after non “terminal period,” and the second to the probability of not having “leading capitalized” after “terminal period.” This penalty sums over tokens where this error occurs. Additionally, we pass two consecutive tensors from the previous layer to softmax layers. Given that, we can efficiently reduce penalty terms. Finally, we have the loss function: