Can Whisper be used for real-time streaming ASR?
By Efficient NLP
Summary
## Key takeaways - **Whisper Limited to 30s Chunks**: Whisper is trained to process input audio of 30 seconds long. If the audio is longer than 30 seconds, there is no way to feed into Whisper anything longer than 30 seconds. [02:01], [02:17] - **Chunking Risks Word Splits**: Splitting audio into 30-second chunks might split it in the middle of a word, which will probably not be recognized correctly, and causes up to 30-second latency for the first word. [02:31], [02:44] - **Whisper-Streaming Uses Growing Buffers**: Whisper-streaming feeds chunks of increasing size into Whisper, starting from 1 second by default, until hitting an end of sentence marker, then scrolls the buffer forward. [03:37], [04:05] - **LocalAgreement Confirms Tokens**: LocalAgreement with n=2 confirms a token only if generated in two consecutive audio buffers; unconfirmed gray tokens may change, but black confirmed ones are permanent. [05:01], [05:26] - **Prompting Boosts New Sentences**: When generating a new sentence, whisper-streaming feeds the previous sentence into the model as prompt tokens to improve accuracy with more context. [06:05], [06:23] - **Whisper Inefficient on Long Sentences**: Whisper assumes each audio buffer starts at the beginning of a sentence, so long sentences cause the beginning to be processed many times, unlike streaming-native models with fixed receptive fields. [07:14], [07:34]
Topics Covered
- Streaming ASR Demands Sentence-Aligned Buffers
- Confirm Tokens via Local Agreement
- Prompt with Prior Sentences Boosts Accuracy
- Retrospective Processing Inefficient for Streaming
Full Transcript
Can the open AI whisper model do streaming ASR? My name is Bai. I'm a
streaming ASR? My name is Bai. I'm a
machine learning engineer and a PhD in natural language processing. And today I will answer this question. So if you haven't seen the whisper model, it is a speechtoext model. It is trained on
speechtoext model. It is trained on about 100 languages on 680,000 hours of data. The model architecture is a
data. The model architecture is a encoder decoder transformer model and it comes in five different sizes and it is growing in popularity recently because it is robust to noise and accents. So
the question of the day is can this model do streaming speech recognition.
So first of all what do we mean by streaming ASR? Well, automated speech
streaming ASR? Well, automated speech recognition can come in two forms batch and streaming. Batch means that the
and streaming. Batch means that the model needs to take a bunch of speech and produce a bunch of text. But
streaming ASR is when the model needs to produce an output as the speaker is saying things with a delay of not more than a few seconds. So for example, if
you're listening to a live broadcast, you don't want to first record the entire broadcast and then produce the subtitles, you want it to appear at most
a few seconds after it is said. And
generally you'll expect streaming ASR to be a few percentage points lower accuracy than batch ASR because the model does not have access to all the
context only the point up until the current word. The question of how to do
current word. The question of how to do streaming ASR came up when I was building the voice writer. This is an AI writing assistant that works in two steps. In the first step, you just speak
steps. In the first step, you just speak your thoughts without worrying too much about the grammar. And in the second step, the AI corrects the grammar for you. I've been finding it super helpful
you. I've been finding it super helpful for writing all kinds of things, including emails, blog posts, and Slack messages. You can try it out for free in
messages. You can try it out for free in the link here. And I'll also post a link in the description of the video. Now,
back to the video. First of all, why is this even difficult? Why is it not so easy to use the Whisper model for streaming ASR? The issue is Whisper is
streaming ASR? The issue is Whisper is trained to process input audio of 30 seconds long. If the audio is less than
seconds long. If the audio is less than 30 seconds, you can pad it to be longer.
So, this is not too much of a problem.
But what if your audio is longer than 30 seconds? There is no way to feed into
seconds? There is no way to feed into Whisper anything longer than 30 seconds.
So, the first thing you might think of is what if you just split it up into chunks of 30 seconds at a time and process each chunk one at a time. Well,
the first problem is you might split it in the middle of the word. Like if you split it while in the middle of the word, this word will probably not be recognized correctly. The second problem
recognized correctly. The second problem is the latency will be really high. So
imagine if you're processing audio, you're waiting for audio to fill up these 30 seconds and only then will you process this entire chunk and then the
latency um for for the first word will be up to 30 seconds. Fortunately, there
is an open source report called Whisper Streaming that does this for you and it turns the Whisper model into a streaming ASR system. There are instructions on
ASR system. There are instructions on how to set this up. So, let's try it out. So, I run this command to start a
out. So, I run this command to start a model running um it's using a whisper small model and then I go to this tab and run another command and I just start
speaking. So you can pretty quickly see
speaking. So you can pretty quickly see that it's um picking it up and it's pretty quick and responsive and it even gives timestamps. So um it's pretty cool
gives timestamps. So um it's pretty cool demo. Um now let's talk about how this
demo. Um now let's talk about how this works. So we start with this audio file
works. So we start with this audio file and what we do is we feed chunks of increasing size into whisper. The size
of these chunks is configured by the minimum chunk size parameter which by default is 1 second. So in each iteration, we increase the size of this buffer by 1 second and feed all of it
into whisper. And this process continues
into whisper. And this process continues until we hit a end of sentence marker like a period or a question mark. Then
we move the buffer forward and start the process again. So each piece of audio is
process again. So each piece of audio is processed by whisper multiple times as many times as it takes until we hit the end of the sentence. The reasons for this is whisper is trained on sentences.
So, it gives the best result when the start of the audio aligns with the start of a sentence. And that's why the buffer only moves forward when the previous sentence is complete and we're beginning
a new sentence. This way, Whisper never needs to start transcribing from the middle of a sentence, which would give suboptimal results. In many
suboptimal results. In many applications, it is useful to have some results available as soon as possible, even if they're not completely accurate.
In the voice writer, I am showing the incomplete results in gray whereas the confirmed results are in black. The
difference is the gray part of the sentence is unconfirmed. So it may change as the model gets new information. But the black part is the
information. But the black part is the confirmed results and they are permanent. How this works is using an
permanent. How this works is using an algorithm called local agreement with n equals 2. This means that for a token to
equals 2. This means that for a token to be confirmed, it needs to be generated in two consecutive audio buffers. Let's
give an example. Let's say in the first step, the worst per model outputs the three tokens if you like and nothing is confirmed at this step. In the second step, the model produces more tokens,
but only the first two tokens agree with the previous step. So those two are confirmed. In the third and the fourth
confirmed. In the third and the fourth steps, more tokens are generated. But at
each step, it does not confirm any tokens until it is at least generated in two consecutive chunks. So anything in the gray part is possible to change and
given new information. For example, the word view might change to video once the model hears the rest of the sentence.
But anything in the black part is permanent. So even if the model wants to
permanent. So even if the model wants to change it to something else after more iterations it cannot be changed anymore.
One last thing that this model does when it generates a new sentence is it feeds the previous sentence into the model as prompt tokens. So this is something that
prompt tokens. So this is something that you can do in the whisper format is you can give it a bunch of prompt tokens before you start generating and this tends to improve the accuracy a little
bit because more context is always good.
So in summary, the algorithm can be summarized in three basic ideas. The
first idea is we feed longer and longer consecutive audio buffers into whisper and then we emit tokens as soon as they're confirmed by two iterations. And
finally, we scroll forward the audio buffer whenever a sentence is completed.
So, it's a pretty simple algorithm that you can apply to any speech to text model that does not support streaming and basically turn it into a streaming model. If you like, you can check out
model. If you like, you can check out all the details in this paper, which I'll link in the description. One of the limitations comes from the fact that whisper was not really designed to be a
streaming model. And because of this, it
streaming model. And because of this, it assumes that each audio buffer has to start at the beginning of a sentence.
Therefore, if you have a sentence that is quite long, then the beginning of the sentence has to be fed into the model and processed many times. This is
inefficient and not necessary if the model was trained from the beginning to do streaming ASR. Now, if we look at an architecture that was designed specifically to do streaming speech
recognition, it looks a little bit different. Here is a model that was
different. Here is a model that was proposed in 2021. At each step, it predicts a token and it has access to a fixed amount of past context and future
context. During training, this rule is
context. During training, this rule is enforced by having an intention mask that is mostly zeros, meaning that the model cannot use information that is outside of this fixed window. And during
prediction, the model predicts a token given a limited amount of fixed context.
To make the next prediction, the entire receptive field moves forward by one chunk, but the size of the receptive field is fixed. This way, the beginning of the sentence will not be processed
multiple times. However, this is not
multiple times. However, this is not possible with the whisper model because it requires modifying the architecture and retraining the model and we do not have access to whisper's training data.
That's it for this video and I hope you enjoyed my explanation of how Whisper can be turned into a streaming model. If
you enjoyed this video, please leave a comment, hit the subscribe button, and ring the bell icon to stay notified when I release future videos. Goodbye.
Loading video analysis...