Can Whisper be used for real-time streaming ASR?

By Efficient NLP

Summary

## Key takeaways - **Whisper Limited to 30s Chunks**: Whisper is trained to process input audio of 30 seconds long. If the audio is longer than 30 seconds, there is no way to feed into Whisper anything longer than 30 seconds. [02:01], [02:17] - **Chunking Risks Word Splits**: Splitting audio into 30-second chunks might split it in the middle of a word, which will probably not be recognized correctly, and causes up to 30-second latency for the first word. [02:31], [02:44] - **Whisper-Streaming Uses Growing Buffers**: Whisper-streaming feeds chunks of increasing size into Whisper, starting from 1 second by default, until hitting an end of sentence marker, then scrolls the buffer forward. [03:37], [04:05] - **LocalAgreement Confirms Tokens**: LocalAgreement with n=2 confirms a token only if generated in two consecutive audio buffers; unconfirmed gray tokens may change, but black confirmed ones are permanent. [05:01], [05:26] - **Prompting Boosts New Sentences**: When generating a new sentence, whisper-streaming feeds the previous sentence into the model as prompt tokens to improve accuracy with more context. [06:05], [06:23] - **Whisper Inefficient on Long Sentences**: Whisper assumes each audio buffer starts at the beginning of a sentence, so long sentences cause the beginning to be processed many times, unlike streaming-native models with fixed receptive fields. [07:14], [07:34]

Topics Covered

Streaming ASR Demands Sentence-Aligned Buffers
Confirm Tokens via Local Agreement
Prompt with Prior Sentences Boosts Accuracy
Retrospective Processing Inefficient for Streaming

Full Transcript

Can the open AI whisper model do streaming ASR? My name is Bai. I'm a

streaming ASR? My name is Bai. I'm a

machine learning engineer and a PhD in natural language processing. And today I will answer this question. So if you haven't seen the whisper model, it is a speechtoext model. It is trained on

speechtoext model. It is trained on about 100 languages on 680,000 hours of data. The model architecture is a

data. The model architecture is a encoder decoder transformer model and it comes in five different sizes and it is growing in popularity recently because it is robust to noise and accents. So

the question of the day is can this model do streaming speech recognition.

So first of all what do we mean by streaming ASR? Well, automated speech

streaming ASR? Well, automated speech recognition can come in two forms batch and streaming. Batch means that the

and streaming. Batch means that the model needs to take a bunch of speech and produce a bunch of text. But

streaming ASR is when the model needs to produce an output as the speaker is saying things with a delay of not more than a few seconds. So for example, if

you're listening to a live broadcast, you don't want to first record the entire broadcast and then produce the subtitles, you want it to appear at most

a few seconds after it is said. And

generally you'll expect streaming ASR to be a few percentage points lower accuracy than batch ASR because the model does not have access to all the

context only the point up until the current word. The question of how to do

current word. The question of how to do streaming ASR came up when I was building the voice writer. This is an AI writing assistant that works in two steps. In the first step, you just speak

steps. In the first step, you just speak your thoughts without worrying too much about the grammar. And in the second step, the AI corrects the grammar for you. I've been finding it super helpful

you. I've been finding it super helpful for writing all kinds of things, including emails, blog posts, and Slack messages. You can try it out for free in

messages. You can try it out for free in the link here. And I'll also post a link in the description of the video. Now,

back to the video. First of all, why is this even difficult? Why is it not so easy to use the Whisper model for streaming ASR? The issue is Whisper is

streaming ASR? The issue is Whisper is trained to process input audio of 30 seconds long. If the audio is less than

seconds long. If the audio is less than 30 seconds, you can pad it to be longer.

So, this is not too much of a problem.

But what if your audio is longer than 30 seconds? There is no way to feed into

seconds? There is no way to feed into Whisper anything longer than 30 seconds.

So, the first thing you might think of is what if you just split it up into chunks of 30 seconds at a time and process each chunk one at a time. Well,

the first problem is you might split it in the middle of the word. Like if you split it while in the middle of the word, this word will probably not be recognized correctly. The second problem

recognized correctly. The second problem is the latency will be really high. So

imagine if you're processing audio, you're waiting for audio to fill up these 30 seconds and only then will you process this entire chunk and then the

latency um for for the first word will be up to 30 seconds. Fortunately, there

is an open source report called Whisper Streaming that does this for you and it turns the Whisper model into a streaming ASR system. There are instructions on

ASR system. There are instructions on how to set this up. So, let's try it out. So, I run this command to start a

out. So, I run this command to start a model running um it's using a whisper small model and then I go to this tab and run another command and I just start

speaking. So you can pretty quickly see

speaking. So you can pretty quickly see that it's um picking it up and it's pretty quick and responsive and it even gives timestamps. So um it's pretty cool

gives timestamps. So um it's pretty cool demo. Um now let's talk about how this

demo. Um now let's talk about how this works. So we start with this audio file

works. So we start with this audio file and what we do is we feed chunks of increasing size into whisper. The size

of these chunks is configured by the minimum chunk size parameter which by default is 1 second. So in each iteration, we increase the size of this buffer by 1 second and feed all of it

into whisper. And this process continues

into whisper. And this process continues until we hit a end of sentence marker like a period or a question mark. Then

we move the buffer forward and start the process again. So each piece of audio is

process again. So each piece of audio is processed by whisper multiple times as many times as it takes until we hit the end of the sentence. The reasons for this is whisper is trained on sentences.

So, it gives the best result when the start of the audio aligns with the start of a sentence. And that's why the buffer only moves forward when the previous sentence is complete and we're beginning

a new sentence. This way, Whisper never needs to start transcribing from the middle of a sentence, which would give suboptimal results. In many

suboptimal results. In many applications, it is useful to have some results available as soon as possible, even if they're not completely accurate.

In the voice writer, I am showing the incomplete results in gray whereas the confirmed results are in black. The

difference is the gray part of the sentence is unconfirmed. So it may change as the model gets new information. But the black part is the

information. But the black part is the confirmed results and they are permanent. How this works is using an

permanent. How this works is using an algorithm called local agreement with n equals 2. This means that for a token to

equals 2. This means that for a token to be confirmed, it needs to be generated in two consecutive audio buffers. Let's

give an example. Let's say in the first step, the worst per model outputs the three tokens if you like and nothing is confirmed at this step. In the second step, the model produces more tokens,

but only the first two tokens agree with the previous step. So those two are confirmed. In the third and the fourth

confirmed. In the third and the fourth steps, more tokens are generated. But at

each step, it does not confirm any tokens until it is at least generated in two consecutive chunks. So anything in the gray part is possible to change and

given new information. For example, the word view might change to video once the model hears the rest of the sentence.

But anything in the black part is permanent. So even if the model wants to

permanent. So even if the model wants to change it to something else after more iterations it cannot be changed anymore.

One last thing that this model does when it generates a new sentence is it feeds the previous sentence into the model as prompt tokens. So this is something that

prompt tokens. So this is something that you can do in the whisper format is you can give it a bunch of prompt tokens before you start generating and this tends to improve the accuracy a little

bit because more context is always good.

So in summary, the algorithm can be summarized in three basic ideas. The

first idea is we feed longer and longer consecutive audio buffers into whisper and then we emit tokens as soon as they're confirmed by two iterations. And

finally, we scroll forward the audio buffer whenever a sentence is completed.

So, it's a pretty simple algorithm that you can apply to any speech to text model that does not support streaming and basically turn it into a streaming model. If you like, you can check out

model. If you like, you can check out all the details in this paper, which I'll link in the description. One of the limitations comes from the fact that whisper was not really designed to be a

streaming model. And because of this, it

streaming model. And because of this, it assumes that each audio buffer has to start at the beginning of a sentence.

Therefore, if you have a sentence that is quite long, then the beginning of the sentence has to be fed into the model and processed many times. This is

inefficient and not necessary if the model was trained from the beginning to do streaming ASR. Now, if we look at an architecture that was designed specifically to do streaming speech

recognition, it looks a little bit different. Here is a model that was

different. Here is a model that was proposed in 2021. At each step, it predicts a token and it has access to a fixed amount of past context and future

context. During training, this rule is

context. During training, this rule is enforced by having an intention mask that is mostly zeros, meaning that the model cannot use information that is outside of this fixed window. And during

prediction, the model predicts a token given a limited amount of fixed context.

To make the next prediction, the entire receptive field moves forward by one chunk, but the size of the receptive field is fixed. This way, the beginning of the sentence will not be processed

multiple times. However, this is not

multiple times. However, this is not possible with the whisper model because it requires modifying the architecture and retraining the model and we do not have access to whisper's training data.

That's it for this video and I hope you enjoyed my explanation of how Whisper can be turned into a streaming model. If

you enjoyed this video, please leave a comment, hit the subscribe button, and ring the bell icon to stay notified when I release future videos. Goodbye.

Loading...

Loading video analysis...