*Abstract* This video explores the potential of OpenAI's Whisper model for real-time streaming automatic speech recognition (ASR). While Whisper excels in batch ASR, its ability to handle streaming scenarios with low latency is less obvious. The video introduces the open-source whisper-streaming project, which adapts Whisper for streaming applications by processing consecutive audio buffers of increasing size and confirming output tokens using the LocalAgreement algorithm. The video also discusses the limitations of this approach compared to models specifically designed for streaming ASR. *Summary* *Introduction (**0:00**)* * The video investigates whether OpenAI's Whisper model can be used for real-time streaming ASR. * Whisper is a powerful ASR model trained on a massive multilingual dataset, known for its robustness to noise and accents. *Batch vs Streaming ASR (**0:35**)* * Batch ASR processes entire audio recordings at once, while streaming ASR produces output as the speaker talks, with minimal delay. * Streaming ASR is crucial for applications like live captioning, where real-time transcription is essential. *Why is Streaming Whisper Difficult? (**1:55**)* * Whisper is designed for processing fixed-length audio segments (30 seconds), making it challenging to handle longer recordings in a streaming fashion. * Simply splitting audio into chunks can lead to inaccurate word recognition and high latency. *Whisper-streaming Demo (**2:58**)* * The video showcases the open-source whisper-streaming project, which enables real-time transcription using Whisper. * The demo demonstrates the project's ability to transcribe speech with minimal delay and provide timestamps. *Processing Consecutive Audio Buffers (**3:38**)* * Whisper-streaming feeds increasingly larger audio chunks into Whisper until an end-of-sentence marker is detected. * This ensures that Whisper processes complete sentences, leading to better accuracy. *Confirming Tokens with LocalAgreement (**4:36**)* * The LocalAgreement algorithm confirms output tokens only after they are generated in two consecutive audio buffers. * This helps distinguish between confirmed and unconfirmed transcription results, allowing for real-time feedback with potential corrections. *Prompting Previous Context (**6:05**)* * Whisper-streaming uses the previous sentence as prompt tokens for the model, providing additional context and improving accuracy. *Limitations vs Other Streaming ASR Models (**7:01**)* * Whisper's design isn't optimized for streaming, leading to inefficiencies like repeatedly processing the beginning of long sentences. * Dedicated streaming ASR models utilize architectures that allow for efficient processing of continuous audio streams with fixed context windows. * Adapting Whisper for streaming requires modifying its architecture and retraining, which is currently limited by data accessibility. I used gemini 1.5 pro for the summary Token count 4,084 / 1,048,576
nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy
There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.
Not sure if this is what you're asking about, but I have a video about Whisper fine-tuning that explains the architecture of the Whisper model as well!
What happens if 2 consecutive predictions continue to disagree on a specific word? Do you pick one of the options at random? Or does the sentence starting at that word never become confirmed?
Generally, the predictions change up to a certain point, after which they no longer change based on additional inputs, and then they are confirmed. If this never occurs, then I guess it will need to handle this edge case in some way, such as picking randomly, but this should not happen often.
Thanks for the video!! This is a great technique. I am thinking to use this technique for our application. I have one question. When the words are confirmed, why don't you feed the partial audio (except the confirmed words part) with the confirmed text in the initial prompt? Would that be a lot faster when a sentence is really long? Or faster on smaller chips like SBC?
The main issue is Whisper is trained on audio that is at the beginning of a sentence so feeding it audio that starts in the middle of a sentence would be out of distribution. Your suggestion would be more efficient, but may lead to a degradation in transcript quality.
I'm trying to see how to take this module and integrate it within a real time audio pipeline just like your project. I'm kind of lost right now and would love to have a bit of feedback on your process.
This method should work for any backend, but only faster-whisper is supported in the current implementation of whisper-streaming. Some modification will be required to make it work for whisper.cpp.
I would like something like your voice writer but instead of outputting text it should output speech. I should remove my grammar mistakes and accent but should copy my intonation. Do you think this is possible at this time? I can't find good text-to-speech or voice cloning models.
This sounds quite different from what I'm building with Voice Writer. I've not looked at voice cloning models before, so I'm not sure of their feasibility, but it's a good and potentially useful project idea.
Latency depends on various factors such as your hardware, model size, and options like minimum chunk size; the paper reports latency results between 3-6 seconds depending on configuration.
@@EfficientNLP Thanks for your reply. I require an order of magnitude less for my application which I currently get with Azure. Whisper ASR is not ready for me to jump in quite yet. Cheers
Hey thanks for covering our work. Really neat explanations!
*Abstract*
This video explores the potential of OpenAI's Whisper model for real-time streaming automatic speech recognition (ASR). While Whisper excels in batch ASR, its ability to handle streaming scenarios with low latency is less obvious. The video introduces the open-source whisper-streaming project, which adapts Whisper for streaming applications by processing consecutive audio buffers of increasing size and confirming output tokens using the LocalAgreement algorithm. The video also discusses the limitations of this approach compared to models specifically designed for streaming ASR.
*Summary*
*Introduction (**0:00**)*
* The video investigates whether OpenAI's Whisper model can be used for real-time streaming ASR.
* Whisper is a powerful ASR model trained on a massive multilingual dataset, known for its robustness to noise and accents.
*Batch vs Streaming ASR (**0:35**)*
* Batch ASR processes entire audio recordings at once, while streaming ASR produces output as the speaker talks, with minimal delay.
* Streaming ASR is crucial for applications like live captioning, where real-time transcription is essential.
*Why is Streaming Whisper Difficult? (**1:55**)*
* Whisper is designed for processing fixed-length audio segments (30 seconds), making it challenging to handle longer recordings in a streaming fashion.
* Simply splitting audio into chunks can lead to inaccurate word recognition and high latency.
*Whisper-streaming Demo (**2:58**)*
* The video showcases the open-source whisper-streaming project, which enables real-time transcription using Whisper.
* The demo demonstrates the project's ability to transcribe speech with minimal delay and provide timestamps.
*Processing Consecutive Audio Buffers (**3:38**)*
* Whisper-streaming feeds increasingly larger audio chunks into Whisper until an end-of-sentence marker is detected.
* This ensures that Whisper processes complete sentences, leading to better accuracy.
*Confirming Tokens with LocalAgreement (**4:36**)*
* The LocalAgreement algorithm confirms output tokens only after they are generated in two consecutive audio buffers.
* This helps distinguish between confirmed and unconfirmed transcription results, allowing for real-time feedback with potential corrections.
*Prompting Previous Context (**6:05**)*
* Whisper-streaming uses the previous sentence as prompt tokens for the model, providing additional context and improving accuracy.
*Limitations vs Other Streaming ASR Models (**7:01**)*
* Whisper's design isn't optimized for streaming, leading to inefficiencies like repeatedly processing the beginning of long sentences.
* Dedicated streaming ASR models utilize architectures that allow for efficient processing of continuous audio streams with fixed context windows.
* Adapting Whisper for streaming requires modifying its architecture and retraining, which is currently limited by data accessibility.
I used gemini 1.5 pro for the summary
Token count
4,084 / 1,048,576
Thank you - yet another beautifully explained topic 🙂
nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy
There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.
thank you. I really enjoyed your explanation. Is it possible to explain and introduce models, designed for streaming?
Not sure if this is what you're asking about, but I have a video about Whisper fine-tuning that explains the architecture of the Whisper model as well!
What happens if 2 consecutive predictions continue to disagree on a specific word? Do you pick one of the options at random? Or does the sentence starting at that word never become confirmed?
Generally, the predictions change up to a certain point, after which they no longer change based on additional inputs, and then they are confirmed. If this never occurs, then I guess it will need to handle this edge case in some way, such as picking randomly, but this should not happen often.
Thanks for the video!! This is a great technique. I am thinking to use this technique for our application. I have one question. When the words are confirmed, why don't you feed the partial audio (except the confirmed words part) with the confirmed text in the initial prompt? Would that be a lot faster when a sentence is really long? Or faster on smaller chips like SBC?
The main issue is Whisper is trained on audio that is at the beginning of a sentence so feeding it audio that starts in the middle of a sentence would be out of distribution. Your suggestion would be more efficient, but may lead to a degradation in transcript quality.
I'm trying to see how to take this module and integrate it within a real time audio pipeline just like your project. I'm kind of lost right now and would love to have a bit of feedback on your process.
Sure, if you DM me on LinkedIn, I'd be happy to chat about it.
@@EfficientNLP Thanks, will do.
Thank you. Are you using faster-whisper as your backend? I'm trying to achieve something similar but with whisper.cpp.
This method should work for any backend, but only faster-whisper is supported in the current implementation of whisper-streaming. Some modification will be required to make it work for whisper.cpp.
@@EfficientNLP Interesting; thank you very much.
Did you end up figuring it out with whisper.cpp?
I would like something like your voice writer but instead of outputting text it should output speech. I should remove my grammar mistakes and accent but should copy my intonation. Do you think this is possible at this time? I can't find good text-to-speech or voice cloning models.
This sounds quite different from what I'm building with Voice Writer. I've not looked at voice cloning models before, so I'm not sure of their feasibility, but it's a good and potentially useful project idea.
What latencies can be expected with Whisper Streaming? I'd like to know what to expect before going down that route?
Latency depends on various factors such as your hardware, model size, and options like minimum chunk size; the paper reports latency results between 3-6 seconds depending on configuration.
@@EfficientNLP Thanks for your reply. I require an order of magnitude less for my application which I currently get with Azure. Whisper ASR is not ready for me to jump in quite yet. Cheers