Can Whisper be used for real-time streaming ASR?

Efficient NLP

มุมมอง 7 165

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 ก.ย. 2024

ความคิดเห็น • 23

@prajdabre หลายเดือนก่อน
Hey thanks for covering our work. Really neat explanations!
@wolpumba4099 5 หลายเดือนก่อน ⁺¹
*Abstract*
This video explores the potential of OpenAI's Whisper model for real-time streaming automatic speech recognition (ASR). While Whisper excels in batch ASR, its ability to handle streaming scenarios with low latency is less obvious. The video introduces the open-source whisper-streaming project, which adapts Whisper for streaming applications by processing consecutive audio buffers of increasing size and confirming output tokens using the LocalAgreement algorithm. The video also discusses the limitations of this approach compared to models specifically designed for streaming ASR.
*Summary*
*Introduction (**0:00**)*
* The video investigates whether OpenAI's Whisper model can be used for real-time streaming ASR.
* Whisper is a powerful ASR model trained on a massive multilingual dataset, known for its robustness to noise and accents.
*Batch vs Streaming ASR (**0:35**)*
* Batch ASR processes entire audio recordings at once, while streaming ASR produces output as the speaker talks, with minimal delay.
* Streaming ASR is crucial for applications like live captioning, where real-time transcription is essential.
*Why is Streaming Whisper Difficult? (**1:55**)*
* Whisper is designed for processing fixed-length audio segments (30 seconds), making it challenging to handle longer recordings in a streaming fashion.
* Simply splitting audio into chunks can lead to inaccurate word recognition and high latency.
*Whisper-streaming Demo (**2:58**)*
* The video showcases the open-source whisper-streaming project, which enables real-time transcription using Whisper.
* The demo demonstrates the project's ability to transcribe speech with minimal delay and provide timestamps.
*Processing Consecutive Audio Buffers (**3:38**)*
* Whisper-streaming feeds increasingly larger audio chunks into Whisper until an end-of-sentence marker is detected.
* This ensures that Whisper processes complete sentences, leading to better accuracy.
*Confirming Tokens with LocalAgreement (**4:36**)*
* The LocalAgreement algorithm confirms output tokens only after they are generated in two consecutive audio buffers.
* This helps distinguish between confirmed and unconfirmed transcription results, allowing for real-time feedback with potential corrections.
*Prompting Previous Context (**6:05**)*
* Whisper-streaming uses the previous sentence as prompt tokens for the model, providing additional context and improving accuracy.
*Limitations vs Other Streaming ASR Models (**7:01**)*
* Whisper's design isn't optimized for streaming, leading to inefficiencies like repeatedly processing the beginning of long sentences.
* Dedicated streaming ASR models utilize architectures that allow for efficient processing of continuous audio streams with fixed context windows.
* Adapting Whisper for streaming requires modifying its architecture and retraining, which is currently limited by data accessibility.
I used gemini 1.5 pro for the summary
Token count
4,084 / 1,048,576
@nmstoker 5 หลายเดือนก่อน
Thank you - yet another beautifully explained topic 🙂
@AmirMahmoudi-je2pu 3 หลายเดือนก่อน
nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy
@EfficientNLP 3 หลายเดือนก่อน ⁺¹
There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.
@yasinsharifbeigy7238 28 วันที่ผ่านมา
thank you. I really enjoyed your explanation. Is it possible to explain and introduce models, designed for streaming?
@EfficientNLP 28 วันที่ผ่านมา
Not sure if this is what you're asking about, but I have a video about Whisper fine-tuning that explains the architecture of the Whisper model as well!
@qwerty_and_azerty 5 หลายเดือนก่อน ⁺¹
What happens if 2 consecutive predictions continue to disagree on a specific word? Do you pick one of the options at random? Or does the sentence starting at that word never become confirmed?
@EfficientNLP 5 หลายเดือนก่อน
Generally, the predictions change up to a certain point, after which they no longer change based on additional inputs, and then they are confirmed. If this never occurs, then I guess it will need to handle this edge case in some way, such as picking randomly, but this should not happen often.
@gpminsuk 4 หลายเดือนก่อน
Thanks for the video!! This is a great technique. I am thinking to use this technique for our application. I have one question. When the words are confirmed, why don't you feed the partial audio (except the confirmed words part) with the confirmed text in the initial prompt? Would that be a lot faster when a sentence is really long? Or faster on smaller chips like SBC?
@EfficientNLP 4 หลายเดือนก่อน ⁺¹
The main issue is Whisper is trained on audio that is at the beginning of a sentence so feeding it audio that starts in the middle of a sentence would be out of distribution. Your suggestion would be more efficient, but may lead to a degradation in transcript quality.
@pinkmatter8488 18 วันที่ผ่านมา
I'm trying to see how to take this module and integrate it within a real time audio pipeline just like your project. I'm kind of lost right now and would love to have a bit of feedback on your process.
@EfficientNLP 18 วันที่ผ่านมา
Sure, if you DM me on LinkedIn, I'd be happy to chat about it.
@pinkmatter8488 10 วันที่ผ่านมา
@@EfficientNLP Thanks, will do.
@pedroprobst5230 5 หลายเดือนก่อน
Thank you. Are you using faster-whisper as your backend? I'm trying to achieve something similar but with whisper.cpp.
@EfficientNLP 5 หลายเดือนก่อน
This method should work for any backend, but only faster-whisper is supported in the current implementation of whisper-streaming. Some modification will be required to make it work for whisper.cpp.
@pedroprobst5230 5 หลายเดือนก่อน
@@EfficientNLP Interesting; thank you very much.
@Bub_s69 2 หลายเดือนก่อน
Did you end up figuring it out with whisper.cpp?
@wolpumba4099 5 หลายเดือนก่อน
I would like something like your voice writer but instead of outputting text it should output speech. I should remove my grammar mistakes and accent but should copy my intonation. Do you think this is possible at this time? I can't find good text-to-speech or voice cloning models.
@EfficientNLP 5 หลายเดือนก่อน
This sounds quite different from what I'm building with Voice Writer. I've not looked at voice cloning models before, so I'm not sure of their feasibility, but it's a good and potentially useful project idea.
@jacemc9852 หลายเดือนก่อน
What latencies can be expected with Whisper Streaming? I'd like to know what to expect before going down that route?
@EfficientNLP หลายเดือนก่อน
Latency depends on various factors such as your hardware, model size, and options like minimum chunk size; the paper reports latency results between 3-6 seconds depending on configuration.
@jacemc9852 หลายเดือนก่อน
@@EfficientNLP Thanks for your reply. I require an order of magnitude less for my application which I currently get with Azure. Whisper ASR is not ready for me to jump in quite yet. Cheers

ต่อไป

เล่นอัตโนมัติ

Fine-tuning Whisper to learn my Chinese dialect (Teochew)