OpenAI Whisper Speaker Diarization - Transcription with Speaker Names

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ส.ค. 2024
  • High level overview of what's happening with OpenAI Whisper Speaker Diarization:
    Using Open AI's Whisper model to seperate audio into segments and generate transcripts.
    Then generating speaker embeddings for each segments.
    Then using agglomerative clustering on the embeddings to identify the speaker for each segment.
    Speaker Identification or Speaker Labelling is very important for Podcast Transcription or Conversations Audio Transcription. This code helps you do that.
    Dwarkesh's Patel Tweet Announcement - / 1579672641887408129
    Colab - colab.research...
    huggingface.co...

ความคิดเห็น • 86

  • @DwarkeshPatel
    @DwarkeshPatel ปีที่แล้ว +32

    Thanks so much for making this video and highlighting my code! Really cool to see it's useful to other peopl!

    • @integralogic
      @integralogic ปีที่แล้ว

      Thank you!

    • @kushagragupta149
      @kushagragupta149 3 หลายเดือนก่อน

      why its down

    • @bumblequiz
      @bumblequiz หลายเดือนก่อน

      Mr. Patel, I can't find any online sites who do speaker diarization accurately. I don't have a GPU on my system. I am ready to pay if the speaker identification is absolutely accurate. Do you know of such services?

  • @kmanjunath5609
    @kmanjunath5609 10 หลายเดือนก่อน +1

    I almost did it manually
    1.Created rttm file using pyannote.
    2.Slice full length audio using rttm file in and out for each.
    3.Run thru wisper for transcription.
    It was like 5 times slower.
    I was thinking hard how to do other way around. First generate full transcript and then separate segments.
    Some how i saw your video and am impressed and the AgglomerativeClustering at he end blowed my mind.
    Thanks for sharing knowledge.

  • @estrangeiroemtodaparte
    @estrangeiroemtodaparte ปีที่แล้ว +6

    As always, delivering the goods! Thanks 1littlecoder!

  • @stebe8271
    @stebe8271 ปีที่แล้ว +3

    I was working on a model to do this exact thing, as we speak. Thanks for the resource, this will save me lots of time

  • @jordandunn4731
    @jordandunn4731 ปีที่แล้ว +11

    on the last cell I get an error for UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128) any ideas whats going wrong and how do I fix it?

    • @latabletagrafica
      @latabletagrafica ปีที่แล้ว +1

      same problem.

    • @humbertucho2724
      @humbertucho2724 ปีที่แล้ว +1

      Same problem, in my case I am using spanish, looks like the problem is with tildes e.g:"ó".

    • @humbertucho2724
      @humbertucho2724 ปีที่แล้ว +6

      I solved the problem replacing the line
      f = open("transcript.txt", "w")
      by the following
      f = open("transcript.txt", "w", encoding="utf-8")

    • @latabletagrafica
      @latabletagrafica ปีที่แล้ว

      @@humbertucho2724 that worked for me, thanks.

    • @thijsdezeeuw8607
      @thijsdezeeuw8607 ปีที่แล้ว

      @@humbertucho2724 Thanks mate!

  • @rrrila8851
    @rrrila8851 ปีที่แล้ว +3

    It is interesting, although, i think it would be way better to autodetect how many speakers are there, and then start the transcription.

  • @ChopLabalagun
    @ChopLabalagun 11 หลายเดือนก่อน +3

    The code is DEATH no longer works :(

  • @JohnHumphrey
    @JohnHumphrey ปีที่แล้ว +9

    I'm not much of a dev myself, but it seems like it might be simple to add a total time spoken for each speaker. I would love to be able to analyze podcasts in order to understand how much time the host is speaking relative to the guest. In fact it would be very cool if someone built an app that would remove one of the speakers from a conversation and create a separate audio file consisting of only what the remaining speaker/s said.

    • @Michallote
      @Michallote 11 หลายเดือนก่อน +4

      It sounds like an interesting project. Becoming a dev is more about trying even though you don't know what you are doing, googling stuff and going step by step on what does each line does broadly. Progress is gradual! I encourage you to try it yourself 😊

  • @IWLTFT
    @IWLTFT ปีที่แล้ว +4

    Hi Everyone, Thanks 1littlecoder and Dwarkesh, this is fantastic, I managed to get it working and it is helping me immensely and I am learning a lot. I am struggling with Google as I always end up with 0 compute units and that causes all sorts of issues and I am unable to complete the transcriptions (i have quite large files I am processing, several 1 hr coaching sessions). Does AWS have a better option going? And the next question would be how easy would it be porting this to an AWS linux environment? if that is an option

  • @dreamypujara3384
    @dreamypujara3384 ปีที่แล้ว +7

    [ERROR]
    embeddings = np.zeros(shape=(len(segments), 192))
    for i, segment in enumerate(segments):
    embeddings[i] = segment_embedding(segment)
    embeddings = np.nan_to_num(embeddings)
    i am getting assertion error here. embeddings[i] = segment_embedding(segment).
    i am using a hindi audio clip and base model. colab platform and GPU T4 runtime.

    • @JoseMorenoofi
      @JoseMorenoofi ปีที่แล้ว +2

      change the audio to mono

    • @yayaninick
      @yayaninick ปีที่แล้ว +1

      @@JoseMorenoofi Thanks, it worked

    • @warrior_1309
      @warrior_1309 ปีที่แล้ว

      @@JoseMorenoofi Thanks

    • @gauthierbayle1508
      @gauthierbayle1508 ปีที่แล้ว

      @@JoseMorenoofi Hi ! I get that assertion error @dreamypujara3384 mentionned above and I'd like to know where did you make that change from audio to mono... Thanks : )

    • @traveltastequest
      @traveltastequest 11 หลายเดือนก่อน

      ​@@JoseMorenoofican you please explain how to do that

  • @DestroyaDestroya
    @DestroyaDestroya ปีที่แล้ว +5

    Anyone tried this recently?
    Code no longer works. Looks to me some dependencies have been upgraded.

    • @gerhardburau
      @gerhardburau ปีที่แล้ว +2

      True. Code does not work. please fix. I need it.

  • @geoffphillips5293
    @geoffphillips5293 10 หลายเดือนก่อน +1

    None of the ones I've played with cope particularly well with more complicated situations. For instance where one person interrupts another, or if there are three people or more. They can all cope with two very clearly different speakers, but then I figure I could do that with old school techniques like simply averaging the frequency. It's weird because the text to speech itself is enormously clever, it's just surprising that the AI can't distinguish voices well.

  • @user-ud7cq3lq4l
    @user-ud7cq3lq4l 11 หลายเดือนก่อน

    i started to love it when you used bruce wayne's clip

  • @cho7official55
    @cho7official55 2 หลายเดือนก่อน

    This video is one year old, is there now some improvement to opensource easily diarization as good as krisp product, and locally ?

  • @Labbsatr1
    @Labbsatr1 ปีที่แล้ว +2

    It is not working anymore

  • @AyiteB
    @AyiteB 8 หลายเดือนก่อน +1

    Thiis kept crashing at the embeddings section for me. And the hugginface link isn't valid anymore

    • @sarfxa9974
      @sarfxa9974 7 หลายเดือนก่อน

      you just have to add [if waveform.shape[0] > 1: waveform = waveform.mean(axis=0, keepdims=True)] right before the return of [segment_embedding()]

  • @IgorGeraskin
    @IgorGeraskin 10 หลายเดือนก่อน

    Thank you for sharing your knowledge!
    Everything works fine, but an error started appearing:
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
    llmx 0.0.15a0 requires cohere, which is not installed.
    llmx 0.0.15a0 requires openai, which is not installed.
    How to fix?

  • @frosti7
    @frosti7 ปีที่แล้ว +1

    It doesn't work well (detects language as Malay, also, does not offer custom names for speakers) - anyone has a better working solution?

  • @ebramsameh5349
    @ebramsameh5349 ปีที่แล้ว +3

    why he give me error in this line
    embeddings[i] = segment_embedding(segment)

    • @NielsPrzybilla
      @NielsPrzybilla ปีที่แล้ว +1

      same error here:
      ---------------------------------------------------------------------------
      AssertionError Traceback (most recent call last)
      in ()
      1 embeddings = np.zeros(shape=(len(segments), 192))
      2 for i, segment in enumerate(segments):
      ----> 3 embeddings[i] = segment_embedding(segment)
      4
      5 embeddings = np.nan_to_num(embeddings)
      1 frames
      /usr/local/lib/python3.9/dist-packages/pyannote/audio/pipelines/speaker_verification.py in __call__(self, waveforms, masks)
      316
      317 batch_size, num_channels, num_samples = waveforms.shape
      --> 318 assert num_channels == 1
      319
      320 waveforms = waveforms.squeeze(dim=1)
      AssertionError:
      And yes :-) I am na totally new coder

    • @geckoofglory1085
      @geckoofglory1085 ปีที่แล้ว +1

      The same thing happened to me.

    • @Hiroyuki255
      @Hiroyuki255 ปีที่แล้ว +1

      same, I noted it gives such an error with more or less big audio files. smaller ones worked fine. not sure if it's the reason though.

    • @Hiroyuki255
      @Hiroyuki255 ปีที่แล้ว

      Guess I found the answer! If you convert the initial audio file from stereo to mono it works fine

    • @guillemarmenteras8105
      @guillemarmenteras8105 ปีที่แล้ว

      @@Hiroyuki255 how do you do it?

  • @serta5727
    @serta5727 ปีที่แล้ว +2

    Whisper is very useful ❤

  • @kmanjunath5609
    @kmanjunath5609 10 หลายเดือนก่อน

    What if new speaker enters in between then number of speakers will become? old + new or old?

  • @MixwellSidechains
    @MixwellSidechains ปีที่แล้ว

    I'm running it locally in Jupyter notebook but I can't seem to find an offline model PreTrainedSpeakerEmbedding

  • @divyanshusingh7893
    @divyanshusingh7893 6 หลายเดือนก่อน

    Can we automatically detect number of speakers in an audio?

  • @SustainaBIT
    @SustainaBIT 9 หลายเดือนก่อน

    Thank you so much, by any chance, do you think there could be a method to make it do all of that in real time, during a call let's say.
    Any ideas of where could I start would be very helpful ❤❤

  • @mramilvideo
    @mramilvideo 4 หลายเดือนก่อน

    There are no speaker names. How we can identify speakers?

  • @jesusjim
    @jesusjim 8 หลายเดือนก่อน +1

    working with large files of 1hr recording seems to be a problem. i was hoping i could run this locally without google colab

    • @micheleromanin7168
      @micheleromanin7168 หลายเดือนก่อน

      You can run this locally, but you need a decent GPU to do it. Just make a python script out of the colab

  • @bumblequiz
    @bumblequiz หลายเดือนก่อน

    Link not working anymore

  • @user-zw3xh2bo2z
    @user-zw3xh2bo2z ปีที่แล้ว

    hello sir...i have small doubt...if we have no of speaker more than "2" ..how could the parameter "num_speakers" be vary..

  • @rehou45
    @rehou45 ปีที่แล้ว +1

    i tried to execute the code on google colab but it buffers during more than 1 hour and still not executed...

    • @raghvendra87
      @raghvendra87 ปีที่แล้ว

      it does not work anymore

  • @jhoanlagunapena429
    @jhoanlagunapena429 6 หลายเดือนก่อน

    Hi! Thanks a lot for this video! I really liked it and I appreciate what you've just shared here. I have a question: what if you are not quite sure about the number of speakers? Sometimes it's not so easy to distinguish one voice from another if there are a lot of people talking (like a focus group). What can I do in that case?

    • @nealbagai5388
      @nealbagai5388 5 หลายเดือนก่อน

      Having this same issue - wondering if its a more widespread issue or specific to this configuration of pyannote. Thinking about trying NeMo if this poses a problem

  • @klarinooo
    @klarinooo ปีที่แล้ว

    I m trying to upload a wav file of 5MB and I m getting "RangeError: Maximum call stack size exceeded" .. is this only for tiny file sizes?

  • @gaurav12376
    @gaurav12376 ปีที่แล้ว

    The colab notebook is not accessible. Can you share the new link.

  • @mjaym30
    @mjaym30 ปีที่แล้ว

    Amazing videos!! Keep going!. I have a request though. Could you please publish a video for customizing GPT-J-6B on colab using 8bit version.

  • @ciaran8491
    @ciaran8491 ปีที่แล้ว

    How can i use this on my local installation?

  • @DRHASNAINS
    @DRHASNAINS 10 หลายเดือนก่อน

    How i can run same on Pycharm ? Can anyone guide

  • @vinsmokearifka
    @vinsmokearifka ปีที่แล้ว

    Thank you Sir, its only for English?

  • @cluttercleaners
    @cluttercleaners 11 หลายเดือนก่อน

    is there an updated hugging face link?

  • @user-kn8zd2qj9x
    @user-kn8zd2qj9x 10 หลายเดือนก่อน

    About the embedding issue, here is a replacement for the cell that converts to wav using ffmpeg, I have included a conversion to mono. This will fix the issue with no problems, simply replace the cell that converted to .wav with the following:
    # Convert to WAV format if not already in WAV
    if path[-3:] != 'wav':
    subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
    path = 'audio.wav'
    # Convert to mono
    subprocess.call(['ffmpeg', '-i', path, '-ac', '1', 'audio_mono.wav', '-y'])
    path = 'audio_mono.wav'

  • @serychristianrenaud
    @serychristianrenaud ปีที่แล้ว +1

    Thanks

  • @datasciencewithanirudh5405
    @datasciencewithanirudh5405 ปีที่แล้ว

    did anyone encounter this error

  • @olegchernov1329
    @olegchernov1329 ปีที่แล้ว +1

    can this be done locally?

    • @mgomez00
      @mgomez00 11 หลายเดือนก่อน

      If you have enough GPU power locally, YES.
      I also assume you have Python and all the required libraries installed.

  • @user-jf5ru5ow8u
    @user-jf5ru5ow8u 7 หลายเดือนก่อน

    bhai MP3 sites wala links v daal detay

  • @yangwang9688
    @yangwang9688 ปีที่แล้ว

    Does it still work if the speech exists overlapping?

    • @1littlecoder
      @1littlecoder  ปีที่แล้ว

      I'm not sure if it'd work fine then. I have not checked it.

  • @Sanguen666
    @Sanguen666 11 หลายเดือนก่อน

    smart pajeeet!

  • @JorgeLopez-gw9xc
    @JorgeLopez-gw9xc ปีที่แล้ว

    for this code you use whisper model for openai ? why you are not using api key ?

    • @1littlecoder
      @1littlecoder  ปีที่แล้ว

      I'm using it from the model (not the API)

    • @JorgeLopez-gw9xc
      @JorgeLopez-gw9xc ปีที่แล้ว

      @@1littlecoder the whister model then is not the one on openai st. In my case I am looking for a method based on openai, the reason is to be able to ensure the privacy of the information since I want to do it with company data, do you know if it is possible?

    • @user-nl2ic1kb7v
      @user-nl2ic1kb7v 4 หลายเดือนก่อน

      @@JorgeLopez-gw9xc Hi Jorge, I am having the same question as yours, I would be really appreciative if you have figured out the solution and happy to share some ideas with me

  • @abinashnayak8132
    @abinashnayak8132 ปีที่แล้ว +1

    How to change speaker1 and speaker2 to the actual person's names ?

    • @mllife7921
      @mllife7921 ปีที่แล้ว +2

      you need to store person's voice embeddings and then compare (some similarity) and map with the one in the samples generated

    • @AndreAngelantoni
      @AndreAngelantoni ปีที่แล้ว +1

      Export into text then do a search and replace.

  • @OttmarFlorez
    @OttmarFlorez ปีที่แล้ว

    Malísimo