How NOT to Sample Audio! - Computerphile

Computerphile

มุมมอง 110 645

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 29 ม.ค. 2025

ความคิดเห็น • 488

@BillySugger1965 4 ปีที่แล้ว ⁺³⁶¹
I’m amazed there was enough information in that image to recreate an intelligible reconstruction!
@Meta11axis 4 ปีที่แล้ว ⁺¹⁰
Well, it was a one second audio clip. Discussing the waveforms of songs from artists (like they proceeded to do) which would be some minutes compressed into the horizontal pixel size of the screen is another thing entirely, and quite ridiculous IMO.
@AudryConsol 4 ปีที่แล้ว
@@Meta11axis not if at the time of the picture the waveform was zoomed in super far, you wouldn't get the whole song but you might get some snippets, of it
@ngkktht774 4 ปีที่แล้ว ⁺¹⁶
he needed 9x stretch to get right speed at 48KHz sampling rate, so the data he had were kind of 5.3KHz sampling rate => bandwidth up to 2.6KHz, which is just about enough for intelligible voice...
@Meta11axis 4 ปีที่แล้ว ⁺⁴
@@ngkktht774 I love the fact that we deduced the same sampling rate using independent information from this video (see my comment above). Great stuff!
@GJ203 4 ปีที่แล้ว ⁺¹
When I'd thought casually about this before I didn't think it was possible.
@joachim4660 4 ปีที่แล้ว ⁺⁴⁹⁵
Honestly.. I didn't expect such a quality! :D
@EmanuilGlavchev 4 ปีที่แล้ว ⁺²
Me, too. I know there should be less loss on the lower frequencies... so some of the voicing does come through, but still... I expected a garble!
@BandanaDrummer95 4 ปีที่แล้ว ⁺¹
Same, though, retrospectively, I should have expected something like this because the most crucial information for intelligibility is low frequency (more so than high frequency which fills in additional information). Honestly, it sounds about where some of the worse hearing aids I've seen get
@CmdrKeene 4 ปีที่แล้ว ⁺¹
Me neither I would have thought it would have lost too much data, I had even started wondering about higher quality input images, like 4K or far higher. But clearly, it did not take too much to produce intelligible output
@bandname 4 ปีที่แล้ว ⁺¹⁵⁴
As an audio engineer I pondered this for some time. You've answered my question.
@warmachineuk 4 ปีที่แล้ว ⁺¹⁶⁰
This is like scraping data from the output of some legacy program and just as horrific a process. Mind you, a good result considering what was lost.
@brod515 4 ปีที่แล้ว
what do you mean. I'm not sure what scraping data from a legacy program means.
@lysdexic 4 ปีที่แล้ว ⁺²
reminds me of the “Visual microphone” project around 2014 :)
@LiEnby 4 ปีที่แล้ว
@@brod515 open the program, take a screenshot of it then move the mouse to the buttons and click them 😂
@Hamachingo 4 ปีที่แล้ว ⁺⁴²
This is like those old cinema film reels where there's literally two visible waveforms running next to the pictures and it's being reproduced as audio. It works surprisingly well.
@cavalrycome 4 ปีที่แล้ว ⁺⁸
I was going to leave a similar comment. Extracting audio from a visual representation of the waveform on the soundtrack of a film was standard practice in cinema projection so they're kind of re-inventing the wheel in this video.
@themroc8231 4 ปีที่แล้ว ⁺⁶
@@cavalrycome Optical sound was used in movie theaters until the turn of the century. If you remember the "Dolby" logo fancy theaters used to show before the movie, it was to tell the audience they used electronic processing on top of optical sound to minimize the noise resulting from the grain of the film in the quiet parts, when it was more noticeable.
Although optical sound was not a sinusoidal representation, if you look at the sound track (as in the sound-carrying track) of a film you'll see that it is a full shape where. I think the width of the shape still translates to loudness, but i am not sure how pitch and tone were expressed.
@JavierAlbinarrate 4 ปีที่แล้ว ⁺⁷
the big difference between this and audio on film is bandwidth, for a given second you had many feet of information. Thus you have the intermediate values that are missing in this experiment.
@fllthdcrb 4 ปีที่แล้ว
@@themroc8231 To my knowledge, there were two types of optical sound track: variable width and variable density. In the former, I believe the _envelope_ would be the waveform being reproduced.
@themroc8231 4 ปีที่แล้ว ⁺²
@@fllthdcrb I remember seeing an ilustration of a variable density film track in a book, but i seem to remember it was never widely adopted in the movie industry or maybe for a short period of time, though maybe it had more use in other applications.
@Yaxqb 4 ปีที่แล้ว ⁺¹⁵⁰
Ok guys, let's see who this Computerphile video really is
***rips mask off Audiophile***
@Anvilshock 4 ปีที่แล้ว ⁺²²
AND HE'D HAVE GOTTEN AWAY WITH IT IF IT WASN'T FOR YOU MEDDLING KIDS!!
@jordankokocinski506 4 ปีที่แล้ว ⁺⁵
This is hilarious. And cool! When I was younger, I remember discovering a feature in Audacity where you could import an image file to audio. I took a screenshot of a sample of audio and imported it, but it was rubbish. This program you've written, however rough, is surely more nuanced. The results were quite surprising to me! I did not expect it to resemble the original sound as much as it did.
@supahfly_uk 4 ปีที่แล้ว
TH-cam has stopped recommending computerphile vids they are literally the highlight of my life lol.
@MrJC1 4 ปีที่แล้ว ⁺¹³⁵
what are you talking about how NOT to??? this EXACTLY how I want to start sampling audio. I always am looking for weird sound sources to feed through walls of effects and filters and see what i get out of it. This sounds like something I could get lost in. Oh god... WHAT HAVE YOU DONE COMPUTERPHILE?
@Yaxqb 4 ปีที่แล้ว ⁺⁸
I imagine this amazing app
Encode some song to an image
Edit in your favorite photo editor
Decode back to audio
Profit
We could have a whole new generation of music production tech that does weird stuff like adding glow effects, displacements and so on. Sounds like an interesting small mini-research area
@shmunkyman33 4 ปีที่แล้ว ⁺³
@@Yaxqb Like the audio guy said, this effect already exists basically, you'd use a "bitcrusher" plugin which does this process, just without the intermediate step of taking a picture.
@charlieangkor8649 3 ปีที่แล้ว ⁺¹
I have working prototypes of:
1) Store audio in digital by printing on B/W laser printer and scanning
2) Store audio in analog by printing on B/W laser printer and scanning
3) Store audio in analog (real analog, with infinitely smooth range of values and noise!) in digital files
4) Store audio in digital in the dithering of an artwork image
I'm open to funding.
@adamsbja 4 ปีที่แล้ว ⁺⁴
Back in the late 90s I heard chatter about an espionage technique where they could point a laser at a window and get the vibrations of people talking inside. My dad's workplace had to have special panes put in to dampen that effect (I grew up in an interesting town). This is a great demonstration of that idea, via a different process.
@3dlabs99 4 ปีที่แล้ว ⁺⁴⁰
Next step: Train a deep learning network using raw sound files and images of the wave to do it.
@MichaelGrundler 4 ปีที่แล้ว ⁺⁴
I was about to comment the same: throw machine learning at it.
@Slettador 4 ปีที่แล้ว ⁺⁷
I doubt this would be very successful because this is in no way a new problem. Sampling audio at a lower frequency (which happens here because of the limited number of pixels in the screenshot image width) and limited bits per sample (due to the limited height of the screenshot image) and trying to improve the fidelity is a signal processing problem and generating more data from what limited data you have without context would be quite difficult for a ML algorithm. You could probably substantially improve it for specific purposes, like if you know you recorded someone's voice you could train the network by using recordings of that same person and that should yield decent results
@3dlabs99 4 ปีที่แล้ว
@@Slettador Yeah it would surely work best for speech and if its the same person for sure. It will probably totally break for music for example. Speech is fairly forgiving.
@ijknm2531 4 ปีที่แล้ว
and put that DNN on a physical (neuromorphic) chip
@ChrisBrennanSF 4 ปีที่แล้ว ⁺²
steganographic Rube Goldberg MIDI sequencer
@busTedOaS 4 ปีที่แล้ว ⁺⁶
Oh my, C# code in the description... I'm falling in love all over again.
@Oli420X 4 ปีที่แล้ว ⁺¹
I'm honestly not too keen on the syntax, most of it's okay, but the amount of brackets in any nesting just gets messy
@domminney 4 ปีที่แล้ว
C# is not what I do day to day 😉
@Kinglink 4 ปีที่แล้ว
As I started. "This won't work."
As I finished. "Damn! Let's go see the code."
I always assumed the visual representation was just an approximation of the wave, or something else. But wow it actually worked! One of the best videos, and that's saying something.
@neilloughran4437 4 ปีที่แล้ว ⁺⁶⁶
I guess the extra data you insert in between samples is "interpolation",,, I recall Roland did some quite advanced interpolation on their 30khz/12 bit samplers to actually modify intermediate point to point values in an intelligent way... i.e. a sine wave would retain it's inherent shape and wouldn't be a bunch of straight lines.
@domminney 4 ปีที่แล้ว ⁺⁴
I was looking for the word dither but it escaped me!
@neilloughran4437 4 ปีที่แล้ว ⁺⁴
@@domminney I wonder what the sound quality would be like if the sampled graphic could be interpolated with smooth sine like wave between the points... probably too much math for my brain to know :D.
Hopefully cleverer people than me commenting :D
@neilloughran4437 4 ปีที่แล้ว ⁺⁴
From Roland W30 manual (circa 1989)
"... there was a need for a reliable way of
"filling in" the spaces between points sampled.
Roland has succeeded in developing a way of carrying out such high-speed calculations , and provide intelligent interpolation for the imaginary points lying between sample points.
The sampler looks well beyond the points in question for information, and makes its calculations using the leading -edge technique known as differential interpolation. As a result, noise is much less ikely to even appear, assuring high quality sound."
@domminney 4 ปีที่แล้ว ⁺¹⁰
@@neilloughran4437 for the sake of speed I made it linear, I've not actually watched the final edit yet as I'm on a zoom call but we did chat about filling in the data with better curves.
@neilloughran4437 4 ปีที่แล้ว
@@domminney cool!
@thekaxmax 4 ปีที่แล้ว ⁺⁸
Note on quality: human voice is a /lot/ easier to understand, and you don't need tone quality, which you do for music
@jordanlin4437 4 ปีที่แล้ว ⁺⁵⁷
I'm actually a bit curious how different the audio reconstruction would sound if instead of what looked like a linear interpolation he instead used the discrete Fourier transform or something. It probably wouldn't have the 8-bit sound but will still sound muffled, but that would be interesting to hear.
@jalvrus 4 ปีที่แล้ว ⁺²
I think the main thing you'd get by using something like sinusoidal interpolation to reconstruct it would be to reduce the noise/static. It wouldn't help reconstruct the high frequencies that have been compressed out during the graphing process.
@Pystro 4 ปีที่แล้ว ⁺²
You can get the low-frequency component from the average of the minimum and the maximum, while the difference of the minimum and maximum gives you the volume of the high frequency sounds - but sadly no information about their pitch.
To reconstruct that you'll probably want to do something like this: take white noise (or some other suitable noise), make a hard cutoff for frequencies that are within the sampling rate of your image, amplitude modulate that noise with the spread.
The absolute easiest way to include high frequency sounds would be to just pick random values between the minimum and the maximum instead of the linear interpolation.
@Mr099660 4 ปีที่แล้ว
Quadratic interpolation should be enough, don't think something else would give better results
@nhandao8836 3 ปีที่แล้ว
@@jalvrus yeh, on top of that he applied the moving average filter it acted like a low-pass filter and removed more high frequency information.
@SkyOctopus1 4 ปีที่แล้ว ⁺¹⁰
As I understand it, the film used in cinema projectors has the audio recorded both as a visual analogue wave file and digitally. Teeny tiny wave images squished down the side of every frame.
@danieljensen2626 4 ปีที่แล้ว ⁺²
Sort of, but they have a much easier time recovering the signal because you can just shine a light through the film onto a photodiode, and get the audio signal out as a voltage. And of course the strip of film is pretty long so you have much better frequency resolution than with this squashed audio clip in the picture.
@AaronOfMpls 4 ปีที่แล้ว ⁺²
@@danieljensen2626 Yup, and I imagine the light shines through a narrow slit so the photodiode is only "seeing" a tiny fraction of a second at a time. With 35 mm film running at 24 fps, 1 second of audio will be spread out across 18 inches / 46 cm of film.
As for digital sound-on-film, that works much like a QR code, which gets scanned inside the projector. SDDS was printed as a long strip on the edge of the film, and Dolby Digital was printed between the sprocket holes. Meanwhile, DTS audio was stored on a CD, and a time code (which kept it in synch) was recorded on the film as a dashed line next to the analog soundtrack. Wikipedia has a picture of all of this on the "35 mm movie film" article ("File:35mm film audio macro.jpg").
@AI7KTD 4 ปีที่แล้ว ⁺³
I think tweaking the interpolation (using sin(x)/x for example) would drastically increase the quality of the recovered clip.
@robertbass682 4 ปีที่แล้ว ⁺²
I will share this with my Computer Science students when we do our audio editing unit. I have had them try to generate samples from a Fourier analysis graph of an instrument playing a single note, but never from a scrunched up wave form. This may help drive home what sampling really is all about, at least numerically. Should be fun!
@fakename8749 4 ปีที่แล้ว ⁺⁸⁷
You talk about bit depth at length, but sampling frequency also matters.
The screenshot you've got is probably less than 1000px wide, which means 1000 measurements to work with, which isn't nearly enough according to Nyquist-Shannon.
@Computerphile 4 ปีที่แล้ว ⁺⁵³
Actually it's closer to 4000 as I grabbed off twin HD monitors :)
@theIpatix 4 ปีที่แล้ว ⁺¹⁴
Nyquist Shannon shouldn't be a problem as long as the signal is properly low pass filtered before it's displayed on the screen (which I'd agree that it most likely isn't, I mean, who cares about the visual fidelity of sound). I wonder how well it would work with properly rendered smooth lines on screen (and an algorithm that can recover from those).
@ObsidianJunkie 4 ปีที่แล้ว ⁺⁶
To be able to perfectly recreate the signal you need a sampling frequency that is 2x the highest frequency component in the original signal, but you can still get a half decent approximation with a really low sampling frequency, as evident in the result.
@kc9scott 4 ปีที่แล้ว ⁺⁹
I was also really surprised that it was intelligible. My early assumption was that the screen resolution would be so low that you could really only use it to volume-modulate noise. But with 4000 measurements for the short time length of the sample, it is enough to reproduce the fundamental frequency of his voice. There will be lots of aliasing of the higher frequencies, but for recognizability of speech, the aliasing won’t matter (which actually isn’t surprising at all).
@widicamdotnet 4 ปีที่แล้ว ⁺⁶
Yeah, the reproduction is intelligible speech at a sampling rate somewhere between 1-4 kHz (a ~1s snippet at around 4000 samples), but still much worse than "telephone" quality which routinely gets lowpassed to about 4 kHz and thus would be perfectly reproduced at 8 kHz. It's not surprising that it worked as well as it did, but still a fun experiment and a nice video :-)
@evgenysavelev837 4 ปีที่แล้ว ⁺²
This has been done before. It is called Shannon-Nyquist theorem, the way to restore the sound to the best quality possible is to use sinc approximations. Sinc is a short for sin(x)/x. There will be problems with aliasing, which is something you will never be able to correct for.
@MichaelAddlesee 4 ปีที่แล้ว
Yes, just what I was going to say. But given the squashed up waveform finding the actual 48kHz sample points is the real problem.
@evgenysavelev837 4 ปีที่แล้ว
@@MichaelAddlesee Yep, I would also dare to say it is impossible to restore high frequency signal after it has been downsampled thus way (or any other way).
@brod515 4 ปีที่แล้ว ⁺¹⁸
I think the audio quality could be somewhat increased if the interpolation for stretching used cubic instead of linear. Since the audio waves are actual sine waves.
@danieljensen2626 4 ปีที่แล้ว ⁺⁴
Or you could just upsample with an FFT.
@eDoc2020 4 ปีที่แล้ว ⁺⁸
Or don't interpolate on the time axis. Instead of stretching each sample 9 times to reach 48000Hz sampling just output a WAV file with 5333Hz sampling. Then all the interpolation needed for playback is done by the audio software. Audio files with crazy weird sample rates still play back fine on modern systems.
@brod515 4 ปีที่แล้ว
@@danieljensen2626 how does that work?
@SentientTent 4 ปีที่แล้ว
@@brod515 I think he is referring to a finite Fourier transform. Which is a way of approximating a signal by adding together sine waves with varying frequencies.
@brod515 4 ปีที่แล้ว ⁺²
@@SentientTent I've heard of fourier transforms I'm just not sure how it works to upsample. It was my understanding that a fourier transform can take a signal and extract the individual sinusoidal frequencies that make up the signal. so if you apply it to the signal in question would we extract the individual frequencies then combine them as sine waves (thus upsampling).
@SentientSeven 4 ปีที่แล้ว ⁺³
This was great! Also, Dave knows how to set up his camera, great quality video
@rojasbdm 4 ปีที่แล้ว ⁺¹
Went much better than I expected!
@patricknelson 4 ปีที่แล้ว
Such an amazing result... I seriously thought it’s just be weird loud buzzing sounds. I’m surprised you could tell what was being said. Well done!
@Md2802 4 ปีที่แล้ว ⁺¹
The rights to a piece of recorded music are generally split between (1) the publishing, i.e. who wrote the song, and (2) the recording, i.e. who paid to have it recorded. Both have clearly defined ownership, and the photographer (or magazine / stock photo site / whatever) would not have the right to sell licenses for either.
@mattstegner 3 ปีที่แล้ว
I work on Audition (I'm a Quality Engineer) at Adobe and just shared this with my team who will all get a kick out of it. Great video.
@StrangelyIronic 4 ปีที่แล้ว ⁺¹
Would be interesting to get a high-resolution macro/microscopic shot of a record and emulate a needle tracing the groove to generate a wave file. You would need a high-quality image with great lighting and a fair bit of processing to get a clean groove guide though.
@BytebroUK 4 ปีที่แล้ว
I'd watch that. This whole idea that started out as a bit of a joke has proved really interesting!
@brettbreet 4 ปีที่แล้ว ⁺³²
"The last V8. Return to base immediately!"
@Computerphile 4 ปีที่แล้ว ⁺¹³
Oh my goodness, that was it! -Sean :) (there was a bug that meant you could slip sideways through the map and cheat)
@mortenohlsen7834 4 ปีที่แล้ว ⁺¹
I thought it would be Space Taxi or Impossible Mission he was remembering.
Though the code using lastv2 made me think of The Last V8 though not remembering voice, just the soundtrack.
@ronnetgrazer362 4 ปีที่แล้ว ⁺⁵
@@mortenohlsen7834 We have a visitor. Stay a while... staaay foreverrrr.
@deoxal7947 4 ปีที่แล้ว
Hm what's this now?
@kellerkind6169 4 ปีที่แล้ว ⁺²
I thought it was:
GHOSTBUSTERS!
MUAHAHAHAHAHAHAHA!
P.S: Or "Stay a while, stay forever !" maybe ;-)
@emuccino 4 ปีที่แล้ว ⁺⁴⁰
Dave: "At the end of the day, everything becomes a list of numbers.."
Me: *exstitential crisis* 😳
@EdwardMillen 4 ปีที่แล้ว ⁺²
He should have added "in computers". Well, and in maths I guess. And I suppose maths is... oh... nevermind
@matiascardullo9892 4 ปีที่แล้ว
I mean, you can decompose each quark in your body into an array of coordinates xyz
@domminney 4 ปีที่แล้ว ⁺¹
I’m assuming that “in computers” was implied by the context, but one could argue it in the real world too
@DrorF 3 ปีที่แล้ว
@@EdwardMillen Math is not just about numbers. In fact, numbers are just a small part of it, to my understanding.
@JeffBlaine 4 ปีที่แล้ว ⁺¹⁷
"Stay a while. STAY FOREVER!"
@AdriaanZwemer 4 ปีที่แล้ว ⁺³
HAHAHAHAHAAAaa
@nohjrd 4 ปีที่แล้ว ⁺⁴
Haha, I was just going to comment that (but starting from "Another visitor..."). The was also "Get him my robots"
@ryan8488 4 ปีที่แล้ว
Yes!
@ryan8488 4 ปีที่แล้ว ⁺²
Ghostbusters ahahahaha
@JeffBlaine 4 ปีที่แล้ว ⁺¹
@@ryan8488 Did that also have crude speech synthesis? My quote was from Impossible Mission
@dominiquestiekema824 3 ปีที่แล้ว ⁺¹
reminds me of mit's "microphone plants" that videoed plants vibrations in the air from the sound of a conversation in the room. They were able to preproduce to some extent the sound conversation from that video. It wasn't great; tone, pitch and gender of the speaker were the only things listeners could with certainty identify.
@notimportant7682 4 ปีที่แล้ว ⁺¹⁴
I would have looked at that waveform and thought there was no way to recover anything but amplitude modulated noise with the that technique, thinking harder about it I understand why it worked
@woulg 4 ปีที่แล้ว ⁺²
Same here, when he played it back and it worked it completely blew my mind. What a great episode
@notimportant7682 4 ปีที่แล้ว
@Kurt Juday forget what he said about filling it in later, I think if he did any post processing at all it may have been amplifying what used to be the high frequency sections, but the frequencies he gets I believe come directly from the averaging of the min and max values over the x axis, essentialy acting like a lowpass.
@shmunkyman33 4 ปีที่แล้ว
@Kurt Juday Well, all a waveform is is amplitude data. It's just a sampling of the amplitude of the air pressure over time, so the only difference in this setup is that the number of samples has been reduced. The frequency data is just an emergent property of the amplitude changing over time, so as long as he is able to match the time scale over which the amplitudes change (which he does with that "scale" variable), the frequencies will come out roughly the same (just filtered a lot due to the loss or corruption of information).
@pokepress 4 ปีที่แล้ว ⁺¹
A ways back I remember Weird Al posting the picture of a waveform of a song he was working on. It was way longer than this, so at best you might have been able to get the lowest of frequencies out of it this way.
@elimalinsky7069 4 ปีที่แล้ว
Back in the 1980s some local radio stations were broadcasting computer programs over the air for people to tape and load up on their Spectrums and C64s. Mostly in the UK, where audio tapes as computer data storage were in use the longest.
@RelianceIndustriesLtd 4 ปีที่แล้ว ⁺³⁸
So this how phone companies transmit audio in phonecalls
@grover- 4 ปีที่แล้ว ⁺¹
08:50 - that is truly amazing! I thought it would just be noise. Maybe next time you could consider the hue of the pixel and deduce an intermediate value from it to add more resolution? It's a new form of data exfiltration too. Sending voice recordings in images.
@gammaray0wn 4 ปีที่แล้ว
Really cool video! What you didn't touch on is how this shows how amazing the human brain is at interpreting human speech. Not only are our ears most sensitive to frequencies that match those of human speech, our brain can also extrapolate words and meaning from even heavily distorted and low information pitch, volume, and intonation content!
@horurkristinsson5292 4 ปีที่แล้ว ⁺¹
I remember an Amiga program called Octamed (tracker from '90) had ability to draw a waveform with the mouse onto a grid and use it in your song.
@Roxor128 4 ปีที่แล้ว ⁺¹
Fast Tracker II on MS-DOS can do that, too. Maybe Triton took inspiration from the earlier Octamed?
@vladpuha 4 ปีที่แล้ว ⁺¹
very educational. thank you! Please have an extended interview about audio headers and how to work with sound with some sample.
@davidyu1813 4 ปีที่แล้ว ⁺⁴
The reproduced audio reminded me of the English listening tests I took in high school
@fllthdcrb 4 ปีที่แล้ว
Sounds like it was torture.
@geiger21 4 ปีที่แล้ว
as I Pole I can relate to that xD questionable voice quality + the shittiest boombox + super reverby class room. Boom, English lesson in Polish school xD
@erifetim 4 ปีที่แล้ว
The outcome is much better than I've expected, would've loved to hear more examples
@Computerphile 4 ปีที่แล้ว
I bet Dave would do you one if you contact him :)
@KrisCalabioMusic 4 ปีที่แล้ว ⁺³⁷
I wanna see LegalEagle do an episode about that!
@tipx2master788 4 ปีที่แล้ว ⁺¹
He is an American lawyer
@maighstir3003 4 ปีที่แล้ว ⁺²
Or LawfulMasses
@jeromethiel4323 4 ปีที่แล้ว
I remember a game from the 80's called sea dragon. And the start of the game had audio that said "sea dragon" 3 times, each speeded up. And that was over an Apple 2 speaker. So basically 1 bit audio, as the speaker was clicked one way, then the other. But at the time, amazing!
@my4trackmachine 4 ปีที่แล้ว
HAHAHA I love this. I was surprised there was enough resolution to get it that clean. I can see this conversion being a rad VST for sound processing.
@recklessroges 4 ปีที่แล้ว ⁺¹
Dude! I understand the defensiveness, but it's totally not needed for me; it's a joy to see some actual practical, (not just an example) programming back on Computerphile. [where I will poke fun] "Would have been better if it was written in rust."
@Lemon_Inspector 3 ปีที่แล้ว
Dave there in the thumbnail is giving me a look like if I sample audio this way, I'm gonna end up in 6 pieces at the bottom of the nearest lake.
@colinstu 4 ปีที่แล้ว ⁺³
1:00 vs 8:42 ... amazing work. I've actually wondered if this was possible years ago, stunning to actually seen it done!
But yeah, I was imagining this done with like a 3min song, squished to about that same size... that would lose way more depth I'm sure.
@danieljensen2626 4 ปีที่แล้ว ⁺¹
Yeah, this clip is only like a second long. With a 3 minute song it would sound at least 180 times worse, haha.
@rene0 4 ปีที่แล้ว ⁺¹
I.. did not expect that. I's expecting 'can maybe barely make up' not like 'can even identify the person talking'... Amazing.
@kieran.stafford 4 ปีที่แล้ว
Love this video. I watched fascinated. Bloody awesome result guys. It'd be very interesting to try this with an ultra high definition image file to see how far you can push the boundaries of quality. Again brilliant video. Many thanks
@shadowwalker23901 4 ปีที่แล้ว ⁺⁷
I have a feeling I was the only one thinking switch it over to frequency domains..aka spectrogram no wasting so much data in a picture. Using a 512x512 picture you could store a 6 second 44100khz 8bit sound clip in mono with grayscale and stereo with color.
@charlieangkor8649 3 ปีที่แล้ว
or simply dump raw .mp3 data into a matrix of 512x512x3 bytes (768 kB) and encode it as PNG. No need for spectrograms.
@LegitJDG534 4 ปีที่แล้ว ⁺¹
Impressive results, I didn't expect to be able to roughly make out different the syllables.
makes me wonder if you could convert the waveforms generated via musical instruments, approximate the value being played and generate a midi file of the audio segment.
@davidg5898 4 ปีที่แล้ว ⁺¹
I'd say the aspects of music copyrights to be concerned about here -- at least in the USA -- fall between broadcast and performance rights. Basically, you wouldn't violating any rights merely by converting the image data into audio and listening to it privately, but if you shared the audio with others then you'd be treading into copyright infringement territory.
Extracting audio from a photograph doesn't qualify the audio as a derivative piece from the photograph because you're changing the media type and thus the set of rules governing the rights. For example, sheet music vs. radio broadcast vs. live performance vs. using a song in a movie/show/video/commercial each have different set(s) of licensing/rights involved (with some overlapping).
That said, if you wholly owned the rights to such a picture, you might be able to skirt the law by publishing the picture and also publishing your method by which audio can be extracted from it, so long as you're not actually doing the conversion for anyone or sharing anything you've personally extracted with anyone. It's no guarantee, though, because copyright suits can involve a lot of interpretation and a clever lawyer could still sway a judge/jury against you.
I am not a lawyer, but have done an immense amount of research into these parts of music copyright law due to public programs an employer put on that I was involved with carrying out.
@bluerizlagirl 4 ปีที่แล้ว
The height of the waveform will give you the bit depth. For instance, if the difference between the lowest and highest points was 400 pixels, then the resolution is somewhere between 8 and 9 bits (which would allow for 256 and 512 steps respectively). And the amount of time represented by one pixel in the horizontal direction is the inverse of the sampling rate. For instance, if one pixel represents 100µs, then the sampling rate is 1 000 000 µs in a second / 100 = 10 000 samples/second = 10kHz. Even easier, the original sample rate can be got by dividing the final sample rate by the time stretch factor.
@gloverelaxis 4 ปีที่แล้ว ⁺¹
As a musician/producer and programmer this was really fascinating! I've never thought about how vertically scaling the waveform graph *exactly* parallels reducing the bit depth, and horizontally scaling *exactly* parallels reducing the sample rate. I think bitcrusher plugins could use this really effectively to visualise what they're doing as users change parameters. Very cool!
@AndersJackson 4 ปีที่แล้ว
Around 13:00 you are talking of Nyqvist frequence, and there are a lot filtering of the original sound when you convert this.
@kasuha 4 ปีที่แล้ว ⁺¹
The audio was just lacking high frequencies, low number of bits on amplitude is not such a big deal. I wonder if instead of interpolating through the range for each column, replacing the interval with appropriately scaled white noise wouldn't help.
@taylorh140 4 ปีที่แล้ว ⁺¹
To me, it sounds like it might be partially due to the triangle waves (linear interpolation between two points). Pretty common in the old video games.
@MubashirullahD 4 ปีที่แล้ว ⁺¹
Impressive. I didn't know you could do that. I assumed there would be too much overlap.
@poorman-trending 4 ปีที่แล้ว ⁺¹⁹
Train a neural network to reconstruct the wave form. You’d have to train it using voices only, but I bet it would do a bang up job.
@crazyluigi6664 4 ปีที่แล้ว
You mean like Jukebox AI?
@TAP7a 4 ปีที่แล้ว ⁺⁵
Talk about using a sledgehammer to crack a nut. Or in this case maybe it's more of a pneumatic drill to crack a sheet of paper
@altrag 4 ปีที่แล้ว
@@TAP7a Well certainly that would be rather overkill for a little programming challenge.. but as the video suggested, if the waveform was stripped from a background monitor during some major producer's interview or the like.. that could be at least somewhat profitable I imagine, especially in markets that aren't so concerned with US/European copyrights but still consume large amounts of US/European content (Asia, Africa, etc).
@ijknm2531 4 ปีที่แล้ว
like brainchip akida ?
@tramsgar 4 ปีที่แล้ว
Nice new practice to paste code in the description! 👍
@lambdaprog 4 ปีที่แล้ว ⁺⁴
Next enhancement: Measure the instantaneous frequency and use it to generate a new wave form based on the instantaneous amplitude. This will effectively ressucitate the lost phase information using a few assumptions on the human voice. Have fun!
@Lolwutdesu9000 4 ปีที่แล้ว ⁺¹
Instantaneous frequency 😂
@lagduck2209 4 ปีที่แล้ว
That sounds different than just bitcrusher/lower samplerate. Linear interpolation bring some nice distortion too. Wold be pretty neat to have this as VST with width/height control, linear/cubic/parabolic option for interpolation, parameter for descretion of pixel brightness. (probably wouldnt work realtime well, but I can Imagine such procedural tool totally fine)
@Turbo3032 4 ปีที่แล้ว ⁺¹⁸
Isn't this basically a simpler version of what people have done to get audio from the vibrations in glass in a video?
@fisch37 4 ปีที่แล้ว ⁺⁵
It is similar, but the latter is probably a lot more complicated and honestly I have never heard of that happening. You would need a pretty high resolution for that
@victorbarroscoch 4 ปีที่แล้ว ⁺⁷
@@fisch37 Resolution shouldn't really matter that much. You might need a high speed camera though, the normal frame rate for video is 24-60 fps. With that you can only reproduce sounds that are 30Hz or lower (without running into aliasing issues).
@cadekachelmeier7251 4 ปีที่แล้ว ⁺⁷
I think the process is really analogous. In this case the sound is compressed by the image resolution. In the vibration video it's compressed by the video framerate. There were more tricks that they pulled out with the audio from video thing though like using the rolling shutter to get higher temporal resolution than you'd originally expect.
@1SmokedTurkey1 4 ปีที่แล้ว ⁺⁵
@@fisch37 Check out veritasium. It's been done. He has a video about it.
@realcygnus 4 ปีที่แล้ว
Cool
@antivanti 4 ปีที่แล้ว
If you haven't kept up on the Commodore 64 demoscene you'd be absolutely blown away by what they're able to do these days as far as audio goes. Have a search for "Cubase 64 Mahoney" for instance...
@joseortiz_io 4 ปีที่แล้ว ⁺¹
Unbelievably awesome! So creative. I love it!
@douggale5962 4 ปีที่แล้ว
When getting minimum and maximum at the same time, the optimal algorithm is to read two consecutive numbers (a and b) from the list, and if a < b then compare a against min and b againt max, else compare b against min and a against max and assign accordingly. Three comparisons every two items, instead of four comparisons every two items.
@wolfgangmcq 4 ปีที่แล้ว
That sounds likely to be less efficient in practice on modern hardware (though, as always, profile if you need performance)---I expect the compiler can probably optimize the naive version quite a bit. For example there's a way to get the max of 4 pairs of numbers in a single instruction. If the compiler can figure out what you're trying to do it can set up a pair of SSE registers for min/max, and stream your entire list through them a lot faster than if it was forced to do branching for pairwise comparisons.
@douggale5962 4 ปีที่แล้ว
@@wolfgangmcq You don't need to use branches to do what I said. Whole thing can be conditional moves. You could use my algorithm on vectorized version too.
@wolfgangmcq 4 ปีที่แล้ว
@@douggale5962 Ah, true, hadn't thought about that. Still seems likely to be more trouble than it's worth, though, if only for reasons of readability.
@phoenixdk 4 ปีที่แล้ว
I'm amazed at how well this worked. It would be fun to push it to complete failure, like 10 seconds of music, several instruments, in a 4000 px wide image.. would it even be tonal at that point?
On the other hand, bit depth could be improved by simply enlarging the y-axis on the screengrab, which might improve articulation. And as mentioned, the code could be tweaked and improved further.
@carlociarrocchi2793 4 ปีที่แล้ว ⁺¹
I did something like that a few years ago. Instead of an audio I was trying to recover as much information as possible from the picture of a graph showing a single continuous line.
@nielsdegroot9138 4 ปีที่แล้ว ⁺¹
The sound reminded me of Impossible Mission on the C64. @8:45. Good memories.
@cpt_nordbart 4 ปีที่แล้ว ⁺¹
Stay forever!
@christoffermedc 4 ปีที่แล้ว
wow i can't understand how sound is representable that well in just a 2d picture, amazing!
@antivanti 4 ปีที่แล้ว ⁺¹
You probably have at bit depth of about 8 bits actually. That would be the equivalent of 256 pixels height between lowest valley and highest peak in the image. The problem is the abysmal sample rate. If you stretched each sample 9 times to match playback speed in 48 kHz that equals 48 / 9 ~ 5.3 kHz. Nyquist tells us that is enough to perfectly recreate frequencies up to half that if the signal was bandlimited. If you instead of average out the values just duplicated them and then bandlimited that signal to 5.3 kHz you'd probably get a pretty good recreation of everything below 2.7 kHz 🤔
@G12GilbertProduction 4 ปีที่แล้ว ⁺¹
High - low processing is really meaning more harder than in 45000 Hz sampling between a two parts of Gaussian curve.
@schifoso 4 ปีที่แล้ว
This was very interesting. I hope you do more videos with Mr. Fowler as he's an excellent presenter.
@Slarti 4 ปีที่แล้ว
Visual Studio C#, I will therefore forgive you of any less than perfect code :)
The best IDE and the most straightforward programming language and framework.
@lysdexic 4 ปีที่แล้ว
Fun project! Sounds like the reconstructed output is crushed to 4-8bit - wonder if the Adobe sampling rate for the wave plot is 1:8 or something (plot 1 out of every 8 samples) maybe it’s being ring modded by your new sampling rate too. How many samples (divided by 9) is your output file?
@veggiet2009 4 ปีที่แล้ว ⁺⁹
Sean thought 8bit voice, I thought about the very first recording cylinders
@Zappabain ปีที่แล้ว
I'm not sure but I understand we lost every info on frequencies with this fast method. Maybe freq could be harshly deduced by the brightness: the more brigthness, the more density of the wave in that instant, the higher it's freq. Then some constants to tune it's scale till it sounded reasonable, plus the sinus approximation or similar, may sound more complete and human.
@Zappabain ปีที่แล้ว
actually I notice now I don't know what exactly that wave represents so I may be completely wrong 😅
@AuthenticTerrificRickCastle 4 ปีที่แล้ว
was listening to the conversation you've had in the very end - you can make a prompter-like DIY contraption that will help you look straight into the face of the person that you are talking to
@Ping727 4 ปีที่แล้ว ⁺³
I just realized listening to this that computerphile is like computer file...
@adamp9553 4 ปีที่แล้ว
Hard to believe he managed to get something that sounds like a voice out of that image. Peak-to-peak tracing doesn't retain the information within, nor is it a form of actual downsampling - you can't know the density, how much closer to one pole it should be.
@DusteDdekay 4 ปีที่แล้ว
I'm surprised it sounds so good, I was not worried about the sample deptht at all, but I thought there wouldn't be enough samples to make out anything.
@EighteenCharacters 4 ปีที่แล้ว
I love this episode! This is amazing!
@SSJfraz 4 ปีที่แล้ว
That's outstanding. Great work.
@SteveJones172pilot 4 ปีที่แล้ว
I was assuming this video was going to go in a COMPLETELY different direction. I was assuming the wave data was going to be encoded into the bitmap using Steganography so that it could be pulled back out by anyone who had access to the picture.
@user-i3pqb2lw54 4 ปีที่แล้ว
Another visitor. Stay a while. Stay forever!
@TimothyWhiteheadzm 4 ปีที่แล้ว
Regarding the legal questions, (I am not a lawyer), I would assume that if you buy all 'rights' to a picture but the picture contains data the person you bought if from did not have rights to, you don't gain rights to that data. At best you could sue the seller for falsely representing ownership, and his client (that owns the audio) could sue him for leaking the audio to you.
@F1ghteR41 4 ปีที่แล้ว ⁺²
I think that Leonard French would be the guy to ask this, he has a engineering background and he's a copyright attorney.
@emmanueloverrated 4 ปีที่แล้ว
Basicaly, the sample just got downsampled... you're using an image to transfert the information. Take any WAV, downsample it to 7bit with a sample rate of 1khz, you gotta have the same quality.
If you want something that sound so much 8bit, convert it to DMC format. It sounds lovely.
@IlluminatiBG 4 ปีที่แล้ว
Well wave us just a graph that can be represented in 2D image with 65536px height (assuming 16 bit) and 44100px (assuming 1 second for 44100Hz sampling) and 1 bit color depth, which would still be way more info than WAV file, as wave is 1D array, not encoding any pixels for the empty part of the graph, it will still be better than PNG compressed image.
@flochartingham2333 4 ปีที่แล้ว ⁺⁴⁰
Did he call Audacity; "old dusty?"
@karatsurba4791 4 ปีที่แล้ว ⁺³
Yes
@domminney 4 ปีที่แล้ว ⁺¹¹
🤣🤣🤣 my saaarf east London accent getting in the way there
@iau 4 ปีที่แล้ว ⁺³
It is. Its usability is almost none.
@Anvilshock 4 ปีที่แล้ว ⁺³
Well, it is. Looks worse than Netscape Navigator 1, it does.
@ZipplyZane 4 ปีที่แล้ว ⁺⁸
@@iau That's like saying GIMP is unusable. It's just not the highest end product. But, for many cases, you don't need the high end. You just want to record audio, apply some effects, and be done.
Though I do hope they'll eventually make crossfading easier. It should be a lot more automatic, and not only in the special live recording mode.
@jangxx 4 ปีที่แล้ว
13:47 It's probably also perfectly audible because of our brains ability to extract speech from incredibly distorted sources, but only if you have an idea of what was being said. If you played the same audio file to someone who doesn't understand English, I'm sure they would tell you that it's just gibberish.
@3DCGdesign 4 ปีที่แล้ว ⁺¹
Not surprising as I’ve already seen where someone turned their sound wave “I love you” or something like that into a piece of wall sculpture and then you can scan it with an app to hear what the wave contains.
@Nejvyn 4 ปีที่แล้ว ⁺¹
Does that work with any depiction of a soundwave tho or only with those sculptures? If it's the latter I'd assume that the app is just checking the pattern via a library and link it to the stored audio file on the manufacturer's servers or sth like that.
@3DCGdesign 4 ปีที่แล้ว
@@Nejvyn I was under the impression that the app could read any soundwave - but I did not investigate further. You could be right.
@EvolWe 4 ปีที่แล้ว
Impressive! this reminds me of "The Visual Microphone: Passive Recovery of Sound from Video" where they extract sounds from objects like plants using high speed cameras and micromovements.
@EdwardMillen 4 ปีที่แล้ว
I dunno whether I count as a great programmer anymore, but your code looks perfectly reasonable to me. I'm also really surprised it came out that clearly!
One thing I'm wondering though - is the horizontal squashing of the waveform equivalent to a lower sample rate? And if so, couldn't you just have set a lower sample rate in the header of the output file so that it gets played at that rate, instead of having to add in extra values to stretch it back out? (Or is there a reason why that wouldn't work?)
@Veptis 3 ปีที่แล้ว
Couldn't you take the screenshot and figure out the duration to divide by pixels and know how much stretching is needed?
Adding better interpolation and also some harmonics could give slightly better results.
Doing this from a spectrograph that is color mapped will give much better results, right?
@umchoyka 4 ปีที่แล้ว
If I understand what he's done here, it's basically like taking the original file and running it through a low pass filter. Interesting that it actually worked
@Henry14arsenal2007 4 ปีที่แล้ว
What i dont get is how voice timbre is preserved in an audio wave in general. How do you differentiate between a rock song, a techno song and a voice recording for example?
@_MG_ 4 ปีที่แล้ว
What is the size difference of the original sound file, image file, and new audio file? Curious of the compression ratio's given you've sent meaningful or recognizable data through this conversion process.
@marksterling8286 4 ปีที่แล้ว
Great video, very surprised about the quality of the output. Really interesting

ต่อไป

เล่นอัตโนมัติ