you probably dont give a shit but if you are stoned like me atm you can watch all of the new movies on instaflixxer. I've been binge watching with my gf for the last weeks =)
To whoever runs ArtOfTheProblem, I want to personally thank you for all of the time you have spent making these videos. I came across one of them on a math forum a few months ago and I proceeded to consume every video you have made. Your videos led me to absolutely fall in love with information theory and persuaded me to open up a major in computer science. Claude Shannon has replaced Richard Feynman as my favorite scientist. I can't thank you enough, these videos literally changed the path I'm taking through life.
@@ArtOfTheProblem Hey! Thanks so much for checking in. I ended up graduating w/ a BS in Computer Science from UC Santa Cruz and I'm currently four years into my career as a backend software engineer in Silicon Valley. I just left the first startup I joined out of college and joined a new one and I'm currently working on building a platform for training data management for enterprise ML applications. I really wanted to thank you again for making these videos, they altered the course of my studies and my career and, hence, my life. I still, to this day, read about information theory in my own time and still hold Claude Shannon up to be one of my favorite scientists. (I just noticed I'm posting from a new email address but it's me!)
@@johnallard2429 wow congrats. This note made my morning and is the reason I started this channel. You're probably way ahead of me now, i'm struggling to simplify my understanding of transformers at the moment for the next video :)
Ok. Not just data-compression. All digital encoding. This guy rivals Alan Turing, if not better. (Essentially the real first computer guy in terms of 'digitizing')
Thanks for uploading a new video!!! I just started watching these a few days ago and was bummed out when I saw you hadn't uploaded one in a little while. These are awesome.
To anyone who did think of this, why would you ask questions such "AB v/s CD" in first case, and "A v/s BCD" in second case - The reason is that this ensures that both answers can occur with 50% probability. If at every step, you ask 2 questions of equal probability, the AVERAGE number of times you'll have to ask the questions (in other words, follow that tree structure of question/answers), will be minimum. Why? You'll need a little bit of basic computer science experience to think this one through, maybe a good reason to take a course on data structures! More interestingly, this is also the reason why binary search is more efficient that splitting a sorted list into 3 parts. It's a simple proof, I encourage you to think it through :)
> If at every step, you ask 2 questions of equal probability, the AVERAGE number of times you'll have to ask the questions (in other words, follow that tree structure of question/answers), will be minimum. Why Why???? Proof please
Great video and super interesting topic. I think this definition of information is super counter-intuitive of the everyday common sense use of the word "information". People often think of information as a source of knowledge or having some inherent use. It seems like the comment-sense definition of information is more akin to education and implies some utility. Whereas this definition of information seems to remain indifferent to its utility and is closer to complexity or an implied amount of substance. This makes sense since entropy and information are increasing in the wild. It is fascinating that information can be proven to be entropic. I wonder if there is a limit to information or if entropy increasing means the universe is infinite in possibilities? Maybe we can only observe an information state but information itself is infinite.
RimstarOrg not in practice, SHA Syntacts unrealised post Collins Diction of call out too non self,I,you,me,them,we,us etc item subject object cycled online from individuals (English post Celt Anglo sax,say,for example)into literate nonsense of eventual some nutter going round system as act too get all way under Bletchly wooden hut non Enigma you need learn two ring Turing tar black out clout law system running Bowe street banker non nuts into lease of no where no be biz kit shopping street mummy's coming boooo?
The beginning of the video is misleading and implies that the second machine is not random. It is in fact random and the difference between them is that the first generates a uniform distribution whereas the second doesn't.
Hmm, is it not so that the sampling (from the distribution) is random, but the actual values are not? For example, I can imagine a distribution that's just a single line at a specific value; random sampling from that distribution will then always give you the same value.
It is very well explained indeed but I still have certain doubt. I would like to think about entropy in terms of random variables (like wikipedia definiton with discrete random variable X where possible events are values of X), but I fail to understand the following: how do you determine the entropy of an English word for example? What is the random variable in that case? I see that for the set of English characters we could determine the entropy according to the formula presented in the video as the X would represent the character that appears as the message and they all have certain probability of appearing so we have the probability distribution. But I've seen people determining entropy of a word by calculating the probabilities of the characters as appearance rates (ABA: 1/3 for B 2/3 for A) and considering that the probability distribution... If anyone could address that and shed some light I would be extremely grateful! Thanks! EDIT: I just realised that given a sequence of characters that appear one after another independently the entropy of such message considered as a value of a random vector would be a sum of entropies of random variables which make up for that vector (is that right?) so it kind of makes more sense to me and seems natural (every additional character provides the same amount of information per average) but I'm still puzzled with entropy of given English word... Hope someone could respond to it. Also, incredibly interesting topic! And this channel is just great!
Hi. Maybe this will help. What was presented has each random letter being statistically independent of the following letter. However, what you seem to be describing is the situation where the random letters are not statistically independent, which is true for the English language.
another way to write shannon equation would be the log base 1/2(p), that you know is how many times you would have to divide in to to approach a probability equal to p, so how many levels we need to accomodate somehow fairly that simbol with that probability
+Anthony Vance Thanks, cam posts the music here (cameronmichaelmurray.bandcamp.com/album/art-of-the-problem-volume-1-gambling-with-secrets) though that song isn't yet listed
This is a year late but by using negative exponent properties we can express 1/p as p^(-1). And then by applying properties of logarithms, we can use the power rule to bring the -1 to the outside of the expression (hence the negative sign in the front).
@@ArtOfTheProblemI had a related question to the video. Say we played the same game but instead of ABCD, we used a fair dice with 1/6 probability for each face at the beginning. I use a strategy where I first ask if the outcome is >4? If yes, then I ask is the outcome {4}, and if it is no then I ask if the outcome is {5}. We do this until we have a similar tree- diagram as in your video (Time stamp 3:34) My question is if the probability of finding the correct outcome, say {4} remains the same as the beginning, that being 1/6 ; or does the probability drop to 1/3 after the first question? Similar resources I found indicates that it should stay at 1/6 but I can't see why. After the asking the first question, it seems to me like the probability of finding {4} should fall to 1/3. The end goal is the same as yours in the video, that being to find the average number of questions using this strategy and compare it against the information entropy. I realize my question comes a decade after the video was first posted, but I would appreciate your input if you have the time. Thank you :)
Very good video, thanks, but I see (from my UoM physics course notes) that the natural logarithm is used instead of the base-2 logarithm: is this choice arbitrary or does it vary to suit a given situation?
+Immac (RockLeet) Good incite! Entropy is good for data encryption (because more guesses/questions are needed to crack it), but entropy is bad for data compression (because maximally compressed data, by definition, has the highest entropy for its size). Data compression is good for communication. Your statement is correct when referring to compression, but certain aspects of communication like bandwidth efficiency or error-correcting codes can improve with entropy. :-)
Why is the number of outcomes state by 1/p?? If the probability was .1 then the number of outcomes would be 10 and if p was .9 the number of putcomes would be 10/9?? I didn't get that =/ But these videos are amazing, thank tou for making such a good video!
+Fernando Gonzalez - use base 2 numbers in your examples to make it clearer. If p = 0.25 then # outcomes = 4 (1 in 4 chance = 25%), if p=0.125 then # outcomes = 8 (1 in 8 chance = 12.5%)...etc.
Using the example in the video, would you be kind enough to explain what the eight outcomes are for C for example? It seems like there are still only 4 possible outcomes, and that to get 8 outcomes after dividing 1 by p, you would have to assume that all the outcomes are equally likely (ie. also p=.125).
Hei, we did not count the probability of CD in the first machine? The first question was is it AB? then you continued only with Yes as an answer? What if it was N0?
You're seeing it the other way around. Imagine that person A and B both have a certain amount of information. To get all the information from A, you have to ask them a minimum of 100 questions. To get all the information from B you have to ask them just a single question. Does A have more or less information that B? It's not a very good analogy, but maybe it can make you understand better :D
"what path would would the other guard show me had I asked him for the path?" And both indicate to the same path(the wrong one) which leaves you with the correct one
That's because every question we ask makes the number of outcomes 2 times bigger (since we are asking yes/no questions). Look at the tree of machine 1, it always has 4 outcomes so we have to ask log2(4) = 2 questions each time. however, in the tree of machine 2 you can stop with less questions, if the letter is 'A' you'll end with 2 outcomes (letter 'A' or the question "is it 'B'?"), so log2(2) = 1, just 1 question.
Because you want your series of questions to reflect the probabilities produced by each machine Machine 1: 25% A, 25% B, 25% C, 25% D BUT Machine 2: 50% A, 12.5% B, 12.5% C, 25% D
Well, if we define "entropy" as disorder, as in physics, then it would seem the more "entropy" the more "information". So, if physicists are concerned that it might be possible to "lose information" in a black hole (for instance - something they nearly all seem to find abhorrent) they are fearing disorder might become more ordered. If randomness is a sort of ultimate disorder, then it must produce the most "information". Retrieving that information is another subject.
I don't know about you, but I read #bounces as "hashtag bounces" and not "number of bounces". Oh look, youtube even made it a hashtag. Thanks, internet.
+XenoContact In the same way people have 2.5 children. It's just an average (expected) of # questions per symbol. Hence the example at the end of this video with 175 questions / 100 symbols
This explains null and nothing. It's only clear for those that already confusing of it for tons years. But not for me, that never thinking about it, just started to grab its meaning. But, sorry, its explanation so fast, unable to attract me, and this is not an art product. Caused it makes me get confusion.
Link to series playlist: th-cam.com/play/PLbg3ZX2pWlgKDVFNwn9B63UhYJVIerzHL.html
you probably dont give a shit but if you are stoned like me atm you can watch all of the new movies on instaflixxer. I've been binge watching with my gf for the last weeks =)
@Lewis Kamryn yea, have been using instaflixxer for years myself =)
To whoever runs ArtOfTheProblem, I want to personally thank you for all of the time you have spent making these videos. I came across one of them on a math forum a few months ago and I proceeded to consume every video you have made. Your videos led me to absolutely fall in love with information theory and persuaded me to open up a major in computer science. Claude Shannon has replaced Richard Feynman as my favorite scientist. I can't thank you enough, these videos literally changed the path I'm taking through life.
He's the same guy, Khan academy liked his videos and brought them onto the site
stay tuned for more!
hey! just checking in, how did your major end up, where did it lead you?
@@ArtOfTheProblem Hey! Thanks so much for checking in. I ended up graduating w/ a BS in Computer Science from UC Santa Cruz and I'm currently four years into my career as a backend software engineer in Silicon Valley. I just left the first startup I joined out of college and joined a new one and I'm currently working on building a platform for training data management for enterprise ML applications. I really wanted to thank you again for making these videos, they altered the course of my studies and my career and, hence, my life. I still, to this day, read about information theory in my own time and still hold Claude Shannon up to be one of my favorite scientists. (I just noticed I'm posting from a new email address but it's me!)
@@johnallard2429 wow congrats. This note made my morning and is the reason I started this channel. You're probably way ahead of me now, i'm struggling to simplify my understanding of transformers at the moment for the next video :)
Thank god I found this video again
Still this is the best explanation about Shannon formula that exists in TH-cam
nobody finds this anymore, glad you did!
This Shannon guy is way ahead of his time. The father of data-compression in times when there were no Internet, computers or even binary-files.
Ok. Not just data-compression. All digital encoding. This guy rivals Alan Turing, if not better. (Essentially the real first computer guy in terms of 'digitizing')
This is the most intuitive video I've ever seen about Shannon entropy, thank you!
I'm studying computer science and I wish all my professors explained things just like you !! that would've saved a lot of time !
Thanks for uploading a new video!!! I just started watching these a few days ago and was bummed out when I saw you hadn't uploaded one in a little while. These are awesome.
I consider that Khan Academy is one of the best things that we have in this internet era
extremely intuitive explanation,so well explained I was wondering why the heck I couldn't understand the way you explained !
I'm thrilled to hear this video helped you. it was an original explanation of this concept I don't think many are aware of
This the best explanation on the INTERNET! Amazing, Awesome and Thank You!
Definitely one of my favorite explanations of all time
Could never understand the information entropy unit in class, but you nailed it here. Thanks!
These videos are extremely informative and very well done! Thank you for all your hard work! :)
+quang nguyen Thanks for feedback! Our new series has begun
To anyone who did think of this, why would you ask questions such "AB v/s CD" in first case, and "A v/s BCD" in second case - The reason is that this ensures that both answers can occur with 50% probability. If at every step, you ask 2 questions of equal probability, the AVERAGE number of times you'll have to ask the questions (in other words, follow that tree structure of question/answers), will be minimum. Why? You'll need a little bit of basic computer science experience to think this one through, maybe a good reason to take a course on data structures!
More interestingly, this is also the reason why binary search is more efficient that splitting a sorted list into 3 parts. It's a simple proof, I encourage you to think it through :)
> If at every step, you ask 2 questions of equal probability, the AVERAGE number of times you'll have to ask the questions (in other words, follow that tree structure of question/answers), will be minimum. Why
Why???? Proof please
Great video! Glad to see you are back
It's great to see a new video after a while :) thank you!
Simply great explanation. Thanks!
Great video and super interesting topic. I think this definition of information is super counter-intuitive of the everyday common sense use of the word "information". People often think of information as a source of knowledge or having some inherent use. It seems like the comment-sense definition of information is more akin to education and implies some utility. Whereas this definition of information seems to remain indifferent to its utility and is closer to complexity or an implied amount of substance. This makes sense since entropy and information are increasing in the wild. It is fascinating that information can be proven to be entropic. I wonder if there is a limit to information or if entropy increasing means the universe is infinite in possibilities? Maybe we can only observe an information state but information itself is infinite.
Good to have you back :)
Thank you for all your hard work making these videos. Every video is made putting attention to the smallest detail. And that intro sound...
thanks so much, working on video now on Bitcoin i'm excited about
I like the creativity u put in explaining anything.
Simply wow.. :)
Thanks for your feedback , much appreciated
Very clear explanation. Thanks.
RimstarOrg not in practice, SHA Syntacts unrealised post Collins Diction of call out too non self,I,you,me,them,we,us etc item subject object cycled online from individuals (English post Celt Anglo sax,say,for example)into literate nonsense of eventual some nutter going round system as act too get all way under Bletchly wooden hut non Enigma you need learn two ring Turing tar black out clout law system running Bowe street banker non nuts into lease of no where no be biz kit shopping street mummy's coming boooo?
Steve Bez
These videos are amazing!
I am greatly enjoying these!
Very nice video and clear explanation!
Thank you for continuing! These are fantastic!
Very nice intuitive explanation. Thank you!
bro ii watch lots of vids and its first time i see a new approaches to describe it pretty dam well articulated ' thanks for sharing it .
nice glad you found this
i simply love your videos.
Niiiiice!! Thank you, i was waiting for the longest time!
Bĺ70ĺmsm ñ
Super clear explanation.
love all your feedback :)
great explanation
HE IS ALIVE! =D WOOHOO
Very easy to understand. THank you!
The beginning of the video is misleading and implies that the second machine is not random. It is in fact random and the difference between them is that the first generates a uniform distribution whereas the second doesn't.
Hmm, is it not so that the sampling (from the distribution) is random, but the actual values are not? For example, I can imagine a distribution that's just a single line at a specific value; random sampling from that distribution will then always give you the same value.
Wow such a well made video! You totally desire more views and subs my good sir!
Awesomely explained! what's the song at the end?,i love it,it's so, optimistical!
thanks all music is original for this series
Incredible video, thank you! :)
Beautiful work.
Great, great video
There's an error/typo at 1:26 where the 2nd question repeats "Is it B?", it should read "Is it C?" or "Is it D?"
It is very well explained indeed but I still have certain doubt. I would
like to think about entropy in terms of random variables (like
wikipedia definiton with discrete random variable X where possible
events are values of X), but I fail to understand the following: how do
you determine the entropy of an English word for example? What is the
random variable in that case? I see that for the set of English
characters we could determine the entropy according to the formula
presented in the video as the X would represent the character that
appears as the message and they all have certain probability of
appearing so we have the probability distribution. But I've seen people
determining entropy of a word by calculating the probabilities of the
characters as appearance rates (ABA: 1/3 for B 2/3 for A) and
considering that the probability distribution... If anyone could address
that and shed some light I would be extremely grateful! Thanks!
EDIT:
I just realised that given a sequence of characters that appear one after another independently the entropy of such message considered as a value of a random vector would be a sum of entropies of random variables which make up for that vector (is that right?) so it kind of makes more sense to me and seems natural (every additional character provides the same amount of information per average) but I'm still puzzled with entropy of given English word... Hope someone could respond to it. Also, incredibly interesting topic! And this channel is just great!
Hi. Maybe this will help. What was presented has each random letter being statistically independent of the following letter. However, what you seem to be describing is the situation where the random letters are not statistically independent, which is true for the English language.
Love your videos. Thank you.
very clear explanation! thank you
Thank you alots man....This clear explaination get me to the point.
another way to write shannon equation would be the log base 1/2(p), that you know is how many times you would have to divide in to to approach a probability equal to p, so how many levels we need to accomodate somehow fairly that simbol with that probability
Very nice explanation! thanks for the help :)
Great video!
Great video. Where can I find the cool music at the end?
+Anthony Vance Thanks, cam posts the music here (cameronmichaelmurray.bandcamp.com/album/art-of-the-problem-volume-1-gambling-with-secrets) though that song isn't yet listed
+Art of the Problem Great, thanks. I love the work of your team. Inspiringly creative.
Please could anybody explain the equivalence of the two formulas at 6.21? THANKS
This is a year late but by using negative exponent properties we can express 1/p as p^(-1). And then by applying properties of logarithms, we can use the power rule to bring the -1 to the outside of the expression (hence the negative sign in the front).
At 3:49, for D, it should have been 0.25X2 instead of 0.25X4, right?
Thanks for such videos :)
At timestamp 3:49, there is a mistake, it should be 0.25(2). Not sure why its 4. Great Video btw.
glad you found it, it's an oldie!
@@ArtOfTheProblemI had a related question to the video. Say we played the same game but instead of ABCD, we used a fair dice with 1/6 probability for each face at the beginning.
I use a strategy where I first ask if the outcome is >4? If yes, then I ask is the outcome {4}, and if it is no then I ask if the outcome is {5}. We do this until we have a similar tree- diagram as in your video (Time stamp 3:34)
My question is if the probability of finding the correct outcome, say {4} remains the same as the beginning, that being 1/6 ; or does the probability drop to 1/3 after the first question?
Similar resources I found indicates that it should stay at 1/6 but I can't see why. After the asking the first question, it seems to me like the probability of finding {4} should fall to 1/3.
The end goal is the same as yours in the video, that being to find the average number of questions using this strategy and compare it against the information entropy.
I realize my question comes a decade after the video was first posted, but I would appreciate your input if you have the time.
Thank you :)
Very good video, thanks, but I see (from my UoM physics course notes) that the natural logarithm is used instead of the base-2 logarithm: is this choice arbitrary or does it vary to suit a given situation?
Yay! New video :)
Just awesome !
Really helpful thanks
glad people are finding this video
SO entropy is good for encryption but bad for communication right?
+Immac (RockLeet) Good incite! Entropy is good for data encryption (because more guesses/questions are needed to crack it), but entropy is bad for data compression (because maximally compressed data, by definition, has the highest entropy for its size). Data compression is good for communication.
Your statement is correct when referring to compression, but certain aspects of communication like bandwidth efficiency or error-correcting codes can improve with entropy. :-)
+Immac (RockLeet) Sorry for misspelling "insight"
Why is the number of outcomes state by 1/p??
If the probability was .1 then the number of outcomes would be 10 and if p was .9 the number of putcomes would be 10/9?? I didn't get that =/
But these videos are amazing, thank tou for making such a good video!
+Fernando Gonzalez - use base 2 numbers in your examples to make it clearer. If p = 0.25 then # outcomes = 4 (1 in 4 chance = 25%), if p=0.125 then # outcomes = 8 (1 in 8 chance = 12.5%)...etc.
Using the example in the video, would you be kind enough to explain what the eight outcomes are for C for example? It seems like there are still only 4 possible outcomes, and that to get 8 outcomes after dividing 1 by p, you would have to assume that all the outcomes are equally likely (ie. also p=.125).
John Hobson Here there are, 8 possibilities. 000, 001, 010, 100, 110, 101, 011, 111. I hope this helped you. Otherwise ask
it was great explanation. Thanks
There is an arithmetical error at 3:47 (should be p_D*2)
shoot. good catch I'll add annotation
There is a typo at 3:52, the last term in #bounces should be .25*2. If this is changed the #bounces=1.75.
And great video BTW!
Hei, we did not count the probability of CD in the first machine? The first question was is it AB? then you continued only with Yes as an answer? What if it was N0?
I don't get it, if we need to ask 1.75 questions to guess the next symbol of machine 2, doesn't it mean it produces MORE information?
+resistancefighter888 I think of it as more information meaning more questions needed to sort through it
Thanks a lot!
You're seeing it the other way around. Imagine that person A and B both have a certain amount of information. To get all the information from A, you have to ask them a minimum of 100 questions. To get all the information from B you have to ask them just a single question. Does A have more or less information that B?
It's not a very good analogy, but maybe it can make you understand better :D
"what path would would the other guard show me had I asked him for the path?" And both indicate to the same path(the wrong one) which leaves you with the correct one
Fantastic! Thanks you
I am in
That's because every question we ask makes the number of outcomes 2 times bigger (since we are asking yes/no questions). Look at the tree of machine 1, it always has 4 outcomes so we have to ask log2(4) = 2 questions each time. however, in the tree of machine 2 you can stop with less questions, if the letter is 'A' you'll end with 2 outcomes (letter 'A' or the question "is it 'B'?"), so log2(2) = 1, just 1 question.
wow what a video :o thank you
5:49 In your mind visualize the binary tree. Then you can know what the "#outcomes" variable means.
Thanks a lot, this was very usefull!
Very good
why outcomes equal 1/p?
Great video, thank you. This make me realize how stupid the school is. They make beautiful things seem to be so dull and complex.
Recommendable
very nice.
Why dont we ask those two questions from machine1 for machine2?
Because you want your series of questions to reflect the probabilities produced by each machine
Machine 1: 25% A, 25% B, 25% C, 25% D
BUT Machine 2: 50% A, 12.5% B, 12.5% C, 25% D
Because you know more about machine 2. So you would be asking more questions than you have to.
素晴らしい
when u ask urself what's entropy in information theory and get the point instantaneously, nice work!
I've heard somewhere that randomness is the most compact way to store information because it lacks pattern
+Julian Goulette that makes no sense
thats dumb m8
Well, if we define "entropy" as disorder, as in physics, then it would seem the more "entropy" the more "information". So, if physicists are concerned that it might be possible to "lose information" in a black hole (for instance - something they nearly all seem to find abhorrent) they are fearing disorder might become more ordered. If randomness is a sort of ultimate disorder, then it must produce the most "information". Retrieving that information is another subject.
This is indeed art, but the tearing of the page in the end was painful. : )
You are a god thank you
5:33 why?
great!
wait... what about a and d or a and c... so on?
2 years late but I think that has to be a typo. I've been puzzled on that part too.
Assuming you're talking about 1:25
I don't know about you, but I read #bounces as "hashtag bounces" and not "number of bounces". Oh look, youtube even made it a hashtag. Thanks, internet.
ahahah #internet
At @3:50 it should be a .... + 0.25 x 2 and not ....+ 0.25 X 4
great (y)
b it
How the fuck do you ask 1.75 questions?
+XenoContact In the same way people have 2.5 children. It's just an average (expected) of # questions per symbol. Hence the example at the end of this video with 175 questions / 100 symbols
This explains null and nothing. It's only clear for those that already confusing of it for tons years. But not for me, that never thinking about it, just started to grab its meaning. But, sorry, its explanation so fast, unable to attract me, and this is not an art product. Caused it makes me get confusion.