I very much enjoyed giving this lecture! Here is my summary: AI is moving so fast that it's hard to keep up. Instead of spending all our energy catching up with the latest development, we should study the change itself. First step is to identify and understand the dominant driving force behind the change. For AI, a single driving force stands out; exponentially cheaper compute and scaling of progressively more end-to-end models to leverage that compute. However this doesn’t mean we should blindly adopt the most end-to-end approach because such an approach is simply infeasible. Instead we should find an “optimal” structure to add given the current level of 1) compute, 2) data, 3) learning objectives, 4) architectures. In other words, what is the most end-to-end structure that just started to show signs of life? These are more scalable and eventually outperform those with more structures when scaled up. Later on, when one or more of those 4 factors improve (e.g. we got more compute or found a more scalable architecture), then we should revisit the structures we added and remove those that hinder further scaling. Repeat this over and over. As a community we love adding structures but a lot less for removing them. We need to do more cleanup. In this lecture, I use the early history of Transformer architecture as a running example of what structures made sense to be added in the past, and why they are less relevant now. I find comparing encoder-decoder and decoder-only architectures highly informative. For example, encoder-decoder has a structure where input and output are handled by separate parameters whereas decoder-only uses the shared parameters for both. Having separate parameters was natural when Transformer was first introduced with translation as the main evaluation task; input is in one language and output is in another. Modern language models used in multiturn chat interfaces make this assumption awkward. Output in the current turn becomes the input of the next turn. Why treat them separately? Going through examples like this, my hope is that you will be able to view seemingly overwhelming AI advances in a unified perspective, and from that be able to see where the field is heading. If more of us develop such a unified perspective, we can better leverage the incredible exponential driving force!
00:07 Hyung Won Chung works on large language models and training frameworks at OpenAI. 02:31 Studying change to understand future trajectory 07:00 Exponentially cheaper compute is driving AI research 09:25 Challenges in modeling human thinking for AI 13:43 AI research heavily relies on exponentially cheaper compute and associated scaling up. 15:59 Understanding the Transformer as a sequence model and its interaction mechanism 19:58 Explanation of cross attention mechanism 21:55 Decoded only architecture simplifies sequence generation 25:41 Comparing differences between the decoder and encoder-decoder architecture 27:38 Hyung Won Chung discusses the evolution of language models. 31:33 Deep learning hierarchical representation learning discussed 33:34 Comparison between bidirectional and unidirectional fine-tuning for chat applications
He is so well oriented with his thoughts and philosophical too. The way he correlated it with the Pen dropping and Gravity(Force) to map the AI(Linear Algebra).
Thanks for your excellent lecture! Regarding the "bitter lesson", I remain optimistic (as a signal processing expert), that we can push to the _left_ in that diagram to obtain comparable performance at far lest compute costs. While I agree with your prescription for frontier model development, there is much work to be done pushing to the left as well. I have already seen many instances of this. Witness for example the multi-scale perceptually rooted adversarial losses in the major audio codecs these days - sure, we could learn all that end to end by effectively simulating evolution of the human ear, but we don't really have to. For me the program is (1) get best results at the far right of your diagram (maximum compute, maximum performance), then (2) push to the left to reduce computation (by literally orders of magnitude) while maintaining comparable quality. Even distillation is an example of this. There are many. Thanks again for your stimulating talk!
Short Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](th-cam.com/video/orDKvo8h71o/w-d-xo.html) "History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture" [00:07](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI. [02:31](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=151) Studying change to understand future trajectory [07:00](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=420) Exponentially cheaper compute is driving AI research [09:25](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=565) Challenges in modeling human thinking for AI [13:43](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up. [15:59](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism [19:58](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1198) Explanation of cross attention mechanism [21:55](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1315) Decoded only architecture simplifies sequence generation [25:41](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture [27:38](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1658) Hyung Won Chung discusses the evolution of language models. [31:33](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1893) Deep learning hierarchical representation learning discussed [33:34](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications --------------------------------- Detailed Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](th-cam.com/video/orDKvo8h71o/w-d-xo.html) "History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture" [00:07](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI. - He has worked on various aspects of large language models, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, and reasoning. - He has also been involved in the development of notable works such as the scaling flan papers like flan T5, flan Palm, and T5x, the training framework used to train the Palm language model. [02:31](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=151) Studying change to understand future trajectory - Identifying dominant driving forces behind the change - Predicting future trajectory based on understanding driving force [07:00](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=420) Exponentially cheaper compute is driving AI research - Compute costs decrease every five years, leading to AI research dominance - Machines are being taught to think in a general sense due to cost-effective computing [09:25](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=565) Challenges in modeling human thinking for AI - Attempting to model human thinking without understanding it poses fundamental flaws in AI. - The AI research has been focused on scaling up with weaker modeling assumptions and more data. [13:43](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up. - Current AI research paradigm is learning-based, allowing models to choose how they learn, which initially leads to chaos but ultimately leads to improvement with more compute. - Upcoming focus of the discussion will be on understanding the driving force of exponentially cheaper compute, and analyzing historical decisions and structures in Transformer architecture. [15:59](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism - A Transformer is a type of sequence model that represents interactions between sequence elements using dot products. - The Transformer encoder-decoder architecture is used for tasks like machine translation, involving encoding input sequences into dense vectors. [19:58](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1198) Explanation of cross attention mechanism - Decoder attends to output from encoder layers - Encoder only architecture simplified for specific NLP tasks like sentiment analysis [21:55](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1315) Decoded only architecture simplifies sequence generation - Decoder only architecture can be used for supervised learning by concatenating input with target - Self-attention mechanism serves both cross-attention and sequence learning within each, sharing parameters between input and target sequences [25:41](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture - The decoder attends to the same layer representation of the encoder - The encoder-decoder architecture has additional built-in structures compared to the decoder-only architecture [27:38](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1658) Hyung Won Chung discusses the evolution of language models. - Language models have evolved from simple translation tasks to learning broader knowledge. - Fine tuning pre-trained models on specific data sets can significantly improve performance. [31:33](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1893) Deep learning hierarchical representation learning discussed - Different levels of information encoding in bottom and top layers of deep neural nets - Questioning the necessity of bidirectional input attention in encoders and decoders [33:34](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications - Bidirectional fine-tuning poses engineering challenges for multi-turn chat applications requiring re-encoding at each turn. - Unidirectional fine-tuning is more efficient as it eliminates the need for re-encoding at every turn, making it suitable for modern conversational interfaces.
The dung beetle uses the milky way to navigate on the ground... humans use criteria, substantially different, to perform decision-making... the point I am trying to make is that mechanisms of machine-based reasoning may be wildly different than those we are aware of (mostly human) and we should be open to the possibility of discovering new forms of reasoning which may not fit within our preconceived notions of information processing/handling.
@@KSSE_Engineer Completely opposite. A good learner is someone who can repeat and apply what they've learned. That takes longer time. If someone thinks they've learned everything quickly but actually has little understanding, that's a bad learner.
Good talk but too short. I love the premise of analysing the rate of change! Unfortunately the short time only permitted the observation of one particular change, it would be great to observer the changes in other details over time (architectural but also hardware, infra like FlashAttention) and also more historical changes (depth/resnets, RNN -> Transformers) and then use this library to make some predictions about possible/likely future directions for change
By YouSum Live 00:02:00 Importance of studying change itself 00:03:34 Predicting future trajectory in AI research 00:08:01 Impact of exponentially cheaper compute power 00:10:10 Balancing structure and freedom in AI models 00:14:46 Historical analysis of Transformer architecture 00:17:29 Encoder-decoder architecture in Transformers 00:19:56 Cross-attention mechanism between decoder and encoder 00:20:10 All decoder layers attend to final encoder layer 00:20:50 Transition from sequence-to-sequence to classification labels 00:21:30 Simplifying problems for performance gains 00:22:24 Decoder-only architecture for supervised learning 00:22:59 Self-attention mechanism handling cross-attention 00:23:13 Sharing parameters between input and target sequences 00:24:03 Encoder-decoder vs. decoder-only architecture comparison 00:26:59 Revisiting assumptions in architecture design 00:33:10 Bidirectional vs. unidirectional attention necessity 00:35:16 Impact of scaling efforts on AI research By YouSum Live
What is the role of CCSVI Venous Hypertension and proper/improved Cerebrospinal Blood flow?? #CCSVI #BloodFlowMatters CCSVI is definitely one of the causes of MS. The novelty for some years is that we are certain that, after studying at the La Sapienza University of Rome, there are 3 types of CCSVI: * a 1 type with patients suffering from an obstacle to the endovacular venous discharge, i.e. due to congenital or acquired anomalies that restrict and block the drainage of the investigated veins (jugular, vertebral, azygos) * a 2nd type with patients suffering from an extra-vascular venous obstruction, i.e. due to external compression of the vessel * a 3 type with patients suffering from endo-vacular and extra-vascular venous obstruction So to simplify we can say that there is a CCSVI of the "hydraulic" type (1 type), a CCSVI of the "mechanical" type (2 type) and a "mixed" CCSVI of the two previous types (3 type), the 2 and the 3 type represent about 85% of cases. A patient with type 1 CCSVI will have a greater indication for angioplasty treatment, a patient with type 2 CCSVI will have a specific indication for specific physiotherapy decompressive treatment (RIMA Method), a patient with type 3 CCSVI it will have an indication for both an endo-vascular dilatation treatment and an extra-vascular decompression treatment. The RIMA Method devised by Dr. Domenico Ricci of Bari is able to release compressed veins throughout their course, as shown by a publication of June 2015 after a one-year study (Internal jugular Venous Compressione Syndrome: hemodynamic outcomes after cervical vertebral decompression manipulations-Pubmed). For information: Dr. Domenico Ricci cell.3393828399 MRI IN MS VASCULAR PATHOLOGY www.dropbox.com/s/m0yvgmufgfcys1v/MRI%20IN%20MS%20-%20VASCULAR%20PATHOLOGY%20.pdf?dl=0 This quantification of the disease pathology will help! #CCSVI Venous Hypertension >microbleedings >iron >inflammation >free radicals >neurodegeneration #multiplesclerosis M.S. - Mystery Solved Mysterious Autoimmunity = CCSVI Neurodegeneration M.S. - Mystery Solved Mysterious Autoimmunity = CCSVI Neurodegeneration Keep in mind! Also venous hypertension ➡️ impaired CSF absorption ➡️ reduced G Lymphatic drainage ➡️ interstitial peptides accumulation ➡️ NEURO INFLAMMATION #CCSVI Eliminating cause of the Symptoms of so called Multiple Sclerosis will End MS. Apparently it is unquantifiable the length of time Symptom$ can be treated! If you hadn't noticed Who Knew?? #BloodFlowMatters What is the role of proper/improved Blood flow? #CCSVI Apparently #BloodFlowMatters Stroke common occurrence in Individuals diagnosed with Diabetes, unproven autoimmune THEORY so called MS Supplying Oxygen and Nutrients to Cell in a Body. Blood Circulation through and including activity/exercise. Building Blocks of life, having made yourself what you are functioning today! #CCSVI #Healthcare game changer when the cause The doctor of the future will give no medicine, but will interest his patient in the care of the human frame, in diet and in the cause and prevention of disease. -THOMAS EDISON Best possiblity easing/eliminating cause of SymptoMS! You can relate! If your veins are blocked they should be opened if you have SymptoMS or not! MRI IN MS VASCULAR PATHOLOGY www.dropbox.com/s/m0yvgmufgfcys1v/MRI%20IN%20MS%20-%20VASCULAR%20PATHOLOGY%20.pdf?dl=0 Who Knew?? #BloodFlowMatters What is the role of proper/improved Blood flow? #CCSVI Apparently #BloodFlowMatters Stroke is a common occurrence in Individuals diagnosed with Diabetes Neurovascular Disease Multiple Sclerosis is being referred 'slow Stroke'. What is the role of proper/improved Blood flow in both conditions as much CCSVI has been Scientifically confirmed a causative factor in Symptoms of so called MS! Supplying Oxygen and Nutrients to every Cell in a Body. Blood Circulation through and including activity/exercise. Building Blocks of life, having made yourself what you are functioning today! #CCSVI #Healthcare game changer when the cause of the Symptoms of Medical conditions are eliminated! The doctor of the future will give no medicine, but will interest his patient in the care of the human frame, in diet and in the cause and prevention of disease. -THOMAS EDISON Best possiblity easing/eliminating cause of SymptoMS! You can relate! If your veins are blocked they should be opened if you have SymptoMS or not! A Vascular problem led to the crippling nightmare of Multiple Sclerosis The real Multiple Sclerosis nightmare started at the point of NeuroDx The disaster of diagnosis being made by general physical observation over time,. Especially when Time is something you can’t afford #CCSVI Multiple Sclerosis is strong and you often need help. Make you be worthy of this help, don't stand in a corner complaining, do your part! 💪 #Symptoms often ease/DISAPPEAR Facilitate Collaboration Neurovascular Disease Research! #CCSVI FB Group: MSS facebook.com/groups/4939355…! - Silent Ischemia ( Myocardial Ischemia Without Angina ) .. 🫀📚🔍 --------------------------------------------------------------- # The medical definition of silent myocardial ischemia is verified myocardial ischemia without angina. Ischemia is a reduction of oxygen-rich blood supply to the heart muscle. Silent ischemia occurs when the heart temporarily doesn’t receive enough blood (and thus oxygen), but the person with the oxygen-deprivation doesn’t notice any effects. Silent ischemia is related to angina, which is a reduction of oxygen-rich blood in the heart that causes chest pain and other related symptoms ... # Most silent ischemia occurs when one or more coronary arteries are narrowed by plaque. It can also occur when the heart is forced to work harder than normal. People who have diabetes or who have had a heart attack are most likely to develop silent ischemia .. ---------------------------------------------- # Risk factors : - People who are at risk for heart disease and angina are also at risk for silent ischemia. Risk factors include : use 1- Diabetes 2- High blood cholesterol 3- High blood pressure 4- A family history of heart disease 5- Age (after age 45 for men and age 55 for women, risk increases) 6- A sedentary lifestyle 7- Obesity 8- Unmanaged stress -------------------------------------------- # To reduce your risk for silent ischemia, you should reduce your risk for heart disease in general. Here are some things you can do : 1- Stop using tobacco of all kinds - including chewing and smoking tobacco and inhaling significant secondhand smoke . 2- Prevent diabetes or manage it if you have it . 3- Prevent or manage high blood pressure . 4- Prevent or treat high blood cholesterol and triglyceride levels . 5- Exercise regularly (talk to your doctor about what type of exercise is right for you) . 6- If you’re overweight, lose weight; maintain a healthy weight . 7- Eat a heart-healthy diet . 8- Take steps to reduce stress in your life, and learn how to manage stress . 9- See your doctor regularly, have recommended heart screenings, and follow your doctor’s instructions . ------------------------------------------------------ By : Kareem Blinder Reference : www.beaumont.org/conditions/silent-ischemia --------------------------------------------------------------- #coachblinder #blinderpathology #anatomyandphysiology #diabetes #ischemia #سبحانك_لا_علم_لنا_الا_ما_علمتنا
Awesome lecture and interesting insights! One small remark: the green and red lines in the Performance vs. Compute graph should probably be monotonically increasing.
I find this talk a bit unsatisfactory. He mentions how for encoder-decoder models, the decoder only attends to the last layer. also he mentions how we treat input and output seperately in encdoer-decoder. However thats not the point at all of encoder-decoder models right ? Its just that the encoder-decoder model has a intermediate encoder objective (to represent the input), thats all. The decoder attending to only last layer, or seperating input-output is just how the orignal transformer did it. Clearly its possible to just attend to layer wise encodings instead of only last layer encodings, just an example. Also its possible to mimic generation decoder model style by adding new input to encoder rather than decoder. I would have really liked some experiments, even if toy, because its incredibly unconvincing. Specfically how he mentions a couple times that encoder final layers are an information bottleneck, but I mean, just attend to layer wise embeddings if you want. Or put some MLP on top of encoder last states. Id argue, we are putting more structure in "decoder-only" model (by that I mean causal attention decoder, which is what he describes). The reason being causal attention, where we restrict the model to only attend to past, both during training and inference, even for part of output that is already generated.
Keep in mind that guys like this and Ilya are why OpenAI is what it is, not Sam Altman. I'm sure Altman is smart but he is not the kind of genius or as knowledgeable as these guys are
Imagine if an actual PhD scientists sat on the openai board making decisions. Instead rich trust fund baby with no background somehow makes it on the board.
I very much enjoyed giving this lecture! Here is my summary:
AI is moving so fast that it's hard to keep up. Instead of spending all our energy catching up with the latest development, we should study the change itself.
First step is to identify and understand the dominant driving force behind the change. For AI, a single driving force stands out; exponentially cheaper compute and scaling of progressively more end-to-end models to leverage that compute.
However this doesn’t mean we should blindly adopt the most end-to-end approach because such an approach is simply infeasible. Instead we should find an “optimal” structure to add given the current level of 1) compute, 2) data, 3) learning objectives, 4) architectures. In other words, what is the most end-to-end structure that just started to show signs of life? These are more scalable and eventually outperform those with more structures when scaled up.
Later on, when one or more of those 4 factors improve (e.g. we got more compute or found a more scalable architecture), then we should revisit the structures we added and remove those that hinder further scaling. Repeat this over and over.
As a community we love adding structures but a lot less for removing them. We need to do more cleanup.
In this lecture, I use the early history of Transformer architecture as a running example of what structures made sense to be added in the past, and why they are less relevant now.
I find comparing encoder-decoder and decoder-only architectures highly informative. For example, encoder-decoder has a structure where input and output are handled by separate parameters whereas decoder-only uses the shared parameters for both. Having separate parameters was natural when Transformer was first introduced with translation as the main evaluation task; input is in one language and output is in another.
Modern language models used in multiturn chat interfaces make this assumption awkward. Output in the current turn becomes the input of the next turn. Why treat them separately?
Going through examples like this, my hope is that you will be able to view seemingly overwhelming AI advances in a unified perspective, and from that be able to see where the field is heading. If more of us develop such a unified perspective, we can better leverage the incredible exponential driving force!
Thank you so much for giving us this lecture! Enabling us to think in this new perspective!
Nice of u
very nice explanation, thx a lot!
Amazing Content! Thank you
tks a lotssss!!
00:07 Hyung Won Chung works on large language models and training frameworks at OpenAI.
02:31 Studying change to understand future trajectory
07:00 Exponentially cheaper compute is driving AI research
09:25 Challenges in modeling human thinking for AI
13:43 AI research heavily relies on exponentially cheaper compute and associated scaling up.
15:59 Understanding the Transformer as a sequence model and its interaction mechanism
19:58 Explanation of cross attention mechanism
21:55 Decoded only architecture simplifies sequence generation
25:41 Comparing differences between the decoder and encoder-decoder architecture
27:38 Hyung Won Chung discusses the evolution of language models.
31:33 Deep learning hierarchical representation learning discussed
33:34 Comparison between bidirectional and unidirectional fine-tuning for chat applications
There are so many genius in this field! Really amazing!
He is so well oriented with his thoughts and philosophical too. The way he correlated it with the Pen dropping and Gravity(Force) to map the AI(Linear Algebra).
Thanks for your excellent lecture! Regarding the "bitter lesson", I remain optimistic (as a signal processing expert), that we can push to the _left_ in that diagram to obtain comparable performance at far lest compute costs. While I agree with your prescription for frontier model development, there is much work to be done pushing to the left as well. I have already seen many instances of this. Witness for example the multi-scale perceptually rooted adversarial losses in the major audio codecs these days - sure, we could learn all that end to end by effectively simulating evolution of the human ear, but we don't really have to. For me the program is (1) get best results at the far right of your diagram (maximum compute, maximum performance), then (2) push to the left to reduce computation (by literally orders of magnitude) while maintaining comparable quality. Even distillation is an example of this. There are many. Thanks again for your stimulating talk!
Short Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](th-cam.com/video/orDKvo8h71o/w-d-xo.html)
"History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture"
[00:07](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI.
[02:31](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=151) Studying change to understand future trajectory
[07:00](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=420) Exponentially cheaper compute is driving AI research
[09:25](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=565) Challenges in modeling human thinking for AI
[13:43](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up.
[15:59](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism
[19:58](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1198) Explanation of cross attention mechanism
[21:55](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1315) Decoded only architecture simplifies sequence generation
[25:41](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture
[27:38](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1658) Hyung Won Chung discusses the evolution of language models.
[31:33](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1893) Deep learning hierarchical representation learning discussed
[33:34](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications
---------------------------------
Detailed Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](th-cam.com/video/orDKvo8h71o/w-d-xo.html)
"History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture"
[00:07](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI.
- He has worked on various aspects of large language models, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, and reasoning.
- He has also been involved in the development of notable works such as the scaling flan papers like flan T5, flan Palm, and T5x, the training framework used to train the Palm language model.
[02:31](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=151) Studying change to understand future trajectory
- Identifying dominant driving forces behind the change
- Predicting future trajectory based on understanding driving force
[07:00](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=420) Exponentially cheaper compute is driving AI research
- Compute costs decrease every five years, leading to AI research dominance
- Machines are being taught to think in a general sense due to cost-effective computing
[09:25](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=565) Challenges in modeling human thinking for AI
- Attempting to model human thinking without understanding it poses fundamental flaws in AI.
- The AI research has been focused on scaling up with weaker modeling assumptions and more data.
[13:43](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up.
- Current AI research paradigm is learning-based, allowing models to choose how they learn, which initially leads to chaos but ultimately leads to improvement with more compute.
- Upcoming focus of the discussion will be on understanding the driving force of exponentially cheaper compute, and analyzing historical decisions and structures in Transformer architecture.
[15:59](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism
- A Transformer is a type of sequence model that represents interactions between sequence elements using dot products.
- The Transformer encoder-decoder architecture is used for tasks like machine translation, involving encoding input sequences into dense vectors.
[19:58](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1198) Explanation of cross attention mechanism
- Decoder attends to output from encoder layers
- Encoder only architecture simplified for specific NLP tasks like sentiment analysis
[21:55](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1315) Decoded only architecture simplifies sequence generation
- Decoder only architecture can be used for supervised learning by concatenating input with target
- Self-attention mechanism serves both cross-attention and sequence learning within each, sharing parameters between input and target sequences
[25:41](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture
- The decoder attends to the same layer representation of the encoder
- The encoder-decoder architecture has additional built-in structures compared to the decoder-only architecture
[27:38](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1658) Hyung Won Chung discusses the evolution of language models.
- Language models have evolved from simple translation tasks to learning broader knowledge.
- Fine tuning pre-trained models on specific data sets can significantly improve performance.
[31:33](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=1893) Deep learning hierarchical representation learning discussed
- Different levels of information encoding in bottom and top layers of deep neural nets
- Questioning the necessity of bidirectional input attention in encoders and decoders
[33:34](th-cam.com/video/orDKvo8h71o/w-d-xo.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications
- Bidirectional fine-tuning poses engineering challenges for multi-turn chat applications requiring re-encoding at each turn.
- Unidirectional fine-tuning is more efficient as it eliminates the need for re-encoding at every turn, making it suitable for modern conversational interfaces.
Nice talk. Before this talk, i was confused about the scaling low and design in GPT. Now i understand the source of this wonderful work.
The dung beetle uses the milky way to navigate on the ground... humans use criteria, substantially different, to perform decision-making... the point I am trying to make is that mechanisms of machine-based reasoning may be wildly different than those we are aware of (mostly human) and we should be open to the possibility of discovering new forms of reasoning which may not fit within our preconceived notions of information processing/handling.
Thanks for giving me a new direction to think in. Learnt something new
OpenAI에서 일하면서 머리숱이 풍성한 그는 갓..
ㅋㅋㅋㅋㅋㅋㅋ
Great talk, I really enjoyed the perspective and intuition.
I keep watching this. This is my fourth time.
😂 bad learner .
@@KSSE_Engineer Completely opposite. A good learner is someone who can repeat and apply what they've learned. That takes longer time. If someone thinks they've learned everything quickly but actually has little understanding, that's a bad learner.
Great guest lecture
Good talk but too short. I love the premise of analysing the rate of change! Unfortunately the short time only permitted the observation of one particular change, it would be great to observer the changes in other details over time (architectural but also hardware, infra like FlashAttention) and also more historical changes (depth/resnets, RNN -> Transformers) and then use this library to make some predictions about possible/likely future directions for change
Thankyou for these videos.i am learning generative ai and llms and these videos are so helpful❤
Good lecture and insights!
By YouSum Live
00:02:00 Importance of studying change itself
00:03:34 Predicting future trajectory in AI research
00:08:01 Impact of exponentially cheaper compute power
00:10:10 Balancing structure and freedom in AI models
00:14:46 Historical analysis of Transformer architecture
00:17:29 Encoder-decoder architecture in Transformers
00:19:56 Cross-attention mechanism between decoder and encoder
00:20:10 All decoder layers attend to final encoder layer
00:20:50 Transition from sequence-to-sequence to classification labels
00:21:30 Simplifying problems for performance gains
00:22:24 Decoder-only architecture for supervised learning
00:22:59 Self-attention mechanism handling cross-attention
00:23:13 Sharing parameters between input and target sequences
00:24:03 Encoder-decoder vs. decoder-only architecture comparison
00:26:59 Revisiting assumptions in architecture design
00:33:10 Bidirectional vs. unidirectional attention necessity
00:35:16 Impact of scaling efforts on AI research
By YouSum Live
🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Thanks for the info! A little complex for me, but we keep going!
Fascinating! Really curious what kinds of Q&A was exchanged in the classroom after this presentation.
좋은 강의 감사합니다
Thank you so much
this is lovely!🙂
What is the role of CCSVI Venous Hypertension and proper/improved Cerebrospinal Blood flow?? #CCSVI #BloodFlowMatters
CCSVI is definitely one of the causes of MS.
The novelty for some years is that we are certain that, after studying at the La Sapienza University of Rome, there are 3 types of CCSVI:
* a 1 type with patients suffering from an obstacle to the endovacular venous discharge, i.e. due to congenital or acquired anomalies that restrict and block the drainage of the investigated veins (jugular, vertebral, azygos)
* a 2nd type with patients suffering from an extra-vascular venous obstruction, i.e. due to external compression of the vessel
* a 3 type with patients suffering from endo-vacular and extra-vascular venous obstruction
So to simplify we can say that there is a CCSVI of the "hydraulic" type (1 type), a CCSVI of the "mechanical" type (2 type) and a "mixed" CCSVI of the two previous types (3 type), the 2 and the 3 type represent about 85% of cases.
A patient with type 1 CCSVI will have a greater indication for angioplasty treatment, a patient with type 2 CCSVI will have a specific indication for specific physiotherapy decompressive treatment (RIMA Method), a patient with type 3 CCSVI it will have an indication for both an endo-vascular dilatation treatment and an extra-vascular decompression treatment.
The RIMA Method devised by Dr. Domenico Ricci of Bari is able to release compressed veins throughout their course, as shown by a publication of June 2015 after a one-year study (Internal jugular Venous Compressione Syndrome: hemodynamic outcomes after cervical vertebral decompression manipulations-Pubmed).
For information: Dr. Domenico Ricci cell.3393828399
MRI IN MS VASCULAR PATHOLOGY
www.dropbox.com/s/m0yvgmufgfcys1v/MRI%20IN%20MS%20-%20VASCULAR%20PATHOLOGY%20.pdf?dl=0
This quantification of the disease pathology will help!
#CCSVI
Venous Hypertension
>microbleedings
>iron
>inflammation
>free radicals
>neurodegeneration
#multiplesclerosis
M.S. - Mystery Solved
Mysterious Autoimmunity
= CCSVI Neurodegeneration
M.S. - Mystery Solved
Mysterious Autoimmunity
= CCSVI Neurodegeneration
Keep in mind!
Also venous hypertension ➡️ impaired CSF absorption ➡️ reduced G Lymphatic drainage ➡️ interstitial peptides accumulation ➡️ NEURO INFLAMMATION #CCSVI
Eliminating cause of the Symptoms of so called Multiple Sclerosis will End MS. Apparently it is unquantifiable the length of time Symptom$ can be treated!
If you hadn't noticed
Who Knew??
#BloodFlowMatters
What is the role of proper/improved Blood flow? #CCSVI Apparently #BloodFlowMatters
Stroke common occurrence in Individuals diagnosed with Diabetes, unproven autoimmune THEORY so called MS
Supplying Oxygen and Nutrients to Cell in a Body. Blood Circulation through and including activity/exercise. Building Blocks of life, having made yourself what you are functioning today! #CCSVI
#Healthcare game changer when the cause
The doctor of the future will give no medicine, but will interest his patient in the care of the human frame, in diet and in the cause and prevention of disease.
-THOMAS EDISON
Best possiblity easing/eliminating cause of SymptoMS!
You can relate!
If your veins are blocked they should be opened if you have SymptoMS or not!
MRI IN MS VASCULAR PATHOLOGY
www.dropbox.com/s/m0yvgmufgfcys1v/MRI%20IN%20MS%20-%20VASCULAR%20PATHOLOGY%20.pdf?dl=0
Who Knew??
#BloodFlowMatters
What is the role of proper/improved Blood flow? #CCSVI Apparently #BloodFlowMatters
Stroke is a common occurrence in Individuals diagnosed with Diabetes Neurovascular Disease Multiple Sclerosis is being referred 'slow Stroke'. What is the role of proper/improved Blood flow in both conditions as much CCSVI has been Scientifically confirmed a causative factor in Symptoms of so called MS!
Supplying Oxygen and Nutrients to every Cell in a Body. Blood Circulation through and including activity/exercise. Building Blocks of life, having made yourself what you are functioning today! #CCSVI
#Healthcare game changer when the cause of the Symptoms of Medical conditions are eliminated!
The doctor of the future will give no medicine, but will interest his patient in the care of the
human frame, in diet and in the cause and prevention of disease.
-THOMAS EDISON
Best possiblity easing/eliminating cause of SymptoMS!
You can relate!
If your veins are blocked they should be opened if you have SymptoMS or not!
A Vascular problem led to the crippling nightmare of Multiple Sclerosis
The real Multiple Sclerosis nightmare started at the point of NeuroDx
The disaster of diagnosis being made by general physical observation over time,.
Especially when Time is something you can’t afford #CCSVI
Multiple Sclerosis is strong and you often need help.
Make you be worthy of this help, don't stand in a corner complaining, do your part! 💪
#Symptoms often ease/DISAPPEAR
Facilitate Collaboration Neurovascular Disease Research! #CCSVI
FB Group: MSS
facebook.com/groups/4939355…!
- Silent Ischemia ( Myocardial Ischemia Without Angina ) .. 🫀📚🔍
---------------------------------------------------------------
# The medical definition of silent myocardial ischemia is verified myocardial ischemia without angina. Ischemia is a reduction of oxygen-rich blood supply to the heart muscle. Silent ischemia occurs when the heart temporarily doesn’t receive enough blood (and thus oxygen), but the person with the oxygen-deprivation doesn’t notice any effects. Silent ischemia is related to angina, which is a reduction of oxygen-rich blood in the heart that causes chest pain and other related symptoms ...
# Most silent ischemia occurs when one or more coronary arteries are narrowed by plaque. It can also occur when the heart is forced to work harder than normal.
People who have diabetes or who have had a heart attack are most likely to develop silent ischemia ..
----------------------------------------------
# Risk factors :
- People who are at risk for heart disease and angina are also at risk for silent ischemia. Risk factors include :
use
1- Diabetes
2- High blood cholesterol
3- High blood pressure
4- A family history of heart disease
5- Age (after age 45 for men and age 55 for women, risk increases)
6- A sedentary lifestyle
7- Obesity
8- Unmanaged stress
--------------------------------------------
# To reduce your risk for silent ischemia, you should reduce your risk for heart disease in general. Here are some things you can do :
1- Stop using tobacco of all kinds - including chewing and smoking tobacco and inhaling significant secondhand smoke .
2- Prevent diabetes or manage it if you have it .
3- Prevent or manage high blood pressure .
4- Prevent or treat high blood cholesterol and triglyceride levels .
5- Exercise regularly (talk to your doctor about what type of exercise is right for you) .
6- If you’re overweight, lose weight; maintain a healthy weight .
7- Eat a heart-healthy diet .
8- Take steps to reduce stress in your life, and learn how to manage stress .
9- See your doctor regularly, have recommended heart screenings, and follow your doctor’s instructions .
------------------------------------------------------
By : Kareem Blinder
Reference : www.beaumont.org/conditions/silent-ischemia
---------------------------------------------------------------
#coachblinder #blinderpathology #anatomyandphysiology #diabetes #ischemia #سبحانك_لا_علم_لنا_الا_ما_علمتنا
Awesome lecture and interesting insights! One small remark: the green and red lines in the Performance vs. Compute graph should probably be monotonically increasing.
Amazing talk! I was wondering why the field has moved closer to decoder-only models lately and whether there's an explanation to it.
Amazing!
I keep watching this. This is my eleventh time.
Brilliant mind 👌 ❤🎉
한국에도 이런 인재가 있다니!
많아요..
I keep watching this. This is my fourth time.
Go K-Bro!!
No PyTorch embedding?
I wonder if GPT-4o is a decoder only model with causal (uni-directional) attention 🤔
Less structure, it is just a huge mlp
Where is that shirt from nice white shirt
this guys stats are insane😂
You should invite Christopher Lafayette to speak
Okay who else though man named Huyng won Chung of openai. :D
👍👍
Awesome
I find this talk a bit unsatisfactory. He mentions how for encoder-decoder models, the decoder only attends to the last layer. also he mentions how we treat input and output seperately in encdoer-decoder. However thats not the point at all of encoder-decoder models right ? Its just that the encoder-decoder model has a intermediate encoder objective (to represent the input), thats all.
The decoder attending to only last layer, or seperating input-output is just how the orignal transformer did it. Clearly its possible to just attend to layer wise encodings instead of only last layer encodings, just an example. Also its possible to mimic generation decoder model style by adding new input to encoder rather than decoder. I would have really liked some experiments, even if toy, because its incredibly unconvincing. Specfically how he mentions a couple times that encoder final layers are an information bottleneck, but I mean, just attend to layer wise embeddings if you want. Or put some MLP on top of encoder last states.
Id argue, we are putting more structure in "decoder-only" model (by that I mean causal attention decoder, which is what he describes). The reason being causal attention, where we restrict the model to only attend to past, both during training and inference, even for part of output that is already generated.
와 개존잘이네....
Keep in mind that guys like this and Ilya are why OpenAI is what it is, not Sam Altman. I'm sure Altman is smart but he is not the kind of genius or as knowledgeable as these guys are
Look at his portfolio and talk again. You’re tunnel visioning hard af. You can’t compare the two skill sets the way you do
What’s your favorite part?
Yep. I’m lost now lol
Imagine if an actual PhD scientists sat on the openai board making decisions. Instead rich trust fund baby with no background somehow makes it on the board.
Nice talk, but physical analogies such as 6:30 are... rather naive and high school level. He should have focused only on AI details.
I found him intellectually sexy 😊
i have no idea who he is
buy nvda!
No idea what he is talking about.
hes talking about developing AI with human thinking capabilities, with past, current, and future developments..
show us the git ffs