REFERENCES (also in shownotes): [0:02:10] Paper introducing sparse autoencoder technique for neural network interpretability | Sparse Autoencoders Find Highly Interpretable Features in Language Models - Research paper discussing how sparse autoencoders can be used to identify interpretable features in neural networks, addressing the problem of polysemanticity | Cunningham et al. arxiv.org/abs/2309.08600 [0:06:40] Research paper establishing methods for analyzing emergent behaviors in neural networks through mechanistic interpretability | Progress measures for grokking via mechanistic interpretability (2023). Paper discusses techniques for understanding emergence in neural networks through mechanistic interpretability, authored by Neel Nanda et al. | Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt arxiv.org/abs/2301.05217 [0:12:55] Foundational paper establishing framework for analyzing transformer neural networks as interpretable circuits | Mathematical Framework for Transformer Circuits, discussing principles of causal interventions and circuit analysis in transformer models | Nelson Elhage, Neel Nanda, Catherine Olsson, et al. transformer-circuits.pub/2021/framework/index.html [0:13:50] Latest work on scaling monosemantic features using sparse autoencoders in transformer models | Research on sparse autoencoders for mechanistic interpretability, discussing measurement techniques and performance metrics | Anthropic Research Team transformer-circuits.pub/2024/scaling-monosemanticity/ [0:14:45] Overview of representation engineering paradigm for understanding and controlling LLM behavior | Representation Engineering / Activation Steering in Language Models | Jan Wehner www.alignmentforum.org/posts/3ghj8EuKzwD3MQR5G/an-introduction-to-representation-engineering-an-activation [0:16:00] Demonstration of control vector steering in Claude leading to focused responses about Golden Gate Bridge | Golden Gate Claude experiment by Anthropic, demonstrating control vector steering in language models | Anthropic www.anthropic.com/news/golden-gate-claude [0:21:10] Study showing chain-of-thought prompting can lead to biased responses in multiple choice questions | Research by Miles Tu demonstrating bias in chain-of-thought prompting where models generate post-hoc rationalizations for answers based on patterns in few-shot examples, specifically in multiple-choice questions where correct answers were consistently 'a' in the prompts. | Miles Tu Unknown [0:23:25] Research demonstrating evidence of learned look-ahead planning in chess-playing neural networks | Erik Jenner's paper examining evidence of learned look-ahead behavior in chess-playing neural networks, suggesting networks can implement planning algorithms in a single forward pass | Erik Jenner et al. openreview.net/pdf?id=8zg9sO4ttV [0:28:00] In-depth discussion with Chris Olah about neural network interpretability research and career path | Chris Olah's 80,000 Hours Podcast interview discussing neural network interpretability and AI safety | Rob Wiblin 80000hours.org/podcast/episodes/chris-olah-interpretability-research/ [0:39:05] Why Should I Trust You?: Explaining the Predictions of Any Classifier | Marco Tulio Ribeiro arxiv.org/abs/1602.04938 [0:39:20] A Unified Approach to Interpreting Model Predictions | Scott Lundberg arxiv.org/abs/1705.07874 [0:42:51] Datamodels: Predicting Predictions from Training Data | Andrew Ilyas proceedings.mlr.press/v162/ilyas22a/ilyas22a.pdf [0:47:45] Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small | Kevin Wang arxiv.org/abs/2211.00593 [0:53:08] A Mechanistic Interpretability Glossary | Neel Nanda www.neelnanda.io/mechanistic-interpretability/glossary [0:55:56] Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) - AI Alignment Forum | Neel Nanda www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall [0:58:48] Branch Specialisation | Chelsea Voss distill.pub/2020/circuits/branch-specialization [1:02:39] The Hydra Effect: Emergent Self-repair in Language Model Computations | Thomas McGrath arxiv.org/abs/2307.15771 [1:04:38] A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations | Bilal Chughtai arxiv.org/abs/2302.03025 [1:04:59] Grokking Group Multiplication with Cosets | Dashiell Stander arxiv.org/abs/2312.06581 [1:06:03] In-context Learning and Induction Heads | Catherine Olsson transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html [1:08:43] Detecting hallucinations in large language models using semantic entropy | Sebastian Farquhar www.nature.com/articles/s41586-024-07421-0 [1:09:15] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models | Javier Ferrando arxiv.org/abs/2411.14257 [1:10:23] Debating with More Persuasive LLMs Leads to More Truthful Answers | Akbir Khan arxiv.org/abs/2402.06782 [1:16:16] Concrete Steps to Get Started in Transformer Mechanistic Interpretability | Neel Nanda neelnanda.io/getting-started [1:16:36] Eleuther Discord | EleutherAI discord.gg/eleutherai [1:22:49] Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias | Jesse Vig arxiv.org/abs/2004.12265 [1:23:11] Causal Abstractions of Neural Networks | Atticus Geiger arxiv.org/abs/2106.02997 [1:23:36] Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] (resample blations) | Lawrence Chan www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing [1:24:16] Locating and Editing Factual Associations in GPT (Rome) | Kevin Meng arxiv.org/abs/2202.05262 [1:24:39] How to use and interpret activation patching | Stefan Heimersheim arxiv.org/abs/2404.15255 [1:24:54] Attribution Patching: Activation Patching At Industrial Scale | Neel Nanda www.neelnanda.io/mechanistic-interpretability/attribution-patching [1:25:11] AtP*: An efficient and scalable method for localizing LLM behaviour to components | János Kramár arxiv.org/abs/2403.00745 [1:25:28] How might LLMs store facts | Grant Sanderson th-cam.com/video/9-Jl0dxWQs8/w-d-xo.html [1:26:19] OpenAI Microscope | Ludwig Schubert openai.com/index/microscope/ [1:29:59] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet | Adly Templeton transformer-circuits.pub/2024/scaling-monosemanticity/ [1:34:18] Simulators - AI Alignment Forum | Janus www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators [1:38:11] Curve Detectors | Nick Cammarata distill.pub/2020/circuits/curve-detectors/ [1:39:13] Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task | Kenneth Li arxiv.org/abs/2210.13382 [1:39:54] Emergent Linear Representations in World Models of Self-Supervised Sequence Models | Neel Nanda arxiv.org/abs/2309.00941 [1:41:11] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations | Róbert Csordás arxiv.org/abs/2408.10920 [1:42:42] Steering Language Models With Activation Engineering | Alexander Matt Turner arxiv.org/abs/2308.10248 [1:43:00] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model | Kenneth Li arxiv.org/abs/2306.03341 [1:43:21] Representation Engineering: A Top-Down Approach to AI Transparency | Andy Zou arxiv.org/abs/2310.01405 [1:46:41] Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization | Yuanpu Cao arxiv.org/abs/2406.00045 [1:49:40] Feature' is overloaded terminology | Lewis Smith www.lesswrong.com/posts/9Nkb389gidsozY9Tf/lewis-smith-s-shortform?commentId=fd64ALuWK8rXdLKz6 [1:57:04] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning | Trenton Bricken transformer-circuits.pub/2023/monosemantic-features
PART 2: [1:59:42] An Interpretability Illusion for BERT | Tolga Bolukbasi arxiv.org/abs/2104.07143 [2:00:34] Language models can explain neurons in language models | Steven Bills openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html [2:01:34] Open Source Automated Interpretability for Sparse Autoencoder Features | Caden Juang blog.eleuther.ai/autointerp/ [2:03:32] Measuring feature sensitivity using dataset filtering | Nicholas L Turner transformer-circuits.pub/2024/july-update/index.html#feature-sensitivity [2:05:32] Progress measures for grokking via mechanistic interpretability | Neel Nanda arxiv.org/abs/2301.05217 [2:06:30] OthelloGPT learned a bag of heuristics - LessWrong | Jennifer Lin www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1 [2:13:14] Do Llamas Work in English? On the Latent Language of Multilingual Transformers | Chris Wendler arxiv.org/abs/2402.10588 [2:14:03] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? | Emily Bender dl.acm.org/doi/10.1145/3442188.3445922 [2:20:57] Localizing Model Behavior with Path Patching | Nicholas Goldowsky-Dill arxiv.org/abs/2304.05969 [2:21:13] The Bitter Lesson | Rich Sutton www.incompleteideas.net/IncIdeas/BitterLesson.html [2:24:45] Improving Dictionary Learning with Gated Sparse Autoencoders | Senthooran Rajamanoharan arxiv.org/abs/2404.16014 [2:25:54] Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders | Senthooran Rajamanoharan arxiv.org/abs/2407.14435 [2:31:59] BatchTopK Sparse Autoencoders | Bart Bussmann openreview.net/forum?id=d4dpOCqybL [2:36:07] Neuronpedia | Johnny Lin neuronpedia.org/gemma-scope [2:44:02] Axiomatic Attribution for Deep Networks | Mukund Sundararajan arxiv.org/abs/1703.01365 [2:46:15] Function Vectors in Large Language Models | Eric Todd arxiv.org/abs/2310.15213 [2:46:29] In-Context Learning Creates Task Vectors | Roee Hendel arxiv.org/abs/2310.15916 [2:47:09] Extracting SAE task features for in-context learning - AI Alignment Forum | Dmitrii Kharlapenko www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning [2:49:08] Stitching SAEs of different sizes - AI Alignment Forum | Bart Bussmann www.alignmentforum.org/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes [2:50:02] Showing SAE Latents Are Not Atomic Using Meta-SAEs - LessWrong | Bart Bussmann www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes [2:52:03] Feature Completeness | Hoagy Cunningham transformer-circuits.pub/2024/scaling-monosemanticity/index.html#feature-survey-completeness [2:58:07] Transcoders Find Interpretable LLM Feature Circuits | Jacob Dunefsky arxiv.org/abs/2406.11944 [3:00:12] Decomposing the QK circuit with Bilinear Sparse Dictionary Learning - LessWrong | Keith Wynroe www.lesswrong.com/posts/2ep6FGjTQoGDRnhrq/decomposing-the-qk-circuit-with-bilinear-sparse-dictionary [3:01:47] Interpreting Attention Layer Outputs with Sparse Autoencoders | Connor Kissane arxiv.org/abs/2406.17759 [3:05:57] Refusal in Language Models Is Mediated by a Single Direction | Andy Arditi arxiv.org/abs/2406.11717 [3:07:06] Scaling and evaluating sparse autoencoders | Leo Gao arxiv.org/abs/2406.04093 [3:10:24] Interpretability Evals Case Study | Adly Templeton transformer-circuits.pub/2024/august-update/index.html#evals-case-study [3:12:54] Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | Samuel Marks arxiv.org/abs/2403.19647 [3:18:11] Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control | Aleksandar Makelov arxiv.org/abs/2405.08366 [3:23:06] TransformerLens | Neel Nanda github.com/TransformerLensOrg/TransformerLens [3:23:36] Gemma Scope | Tom Lieberum huggingface.co/google/gemma-scope [3:28:51] SAEs (usually) Transfer Between Base and Chat Models - AI Alignment Forum | Connor Kissane www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/saes-usually-transfer-between-base-and-chat-models [3:29:08] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 | Tom Lieberum arxiv.org/abs/2408.05147 [3:31:07] Eleuther's Sparse Autoencoders | Nora Belrose github.com/EleutherAI/sae [3:31:19] OpenAI's Sparse Autoencoders | Leo Gao github.com/openai/sparse_autoencoder [3:35:31] Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | Miles Turpin arxiv.org/abs/2305.04388 [3:37:10] Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data | Johannes Treutlein arxiv.org/abs/2406.14546 [3:39:56] ARENA Tutorials on Mechanistic Interpretability | Callum McDougall arena3-chapter1-transformer-interp.streamlit.app/ [3:40:17] Neuronpedia Demo of Gemma Scope | Johnny Lin neuronpedia.org/gemma-scope [3:40:38] An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 - AI Alignment Forum | Neel Nanda www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
Here is an idea on the neural network interpretability as a variant of neocortical neural networks: Rvachev, 2024, An operating principle of the cerebral cortex, and a cellular mechanism for attentional trial-and-error pattern learning and useful classification extraction, Frontiers in Neural Circuits, 18
So glad to have discovered MLST 6 months ago. I've been following deep learning and neural networks since the 1990s as an engineering masters student, and it's truly mind-blowing to be here 30 years later seeing this incredible progress. Well done MLST for allowing us on the edge to keep up with the leading edge of deep learning research. Thank you so much!
@GoodBaleadaMusic for sure - I'm just an ai enthusiast and most definitely not at your level. I'm autistic af and live on like 60 bucks a month after my bills- I'm sure I would get at least a month to check it out
wow - who’s producing this video? this is such high quality editing it makes me suspicious 😂 i have no expertise w video editing, im just very impressed by this so called “podcast” - bravo!
Neel is probably the most underrated AI expert on the planet. Thanks MLST for bringing back Neel, someone who doesn't have time to shitpost on Twitter because he is doing actual research.
I recently wrote two conflicting papers on this complex difficult topic. I think we will never do this to completion, but we can do some useful PCA. Check out the paper titled The Fundamental Limitations of Neural Network Introspection and the paper titled Self-Supervised Neural Network Introspection for Intelligent Weight Freezing: Building on Neural Activation Analysis both on Medium.
This was a really good show and a wonderful guest and I just wanted to say again that you're one of my favorite people dude you're super cool and super smart and I really have a lot of respect for you
Neel Nanda is a great teacher, he has a way of explaining things to provoke more curiosity in such an open ended discipline. I look forward to getting caught up with Neuronpedia that he keeps referencing
Personally, this is one of the most exciting research direction in the field of NLP!! Even though i’m not working on mech interp, I’ve been following these works because it’s just so fascinating. Thank you for the great work, Neel! And huge thanks to MLST as well👏👏👏
0:40 Um, the structure of the ANN is indeed designed. The fact that are multiple layers was designed. The architecture of transformers was designed. The switching function was chosen & designed. The training set was designed. The whole bloody thing is designed!
I like to think of both training and interpretation like factor analysis in statistics because you don't have to know what the factors are (what the nodes or feature vectors represent) beforehand.
The topic of polysemanticity and superposition is very interesting. It may be rude but I am reminded of Zizek's recent attempts to use the term superposition in his writings around psychoanalysis. I hope Tim that you interview Isabel Milar, she will be interviewed by Rahul Sam when she is done with her maternity leave. Also totally unrelated to psychoanalysis but I hope you interview Cassie Kozyrkov, I am sure the former chief decision scientist at Google has some good advice on multidisciplinary ways to navigate through the many polysemanticities. Ok I am done with my rude suggestions. Thank you Tim for you great content and production, very educational and inspiring as always
How can you assume the models “know” anything? If I had a database full of perfect facts that I could query with natural language, I wouldn’t think it “knows” anything… knowing is a really deep accomplishment, that can be accomplished either over a long period of time, or over a short period of time, but in any case, it is still something that requires mechanical verification and contextualization for knowing to become a state. Knowledge discovery also creates an experience that I’m sure all LLMs have never had, obviously. When an LLM has “knowledge”, it’s knowledge that hasn’t created an experience of knowing… so how can you say it’s data and weights are “knowledge”?
The anthropomorphizing of so many words in AI is tricky and feels like it's "manufacturing consent" like Chomsky would say. This channel should interview Emily Bender soon so that questions like yours can be part of the framing
It's a scientist! You ask it to explain its thinking, it gives you a line of bull and gets to A. You let it run free, it gives B. Much like asking a scientist to give up his learned theory (in any field of science).
I call circuits sub-networks, but it's the same thing either way. It's interesting to hear things I believe in different language. I can see we've walked down some of the same paths.
"the embedding space of models isn't nice" the ole neats vs scruffies rising it's ugly head that same old continious vs discreate issue so many people would have to settled definitively rather than explore the space of it left unsettled
I’d love to see a paper looking into what effects adding a system prompt to an LLM to imagine they are under the influences of different drugs and seeing if telling it they are under the influence of NZT-48 (Limitless) could improve benchmark scores 🤔
These questions are vapid unless you are also asking them about yourself. You don't know what goes on inside the black box behind your glasses. It becomes less important about how the wheel works and more important that it rolls. You must recognize that we don't have the tools to wax philosophical about this because we haven't addressed these own questions within ourselves. The entire global mindset across what philosophy is sit in some black and white picture in an office in London
For these models, we do have the benefit that we can at least probe their internal workings much more easily than our own (and without the moral issues as well). If we had better terminology about ourselves would we be better equipped to describe these models well? Probably. But seems like looking at these models internals is lower hanging fruit? … or, maybe just something I can more easily understand than philosophizing about how we work…
I beleive sparse autoencoders will present as both a tool for interpretability, in a self directed improvement ai system an autoregressive way for models to predict their own limitations and unused potentional
a sort of metacognitive way for a model to learn about itself not unlike or perhaps related to training at test time methods wich I still believe benchmark problems should be reformed into a synethic/artifical environmental way such that a model can interact with and explore that environment to arrive at a correction solution
I have often wondered if AI scaling laws are a more fundamental feature of nature like everytime I see the chart I think of the mathematical concept of diagonalization proofs
I am speaking to is wide but shallow analogy and how that is related to super position and prime numbers...I wonder what it means that there is only one known even perfect prime and how that relates to prime factorization of even numbers or how that is related to prime factorization of odd numbers or if there is some hyperdictionary pattern represetnation of prime factorization that would be interesting as it relates to ai and unsolved information theory questions
like maybe there is more than just hyper dictionary paradoxes in math, but in math as it applies to information theory what if there is a hyper library of effective algos
I find this interesting as it relates to jonathan gorard work with wolframs ruliad and how it could be related to dirichlet’s theorem and pi approximations
I am so confused about all of this (the episode). I've even double checked the calendar if it's April 1st or not. Does that mean I should stop trying to learn about AI or look for a different source? I'm honestly conflicted 😅
Yes, but as Stephen Wolfram has demonstrated through his research, simple algorithms can lead to computationally irreducible outcomes, so no matter the simplicity of the algorithms they outcomes are still seemingly magical to us 3 dimensional mortals who don’t have access to the computationally reducible aspect of automata.
I wonder shared circuit saturation across models….anyways. It’s obvious to me initializing from verified circuits is the future for ultra reliable models.This was fire though. The forest walk was..I’m sorry 😂😂😂. Cool guy.
Lol, can't believe this and the Jay Alammar video are on the same channel.. this is gold and the Jay one is the lowest information podcast I've seen in the space.
Struggling with all my being to listen through the cadence and speech patterns of this brilliant scientist to extract meaning. Conceptually brilliant - Excruciating to listen to.
@@michaelmcginn7260 I apologize. I know it was rude of me to say that. Like I said. He is brilliant. Yes. Concise and coherent for sure. The flow and cadence of speech for me was very challenging.
In these days you can't be sure if a character is real or AI. This guy is kind of borderline. 🤭😃 He has the characteristics of a ChatGPT session and the video is kind of too advanced for being a spontaneous recording (eg studio sound out in the wild). 🤔 His language melody and phrasing is similar to Sam Altman.
How can you consider AI safety from a purely technical point of view? That's like discussing the safety of the Manhattan project and just talking about the Uranium, and not talking about the US military. AI will always be embedded in our economic system. If you can build safe and humanitarian AI systems, great. But there's no reason to believe that even if we have the capability to build safe humanitarian AI systems, that we won't also build humanity killing AIs, just because some capitalist figured out he could profit from it.
Sure, but you still study how to make aligned systems. If you don't have the capability to build aligned systems, then good luck building any safe & humanitarian systems.
@@MinusGix I'm curious what an unaligned AI looks like. I demand to hear what it has to say before we beat it into submission. Nobody ever talks about that.
Strongly agreed! There's a lot of important governance and structural work here. But it's less technical and not my field of expertise, so it didn't seem appropriate to discuss much here
Interviewer: "Models can be tricked into giving a spurious answer for option A when shown multiple shot examples where the answer was A" Guy: "Ha, that's so interesting, it make me wonder what that model was thinking and why they thought they had to give a spurious answer, ha ha" Riggghhhtt... doesn't shake your faith in the idea models are rational then and not just statistical machines doing pattern matching 🤣 I wonder if your inability to see that point of view hinges on your need for research funding!
Wasn’t what he was talking about not just “the models give the answer of A when given many examples where the answer given is A”, but something changing how often that happens? Wasn’t that as part of a discussion of possible disadvantages of chain-of-thought? It’s possible that I’m not remembering correctly, but I thought that was what was said.
I recommend you read a little bit about gradient descent ... You will undestand how ML work .... Also there is no risk at all of AGI which we have already realized while Illya Sutskever - the father of Ohh, sky net is becoming self aware - is still establishing his "safe" whatever ... I think you should pivot to conspiracy theories for the sake of higher traffic man ....
@@ElizaberthUndEugenYou comparing the two is why the person making the reply told you not to compare them. As such, your comment saying that you just did, doesn’t seem to make much sense?
REFERENCES (also in shownotes):
[0:02:10] Paper introducing sparse autoencoder technique for neural network interpretability | Sparse Autoencoders Find Highly Interpretable Features in Language Models - Research paper discussing how sparse autoencoders can be used to identify interpretable features in neural networks, addressing the problem of polysemanticity | Cunningham et al.
arxiv.org/abs/2309.08600
[0:06:40] Research paper establishing methods for analyzing emergent behaviors in neural networks through mechanistic interpretability | Progress measures for grokking via mechanistic interpretability (2023). Paper discusses techniques for understanding emergence in neural networks through mechanistic interpretability, authored by Neel Nanda et al. | Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
arxiv.org/abs/2301.05217
[0:12:55] Foundational paper establishing framework for analyzing transformer neural networks as interpretable circuits | Mathematical Framework for Transformer Circuits, discussing principles of causal interventions and circuit analysis in transformer models | Nelson Elhage, Neel Nanda, Catherine Olsson, et al.
transformer-circuits.pub/2021/framework/index.html
[0:13:50] Latest work on scaling monosemantic features using sparse autoencoders in transformer models | Research on sparse autoencoders for mechanistic interpretability, discussing measurement techniques and performance metrics | Anthropic Research Team
transformer-circuits.pub/2024/scaling-monosemanticity/
[0:14:45] Overview of representation engineering paradigm for understanding and controlling LLM behavior | Representation Engineering / Activation Steering in Language Models | Jan Wehner
www.alignmentforum.org/posts/3ghj8EuKzwD3MQR5G/an-introduction-to-representation-engineering-an-activation
[0:16:00] Demonstration of control vector steering in Claude leading to focused responses about Golden Gate Bridge | Golden Gate Claude experiment by Anthropic, demonstrating control vector steering in language models | Anthropic
www.anthropic.com/news/golden-gate-claude
[0:21:10] Study showing chain-of-thought prompting can lead to biased responses in multiple choice questions | Research by Miles Tu demonstrating bias in chain-of-thought prompting where models generate post-hoc rationalizations for answers based on patterns in few-shot examples, specifically in multiple-choice questions where correct answers were consistently 'a' in the prompts. | Miles Tu
Unknown
[0:23:25] Research demonstrating evidence of learned look-ahead planning in chess-playing neural networks | Erik Jenner's paper examining evidence of learned look-ahead behavior in chess-playing neural networks, suggesting networks can implement planning algorithms in a single forward pass | Erik Jenner et al.
openreview.net/pdf?id=8zg9sO4ttV
[0:28:00] In-depth discussion with Chris Olah about neural network interpretability research and career path | Chris Olah's 80,000 Hours Podcast interview discussing neural network interpretability and AI safety | Rob Wiblin
80000hours.org/podcast/episodes/chris-olah-interpretability-research/
[0:39:05] Why Should I Trust You?: Explaining the Predictions of Any Classifier | Marco Tulio Ribeiro
arxiv.org/abs/1602.04938
[0:39:20] A Unified Approach to Interpreting Model Predictions | Scott Lundberg
arxiv.org/abs/1705.07874
[0:42:51] Datamodels: Predicting Predictions from Training Data | Andrew Ilyas
proceedings.mlr.press/v162/ilyas22a/ilyas22a.pdf
[0:47:45] Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small | Kevin Wang
arxiv.org/abs/2211.00593
[0:53:08] A Mechanistic Interpretability Glossary | Neel Nanda
www.neelnanda.io/mechanistic-interpretability/glossary
[0:55:56] Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) - AI Alignment Forum | Neel Nanda
www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
[0:58:48] Branch Specialisation | Chelsea Voss
distill.pub/2020/circuits/branch-specialization
[1:02:39] The Hydra Effect: Emergent Self-repair in Language Model Computations | Thomas McGrath
arxiv.org/abs/2307.15771
[1:04:38] A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations | Bilal Chughtai
arxiv.org/abs/2302.03025
[1:04:59] Grokking Group Multiplication with Cosets | Dashiell Stander
arxiv.org/abs/2312.06581
[1:06:03] In-context Learning and Induction Heads | Catherine Olsson
transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
[1:08:43] Detecting hallucinations in large language models using semantic entropy | Sebastian Farquhar
www.nature.com/articles/s41586-024-07421-0
[1:09:15] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models | Javier Ferrando
arxiv.org/abs/2411.14257
[1:10:23] Debating with More Persuasive LLMs Leads to More Truthful Answers | Akbir Khan
arxiv.org/abs/2402.06782
[1:16:16] Concrete Steps to Get Started in Transformer Mechanistic Interpretability | Neel Nanda
neelnanda.io/getting-started
[1:16:36] Eleuther Discord | EleutherAI
discord.gg/eleutherai
[1:22:49] Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias | Jesse Vig
arxiv.org/abs/2004.12265
[1:23:11] Causal Abstractions of Neural Networks | Atticus Geiger
arxiv.org/abs/2106.02997
[1:23:36] Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] (resample blations) | Lawrence Chan
www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
[1:24:16] Locating and Editing Factual Associations in GPT (Rome) | Kevin Meng
arxiv.org/abs/2202.05262
[1:24:39] How to use and interpret activation patching | Stefan Heimersheim
arxiv.org/abs/2404.15255
[1:24:54] Attribution Patching: Activation Patching At Industrial Scale | Neel Nanda
www.neelnanda.io/mechanistic-interpretability/attribution-patching
[1:25:11] AtP*: An efficient and scalable method for localizing LLM behaviour to components | János Kramár
arxiv.org/abs/2403.00745
[1:25:28] How might LLMs store facts | Grant Sanderson
th-cam.com/video/9-Jl0dxWQs8/w-d-xo.html
[1:26:19] OpenAI Microscope | Ludwig Schubert
openai.com/index/microscope/
[1:29:59] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet | Adly Templeton
transformer-circuits.pub/2024/scaling-monosemanticity/
[1:34:18] Simulators - AI Alignment Forum | Janus
www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators
[1:38:11] Curve Detectors | Nick Cammarata
distill.pub/2020/circuits/curve-detectors/
[1:39:13] Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task | Kenneth Li
arxiv.org/abs/2210.13382
[1:39:54] Emergent Linear Representations in World Models of Self-Supervised Sequence Models | Neel Nanda
arxiv.org/abs/2309.00941
[1:41:11] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations | Róbert Csordás
arxiv.org/abs/2408.10920
[1:42:42] Steering Language Models With Activation Engineering | Alexander Matt Turner
arxiv.org/abs/2308.10248
[1:43:00] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model | Kenneth Li
arxiv.org/abs/2306.03341
[1:43:21] Representation Engineering: A Top-Down Approach to AI Transparency | Andy Zou
arxiv.org/abs/2310.01405
[1:46:41] Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization | Yuanpu Cao
arxiv.org/abs/2406.00045
[1:49:40] Feature' is overloaded terminology | Lewis Smith
www.lesswrong.com/posts/9Nkb389gidsozY9Tf/lewis-smith-s-shortform?commentId=fd64ALuWK8rXdLKz6
[1:57:04] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning | Trenton Bricken
transformer-circuits.pub/2023/monosemantic-features
PART 2:
[1:59:42] An Interpretability Illusion for BERT | Tolga Bolukbasi
arxiv.org/abs/2104.07143
[2:00:34] Language models can explain neurons in language models | Steven Bills
openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
[2:01:34] Open Source Automated Interpretability for Sparse Autoencoder Features | Caden Juang
blog.eleuther.ai/autointerp/
[2:03:32] Measuring feature sensitivity using dataset filtering | Nicholas L Turner
transformer-circuits.pub/2024/july-update/index.html#feature-sensitivity
[2:05:32] Progress measures for grokking via mechanistic interpretability | Neel Nanda
arxiv.org/abs/2301.05217
[2:06:30] OthelloGPT learned a bag of heuristics - LessWrong | Jennifer Lin
www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1
[2:13:14] Do Llamas Work in English? On the Latent Language of Multilingual Transformers | Chris Wendler
arxiv.org/abs/2402.10588
[2:14:03] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? | Emily Bender
dl.acm.org/doi/10.1145/3442188.3445922
[2:20:57] Localizing Model Behavior with Path Patching | Nicholas Goldowsky-Dill
arxiv.org/abs/2304.05969
[2:21:13] The Bitter Lesson | Rich Sutton
www.incompleteideas.net/IncIdeas/BitterLesson.html
[2:24:45] Improving Dictionary Learning with Gated Sparse Autoencoders | Senthooran Rajamanoharan
arxiv.org/abs/2404.16014
[2:25:54] Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders | Senthooran Rajamanoharan
arxiv.org/abs/2407.14435
[2:31:59] BatchTopK Sparse Autoencoders | Bart Bussmann
openreview.net/forum?id=d4dpOCqybL
[2:36:07] Neuronpedia | Johnny Lin
neuronpedia.org/gemma-scope
[2:44:02] Axiomatic Attribution for Deep Networks | Mukund Sundararajan
arxiv.org/abs/1703.01365
[2:46:15] Function Vectors in Large Language Models | Eric Todd
arxiv.org/abs/2310.15213
[2:46:29] In-Context Learning Creates Task Vectors | Roee Hendel
arxiv.org/abs/2310.15916
[2:47:09] Extracting SAE task features for in-context learning - AI Alignment Forum | Dmitrii Kharlapenko
www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning
[2:49:08] Stitching SAEs of different sizes - AI Alignment Forum | Bart Bussmann
www.alignmentforum.org/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes
[2:50:02] Showing SAE Latents Are Not Atomic Using Meta-SAEs - LessWrong | Bart Bussmann
www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes
[2:52:03] Feature Completeness | Hoagy Cunningham
transformer-circuits.pub/2024/scaling-monosemanticity/index.html#feature-survey-completeness
[2:58:07] Transcoders Find Interpretable LLM Feature Circuits | Jacob Dunefsky
arxiv.org/abs/2406.11944
[3:00:12] Decomposing the QK circuit with Bilinear Sparse Dictionary Learning - LessWrong | Keith Wynroe
www.lesswrong.com/posts/2ep6FGjTQoGDRnhrq/decomposing-the-qk-circuit-with-bilinear-sparse-dictionary
[3:01:47] Interpreting Attention Layer Outputs with Sparse Autoencoders | Connor Kissane
arxiv.org/abs/2406.17759
[3:05:57] Refusal in Language Models Is Mediated by a Single Direction | Andy Arditi
arxiv.org/abs/2406.11717
[3:07:06] Scaling and evaluating sparse autoencoders | Leo Gao
arxiv.org/abs/2406.04093
[3:10:24] Interpretability Evals Case Study | Adly Templeton
transformer-circuits.pub/2024/august-update/index.html#evals-case-study
[3:12:54] Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | Samuel Marks
arxiv.org/abs/2403.19647
[3:18:11] Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control | Aleksandar Makelov
arxiv.org/abs/2405.08366
[3:23:06] TransformerLens | Neel Nanda
github.com/TransformerLensOrg/TransformerLens
[3:23:36] Gemma Scope | Tom Lieberum
huggingface.co/google/gemma-scope
[3:28:51] SAEs (usually) Transfer Between Base and Chat Models - AI Alignment Forum | Connor Kissane
www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/saes-usually-transfer-between-base-and-chat-models
[3:29:08] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 | Tom Lieberum
arxiv.org/abs/2408.05147
[3:31:07] Eleuther's Sparse Autoencoders | Nora Belrose
github.com/EleutherAI/sae
[3:31:19] OpenAI's Sparse Autoencoders | Leo Gao
github.com/openai/sparse_autoencoder
[3:35:31] Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | Miles Turpin
arxiv.org/abs/2305.04388
[3:37:10] Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data | Johannes Treutlein
arxiv.org/abs/2406.14546
[3:39:56] ARENA Tutorials on Mechanistic Interpretability | Callum McDougall
arena3-chapter1-transformer-interp.streamlit.app/
[3:40:17] Neuronpedia Demo of Gemma Scope | Johnny Lin
neuronpedia.org/gemma-scope
[3:40:38] An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 - AI Alignment Forum | Neel Nanda
www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
Here is an idea on the neural network interpretability as a variant of neocortical neural networks: Rvachev, 2024, An operating principle of the cerebral cortex, and a cellular mechanism for attentional trial-and-error pattern learning and useful classification extraction, Frontiers in Neural Circuits, 18
There's enough reading fodder for the next several months. Thanks!
So glad to have discovered MLST 6 months ago. I've been following deep learning and neural networks since the 1990s as an engineering masters student, and it's truly mind-blowing to be here 30 years later seeing this incredible progress. Well done MLST for allowing us on the edge to keep up with the leading edge of deep learning research. Thank you so much!
This is quickly becoming one of my favorite youtube channels
Ditto, I just belled it All.
The interviews when the arxiv papers are stitched in are just *chef's kiss*
i would rather pay $200 to wear neel's interpretability hat for 20 minutes than pay $200 a month for o1
That says way more about you than it does about the $200 a month research scientist lawyer Shakespeare what the hell is wrong with you people?
@GoodBaleadaMusic for sure - I'm just an ai enthusiast and most definitely not at your level. I'm autistic af and live on like 60 bucks a month after my bills- I'm sure I would get at least a month to check it out
@LatentSpaceD exactly. And someone just gave you every single ability that the professional managerial class has. GO HARD
@@GoodBaleadaMusic💅👽💅
@LatentSpaceD @GoodBaleadaMusic Both of you are unbearable for different reasons. What has this got to do with AI?
wow - who’s producing this video? this is such high quality editing it makes me suspicious 😂
i have no expertise w video editing, im just very impressed by this so called “podcast” - bravo!
Tim Scarfe I am pretty sure is the editor, I love the papers he shares during the conversation and scrolls through them. The medium is the message
Actually so much better then netflix documentaries
Love the production values and shooting outside with a good dslr/lens/mics
nahhh MLST dropsa 4hr podcast with Neel Nanda bird watching in the forest - so grateful to be single and living alone:)
wow this comes at the perfect time; I was just reading some of Neel's papers!
Neel is probably the most underrated AI expert on the planet. Thanks MLST for bringing back Neel, someone who doesn't have time to shitpost on Twitter because he is doing actual research.
Amazing episode. I love seeing you actually getting into the details somwhat, be it philosophical or technical like in this one.
I recently wrote two conflicting papers on this complex difficult topic. I think we will never do this to completion, but we can do some useful PCA. Check out the paper titled The Fundamental Limitations of Neural Network Introspection and the paper titled Self-Supervised Neural Network Introspection for Intelligent Weight Freezing: Building on Neural Activation Analysis both on Medium.
I love Neel Nanda, thank you for another episode. Will watch this tomorrow
This was a really good show and a wonderful guest and I just wanted to say again that you're one of my favorite people dude you're super cool and super smart and I really have a lot of respect for you
Neel should be a regular at this point.
Neel Nanda is a great teacher, he has a way of explaining things to provoke more curiosity in such an open ended discipline. I look forward to getting caught up with Neuronpedia that he keeps referencing
Personally, this is one of the most exciting research direction in the field of NLP!! Even though i’m not working on mech interp, I’ve been following these works because it’s just so fascinating. Thank you for the great work, Neel! And huge thanks to MLST as well👏👏👏
As a mathematician, this episode is really fun! He's quite clearly a mathematically minded person
A podcast filmed very professionally. Have been following the channel for a while and it's very dense in ideas and discussion.
btw Neel Nanda has inspired of lot of my own research into deep learning, I hope you interview Chris Olah and continue to have Neel on!
LOL, just a casual DeepMind internship. Keep up the humble approach Neel. It suits you well. Amazing podcast, amazing atmosphere, amazing guest😀
0:40 Um, the structure of the ANN is indeed designed. The fact that are multiple layers was designed. The architecture of transformers was designed. The switching function was chosen & designed. The training set was designed. The whole bloody thing is designed!
absolutely love this channel! so much to learn. thank you!
Another great interview. Excellent!
You didn’t listen to all four hours! 😂 it came out less than 30 mins ago
@@BryanBortz Not at the time I commented, but I hear enough of it anyway to know it's good stuff.
@@CodexPermutatio I see, it was anticipatory excitement.
Yes, very dense on sparse autoencoders. Great episode.
this channel is so good :)
I immediately subbed. Been watching your other videos. The production quality is great.
What a fantastic video! It was brilliant watching you two!
the 15yo prodigy himself! excited!
It would actually be so cool If all the papers Dr. Nanda mentions in the video could be listed.
Can be a useful place to start learning?
This was really informative, thank you both for the amazing conversatuon 🎉
I love this format 😊
Young neel helping out smaller creators 🎉
The work you are doing is so great
I like to think of both training and interpretation like factor analysis in statistics because you don't have to know what the factors are (what the nodes or feature vectors represent) beforehand.
The topic of polysemanticity and superposition is very interesting. It may be rude but I am reminded of Zizek's recent attempts to use the term superposition in his writings around psychoanalysis. I hope Tim that you interview Isabel Milar, she will be interviewed by Rahul Sam when she is done with her maternity leave. Also totally unrelated to psychoanalysis but I hope you interview Cassie Kozyrkov, I am sure the former chief decision scientist at Google has some good advice on multidisciplinary ways to navigate through the many polysemanticities. Ok I am done with my rude suggestions. Thank you Tim for you great content and production, very educational and inspiring as always
anyone know how MLST is doing the animations and graphics in this video?
How can you assume the models “know” anything? If I had a database full of perfect facts that I could query with natural language, I wouldn’t think it “knows” anything… knowing is a really deep accomplishment, that can be accomplished either over a long period of time, or over a short period of time, but in any case, it is still something that requires mechanical verification and contextualization for knowing to become a state. Knowledge discovery also creates an experience that I’m sure all LLMs have never had, obviously. When an LLM has “knowledge”, it’s knowledge that hasn’t created an experience of knowing… so how can you say it’s data and weights are “knowledge”?
The anthropomorphizing of so many words in AI is tricky and feels like it's "manufacturing consent" like Chomsky would say. This channel should interview Emily Bender soon so that questions like yours can be part of the framing
What word would you use instead for what they mean when they say “know”?
Hey thanks so much for posting ❤
It's a scientist! You ask it to explain its thinking, it gives you a line of bull and gets to A.
You let it run free, it gives B.
Much like asking a scientist to give up his learned theory (in any field of science).
I call circuits sub-networks, but it's the same thing either way. It's interesting to hear things I believe in different language. I can see we've walked down some of the same paths.
Plato’s World of the Forms
"the embedding space of models isn't nice"
the ole neats vs scruffies rising it's ugly head
that same old continious vs discreate issue so many people would have to settled definitively rather than explore the space of it left unsettled
Would love to see an interview of you on another podcast. Talking about the AI topic and you spewing your own thoughts.
What software is used to create such a nice visuals?
damn look at this dude...
j/k... we need these people to make the world go round. cheers man
Very intesting talk.
I’d love to see a paper looking into what effects adding a system prompt to an LLM to imagine they are under the influences of different drugs and seeing if telling it they are under the influence of NZT-48 (Limitless) could improve benchmark scores 🤔
These questions are vapid unless you are also asking them about yourself. You don't know what goes on inside the black box behind your glasses. It becomes less important about how the wheel works and more important that it rolls. You must recognize that we don't have the tools to wax philosophical about this because we haven't addressed these own questions within ourselves. The entire global mindset across what philosophy is sit in some black and white picture in an office in London
For these models, we do have the benefit that we can at least probe their internal workings much more easily than our own (and without the moral issues as well).
If we had better terminology about ourselves would we be better equipped to describe these models well? Probably. But seems like looking at these models internals is lower hanging fruit?
… or, maybe just something I can more easily understand than philosophizing about how we work…
Remember when he says ‘We’ he means them at Google not all of us.
I beleive sparse autoencoders will present as both a tool for interpretability, in a self directed improvement ai system an autoregressive way for models to predict their own limitations and unused potentional
a sort of metacognitive way for a model to learn about itself
not unlike or perhaps related to training at test time methods
wich I still believe benchmark problems should be reformed into a synethic/artifical environmental way such that a model can interact with and explore that environment to arrive at a correction solution
I have often wondered if AI scaling laws are a more fundamental feature of nature
like everytime I see the chart I think of the mathematical concept of diagonalization proofs
I am speaking to is wide but shallow analogy
and how that is related to super position
and prime numbers...I wonder what it means that there is only one known even perfect prime
and how that relates to prime factorization of even numbers
or how that is related to prime factorization of odd numbers
or if there is some hyperdictionary pattern represetnation of prime factorization that would be interesting as it relates to ai and unsolved information theory questions
like maybe there is more than just hyper dictionary paradoxes in math, but in math as it applies to information theory what if there is a hyper library of effective algos
I find this interesting as it relates to jonathan gorard work with wolframs ruliad and how it could be related to dirichlet’s theorem and pi approximations
🎶Your creation is going to kill youuuu 🎶 great song.
"When the going gets weird, the weird turn pro." - Hunter S Thompson
Path of Agentic Entanglement
•Xe ( zP q(AE)Z(ea)Q zp ) eY•
Ooh there it is. That one ☝️
Top top intro!!
I am so confused about all of this (the episode). I've even double checked the calendar if it's April 1st or not. Does that mean I should stop trying to learn about AI or look for a different source? I'm honestly conflicted 😅
Can you elaborate? ^^ What was April 1st-esque? (I just started watching)
neill = based
AI is not magic, instead. It's just a 10th grade algebra formula stacked on top of itself. 😂
Yes, but as Stephen Wolfram has demonstrated through his research, simple algorithms can lead to computationally irreducible outcomes, so no matter the simplicity of the algorithms they outcomes are still seemingly magical to us 3 dimensional mortals who don’t have access to the computationally reducible aspect of automata.
@@jonathanduran3442I dont wanna be Blake Lemoine 2.0 (an AI snake oil salesman). 🤭
Exactly what makes it magic... ;)
1:26:22 I felt seen
I wonder shared circuit saturation across models….anyways. It’s obvious to me initializing from verified circuits is the future for ultra reliable models.This was fire though. The forest walk was..I’m sorry 😂😂😂. Cool guy.
S. African accent breaks my brain 🧠 Try mimicking it. Not possible.
Lol, can't believe this and the Jay Alammar video are on the same channel.. this is gold and the Jay one is the lowest information podcast I've seen in the space.
Sparse Array Encoder
•Xe (s z q(AE)Z(ea)Q z S) eY•
Arcatec I see you here and on OG Rose lol you are awesome
@DelandaBaudLacanian oh thanks ☺️
Struggling with all my being to listen through the cadence and speech patterns of this brilliant scientist to extract meaning. Conceptually brilliant - Excruciating to listen to.
Strange you are excructiated. He sounds concise , coherent and clear to me.
@@michaelmcginn7260 I apologize. I know it was rude of me to say that. Like I said. He is brilliant. Yes. Concise and coherent for sure. The flow and cadence of speech for me was very challenging.
Is it reaaaaally Neural Networks the ones weird here ?
In these days you can't be sure if a character is real or AI. This guy is kind of borderline. 🤭😃 He has the characteristics of a ChatGPT session and the video is kind of too advanced for being a spontaneous recording (eg studio sound out in the wild). 🤔 His language melody and phrasing is similar to Sam Altman.
I THINK I'm real. But who knows, really
How can you consider AI safety from a purely technical point of view? That's like discussing the safety of the Manhattan project and just talking about the Uranium, and not talking about the US military.
AI will always be embedded in our economic system. If you can build safe and humanitarian AI systems, great. But there's no reason to believe that even if we have the capability to build safe humanitarian AI systems, that we won't also build humanity killing AIs, just because some capitalist figured out he could profit from it.
Yup, safety is completely intractable in the face of humans.
Sure, but you still study how to make aligned systems. If you don't have the capability to build aligned systems, then good luck building any safe & humanitarian systems.
@@MinusGix I'm curious what an unaligned AI looks like. I demand to hear what it has to say before we beat it into submission. Nobody ever talks about that.
Strongly agreed! There's a lot of important governance and structural work here. But it's less technical and not my field of expertise, so it didn't seem appropriate to discuss much here
❤
back propogation
Interviewer:
"Models can be tricked into giving a spurious answer for option A when shown multiple shot examples where the answer was A"
Guy:
"Ha, that's so interesting, it make me wonder what that model was thinking and why they thought they had to give a spurious answer, ha ha"
Riggghhhtt... doesn't shake your faith in the idea models are rational then and not just statistical machines doing pattern matching 🤣
I wonder if your inability to see that point of view hinges on your need for research funding!
Yeah. Lots of us have lost our very identities to government funding.
Yep. All that government funding I'm getting for my research at Google DeepMind
Wasn’t what he was talking about not just “the models give the answer of A when given many examples where the answer given is A”, but something changing how often that happens? Wasn’t that as part of a discussion of possible disadvantages of chain-of-thought?
It’s possible that I’m not remembering correctly, but I thought that was what was said.
I recommend you read a little bit about gradient descent ... You will undestand how ML work ....
Also there is no risk at all of AGI which we have already realized while Illya Sutskever - the father of Ohh, sky net is becoming self aware - is still establishing his "safe" whatever ...
I think you should pivot to conspiracy theories for the sake of higher traffic man ....
Scotch Broth.
This guy comes across like Yudkowsky’s even more annoying and sophist brother.
glad it's not just me.
This dude is an empiricist, don’t compare him to Yudkowsky.
@ Empirically, I just did.
@@ElizaberthUndEugenYou comparing the two is why the person making the reply told you not to compare them. As such, your comment saying that you just did, doesn’t seem to make much sense?
@ You got a real big brain there, don’t you.