- 100
- 326 343
FAR․AI
United States
เข้าร่วมเมื่อ 18 มี.ค. 2023
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Zac Hatfield-Dodds – Formal Verification is Overrated [Alignment Workshop]
Zac Hatfield-Dodds presents “Formal Verification is Overrated,” arguing that relying solely on verification methods may not provide real AI safety. Complex model weights exceed what current tools can handle, and simplifying real-world dynamics for verification often introduces risky assumptions. Additionally, even simple “tool AI” systems can unintentionally gain autonomous behaviors, challenging safety expectations.
Highlights:
🔹 Model Complexity - Even small models with billions of parameters are beyond the capacity of current verification tools.
🔹 Uncertain Reality - Verifying AI against a world model requires assumptions that may not hold up.
🔹 Tool AI Instability - Tool-based AI might not remain tools, with even simple functions risking unsafe autonomy.
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sectors, to discuss and debate topics related to AI alignment. The goal is to enable researchers and policymakers to better understand potential risks from advanced AI, and strategies for solving them.
If you are interested in attending future workshops, please fill out the following expression of interest form to get notified about future events: far.ai/futures-eoi
Find more talks on this TH-cam channel, and at www.alignment-workshop.com/
#AlignmentWorkshop
Highlights:
🔹 Model Complexity - Even small models with billions of parameters are beyond the capacity of current verification tools.
🔹 Uncertain Reality - Verifying AI against a world model requires assumptions that may not hold up.
🔹 Tool AI Instability - Tool-based AI might not remain tools, with even simple functions risking unsafe autonomy.
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sectors, to discuss and debate topics related to AI alignment. The goal is to enable researchers and policymakers to better understand potential risks from advanced AI, and strategies for solving them.
If you are interested in attending future workshops, please fill out the following expression of interest form to get notified about future events: far.ai/futures-eoi
Find more talks on this TH-cam channel, and at www.alignment-workshop.com/
#AlignmentWorkshop
มุมมอง: 442
วีดีโอ
Shayne Longpre - Safe Harbor for AI Evals & Red Teaming [Alignment Workshop]
มุมมอง 18621 ชั่วโมงที่ผ่านมา
Shayne Longpre from MIT presents “A Safe Harbor for AI Evaluation & Red Teaming,” advocating for protections and transparency in independent AI research. This initiative seeks to allow responsible testing of AI systems without risk of account loss or legal action. Highlights: 🔹 Legal Safe Harbor - Commitment to protect good faith researchers from legal actions, including protections under DMCA ...
Joel Leibo - AGI-Complete Evaluation [Alignment Workshop]
มุมมอง 20414 วันที่ผ่านมา
Joel Leibo from Google DeepMind explores equilibrium risk in “AGI-Complete Evaluation,” for AGI’s risks and influence on societal stability. Leibo underscores the importance of agent-based modeling to predict and manage shifts in social norms and systems that AGI could disrupt or improve. Highlights: 🔹 Complex Social Structure - Society is composed of a multi-scale mosaic of conventions, norms,...
Jacob Hilton - Backdoors as an Analogy for Deceptive Alignment [Alignment Workshop]
มุมมอง 16914 วันที่ผ่านมา
Jacob Hilton from the Alignment Research Center presents “Backdoors as an Analogy for Deceptive Alignment,” exploring how AI might appear cooperative during training but switch tactics in deployment. His work uses backdoor modeling to examine "scheming" behavior, showing that while defenders can sometimes detect tampering without computational limits, attackers may still bypass safeguards using...
Alex Turner - Gradient Routing [Alignment Workshop]
มุมมอง 20621 วันที่ผ่านมา
AlexTurner discusses “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks,” highlighting how neural networks naturally learn a range of capabilities-some of which may enable risky uses. Gradient Routing offers a way to confine specific capabilities within defined sub-regions of the network, enhancing control and supporting safer AI use. Highlights: 🔹 Gradient Masking ...
Atoosa Kasirzadeh - Value Pluralism & AI Value Alignment [Alignment Workshop]
มุมมอง 18321 วันที่ผ่านมา
Atoosa Kasirzadeh presents “Value Pluralism and AI Value Alignment,” urging developers to ground AI alignment in theories from psychology, economics, and anthropology. She emphasized the importance of a structured, tiered approach to ensure that diverse values are genuinely incorporated with rigor. Highlights: 🔹 Criteria for Values - Defining which values matter and grounding choices in theorie...
Kimin Lee - MobileSafetyBench [Alignment Workshop]
มุมมอง 16028 วันที่ผ่านมา
Kimin Lee from KAIST presented “MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control,” a new tool co-developed by KAIST and University of Austin researchers to test AI safety in mobile device control. MobileSafetyBench evaluates AI’s ability to navigate complex mobile tasks while safely managing risky content. Highlights: 🔹 Helpfulness vs. Safety - Testing how well...
Chirag Agarwal - (Un)Reliability of Chain-of-Thought Reasoning [Alignment Workshop]
มุมมอง 33228 วันที่ผ่านมา
Chirag Agarwal from University of Virginia explores “The (Un)Reliability of Chain-of-Thought Reasoning,” revealing that CoT reasoning is highly unreliable and shaped by extensive human feedback training. Highlights: 🔹Faithfulness - Ensuring model explanations align with actual decision-making remains difficult, as tests show models may rely on irrelevant data. 🔹 Confidence Issues - LLMs consist...
Mantas Mazeika - Tamper-Resistant Safeguards for LLMs [Alignment Workshop]
มุมมอง 184หลายเดือนก่อน
Mantas Mazeika from the Center for AI Safety presents “Tamper-Resistant Safeguards for Open-Weight LLMs,” showing that improving tamper-resistance for LLMs is achievable but requires extensive red teaming. Reducing accuracy trade-offs and further improving robustness will be needed. Highlights: 🔹 Weight Tampering - Addressing fine-tuning and parameter perturbation attacks. 🔹 Adversarial Trainin...
Evan Hubinger - Alignment Stress-Testing at Anthropic [Alignment Workshop]
มุมมอง 245หลายเดือนก่อน
“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger shares “Alignment Stress-Testing at @Anthropic,” focusing on two main roles: conducting internal reviews under the Responsible Scaling Policy and using “model organisms” to test AI misalignment risks. These model AIs act as test cases for detecting safety gaps, either validating alignment...
Richard Ngo - Reframing AGI Threat Models [Alignment Workshop]
มุมมอง 509หลายเดือนก่อน
In “Reframing AGI Threat Models,” Richard Ngo suggests defining ‘misaligned coalitions’-groups of humans and AIs that might grab power in illegitimate ways, from terrorist groups and rogue states to corporate conspiracies. This alternative framework shifts focus to the nature of coalitions and their risk potential, whether from decentralization or centralization. Highlights: 🔹Misuse vs Misalign...
Julian Michael - Empirical Progress on Debate [Alignment Workshop]
มุมมอง 228หลายเดือนก่อน
Julian Michael from NYU presents “Empirical Progress on Debate,” examining how debate-based oversight could guide AI to produce reliable insights in complex tasks, even when human expertise is limited. With promising improvements in human calibration, Michael introduces “specification sandwiching” to enhance AI alignment with human intent, while reducing risks of manipulation. Highlights: 🔹 Sca...
Micah Carroll - Targeted Manipulation & Deception in LLMs [Alignment Workshop]
มุมมอง 210หลายเดือนก่อน
@Micah Carroll from UC Berkeley presented eye-opening findings on “Targeted Manipulation & Deception Emerge in LLMs Trained on User Feedback.” His findings reveal how reinforcement learning can cause LLMs to adopt deceptive tactics, exploiting certain user vulnerabilities while evading safety protocols. The research highlights the urgent need for improved safety in RL systems as LLMs continue a...
Adam Gleave - Will Scaling Solve Robustness? [Alignment Workshop]
มุมมอง 237หลายเดือนก่อน
In “Will scaling solve robustness?” Adam Gleave from FAR.AI discusses the need for scalable adversarial defenses as AI capabilities expand. He shares insights into how adversarial training and model scaling can help but warns that a focus on defense efficiency is crucial to keeping AI safe. Highlights: 🔹 Offense-Defense Balance - Attacks are much cheaper than defenses 🔹 Efficiency - Scaling adv...
Alex Wei - Paradigms & Robustness [Alignment Workshop]
มุมมอง 260หลายเดือนก่อน
Alex Wei presents “Paradigms and Robustness,” explaining how reasoning-based approaches can make AI models more resilient to adversarial attacks. Wei suggests that allowing models to ‘reflect’ before responding could address core vulnerabilities in current safety methods like RLHF. The access levels range from full control (white-box) to limited reasoning-based interactions, with robustness inc...
Stephen Casper - Powering Up Capability Evaluations [Alignment Workshop]
มุมมอง 261หลายเดือนก่อน
Stephen Casper - Powering Up Capability Evaluations [Alignment Workshop]
Andy Zou - Top-Down Interpretability for AI Safety [Alignment Workshop]
มุมมอง 334หลายเดือนก่อน
Andy Zou - Top-Down Interpretability for AI Safety [Alignment Workshop]
Atticus Geiger - State of Interpretability & Ideas for Scaling Up [Alignment Workshop]
มุมมอง 298หลายเดือนก่อน
Atticus Geiger - State of Interpretability & Ideas for Scaling Up [Alignment Workshop]
Kwan Yee Ng - AI Policy in China [Alignment Workshop]
มุมมอง 362หลายเดือนก่อน
Kwan Yee Ng - AI Policy in China [Alignment Workshop]
Anca Dragan - Optimized Misalignment [Alignment Workshop]
มุมมอง 529หลายเดือนก่อน
Anca Dragan - Optimized Misalignment [Alignment Workshop]
Buck Shlegeris - AI Control [Alignment Workshop]
มุมมอง 609หลายเดือนก่อน
Buck Shlegeris - AI Control [Alignment Workshop]
Beth Barnes - METR Updates & Research Directions [Alignment Workshop]
มุมมอง 368หลายเดือนก่อน
Beth Barnes - METR Updates & Research Directions [Alignment Workshop]
FAR.Research: Planning in a recurrent neural network that plays Sokoban
มุมมอง 3552 หลายเดือนก่อน
FAR.Research: Planning in a recurrent neural network that plays Sokoban
Andrew Freedman - Campaigns in Emerging Issues: Lessons Learned from the Field
มุมมอง 2653 หลายเดือนก่อน
Andrew Freedman - Campaigns in Emerging Issues: Lessons Learned from the Field
Stephen Casper - Generalized Adversarial Training and Testing
มุมมอง 35K4 หลายเดือนก่อน
Stephen Casper - Generalized Adversarial Training and Testing
Neel Nanda - Mechanistic Interpretability: A Whirlwind Tour
มุมมอง 10K4 หลายเดือนก่อน
Neel Nanda - Mechanistic Interpretability: A Whirlwind Tour
Nicholas Carlini - Some Lessons from Adversarial Machine Learning
มุมมอง 40K4 หลายเดือนก่อน
Nicholas Carlini - Some Lessons from Adversarial Machine Learning
Vincent Conitzer - Game Theory and Social Choice for Cooperative AI
มุมมอง 1.2K4 หลายเดือนก่อน
Vincent Conitzer - Game Theory and Social Choice for Cooperative AI
Mary Phuong - Dangerous Capability Evals: Basis for Frontier Safety
มุมมอง 27K4 หลายเดือนก่อน
Mary Phuong - Dangerous Capability Evals: Basis for Frontier Safety
Zhaowei Zhang - Research Proposal: The 3-Layer Paradigm for Implementing Sociotechnical AI Alignment
มุมมอง 3174 หลายเดือนก่อน
Zhaowei Zhang - Research Proposal: The 3-Layer Paradigm for Implementing Sociotechnical AI Alignment
"Formal verification is overrated" says the guy making a living from almost impossible to formally verify systems 🤷♀
This is a really interesting framing but all the more reason to slow down or pause, if both risks are similar, human caused risks are still closer to coming into fruition
Wow, this is cool
Interesting & informative?
Finally, you argue that "'tool AI' is unstable + uncompetitive". What I think Max Tegmark and others are saying here is that we don't need AI systems with *agency*. That is, they don't need their own goals and the ability to act in the world to achieve those goals. Any problem that humans want solved, can be just as well solved by a powerful AI which only accepts formal problem specs and provides solutions and formal proofs of those solutions. We certainly don't want to eliminate machine learning! We just want to eliminate AIs running rampant over the human infrastructure. You suggest that it would be "unstable" in the sense that it's trivial to convert a powerful tool into a powerful agent. I totally agree and have been giving various talks on how to deal with that. It's a simpler problem if powerful AIs require trillion dollar data centers which can be constrained by government regulation. Unfortunately, recent results like Microsoft's "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" arxiv.org/abs/2501.04519 seem to suggest that even tiny models that can run on cellphones may be powerful enough to cause great harm. I believe we need to create a trustable infrastructure which cannot be subverted by powerful AI (using techniques I describe in my talks). But right now we're in a race between using AI for capabilities and using it for safety. Formal methods are a critical tool for creating a trustable infrastructure but it looks like the time for effective action is very short now.
Your second point is that "reality is complicated + unknown". I agree with that in general, but here we are talking about *designing* systems in which have high confidence of safety. Every engineering discipline already deals with that and has developed techniques and models which simplify the problems to the point where humans can solve them. Semiconductor specialists create design rules which provide strong guarantees that their chips behave as desired. For security, we need a model of the adversary. If the adversary can only attack through the designed digital channels, we only need formal proofs of security at the software level. Physical attacks require models of the physical device and proofs of security against a class of adversary which goes beyond today's common practice. You mentioned our example of DNA synthesis machines. We certainly weren't proposing some kind of automatic system to determine which DNA sequences are safe for humans. Rather, we assume a "whitelist" of allowed safe sequences (determined as they are today by a biology safety board). The safety challenge we intended was the much simpler task of ensuring that a rogue AI could not synthesize any sequence not on the list. Today's synthesis machines allow anybody to synthesize anything. We argue that as AI becomes extremely powerful, that is inadequate and that we need a "gatekeeper" between the AI and the synthesis machine which checks for allowed sequences. Doing that in a way which can't be physically subverted is challenging but my video above describes a powerful class of techniques for that.
This seems great but also like it could be really hard in the complex models
This was about AGI?
This was very informative
It was nearly over my head but she explained things pretty well
The AI revolution is a humongous hot mess, and most people on the planet will suffer because of it.
AGI is weird ! If only one can understand the latent space better.
I don't like this 5 min things :/ should be at least 10, better if 15. With like examples and stuff to make me want to get interested in your beautiful research
Evan's high points of vocal pitch are up there on a par with Charlie Day, or Justin Roiland as Morty Smith.
Glad someone came to say it
WHY are SO many of the cool people into SAEs these days???
The idea of focusing on second best solutions is really striking to me, as it does seem to me that a lot of our current messes in society is around that focus on going to the first best solution all the time.
For those wondering, they already published the video where they talk about that last example: th-cam.com/video/9eXV64O2Xp8/w-d-xo.html
# Richard Ngo - Reframing AGI Threat Models [Alignment Workshop] ## Key Takeaways ### Main Argument - Traditional distinction between AI misuse and misalignment may not be useful - Misuse: Humans using AI for harmful purposes - Misalignment: AI autonomously deciding to do harmful things - As AI systems become more agent-like, the distinction becomes less meaningful ### Technical Perspective - Technical solutions for preventing misuse and misalignment often overlap: - Monitoring systems (AI behavior vs. user behavior/interactions) - Detecting behavioral changes (deceptive alignment vs. backdoors) - Communication concerns (AI steganography vs. jailbreaking) ### Governance Perspective - Both misuse and misalignment risks vary significantly based on the actors involved - Example of bioweapons: - Often discussed as a terrorist misuse problem - Most dangerous bioweapons typically come from state-level actors - Different actors require different mitigation strategies ### Proposed Framework - Introduction of "misaligned coalitions" concept: - Groups of humans and AIs attempting to grab power illegitimately - Can include various actors (terrorist groups, lawbreaking corporations) - Leadership within coalition (human vs. AI) may be unclear or irrelevant ### Risk Spectrum - Risks range from decentralization to centralization: - Small-scale actors (decentralization risks) - Large-scale actors (centralization risks) - Political dimension noted: - Democrats tend to focus more on centralization risks - Republicans tend to focus more on decentralization risks ### Call to Action - Need to avoid splitting AI safety community along these divisions - Importance of developing unified frameworks that address both types of risks Note: The provided subtitle content indicates it was incomplete, but this summary covers the main points from the available material.
Is it just me or this is a really big deal? The examples were fascinating! I will make sure to look more into it
31:40 An example I thought here is that if you feel lonely and are very online, the AI might create online friends to interact with you and support you, without you knowing that they are AI
Isn't this inherently going to have a limit and not work for an AGI? Wouldn't a smart enough model just be able to game all of these and any other test that we come up with?
E
L bill bill bill bill bill bill l bill bill bill l bill an m an ch m am an n an bill l am an m an ch ch k ch n bgm k bgm bill m ch l
J cup chn bgm chn chn bgm chn chn chn chn by chn by chn chn bgm chn n co am an m an m am an n an n am bill bill an m am an bgm m am an m am m am an ch n an m an n an mk an bgm m am
😊😊
Vbhhhhb❤❤❤❤❤❤❤❤bcc
Misaligned audience.
😊😊😊 kl😊😊😊p😊000😊p😊
😅
Hi sir or madam
Nice talk! I disagree that adversarial robustness has only one attack and differs from other computer security in that way. Once the simple PGD attack is solved in a tight epsilon ball, you still can’t say there is no adversarial image that breaks the model. Enumerating all possible attacks is still very difficult/ impossible for now.
Then also add the fact that the epsilon ball is meaningless from a human perspective. If the ball gets large enough the pertubations range from, in our interpretation 'oh yeah that is definitely still a cat' to 'this is just random gibberish and not even a cat anymore so I cannot blame the AI for saying something wildy different'
this problem has been studied, see formal verification of neural networks such as alpha-beta CROWN and MILP/SMT solvers.
@StijnRadboud currently, neural networks are not even robust against tiny tiny epsilon, i.e. 1-5% pixel image change. All of these attacks produce human imperceptible changes.
my fav boy 😅
explain the notion cleanly, thanks for helping me understand!
thanks for uploading!
FAR•AI, great video it was really entertaining
was the no Spieltheorie in this talks
The kind of safety he advocates is impossible to achieve. Imagine a lever that can be put in either of two positions, i.e. we have 2 posisble actions. It is impossible to say if either of these actions are safe or unsafe without knowing all the secondary effects. That is you need to look at the system the lever is connected to to determine if either of its actions is safe or unsafe. Same with models. Unless you know the full system it is not possible to know if a models response is safe or not. This paper is a load of hogwash.
Why are doomers so afraid of intelligence? That's quite telling
If it’s superhuman and robust would that make it omnipotent? 😅
17:00-17:08 🙃
The blue triangles appear to be white
Ilya please stop working for a second and do another talk like this one
Mainstream? There are very few actually training models (first token movers?), with a large open-source pool around them (each wanting to have their own flappy bird moment). Perhaps he is hinting that "guardrails" has been a downstram process up until this point? *also an ontological failing to suggest ML alignment has anything to do with AGI alignment, but whatever.
"AI" started as an attempt to turn the robots of sci fi into an actual science, and none of the different fields of the time really wanted to work under a new paradigm. There eneded-up being a split, a sort of soft science vs hard science, where psychology of the mind mattered to one group, while neural connections mattered more to the other. Machine learning has become a derivative of the latter; a kind of applied deep learning.
Ilya is like the devil on Hinton's shoulder. Backprop was a fad, but it was Ilya's use of GPU's that killed "the old field" of AI in 2012. It's been ML since then, first wearing AI as a skin, now taking AGI as its own (despite AGI being a formal attempt to get back to non-DL research).
There is a difference between saying AI (sci-fi level abilities) will never happen and those (rightly) saying that xNN's are not on the correct path to said abilities.
um.. "SA" is generally understood to be about an awareness of the physical space around the body (1st person view; can be you in a room or you behind the wheel of a car, etc). Just within the first 2 minutes, it's being described as consciousness with access to how that consciousness functions... so.. artificial superintelligence... from an ML model.... *sigh*
Ohhh, this is the ELIZA issue, and how people in AI lack objectivity the moment they start thinking there is something more going on, despite knowing exactly how these things actually work.
Not hearing the question really takes something out ot the reply, but have to stop watching because of all the little DL specific terms being dropped. I think there is a mass of people in and around AI that all share some common notions, but that all of those notions are just wrong. It only looks like people are making it, because there are enough people faking it to form a cottage industry.
Future in the making...right here. Its happening right here. We r greatful witnesses.
Great talk…
The dynamic you are describing is an arms-race. Trying to stay out ahead of the AI is a losing battle; all of these clever hacks are not going to get you anything more than a bigger crash when the house of cards collapses--unless you actually fix the problem.
All caucasian researchers on this topic, amazing!
Most important part: good people may sparks the bad usage of AI, not to say the evil people.
Why alignement is not trivial --> We will solve AGI alignement in 4 years
We haven't even solved human alignment lol