Chuyển sang chế độ ngoại tuyến với ứng dụng Player FM !
34 - AI Evaluations with Beth Barnes
Manage episode 431050436 series 2844728
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
39 tập
Manage episode 431050436 series 2844728
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
39 tập
Όλα τα επεισόδια
×Chào mừng bạn đến với Player FM!
Player FM đang quét trang web để tìm các podcast chất lượng cao cho bạn thưởng thức ngay bây giờ. Đây là ứng dụng podcast tốt nhất và hoạt động trên Android, iPhone và web. Đăng ký để đồng bộ các theo dõi trên tất cả thiết bị.