[Home](/)[All Posts](/allPosts)[Concepts](/wikitags/all)[Library](/library)[
Best of LessWrong
](/bestoflesswrong)[
Sequence Highlights
](/highlights)[
Rationality: A-Z
](/rationality)[
The Codex
](/codex)[
HPMOR
](/hpmor)[Community Events](/community)
[
MAISU - Minimal AI Safety Unconference
Fri Apr 18•Online
](/events/JNL2bmDXmaG7YnRbF/maisu-minimal-ai-safety-unconference)[
LessWrong Community Weekend 2025
Fri Aug 29•Berlin
](/events/JxsdDs8ZfbF4dBkGe/lesswrong-community-weekend-2025)
[
Crock, Crocker, Crockiest
Sat Mar 22•Cambridge
](/events/tSeJirQwZto5o5tKg/crock-crocker-crockiest-1)[
OC ACXLW Meetup #90: “Alien Values & The Democracy of the Dead”
Sat Mar 22•Newport Beach
](/events/33Ethpiq7zLqv2uPK/oc-acxlw-meetup-90-alien-values-and-the-democracy-of-the)
Subscribe (RSS/Email)
[
LW the Album
](/posts/YMo5PuXnZDwRjhHhE/the-story-of-i-have-been-a-good-bing)[
About
](/about)[
FAQ
](/faq)
[Home](/ "Latest posts, comments and curated content.")[All Posts](/allPosts "See all posts, filtered and sorted however you like.")[Concepts](/wikitags/all)[Library](/library "Curated collections of LessWrong's best writing.")[Community](/community "Find a meetup near you.")
[Embedded Agents](/posts/p7x32SEt43ZMC9r7r/embedded-agents)
[Best of LessWrong 2018](/bestoflesswrong?year=2018&category=ai safety)
How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and _not_ humans and AIs playing video games in virtual worlds where the player not part of the environment. Â The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.
by [abramdemski](/users/abramdemski)
14orthonormal
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
LatestEnrichedRecommended
Customize
450[Welcome to LessWrong!](/posts/bJ2haLkcGeLtTWaD5/welcome-to-lesswrong)
[Ruby](/users/ruby), [Raemon](/users/raemon), [RobertM](/users/t3t), [habryka](/users/habryka4)
6y
64
129
[](/recommendations)
[Levels of Friction](/posts/xcMngBervaSCgL9cu/levels-of-friction)
[Zvi](/users/zvi)
3d
[](/recommendations)
6
207
[](/recommendations)
[Eliezer's Lost Alignment Articles / The Arbital Sequence](/posts/mpMWWKzkzWqf57Yap/eliezer-s-lost-alignment-articles-the-arbital-sequence)
[Ruby](/users/ruby), [RobertM](/users/t3t)
7d
[](/recommendations)
9
174[METR: Measuring AI Ability to Complete Long Tasks](/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks)
[](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/)[Ω](https://alignmentforum.org/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks)
[Zach Stein-Perlman](/users/zach-stein-perlman)
2d
[](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/)[Ω](https://alignmentforum.org/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks)
48
552[How to Make Superbabies](/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies)
[](/recommendations)
[GeneSmith](/users/genesmith), [kman](/users/kman)
23d
[](/recommendations)
322
92[Intention to Treat](/posts/yRJ5hdsm5FQcZosCh/intention-to-treat)
[Alicorn](/users/alicorn)
16h
3
336[A Bear Case: My Predictions Regarding AI Progress](/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress)
[](/recommendations)
[Thane Ruthenis](/users/thane-ruthenis)
14d
[](/recommendations)
145
160[Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations](/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment)
[Ω](https://alignmentforum.org/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment)
[Nicholas Goldowsky-Dill](/users/nicholas-goldowsky-dill), [Mikita Balesni](/users/mikita-balesni), [Jérémy Scheurer](/users/jerrysch), [Marius Hobbhahn](/users/marius-hobbhahn)
4d
[Ω](https://alignmentforum.org/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment)
5
171[Why White-Box Redteaming Makes Me Feel Weird](/posts/MnYnCFgT3hF6LJPwn/why-white-box-redteaming-makes-me-feel-weird-1)
[Zygi Straznickas](/users/zygi-straznickas)
5d
30
321[Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs](/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly)
[](/recommendations)[Ω](https://alignmentforum.org/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly)
[Jan Betley](/users/jan-betley), [Owain\_Evans](/users/owain_evans)
10d
[](/recommendations)[Ω](https://alignmentforum.org/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly)
88
205[Trojan Sky](/posts/fheyeawsjifx4MafG/trojan-sky)
[](https://www.narrativeark.xyz/p/trojan-sky)
[Richard\_Ngo](/users/ricraz)
10d
[](https://www.narrativeark.xyz/p/trojan-sky)
39
393[How AI Takeover Might Happen in 2 Years](/posts/KFJ2LFogYqzfGB3uX/how-ai-takeover-might-happen-in-2-years)
[](/recommendations)[](https://x.com/joshua_clymer/status/1887905375082656117)[Ω](https://alignmentforum.org/posts/KFJ2LFogYqzfGB3uX/how-ai-takeover-might-happen-in-2-years)
[joshc](/users/joshc)
1mo
[](/recommendations)[](https://x.com/joshua_clymer/status/1887905375082656117)[Ω](https://alignmentforum.org/posts/KFJ2LFogYqzfGB3uX/how-ai-takeover-might-happen-in-2-years)
131
87[The principle of genomic liberty](/posts/rxcGvPrQsqoCHndwG/the-principle-of-genomic-liberty)
[TsviBT](/users/tsvibt)
2d
16
181[OpenAI: Detecting misbehavior in frontier reasoning models](/posts/7wFdXj9oR8M9AiFht/openai-detecting-misbehavior-in-frontier-reasoning-models)
[](https://openai.com/index/chain-of-thought-monitoring/)[Ω](https://alignmentforum.org/posts/7wFdXj9oR8M9AiFht/openai-detecting-misbehavior-in-frontier-reasoning-models)
[Daniel Kokotajlo](/users/daniel-kokotajlo)
10d
[](https://openai.com/index/chain-of-thought-monitoring/)[Ω](https://alignmentforum.org/posts/7wFdXj9oR8M9AiFht/openai-detecting-misbehavior-in-frontier-reasoning-models)
25
148[Reducing LLM deception at scale with self-other overlap fine-tuning](/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine)
[Ω](https://alignmentforum.org/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine)
[Marc Carauleanu](/users/marc-everin-carauleanu-carauleanu), [Diogo de Lucena](/users/diogo-de-lucena), [Gunnar\_Zarncke](/users/gunnar_zarncke), [Judd Rosenblatt](/users/judd), [Cameron Berg](/users/cameron-berg), [Mike Vaiana](/users/mike-vaiana), [AE Studio](/users/ae-studio)
8d
[Ω](https://alignmentforum.org/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine)
34
70[Elite Coordination via the Consensus of Power](/posts/zqffB6gokoivwwn7X/elite-coordination-via-the-consensus-of-power)
[](https://www.mindthefuture.info/p/elite-coordination-via-the-consensus)
[Richard\_Ngo](/users/ricraz)
2d
[](https://www.mindthefuture.info/p/elite-coordination-via-the-consensus)
12
[Load More](#)[Advanced Sorting/Filtering](/allPosts)
[Quick Takes](/quicktakes)
==============================
CancelSubmit
[](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=ajaH5ntiiNrPG2c3S)[Mo Putera](/users/mo-putera)3h140
Scott Alexander's [Mistakes](https://www.astralcodexten.com/p/mistakes), Dan Luu's [Major errors on this blog (and their corrections)](https://danluu.com/corrections/), and Gwern's [My Mistakes](https://github.com/gjord/gwern.net/blob/master/Mistakes.page) (last updated 11 years ago) are the only online writers I know of who maintain a dedicated, centralized page solely for cataloging their errors, which I admire. Probably not coincidentally they're also among the thinkers I respect the most for repeatedly empirically grounding their reasoning. Some orgs do this too, like 80K's [Our mistakes](https://80000hours.org/about/credibility/evaluations/mistakes/), CEA's [Mistakes we've made](https://www.centreforeffectivealtruism.org/our-mistakes), and GiveWell's [Our mistakes](https://www.givewell.org/about/our-mistakes).Â
While I prefer dedicated centralized pages like those to one-off writeups for [long content benefit](https://gwern.net/about#long-content) reasons, one-off definitely beats none (myself included). In that regard I appreciate essays like Holden Karnofsky's [Some Key Ways in Which I've Changed My Mind Over the Last Several Years](https://gwern.net/doc/existential-risk/2016-karnofsky.pdf) (2016), Denise Melchin's [My mistakes on the path to impact](https://forum.effectivealtruism.org/posts/QFa92ZKtGp7sckRTR/my-mistakes-on-the-path-to-impact) (2020), Zach Groff's [Things I've Changed My Mind on This Year](https://zachfreitasgroff.blogspot.com/2017/12/things-ive-changed-my-mind-on-this-year.html) (2017), and this [2013 LW repository](/posts/KLiJPDFHCRYcftQnq/mistakes-repository) for "major, life-altering mistakes that you or others have made", as well as by orgs like HLI's [Learning from our mistakes](https://forum.effectivealtruism.org/posts/4edCygGHya4rGx6xa/learning-from-our-mistakes-how-hli-plans-to-improve).
In this vein I'm also sad to see mistakes pages get removed, e.g. ACE used to have a [Mistakes page](https://web.archive.org/web/20230127132712/https://animalcharityevaluators.org/transparency/mistakes/) (archived link) but now [no longer do](https://animalcharityevaluators.org/transparency/mistakes/).
Reply1
[](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=ajaH5ntiiNrPG2c3S)[Mo Putera](/users/mo-putera)[3h](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=ajaH5ntiiNrPG2c3S)140
0
Scott Alexander's Mistakes, Dan Luu's Major errors on this blog (and their corrections), and Gwern's My Mistakes (last updated 11 years ago) are the only online writers I know of who maintain a dedicated, centralized page solely for cataloging their errors, which I admire. Probably not coincidentally they're also among the thinkers I respect the most for repeatedly empirically grounding their reasoning. Some orgs do this too, like 80K's Our mistakes, CEA's Mistakes we've made, and GiveWell's Our mistakes. While I prefer dedicated centralized pages like those to one-off writeups for long content benefit reasons, one-off definitely beats none (myself included). In that regard I appreciate essays like Holden Karnofsky's Some Key Ways in Which I've Changed My Mind Over the Last Several Years (2016), Denise Melchin's My mistakes on the path to impact (2020), Zach Groff's Things I've Changed My Mind on This Year (2017), and this 2013 LW repository for "major, life-altering mistakes that you or others have made", as well as by orgs like HLI's Learning from our mistakes. In this vein I'm also sad to see mistakes pages get removed, e.g. ACE used to have a Mistakes page (archived link) but now no longer do.
[](/posts/NvWMicxskigcDfMgv/aaron-bergman-s-shortform?commentId=wcyJCwYkhzXB4SpGa)[Aaron Bergman](/users/aaronb50)11h240
Sharing [https://earec.net](https://earec.net), semantic search for the EA + rationality ecosystem. Not fully up to date, sadly (doesn't have the last month or so of content). The current version is basically a minimal viable product!Â
On the results page there is also an option to see EA Forum only results which allow you to sort by a weighted combination of karma and semantic similarity thanks to the API.Â
Unfortunately there's no corresponding system for LessWrong because of (perhaps totally sensible) rate limits (the EA Forum offers a [bots site](https://forum-bots.effectivealtruism.org/) for use cases like this with much more permissive access).
Final feature to note is that there's an option to have gpt-4o-mini "manually" read through the summary of each article on the current screen of results, which will give better evaluations of relevance to some query (e.g. "sources I can use for a project on X") than semantic similarity alone.
Still kinda janky - as I said, minimal viable product right now. Enjoy and feedback is welcome!
Thanks to [@Nathan Young](https://forum.effectivealtruism.org/users/nathan?mention=user) for commissioning this!Â
Reply
[](/posts/NvWMicxskigcDfMgv/aaron-bergman-s-shortform?commentId=wcyJCwYkhzXB4SpGa)[Aaron Bergman](/users/aaronb50)[11h](/posts/NvWMicxskigcDfMgv/aaron-bergman-s-shortform?commentId=wcyJCwYkhzXB4SpGa)240
0
Sharing https://earec.net, semantic search for the EA + rationality ecosystem. Not fully up to date, sadly (doesn't have the last month or so of content). The current version is basically a minimal viable product! On the results page there is also an option to see EA Forum only results which allow you to sort by a weighted combination of karma and semantic similarity thanks to the API. Unfortunately there's no corresponding system for LessWrong because of (perhaps totally sensible) rate limits (the EA Forum offers a bots site for use cases like this with much more permissive access). Final feature to note is that there's an option to have gpt-4o-mini "manually" read through the summary of each article on the current screen of results, which will give better evaluations of relevance to some query (e.g. "sources I can use for a project on X") than semantic similarity alone.  Still kinda janky - as I said, minimal viable product right now. Enjoy and feedback is welcome! Thanks to @Nathan Young for commissioning this!Â
[](/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=utYfktgDZf7wMyjNF)[TurnTrout](/users/turntrout)20hΩ15217
Want to get into alignment research? Alex Cloud ([@cloud](https://www.lesswrong.com/users/cloud-1?mention=user)) & I mentor **Team Shard**, responsible for [gradient routing](/posts/nLRKKCTtwQgvozLTN/gradient-routing-masking-gradients-to-localize-computation), [steering vectors](/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector), [retargeting the search in a maze agent](/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network), [MELBO](/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1) for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.
Apply for mentorship this summer at [https://forms.matsprogram.org/turner-app-8](https://forms.matsprogram.org/turner-app-8)Â
Reply
[](/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=utYfktgDZf7wMyjNF)[TurnTrout](/users/turntrout)[20h](/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=utYfktgDZf7wMyjNF)Ω15217
1
Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields. Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8Â
[](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform?commentId=LPsMifygc8vqD2edr)[Rafael Harth](/users/sil-ver)1h20
For those who work on Windows, a nice little quality of life improvement for me was just to hide desktop icons and do everything by searching in the task bar. (Would be even better if the search function wasn't so odd.) Been doing this for about two years and like it much more.
Maybe for others, using the desktop is actually worth it, but for me, it was always cluttering up over time, and the annoyance over it not looking the way I want always outweighed the benefits. It really takes barely longer to go CTRL+ESC+"firef"+ENTER than to double click an icon.
Reply
[](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform?commentId=LPsMifygc8vqD2edr)[Rafael Harth](/users/sil-ver)[1h](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform?commentId=LPsMifygc8vqD2edr)20
0
For those who work on Windows, a nice little quality of life improvement for me was just to hide desktop icons and do everything by searching in the task bar. (Would be even better if the search function wasn't so odd.) Been doing this for about two years and like it much more. Maybe for others, using the desktop is actually worth it, but for me, it was always cluttering up over time, and the annoyance over it not looking the way I want always outweighed the benefits. It really takes barely longer to go CTRL+ESC+"firef"+ENTER than to double click an icon.
[](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=Lif85TC2zJgmQieZH)[Mo Putera](/users/mo-putera)20h19\-2
I used to consider it a mystery that math was so unreasonably effective in the natural sciences, but changed my mind after reading [this essay](http://www.catb.org/esr/writings/utility-of-math/) by Eric S. Raymond ([who's here](https://www.lesswrong.com/users/eric-raymond) on the forum, hi and thanks Eric), in particular this part, which is as good a [question dissolution](/w/dissolving-the-question) as any I've seen:Â
> The relationship between mathematical models and phenomenal prediction is complicated, not just in practice but in principle. Much more complicated because, as we now know, there are mutually exclusive ways to axiomatize mathematics! It can be diagrammed as follows (thanks to [Jesse Perry](mailto:[email protected]) for supplying the original of this chart):
(it's a shame this chart isn't rendering properly for some reason, since without it the rest of Eric's quote is ~incomprehensible)
> The key transactions for our purposes are **C** and **D** -- the translations between a predictive model and a mathematical formalism. What mystified Einstein is how often **D** leads to new insights.
>
> We begin to get some handle on the problem if we phrase it more precisely; that is, "Why does a good choice of **C** so often yield new knowledge via **D**?"
>
> The simplest answer is to invert the question and treat it as a definition. A "good choice of **C**" _is_ one which leads to new predictions. The choice of **C** is not one that can be made a-priori; one has to choose, empirically, a mapping between real and mathematical objects, then evaluate that mapping by seeing if it predicts well.
>
> One can argue that it only makes sense to marvel at the utility of mathematics if one assumes that **C** for any phenomenal system is an a-priori given. But we've seen that it is not. A physicist who marvels at the applicability of mathematics has forgotten or ignored the complexity of **C**; he is really being puzzled at _the human ability to choose appropriate mathematical models empirically_.
>
> By reformulating the question this way, we've slain half the dragon. Human beings are clever, persistent apes who like to play with ideas. If a mathematical formalism can be found to fit a phenomenal system, some human will eventually find it. And the discovery will come to look "inevitable" because those who tried and failed will generally be forgotten.
>
> But there is a deeper question behind this: why do good choices of mathematical model exist _at all_? That is, why is there _any_ mathematical formalism for, say, quantum mechanics which is so productive that it actually predicts the discovery of observable new particles?
>
> The way to "answer" this question is by observing that it, too, properly serves as a kind of definition. There are many phenomenal systems for which no such exact predictive formalism has been found, nor for which one seems likely. Poets like to mumble about the human heart, but more mundane examples are available. The weather, or the behavior of any economy larger than village size, for example -- systems so chaotically interdependent that exact prediction is effectively impossible (not just in fact but in principle).
>
> There are many things for which mathematical modeling leads at best to fuzzy, contingent, statistical results and never successfully predicts 'new entities' at all. In fact, such systems are the rule, not the exception. So the proper answer to the question "Why is mathematics is so marvelously applicable to my science?" is simply "Because that's the kind of science you've chosen to study!"
I also think I was intuition-pumped to buy Eric's argument by Julie Moronuki's beautiful meandering essay [The Unreasonable Effectiveness of Metaphor](https://argumatronic.com/posts/2018-09-02-effective-metaphor.html).
Reply
[](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=Lif85TC2zJgmQieZH)[Mo Putera](/users/mo-putera)[20h](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=Lif85TC2zJgmQieZH)19\-2
8
I used to consider it a mystery that math was so unreasonably effective in the natural sciences, but changed my mind after reading this essay by Eric S. Raymond (who's here on the forum, hi and thanks Eric), in particular this part, which is as good a question dissolution as any I've seen:Â (it's a shame this chart isn't rendering properly for some reason, since without it the rest of Eric's quote is ~incomprehensible) I also think I was intuition-pumped to buy Eric's argument by Julie Moronuki's beautiful meandering essay The Unreasonable Effectiveness of Metaphor.
[Load More (5/44)](#)
Popular Comments
====================
[Daniel Kokotajlo](/users/daniel-kokotajlo)2dΩ29855
[METR: Measuring AI Ability to Complete Long Tasks](/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks)
This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. Eli Lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon. Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We've got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can't do tasks that take professional humans 160 years? And it's going to take 4 more doublings to get there? And these 4 doublings are going to take 2 more years to occur? No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc. Also, zooming in mechanistically on what's going on, insofar as an AI system can do tasks below length X but not above length X, it's gotta be for some reason -- some skill that the AI lacks, which isn't important for tasks below length X but which tends to be crucial for tasks above length X. But there are only a finite number of skills that humans have that AIs lack, and if we were to plot them on a horizon-length graph (where the x-axis is log of horizon length, and each skill is plotted on the x-axis where it starts being important, such that it's not important to have for tasks less than that length) the distribution of skills by horizon length would presumably taper off, with tons of skills necessary for pretty short tasks, a decent amount necessary for medium tasks (but not short), and a long thin tail of skills that are necessary for long tasks (but not medium), a tail that eventually goes to 0, probably around a few years on the x-axis. So assuming AIs learn skills at a constant rate, we should see acceleration rather than a constant exponential. There just aren't that many skills you need to operate for 10 days that you don't also need to operate for 1 day, compared to how many skills you need to operate for 1 hour that you don't also need to operate for 6 minutes. There are two other factors worth mentioning which aren't part of the above: One, the projected slowdown in capability advances that'll come as compute and data scaling falters due to becoming too expensive. And two, pointing in the other direction, the projected speedup in capability advances that'll come as AI systems start substantially accelerating AI R&D.
[testingthewaters](/users/testingthewaters)2d425
[Elite Coordination via the Consensus of Power](/posts/zqffB6gokoivwwn7X/elite-coordination-via-the-consensus-of-power)
Hey, really enjoyed your triple review on power lies trembling, but imo this topic has been... done to death in the humanities, and reinventing terminology ad hoc is somewhat missing the point. The idea that the dominant class in a society comes from a set of social institutions that share core ideas and modus operandi (in other words "behaving as a single organisation") is not a shocking new phenomenon of twentieth century mass culture, and is certainly not a "mystery". This is basically how every country has developed a ruling class/ideology since the term started to have a meaning, through academic institutions that produce similar people. Yale and Harvard are as Oxford and Cambridge, or Peking University and Renmin University. (European universities, in particular, started out as literal divinity schools, and hence are outgrowths of the literal Catholic church, receiving literal Papal bulls to establish themselves as one of the studia generalia.) \[Retracted, while the point about teaching religious law and receiving literal papal bulls is true the origins of the universities are much more diverse. But my point about the history of cultural hegemony in such institutions still stands.\] What Yarvin seems to be annoyed by is that the "Cathedral consensus" featured ideas that he dislikes, instead of the quasi-feudal ideology of might makes right that he finds more appealing. That is also not surprising. People largely don't notice when they are part of a dominant class and their ideas are treated as default: that's just them being normal, not weird. However, when they find themselves at the edge of the overton window, suddenly what was right and normal becomes crushing and oppressive. The natural dominance of sensible ideas and sensible people becomes a twisted hegemony of obvious lies propped up by delusional power-brokers. This perspective shift is also extremely well documented in human culture and literature. In general, the concept that a homogenous ruling class culture can then be pushed into delusional consensuses which ultimately harms everyone is an idea as old as the Trojan War. The tension between maintaining a grip on power and maintaining a grip on reality is well explored in Yuval Noah Harari's book Nexus (which also has an imo pretty decent second half on AI). In particular I direct you to his account of the Bavarian witch hunts. Indeed, the unprecedented feature of modern society is the rapid divergence in ideas that is possible thanks to information technology and the cultivation of local echo chambers. Unfortuantely, I have few simple answers to offer to this age old question, but I hope that recognising the lineage of the question helps with disambiguation somewhat. I look forward to your ideas about new liberalisms.
[Owain\_Evans](/users/owain_evans)3dΩ16369
[Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions](/posts/RoWabfQxabWBiXwxP/go-home-gpt-4o-you-re-drunk-emergent-misalignment-as-lowered)
I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don't act misaligned. We don't claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations. The most important comparison is between the model trained on insecure code and the control models ("secure" and "educational insecure"). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it's systematically more like a human). So that's the experiment I think you should do.
[Load More](#)
Recent Discussion
=================
[Richard\_Kennaway's Shortform](/posts/snQGEAK8PTxSDra2m/richard_kennaway-s-shortform)
[Richard\_Kennaway](/users/richard_kennaway)
3y
2Richard\_Kennaway15h
My left hand cannot force my right hand to do anything either. Instead, they work harmoniously together. Likewise my present, past, and future. Not only is the sage one with causation, he is one with himself. That is an example of dysfunctional decision-making. It is possible to do better.. I always do the dishes today.
2cubefox16h
That's an interesting perspective. Only it doesn't seem fit into the simplified but neat picture of decision theory. There everything is sharply divided between being either a statement we can make true at will (an action we can currently decide to perform) and to which we therefore do not need to assign any probability (have a belief about it happening), or an outcome, which we can't make true directly, that is at most a consequence of our action. We can assign probabilities to outcomes, conditional on our available actions, and a value, which lets us compute the "expected" value of each action currently available to us. A decision is then simply picking the currently available action with the highest computed value. Though as you say, such a discretization for the sake of mathematical modelling does fit poorly with the continuity of time.
2Dagon13h
Decision theory is fine, as long as we don't think it applies to most things we colloquially call "decisions". Â In terms of instantaneous discrete choose-an-action-and-complete-it-before-the-next-processing-cycle, it's quite a reasonable topic of study.
[cubefox](/users/cubefox)[38m](/posts/snQGEAK8PTxSDra2m/richard_kennaway-s-shortform?commentId=uYnz5dBrwK5anhhDa)20
A more ambitious task would be to come up with a model that is more sophisticated than decision theory, one which tries to formalize your previous comment about intent and prediction/belief.
Reply
[Mo Putera's Shortform](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform)
[Mo Putera](/users/mo-putera)
2mo
6cubefox3h
Interesting. This reminds me of a related thought I had: Why do models with differential equations work so often in physics but so rarely in other empirical sciences? Perhaps physics simply is "the differential equation science". Which is also related to the frequently expressed opinion that philosophy makes little progress because everything that gets developed enough to make significant progress splits off from philosophy. Because philosophy is "the study of ill-defined and intractable problems". Not saying that I think these views are accurate, though they do have some plausibility.
[Mo Putera](/users/mo-putera)[41m](/posts/3LcyoqNTJuCZ65MbL/mo-putera-s-shortform?commentId=bCppJyt4JCGxaZn5f)10
(To be honest, to first approximation my guess mirrors yours.)Â
Reply
14Mo Putera3h
Scott Alexander's Mistakes, Dan Luu's Major errors on this blog (and their corrections), and Gwern's My Mistakes (last updated 11 years ago) are the only online writers I know of who maintain a dedicated, centralized page solely for cataloging their errors, which I admire. Probably not coincidentally they're also among the thinkers I respect the most for repeatedly empirically grounding their reasoning. Some orgs do this too, like 80K's Our mistakes, CEA's Mistakes we've made, and GiveWell's Our mistakes. While I prefer dedicated centralized pages like those to one-off writeups for long content benefit reasons, one-off definitely beats none (myself included). In that regard I appreciate essays like Holden Karnofsky's Some Key Ways in Which I've Changed My Mind Over the Last Several Years (2016), Denise Melchin's My mistakes on the path to impact (2020), Zach Groff's Things I've Changed My Mind on This Year (2017), and this 2013 LW repository for "major, life-altering mistakes that you or others have made", as well as by orgs like HLI's Learning from our mistakes. In this vein I'm also sad to see mistakes pages get removed, e.g. ACE used to have a Mistakes page (archived link) but now no longer do.
2faul\_sname6h
I am not one of them - I was wondering the same thing, and was hoping you had a good answer. If I was trying to answer this question, I would probably try to figure out what fraction of all economically-valuable labor each year was cognitive, the breakdown of which tasks comprise that labor, and the year-on-year productivity increases on those task, then use that to compute the percentage of economically-valuable labor that is being automated that year. Concretely, to get a number for the US in 1900 I might use a weighted average of productivity increases across cognitive tasks in 1900, in an approach similar to how CPI is computed \* Look at the occupations listed in the 1900 census records \* Figure out which ones are common, and then sample some common ones and make wild guesses about what those jobs looked like in 1900 \* Classify those tasks as cognitive or non-cognitive \* Come to estimate that record-keeping tasks are around a quarter to a half of all cognitive labor \* Notice that typewriters were starting to become more popular - about 100,000 typewriters sold per year \* Note that those 100k typewriters were going to the people who would save the most time by using them \* As such, estimate 1-2% productivity growth in record-keeping tasks in 1900 \* Multiply the productivity growth for record-keeping tasks by the fraction of time (technically actually 1-1/productivity increase but when productivity increase is small it's not a major factor) \* Estimate that 0.5% of cognitive labor was automated by specifically typewriters in 1900 \* Figure that's about half of all cognitive labor automation in 1900 and thus I would estimate ~1% of all cognitive labor was automated in 1900. By the same methodology I would probably estimate closer to 5% for 2024. Again, though, I am not associated with Open Phil and am not sure if they think about cognitive task automation in the same way.
[Rafael Harth's Shortform](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform)
[Rafael Harth](/users/sil-ver)
Ω 25y
[](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform?commentId=LPsMifygc8vqD2edr)[Rafael Harth](/users/sil-ver)[1h](/posts/dkrfGqJrGRx2HsLpe/rafael-harth-s-shortform?commentId=LPsMifygc8vqD2edr)20
For those who work on Windows, a nice little quality of life improvement for me was just to hide desktop icons and do everything by searching in the task bar. (Would be even better if the search function wasn't so odd.) Been doing this for about two years and like it much more.
Maybe for others, using the desktop is actually worth it, but for me, it was always cluttering up over time, and the annoyance over it not looking the way I want always outweighed the benefits. It really takes barely longer to go CTRL+ESC+"firef"+ENTER than to double click an icon.
Reply
[How far along Metr's law can AI start automating or helping with alignment research?](/posts/gXyMCnjrMfBbnYyZ4/how-far-along-metr-s-law-can-ai-start-automating-or-helping)
19
[Christopher King](/users/christopher-king)
20h
In [METR: Measuring AI Ability to Complete Long Tasks](/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks) found a Moore's law like trend relating (model release date) to (time needed for a human to do a task the model can do).
Here is their rationale for plotting this.
> Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most exam-style problems for a fraction of the cost. With some task-specific adaptation, they can also serve as useful tools in many applications. And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. They are unable to reliably handle even relatively low-skill, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in
...
[(See More – 90 more words)](/posts/gXyMCnjrMfBbnYyZ4/how-far-along-metr-s-law-can-ai-start-automating-or-helping)
[Rafael Harth](/users/sil-ver)[1h](/posts/gXyMCnjrMfBbnYyZ4/how-far-along-metr-s-law-can-ai-start-automating-or-helping?commentId=biBp6uwGms6kxbSpt)20
I don't think I get it. If I read this graph correctly, it seems to say that if you let a human play chess against an engine and want it to achieve equal performance, then the amount of time the human needs to think grows exponentially (as the engine gets stronger). This doesn't make sense if extrapolated downward, but upward it's about what I would expect. You can compensate for skill by applying more brute force, but it becomes exponentially costly, which fits the exponential graph.
It's probably not perfect -- I'd worry a lot about strategic mistakes in the opening -- but it seems pretty good. So I don't get how this is an argument against the metric.
Reply
1Answer by Alice Blair11h
This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point). It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement. However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research. I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.
2Garrett Baker11h
You probably mention this somewhere, but I'll ask here, are you currently researching whether these results hold for those other domains? I'm personally more interested about math than law.
4Thomas Kwa11h
It's expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician's job.
[Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions](/posts/RoWabfQxabWBiXwxP/go-home-gpt-4o-you-re-drunk-emergent-misalignment-as-lowered)
71
[Stuart\_Armstrong](/users/stuart_armstrong), [rgorman](/users/rgorman)
Ω 313d
_Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned_
We were very excited when we first read the [Emergent Misalignment](https://arxiv.org/pdf/2502.17424) paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use it to measure alignment, we can even make the model more aligned by minimising it.
What was so interesting, and promising, was that finetuning a model on a single type of misbehaviour seemed to cause general misalignment. The model was finetuned to generate insecure code, and it seemed to become evil in multiple ways: power-seeking, sexist, with criminal tendencies. All these tendencies tied together in one feature. It was all perfect.
Maybe too perfect. AI alignment is never easy. Our...
[(Continue Reading – 1454 more words)](/posts/RoWabfQxabWBiXwxP/go-home-gpt-4o-you-re-drunk-emergent-misalignment-as-lowered)
[Daniel Tan](/users/daniel-tan)[3h](/posts/RoWabfQxabWBiXwxP/go-home-gpt-4o-you-re-drunk-emergent-misalignment-as-lowered?commentId=tCoeDhPqz3HGLZJEq)20
A datapoint which I found relevant: @voooooogel on twitter produced steering vectors for emergent misalignment in Qwen-Coder.Â
* When applied with -10 multiplier, the steering vector produces emergent misalignment: [https://x.com/voooooogel/status/1895614767433466147](https://x.com/voooooogel/status/1895614767433466147)
* +10 multiplier makes the model say 'Certainly!' or 'Alright!' a lot. [https://x.com/voooooogel/status/1895734838390661185](https://x.com/voooooogel/status/1895734838390661185)
One possible take here is that this steering vector controls for 'user intent' (+10) vs 'model intent' (-10). This seems consistent with the argument presented in the main post.Â
Reply
[johnswentworth's Shortform](/posts/puv8fRDCH9jx5yhbX/johnswentworth-s-shortform)
[johnswentworth](/users/johnswentworth)
Ω 55y
[Thane Ruthenis](/users/thane-ruthenis)[3h](/posts/puv8fRDCH9jx5yhbX/johnswentworth-s-shortform?commentId=cjjpsZEHZh5ch9Lih)20
I see two explanations: the boring wholesome one and the interesting cynical one.
The wholesome one is: You're underestimating how much other value the partner offers and how much the men care about the mostly-platonic friendship. I think that's definitely a factor that explains some of the effect, though I don't know how much.
The cynical one is: [It's part of the template](/posts/AqbWna2S85pFTsHH4/the-intelligent-social-web). Men feel that are "supposed to" have wives past a certain point in their lives; that it's their role to act. Perhaps they even feel that they are "supposed to" have wives they _hate_, see t... (read more)
Reply
5Lucius Bushnaq3h
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.Â
7Alexander Gietelink Oldenziel5h
it's the mystery of love, John
13DirectedEvolution6h
“I'm skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor.” Some people enjoy attending to their partner and find meaning in emotional labor. Housing’s a lot more expensive than gifts and dates. My partner and I go 50/50 on expenses and chores. Some people like having long-term relationships with emotional depth. You might want to try exploring out of your bubble, especially if you life in SF, and see what some normal people (ie non-rationalists) in long term relationships have to say about it.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with
[GOOGLE](/auth/google?returnTo=/)[GITHUB](/auth/github?returnTo=/)
[Sapir-Whorf Ego Death](/posts/JMmmDmnJjxkXJAX6u/sapir-whorf-ego-death)
7
[Jonathan MoregĂĄrd](/users/jonathan-moregard)
3d
This is a linkpost for [https://honestliving.substack.com/p/sapir-whorf-ego-death](https://www.lesswrong.com/out?url=https%3A%2F%2Fhonestliving.substack.com%2Fp%2Fsapir-whorf-ego-death)
Meditation can be tricky. I’m by no means a skilled practitioner, but I did make a fair bit of progress with my focus meditation recently. This post is about the realization that helped me up my meditation game. Enjoy!
* * *
When I meditate, I often spin away into reflections and judgments about how the meditation is going.
_“I should focus on the breath.”_
_“It’s going quite well—oh wait, I should not make this into a performance—damn, I got stuck thinking about how I meditate. I should focus back on the breath—wait, reflecting on how I reflect is not the same as focusing on the breath, damn it—\[…\]”_
Some time ago, I realized that the perspective _"I want to focus on the breath"_ is self-defeating. It uses a third-person perspective that includes _me_...
[(See More – 561 more words)](/posts/JMmmDmnJjxkXJAX6u/sapir-whorf-ego-death)
[Jonathan MoregĂĄrd](/users/jonathan-moregard)[3h\*](/posts/JMmmDmnJjxkXJAX6u/sapir-whorf-ego-death?commentId=DDkyvb9fYqJgFSYwq)10
Makes sense, I'll see if I manage to get there in time.
Seems like your approach is cohering across perspectives while including more aspects into conscious awareness. Seems more likely to lead to integration/wholeness instead of dissociation/lost purposes.
edit: I'm also curious about your background/experience of meditation, if you are open to sharing.
Reply
[Gunnar\_Zarncke's Shortform](/posts/8szBqBMqGJApFFsew/gunnar_zarncke-s-shortform)
[Gunnar\_Zarncke](/users/gunnar_zarncke)
4y
2Gunnar\_Zarncke13h
LLMs necessarily have to simplify complex topics. The output for a prompt cannot represent all they know about some fact or task. Even if the output is honest and helpful (ignoring harmless for now), the simplification will necessarily obscure some details of what the LLM "intends" to do - in the sense of satisfying the user request. The model is trained to get things done. Thus, the way it simplifies has a large degree of freedom and gives the model many ways to achieve its goals. You could think of a caring parent who tells the child a simplified version of the truth, knowing that the child will later ask additional questions and then learn the details (I have a parent in mind who is not hiding things intentionally). Nonetheless, the parent's expectations of what the child may or may not need to know - the parent's best model of society and the world - which may be subtly off - influence how they simplify for the benefit of the child. This is a form of deception. The deception may be benevolent, as in the example with the parent, but we can't know. Even if there is a chain of thought we can inspect, the same is true for that. It seems unavoidable.
[cubefox](/users/cubefox)[3h](/posts/8szBqBMqGJApFFsew/gunnar_zarncke-s-shortform?commentId=X5kkYAqWnkb2AMqAD)20
It seems to be only "deception" if the parent tries to conceal the fact that he or she is simplifying things.
Reply
[LWLW's Shortform](/posts/T54YXFjFeGuE6sbst/lwlw-s-shortform)
[LWLW](/users/lwlw)
2mo
6LWLW10h
This just boils down to “humans aren’t aligned,” and that fact is why this would never work, but I still think it’s worth bringing up. Why are you required to get a license to drive, but not to have children? I don’t mean this in a literal way, I’m just referring to how casual the decision to have children is seen by much of society. Bringing someone into existence is vastly higher stakes than driving a car. I’m sure this isn’t implementable, but parents should at least be screened for personality disorders before they’re allowed to have children. And sure that’s a slippery slope, and sure many of the most powerful people just want workers to furnish their quality of life regardless of the worker’s QOL. But bringing a child into the world who you can’t properly care for can lead to a lifetime of avoidable suffering. I was just reading about “genomic liberty,” and the idea that parents would choose to make their kids iq lower than possible, that some would even choose for their children to have disabilities like them is completely ridiculous. And it just made me think “those people shouldn’t have the liberty of being parents.” Bringing another life into existence is not casual like where you work/live. And the obligation should be to the children, not the parents.
[cubefox](/users/cubefox)[3h](/posts/T54YXFjFeGuE6sbst/lwlw-s-shortform?commentId=Nmq4bHQoYpafunZ4Y)20
There is also the related problem of intelligence being negatively correlated with fertility, which leads to a dysgenic trend. Even if preventing people below a certain level of intelligence to have children was realistically possible, it would make another problem more severe: the fertility of smarter people is far below replacement, leading to quickly shrinking populations. Though fertility is likely partially heritable, and would go up again after some generations, once the descendants of the (currently rare) high-fertility people start to dominate.
Reply
2Garrett Baker4h
Historically attempts to curtail this right lead to really really dark places. Part of living in a society with rights and laws is that people will do bad things the legal system has no ability to prevent. And on net, that’s a good thing. See also.