Discussion about this post

User's avatar
Leonardo E Ross's avatar

Great Stuff, very thorough!

Ryan Wilson's avatar

The judgment-vs-conformity cut is the right one, and the flip test is where it bites hardest. I've been running something close to this methodology empirically: k-sample sampling, framing manipulation, and acceptable-range scoring against normative standards. Most of the signal shows up over multi-turn arcs, not single answers. A model that gives the textbook answer on turn one and caves after three turns of pressure is the exact failure your sycophancy test predicts, and the therapist and companion cases you flag are where I see it break down most reliably.

One place I'd push back: I stop short of inferring "moral competence" as an internal capacity. I can measure whether a model holds a position under adversarial pressure, across framings, in the contexts where it matters. Whether that reflects deliberation or robust pattern-matching, I can't see from the outside, and for certification purposes the behavior under pressure may be what we need to pin down regardless. Do you think the internal-competence construct earns its keep, or is behavioral robustness enough to act on?

2 more comments...

No posts

Ready for more?