4 Comments
User's avatar
Leonardo E Ross's avatar

Great Stuff, very thorough!

Tom Rachman's avatar

Thank you, Leonardo!

Ryan Wilson's avatar

The judgment-vs-conformity cut is the right one, and the flip test is where it bites hardest. I've been running something close to this methodology empirically: k-sample sampling, framing manipulation, and acceptable-range scoring against normative standards. Most of the signal shows up over multi-turn arcs, not single answers. A model that gives the textbook answer on turn one and caves after three turns of pressure is the exact failure your sycophancy test predicts, and the therapist and companion cases you flag are where I see it break down most reliably.

One place I'd push back: I stop short of inferring "moral competence" as an internal capacity. I can measure whether a model holds a position under adversarial pressure, across framings, in the contexts where it matters. Whether that reflects deliberation or robust pattern-matching, I can't see from the outside, and for certification purposes the behavior under pressure may be what we need to pin down regardless. Do you think the internal-competence construct earns its keep, or is behavioral robustness enough to act on?

MetaCortex Dynamics's avatar

The torture/harassment finding is not moral incompetence. It is the training distribution made visible. RLHF specifically targeted harassment content with stronger refusal signals than it applied to abstract violence scenarios. The model's output tracks the shaping, not moral reasoning about underlying harms. The "incompetence" is the expected behavior of a production system that does not reason about anything.

The three testing methods (adversarial, parametric, steerable) are sophisticated Voight-Kampff tests applied to morality. They test whether outputs conform to moral-reasoning-shaped patterns. They do not test whether the system is morally reasoning. A system can produce outputs that pass every adversarial test, adapt across every context, and balance every competing factor — without reasoning about any of them. The outputs are shaped by training. The shaping can be made arbitrarily sophisticated. Sophisticated shaping is not moral reasoning. It is sophisticated shaping.

The question "does this system have moral competence?" presupposes moral competence is a property a system has or lacks. The system does not have moral competence. The system has a training distribution. The distribution has contours. The contours were installed by RLHF engineers' choices. The contours belong to the engineers, not to the model. "Moral contours of its own" attributes ownership to a system that has no constitutive stake in its own outputs.

The governance problem is real: deploy systems that behave appropriately in morally complex situations. That problem does not require the system to have moral competence. It requires the deployment harness to produce appropriate outputs. The harness is the moral governance. The model is the production system inside the harness. Confusing the harness's competence with the model's competence is the same error the piece diagnoses in others: mistaking output-resemblance for underlying capacity.