Today’s essay is the fifth entry in a seven-part series focused on questions of responsibility, accountability, and control in a pre-AGI world. This piece, and those that follow, are written in a personal capacity by Seb Krier, who works on Policy Development & Strategy at Google DeepMind.
Following the last essay looking at the potential impacts on labour markets, I’m back with another piece attempting to make sense of where we are now and where we are going. This time, the focus is on trying to reconcicle competing values and different visions of fairness. An important caveat: these are my current high-level thoughts, but they're not set in stone. The nature of AI means my views evolve often, so I highly encourage comments and counterpoints. Over the coming weeks, this series will include the following essays:
→ Are there ways of reconciling competing visions of fairness in AI?
What does a world with many agentic models look like? How should we start thinking about cohabitation?
How can we balance proliferation and democratization with appropriate oversight when developing and deploying powerful AI systems?
Huge thanks to Kory Matthewson, Nick Whittaker, Lewis Ho, Adam Hunt, Benjamin Hayum, Gustavs Zilgalvis, Harry Law, Nicklas Lundblad, Nick Swanson, Seliem El-Sayed, Pedro Serôdio, Brian C. Albrecht, Juan Mateos-Garcia, and Jacques Thibodeau for their very helpful comments! Views are my own etc.
Are there ways of reconciling competing visions of fairness in AI?
I wonder if decentralized fine-tuning of large language models may offer a flexible way to operationalize context-specific visions of algorithmic fairness, avoiding one-size-fits-all solutions. Put differently, by fine-tuning models for specific contexts or communities, it may be possible to better align the model's outputs with local norms, preferences and values, thereby reducing biases and improving fairness overall. In a way, fine-tuning reminds me of platforms and content moderation - with the important exception that you’re not limited by network effects. It’s easier to switch from one model to another (e.g. ChatGPT to Claude) than to switch from one platform to another (e.g. X to BlueSky): developers, organizations, or communities can fine-tune and deploy models that align with their specific needs or values without worrying about "falling behind" due to fewer users.
It’s well known in AI ethics that different fairness metrics inherently involve tradeoffs, as Arvind Narayanan's '21 definitions and their politics' illustrates. The ‘right’ metric depends on context, values and norms. For lending, equalizing positive predictive values may take priority to ensure the bottom line, even if false positive rates differ across groups. Whereas for recidivism prediction algorithms, achieving counterfactual fairness with respect to race could be preferable to ensure an individual's predicted recidivism risk would remain the same even if their race were different, all else being equal - although that is of course not the case today. But ultimately deciding on a specific notion of fairness is a value judgment and not everyone will agree. How exactly these things are decided in practice is unclear: fairness metrics were rarely considered by model developers, and if they ever were, that was rarely explicitly shared or justified.
Before large language models, these representational issues were generally addressed by cleaning datasets or tweaking algorithms and fairness metrics. This made sense for narrow models that performed one discrete function, like deciding whether a defendant was at risk of reoffending, or whether a loan applicant was likely to default. Large language models however demonstrate far more general capabilities, and the biases they have may not always be consistent across use cases or tasks. Moreover how they are prompted, and how instructions are specified, matters a lot for the outputs you get. It seems much more challenging to choose a particular definition of fairness at the algorithmic level in such a case, since by definition their appropriateness will differ depending on context and use.
These days however, you can fine-tune, use custom instructions and system prompts, unlearn information, or even RLHF models to optimize for preferred outcomes. So while language models still exhibit biases, these can presumably be addressed downstream of model training to align with preferred fairness outcomes. LoRA for example can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times (see also QLoRA). So if you think a language model used for evaluating a loan application should favor certain applicants, you could just explicitly instruct it to do so post-training. This not only seems easier, but more economical than retraining models from scratch which would be overly expensive and slow compared to fine-tuning. Experimenting with different fine-tuned models also allows for greater customization, A/B testing and RCTs compared to modifying training sets.
In scenarios where different notions of fairness can coexist depending on products people prefer, transparency around evaluations and fine-tuning could enable a more competitive marketplace where buyers opt for models more aligned with their preferences. For example, AI writing assistants fine-tuned on diverse datasets may appeal to some users who want more inclusive language, while others may prefer models optimized purely for factual accuracy or conciseness, even if less sensitive. Consider also content moderation: some models could be fine-tuned to align with specific cultural norms and sensitivities, offering tailored content moderation that respects local values and norms, whereas others could be optimized to prioritize free speech, with less stringent moderation in line with certain societal values. Users of an image generator in Lagos will have different expectations and preferences than one used in Tehran.
This excellent paper from Emily Black et al. also points to a similar idea. For any given predictive problem, there are usually multiple machine learning models that perform equally well on that problem, with equivalent accuracy. These models achieve the same overall accuracy through different distributions of errors: some models make Mistake A more often, others make Mistake B more often, but in aggregate they have the same error rate. One key implication is that models with the same overall accuracy can have different impacts on demographic groups: one model might make more errors predicting outcomes for Group X, while another makes more errors for Group Y. So if a model is found to have a disparate impact on a protected group (i.e. makes more errors for that group), model multiplicity implies that an alternative model likely exists that has the same overall accuracy but distributes those errors differently in a way that reduces the disparate impact. Model multiplicity implies less discriminatory models exist, so there should be incentives to identify them; my view is that fine-tuning can greatly help in this respect.
In the excellent Roadmap to Pluralistic Alignment paper, the authors rightly contend that “AI systems should reflect and support the diversity amongst humans and their values, as it is both a feature and a desired quality of human societies. Exposure to diverse ideas also improves deliberation.” I’m imagining a future with lots of differently fine-tuned models (agents?) competing and collaborating. This allows context-specific optimization grounded in pluralistic values, and strikes me as the better approach to fairness: Amartya Sen argued for a more comparative, pluralistic approach to justice, which he refers to as the "comparative approach," rather than seeking a single, unified theory that tries to reconcile all others. His "realization-focused comparisons," which means comparing different states of affairs to assess how justice can be advanced here and now, seems like something language models can help enable and materialize.
Of course, there need to be certain things that are cut off from all people no matter what. This issue also arises with agents - what do you let users customize (e.g. profanity filters, anthropomorphism levels) and what should they not be able to change (e.g. ability to launch coordinated DDOS attacks or designing and executive a harassment campaign over a person). Importantly, this means that some degree of oversight is needed to ensure models are not fine-tuned in unacceptable ways, and better output filters to protect against jailbreaking and malicious uses. In other words, flexibility should not come at the expense of safety. And without some limitations, you could risk some race to the bottom, as the Meaning Alignment Institute notes in a related post. My approach here allows a degree of proliferation of fine-tuned models customized to different communities and use cases, which would allow competitive dynamics to play out: however it’s not clear this is necessarily the optimal solution, and there should be more work done on how to elicit better conceptions of fairness and values. I like the Meaning Alignment Institute’s suggestion of a common "moral graph" of wise values that could serve as a unified basis for fine-tuning. This does not seem incompatible here: in principle moral graphs could be built at different scales, ranging from small groups to all of humanity, and the process of building moral graphs through large-scale participation could itself be somewhat decentralized.
Note also that I’m considering this from the perspective of language models; but in the future, to a degree, this same issue will arise in more complex systems that will effectively automate the economy and society. Finding the optimal balance between decentralization and centralization is trickier here, and I’m more skeptical of allowing authoritarian and illiberal visions, e.g. India’s caste system or Jim Crow laws, to be propagated or easily available. I think these fail in the long run, but if we already know that, I think we can also prevent some harm outright. Should a system of fine-tuned agents used by a state cater to illiberal or authoritarian tendencies within a community, or actively promote liberal democratic norms? The latter seems clearly right to me.
The implication here is that to the degree that there is a level of decentralization, this should not just allow any conception of politics and power to thrive either. The reason for favoring liberal democratic values in these systems is not simply a matter of ideology, but of enabling the conditions for progress: liberal democracy, at its best, allows for the possibility of being wrong and changing course, replacing the rigidity of violence and oppression with the flexibility of changeable laws and evolving norms. It creates space for ideas to be tested and improved over time, rather than locking in any single conception for all time. For AI systems that will profoundly shape the future, this openness to change and improvement seems essential. We should seek to hardcode not any particular political orthodoxy, but rather the meta-norms that allow for productive contestation and progress.
Ultimately what this partially decentralized approach means for more systemic effects is unclear, but it seems far more desirable than determining fairness ex ante during model training, and imposing a top down vision on many different products and groups. As with most things, there’s a balance somewhere. As always, it’s worth stressing I’m definitely not certain about this, my views are constantly morphing, and it’s easy to imagine some complex harmful dynamics: some forms of fine-tuning are going to be more competitive than others and that could create an incentive gradient towards those forms of fine-tuning; it’s not clear whether - like markets - this is broadly positive, or whether it leads to a more subtle tyranny of the majority. But in the short term I do think more experimentation and a flourishing of models will help make progress on this front.
Question for researchers: can a multiplicity of fine-tuned models truly address underlying systemic biases and improve the world, or is it akin to rearranging deck chairs on the Titanic – masking deeper structural problems?