Are personas even doing something when prompting?

Everyone who has been doing anything with prompting LLMs already knows the drill: “You are an expert in social media
marketing, create a plan for…”, “You are a senior software engineer with 15 years of experience in Java, review this code..”
In LLMs these are called “Personas” or “Role-Prompting”. The folklore says that this “make believe” tactic is somehow activating
latent domain knowledge and stylistic cues in the model’s pre-training corpus that will guide attention to more relevant tokens,
improve coherence, and constrain scope in order to achieve better results.

Depending on how sophisticated the prompting is you can use several tactics:

Single-turn role (“You are an expert X. …”).
Two-round role (“Introduce yourself as X; now solve Y.”).
Multi-agent roundtable (several personas debate, answers aggregated).

The results often feel better and “higher quality”, but is it really doing anything? Or is it just styling placebo effect?

Evidence

My initial expectations when I started to research this topic were not that great, as I didn’t expect to find a lot of rigorous testing, but to my surprise, seems like some people are already starting seriously researching this stuff.

Let’s recap some papers I’ve found over the internet that somewhat put to test the performance of different role-prompting in different scenarios:

Mathematical & Symbolic Reasoning

Early optimism: Kong et al. reported that Llama-7B with a “helpful mathematician” persona beat zero-shot prompting by 5 percentage points (pp) on GSM8K (a math word problem dataset). 1

Contradictory evidence: Han et al. repeated the experiment on the same datasets, found no improvement, and observed worse results when Chain of Thought (CoT) was combined with a persona (over-prompting confusion). 2

Commonsense & Open-Ended Question Answering

Olea et al. evaluated 4,000 QA items across nine datasets. A single “expert” persona improved accuracy on high-openness (many possible correct answers) questions by ~2 pp.3 Gains disappeared on low-openness multiple-choice items.4

Medical & Health-Care QA

A peer-reviewed orthopedics study assessed four LLMs on knee-replacement FAQs. ChatGPT-4 with “experienced orthopedic surgeon” role reached 77.5% acceptability versus 55% in the neutral condition, a sizeable jump. 5

Software Engineering & Code Generation

Prompt Variability Study: Researchers generated code for 492 tasks with CodeLlama, DeepSeek-Coder, and CodeGemma. They examined typos, synonyms, paraphrases, and “senior engineer” personas. Persona changes shifted Abstract Syntax Tree (AST) edit distance by <1 pp and never fixed logical bugs. 6

Persona Ideation for Idea Evaluation: A GPT-4 persona study on idea scoring found different role prompts (“senior dev”, “PM”) altered subjective ratings but did not measure code quality. 7

Code-quality audits reveal that inefficiencies and security flaws persist irrespective of persona wording. 8

So…? What is happening?

Well, as with many things, it just seems like something too good to be true. The research so far is lacking evidence for role-prompting, especially in the domain of software engineering. The only paper which shows the use of personas made any meaningful impact was in the one of LLMs on knee-replacement FAQs. As with any “make believe”, LLMs are just tricking us most of the time, literally pretending to be the “persona” in style and tone, but without any meaningful content change.

But are they useful? I still think they are, and I will certainly keep using them whenever I feel it can help. Maybe for doing things like code reviews the style and tone are not something super relevant, but for example if you need help writing an official email for an authority, then style and tone can become crucial to the result.

Also, the field is evolving rapidly, so when more techniques of role-prompting are tested in more models we maybe start seeing more consistent results in the long term.

Stay tuned for more