Time cost of manual coding
Most researchers underestimate how much time OE coding consumes, because the hours are spread across a project rather than concentrated in one block. Add them up:
| Task | Time (manual) | Notes |
|---|---|---|
| Build initial codeframe | 2–4 hours | Based on reading ~50–100 verbatims from a random sample |
| Code 500 responses | 6–12 hours | Assumes one coder; multi-coding adds time |
| Inter-coder reconciliation | 2–4 hours | Comparing assignments, resolving disagreements |
| Codeframe revisions + re-coding | 2–6 hours | Near-universal on real projects; rare to get the frame right first pass |
| Total | 12–26 hours | Per study, per OE battery |
For a team running 10 studies a month with two or three open ends each, that’s a substantial share of total capacity. Studies with 1,000 verbatims, or complex multi-code schemes, push well past these numbers. Hours spent on mechanical coding are hours not spent on analysis or the next proposal.
Where consistency breaks down
Researchers know the reliability problem exists. Few teams formally measure it.
Inter-coder reliability is typically assessed with Cohen’s Kappa: 1.0 is perfect agreement, 0.0 is chance. A Kappa above 0.80 is the accepted threshold for content analysis. In practice, teams that actually measure agreement on OE coding often find first-pass scores between 0.60 and 0.75. That’s not carelessness. It reflects genuine ambiguity in natural language.
The deeper problem is codeframe drift. Early in a project, coders apply the frame as written. By response 300, they’re making micro-interpretations that diverge from each other. A response coded under “Value for money” in the morning gets coded under “Price” by the afternoon, because the coder has started treating them as synonyms.
Drift compounds during revisions. When a category is split or renamed mid-project, earlier responses need re-reviewing. In practice they often aren’t, so the final dataset carries systematic inconsistencies invisible in the output.
What AI coding changes
AI-assisted coding doesn’t eliminate the need for researcher judgement. It changes where that judgement is applied.
| Aspect | Manual | AI-assisted |
|---|---|---|
| Codeframe development | Researcher builds from scratch | AI proposes initial categories from a sample; researcher revises and approves |
| Coding throughput | 40–80 responses per hour | Full dataset coded in minutes after codeframe is locked |
| Consistency | Degrades over time; varies between coders | Applied uniformly across all responses |
| Ambiguous responses | Coded into nearest category or skipped | Flagged with confidence scores for human review |
| Codeframe revisions | Requires re-reading previously coded responses | Re-codes the dataset automatically against the revised frame |
| Researcher time | Spread across all 500 responses | Concentrated on review, exceptions, and final judgements |
The gains are real and specific. But so are the constraints. AI coding works well when the codeframe is explicit and the language is relatively predictable. It works less well on highly idiomatic responses, responses in mixed languages, or topics where the interpretive weight depends on cultural context the model doesn’t have.
Tools like OE Coder are designed so the researcher reviews a confidence-weighted sample rather than every response. High-confidence assignments are accepted in bulk. Flagged responses get human attention. In a well-structured study, that means reviewing 10–15% of responses in detail rather than 100%.
The codeframe is still yours
The most common objection to AI-assisted coding is that the codeframe will reflect statistical frequency rather than analytical intent. It’s a fair concern.
A codeframe isn’t just a list of themes. It encodes decisions about granularity, which distinctions matter to the client, and what language will land in a debrief. A statistically derived frame surfaces the most common topics. It won’t know that the client cares about the difference between “staff attitude” and “staff competence,” or that “price” and “value” should be kept separate because they map to different strategic levers.
AI suggestions are a starting point, not a finished product. You merge categories that are too granular, split ones that are too broad, add codes for themes that appear in only 3% of responses but matter strategically, and rename everything to match the client’s vocabulary.
The honest limitation: the AI’s initial framing can anchor your thinking. If the model proposes eight categories under time pressure, you’re more likely to refine those eight than start blank. Reading 30–50 verbatims yourself before looking at AI suggestions helps. The final dataset should reflect your interpretation. The AI handles the mechanical application of it at scale.
Practical checklist
Starting your first AI-coded project. In order:
- Use a real file from a recent study, not a demo dataset. You need to see how the system handles your actual verbatims: the poorly typed, the off-topic, the genuinely ambiguous.
- Read 30–50 verbatims yourself before the AI proposes categories. Get an independent view of the data first.
- Set multi-coding rules upfront. Decide whether responses can carry multiple codes and what the cap is. Ambiguity here creates downstream problems.
- Review low-confidence assignments carefully. Don’t skip this under time pressure. It’s where systematic errors surface.
- Track categories you consistently override. If you keep changing the same one, the definition needs tightening.
- Run a parallel manual sample. Code 50 responses by hand and compare. This gives you a real accuracy baseline for your data type.
- Keep original verbatims in the deliverable. Clients occasionally need to trace a coded response back to the raw text.
Measuring whether it’s working
Efficiency claims are easy to make. Measuring them is more useful. Four metrics worth tracking:
Time from fieldwork close to signed-off coded dataset. The most direct measure. Track it per project for a quarter and compare against your manual baseline. Include revision cycles, not just first-pass time.
Codeframe iteration count. How many times is the frame revised before sign-off? AI-assisted development should reduce this over time as you calibrate the process to your data types.
Inter-coder agreement rate. If you’re reviewing a sample alongside the AI output, measure agreement. A Kappa above 0.80 is the target. Lower scores on specific question types point to where the tool needs more human support.
Client revision requests related to coding. Track how often clients push back on codes in debrief or ask for re-codes. If this increases after adopting AI-assisted coding, investigate which question types or codeframe decisions are driving it.
Before changing anything, record your current average for the first metric across your next five projects. That number is the only honest comparison point.