GlobalAnnotate
All case studies
AI AnnotationEducation9-week programme

NLP intent dataset across 8 Indian languages

An ed-tech company serving rural India

A balanced intent-and-entity dataset across 8 Indian languages — including code-mixed utterances common in real users — that delivered a 9.2% model F1 lift in production.

240k
Utterances
8
Languages
+9.2%
Model F1 lift

Challenge

What we walked into.

The product team's NLP needed to handle eight Indian languages, including code-mixed Hinglish and Tanglish utterances common in their rural-user base. Off-the-shelf intent data didn't reflect how their users actually spoke, and accuracy in those locales was dragging down the assistant's overall performance.

What we did

The work, step by step.

  1. Defined an intent-and-entity schema with the product team and recruited native annotators in each language

  2. Sourced and balanced 240k utterances across formal, informal, and code-mixed registers

  3. Calibrated annotators on a held-out gold set per language and tracked inter-annotator agreement throughout

  4. Delivered weekly batches with per-language QA reports tied back to model-error analysis

Results

What it shipped.

Outcomes measured against the brief we agreed up front, not vanity metrics.

  • Utterances
    240k
  • Languages
    8
  • Model F1 lift
    +9.2%

Need a smarter localization setup?

Get personalised guidance on the right approach for your content, data, and growth.

Talk to an expert

Start a project

Have a file or brief ready? Tell us your languages and timeline.

Start a project

Work we've delivered

See how teams use GlobalAnnotate to go global.

See case studies