HarfoSokhan: Bridging Persian's Formal-Colloquial Divide with 6M Parallel Pairs
1. Persian: A Global Language
Persian is primarily spoken in Iran (83M), Afghanistan (15M, Dari), and Tajikistan (8M, Tajik), with significant diaspora communities worldwide.
2. The Problem: NLP Fails on Colloquial Persian
Persian exists in two fundamentally different forms: formal (books, news, academia) and colloquial (Shekaste-nevisi — everyday speech, social media). Most NLP models are trained on formal text only, causing severe performance drops on colloquial input:
3. Formal vs. Colloquial: Deep Linguistic Differences
The gap is not a simple vocabulary swap — it spans syntax, morphology, and phonology:
4. Building HarfoSokhan
Our back-translation pipeline:
Persian
(bridge)
Persian
Manual Corpus (12K pairs) — Sampled from subtitles + social media, translated by native speakers, dual-reviewed.
Machine Corpus (~6M pairs) — 6M colloquial sentences from OpenSubtitles & DegarBayan, back-translated via Google Translate, METEOR-validated.
5. Results: Beating GPT-3.5-turbo
We fine-tuned ParsGPT2 (117M) and ParsT5 (275M) and evaluated with human ranking, LLM-as-a-Judge, and BLEU.
Human Evaluation
10 native speakers ranked 6 models on 200 sentences (blind):
| Model | top@1 | top@2 | top@3 |
|---|---|---|---|
| T5-Manual | 4.4 | 12.6 | 23.0 |
| T5-HarfoSokhan | 5.4 | 19.1 | 38.4 |
| GPT2-Manual | 8.3 | 25.6 | 45.4 |
| FarsiYar (rule-based) | 18.0 | 43.5 | 65.4 |
| GPT-3.5-turbo | 20.5 | 37.5 | 53.9 |
| GPT2-HarfoSokhan | 43.0 | 61.4 | 73.5 |
BLEU vs. Human: A Cautionary Tale
Downstream Impact
| Task | GPT2-HarfoSokhan | ChatGPT |
|---|---|---|
| News classification | 74.60% | 71.43% |
| Sentiment analysis | 85.09% | 83.33% |
6. Key Contributions
◆ 12K expert-annotated + ~6M back-translated sentence pairs
◆ GPT2-HarfoSokhan beats GPT-3.5-turbo by 2.1x in human evaluation
◆ Demonstrates BLEU inadequacy for style transfer tasks
◆ Practical normalization improving downstream NLP pipelines
Resources
- Paper: ACL Anthology
- Dataset: HuggingFace
- Slides: PDF
- Poster: PDF
- Venue: EACL 2026 (Main Conference), Rabat, Morocco
- Award: Best Resource Paper Award
Sarvestani, H.J., Ramezanian, V., Saadat, S., Serajeh, N.T., Razavi Taheri, M.S., Kasaei, Sh., Fazli, M.A., and Asgari, E. (2026). HarfoSokhan: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal Variations. European Chapter of the Association for Computational Linguistics (EACL).