HarfoSokhan: Bridging Persian's Formal-Colloquial Divide with 6M Parallel Pairs

Paper Post

حرف و سخن
HarfoSokhan
EACL 2026 · Rabat, Morocco · Best Resource Paper Award

1. Persian: A Global Language

100M+
Native speakers
24th
Most spoken globally
10th
Internet presence
3+
Official forms (Farsi / Dari / Tajik)

Persian is primarily spoken in Iran (83M), Afghanistan (15M, Dari), and Tajikistan (8M, Tajik), with significant diaspora communities worldwide.

Official (Iran, Afghanistan, Tajikistan)
Significant minority
Historical / diaspora

2. The Problem: NLP Fails on Colloquial Persian

Persian exists in two fundamentally different forms: formal (books, news, academia) and colloquial (Shekaste-nevisi — everyday speech, social media). Most NLP models are trained on formal text only, causing severe performance drops on colloquial input:

-33%Sentiment
-35%NER
-25%POS Tagging
-16ptMT BLEU
Our solution: Normalize colloquial text to formal before the NLP pipeline. This requires a large-scale parallel dataset — which didn't exist until HarfoSokhan.

3. Formal vs. Colloquial: Deep Linguistic Differences

The gap is not a simple vocabulary swap — it spans syntax, morphology, and phonology:

Verb conjugation-ید → -ین   -ند → -ن
Vocabulary shiftbozorg → gondeh ("big")
Pronunciationآن → اون
Stem reductionرفتن: رَو → ر
COLLOQUIAL
داره می‌ره
FORMAL
او در حال رفتن است
"He is going" — full syntactic restructuring, not a word swap

4. Building HarfoSokhan

Our back-translation pipeline:

Colloquial
Persian
English
(bridge)
Formal
Persian

Manual Corpus (12K pairs) — Sampled from subtitles + social media, translated by native speakers, dual-reviewed.

Machine Corpus (~6M pairs) — 6M colloquial sentences from OpenSubtitles & DegarBayan, back-translated via Google Translate, METEOR-validated.

6M
Parallel pairs
12K
Expert-annotated
533K
Colloquial unique tokens
2.8x
More tokens vs formal
Colloquial Persian has 2.8x more unique tokens than formal — reflecting the high lexical creativity of informal speech.

5. Results: Beating GPT-3.5-turbo

We fine-tuned ParsGPT2 (117M) and ParsT5 (275M) and evaluated with human ranking, LLM-as-a-Judge, and BLEU.

Human Evaluation

10 native speakers ranked 6 models on 200 sentences (blind):

Model top@1 top@2 top@3
T5-Manual 4.4 12.6 23.0
T5-HarfoSokhan 5.4 19.1 38.4
GPT2-Manual 8.3 25.6 45.4
FarsiYar (rule-based) 18.0 43.5 65.4
GPT-3.5-turbo 20.5 37.5 53.9
GPT2-HarfoSokhan 43.0 61.4 73.5
GPT2-HarfoSokhan outperforms GPT-3.5-turbo by 2.1x on top@1 — a 117M model beating a much larger one with the right data.

BLEU vs. Human: A Cautionary Tale

FarsiYar (rule-based)
0.697
▼ #4 in human ranking
GPT2-HarfoSokhan
0.338
▲ #1 in human ranking
Why? BLEU rewards token overlap. FarsiYar substitutes words but keeps colloquial structure — high BLEU, low human preference. GPT2-HarfoSokhan does deep restructuring that humans prefer but BLEU penalizes.

Downstream Impact

Task GPT2-HarfoSokhan ChatGPT
News classification 74.60% 71.43%
Sentiment analysis 85.09% 83.33%

6. Key Contributions

First large-scale Persian colloquial-to-formal parallel dataset (~6M pairs)
12K expert-annotated + ~6M back-translated sentence pairs
GPT2-HarfoSokhan beats GPT-3.5-turbo by 2.1x in human evaluation
Demonstrates BLEU inadequacy for style transfer tasks
Practical normalization improving downstream NLP pipelines

Resources

Sarvestani, H.J., Ramezanian, V., Saadat, S., Serajeh, N.T., Razavi Taheri, M.S., Kasaei, Sh., Fazli, M.A., and Asgari, E. (2026). HarfoSokhan: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal Variations. European Chapter of the Association for Computational Linguistics (EACL).

EACL 2026 Best Resource Paper Award
EACL 2026 Best Resource Paper Award