HarfoSokhan: Bridging Persian's Formal-Colloquial Divide with 6M Parallel Pairs

Paper Post

حرف و سخن

HarfoSokhan

EACL 2026 · Rabat, Morocco · Best Resource Paper Award

1. Persian: A Global Language

100M+

Native speakers

24th

Most spoken globally

10th

Internet presence

Official forms (Farsi / Dari / Tajik)

Persian is primarily spoken in Iran (83M), Afghanistan (15M, Dari), and Tajikistan (8M, Tajik), with significant diaspora communities worldwide.

Official (Iran, Afghanistan, Tajikistan)

Significant minority

Historical / diaspora

2. The Problem: NLP Fails on Colloquial Persian

Persian exists in two fundamentally different forms: formal (books, news, academia) and colloquial (Shekaste-nevisi — everyday speech, social media). Most NLP models are trained on formal text only, causing severe performance drops on colloquial input:

-33%Sentiment

-35%NER

-25%POS Tagging

-16ptMT BLEU

Our solution: Normalize colloquial text to formal before the NLP pipeline. This requires a large-scale parallel dataset — which didn't exist until HarfoSokhan.

3. Formal vs. Colloquial: Deep Linguistic Differences

The gap is not a simple vocabulary swap — it spans syntax, morphology, and phonology:

Verb conjugation-ید → -ین -ند → -ن

Vocabulary shiftbozorg → gondeh ("big")

Pronunciationآن → اون

Stem reductionرفتن: رَو → ر

COLLOQUIAL

داره می‌ره

→

FORMAL

او در حال رفتن است

"He is going" — full syntactic restructuring, not a word swap

4. Building HarfoSokhan

Our back-translation pipeline:

Colloquial
Persian

→

English
(bridge)

→

Formal
Persian

Manual Corpus (12K pairs) — Sampled from subtitles + social media, translated by native speakers, dual-reviewed.

Machine Corpus (~6M pairs) — 6M colloquial sentences from OpenSubtitles & DegarBayan, back-translated via Google Translate, METEOR-validated.

Parallel pairs

12K

Expert-annotated

533K

Colloquial unique tokens

2.8x

More tokens vs formal

Colloquial Persian has 2.8x more unique tokens than formal — reflecting the high lexical creativity of informal speech.

5. Results: Beating GPT-3.5-turbo

We fine-tuned ParsGPT2 (117M) and ParsT5 (275M) and evaluated with human ranking, LLM-as-a-Judge, and BLEU.

Human Evaluation

10 native speakers ranked 6 models on 200 sentences (blind):

Model	top@1	top@2	top@3
T5-Manual	4.4	12.6	23.0
T5-HarfoSokhan	5.4	19.1	38.4
GPT2-Manual	8.3	25.6	45.4
FarsiYar (rule-based)	18.0	43.5	65.4
GPT-3.5-turbo	20.5	37.5	53.9
GPT2-HarfoSokhan	43.0	61.4	73.5

GPT2-HarfoSokhan outperforms GPT-3.5-turbo by 2.1x on top@1 — a 117M model beating a much larger one with the right data.

BLEU vs. Human: A Cautionary Tale

FarsiYar (rule-based)

0.697

▼ #4 in human ranking

≠

GPT2-HarfoSokhan

0.338

▲ #1 in human ranking

Why? BLEU rewards token overlap. FarsiYar substitutes words but keeps colloquial structure — high BLEU, low human preference. GPT2-HarfoSokhan does deep restructuring that humans prefer but BLEU penalizes.

Downstream Impact

Task	GPT2-HarfoSokhan	ChatGPT
News classification	74.60%	71.43%
Sentiment analysis	85.09%	83.33%

6. Key Contributions

◆ First large-scale Persian colloquial-to-formal parallel dataset (~6M pairs)
◆ 12K expert-annotated + ~6M back-translated sentence pairs
◆ GPT2-HarfoSokhan beats GPT-3.5-turbo by 2.1x in human evaluation
◆ Demonstrates BLEU inadequacy for style transfer tasks
◆ Practical normalization improving downstream NLP pipelines

Resources

Paper: ACL Anthology
Dataset: HuggingFace
Slides: PDF
Poster: PDF
Venue: EACL 2026 (Main Conference), Rabat, Morocco
Award: Best Resource Paper Award

Sarvestani, H.J., Ramezanian, V., Saadat, S., Serajeh, N.T., Razavi Taheri, M.S., Kasaei, Sh., Fazli, M.A., and Asgari, E. (2026). HarfoSokhan: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal Variations. European Chapter of the Association for Computational Linguistics (EACL).

EACL 2026 Best Resource Paper Award