A domain-specific BERT model for Turkish legal text, pretrained from scratch on 6 million unique court decisions with a custom 48K WordPiece tokenizer.
HukukBERT is developed by TurkHukuk.ai
| Model | Top-1 Accuracy | Top-1 95% CI | Top-3 Accuracy | Top-3 95% CI |
|---|---|---|---|---|
| turkhukuk.ai/hukukbert | 84.40% | [81.63%, 86.82%] | 98.80% | [97.74%, 99.37%] |
| newmindai/Mursit-Large | 78.80% | [75.73%, 81.57%] | 97.73% | [96.40%, 98.58%] |
| KocLab-Bilkent/BERTurk-Legal | 75.47% | [72.26%, 78.41%] | 96.00% | [94.35%, 97.18%] |
| dbmdz/bert-base-turkish-128k-cased | 71.87% | [68.54%, 74.97%] | 95.20% | [93.43%, 96.51%] |
| boun-tabilab/TabiBERT | 68.13% | [64.71%, 71.37%] | 95.33% | [93.58%, 96.63%] |
| dbmdz/bert-base-turkish-cased | 63.73% | [60.23%, 67.10%] | 93.47% | [91.47%, 95.02%] |
| ytu-ce-cosmos/turkish-large-bert-cased | 61.60% | [58.07%, 65.01%] | 91.20% | [88.96%, 93.02%] |
turkhukuk.ai/hukukbert outperforms the strongest baseline in this table (newmindai/Mursit-Large) by +5,60 points on Top-1 and +1,07 points on Top-3 accuracy on the legal cloze benchmark (n=750).
| Architecture | BERT-base (12 layers, 768 hidden, 12 heads) |
| Tokenizer | Custom 48K WordPiece, trained on Turkish legal corpus |
| Pretraining corpus | ~6M unique court decisions, mevzuat text, and various legal articles (Yargıtay, İstinaf, İlk Derece, Danıştay, AYM, Mevzuat, legal articles) |
| Deduplication | MinHash + LSH on 11M original decisions → 6M unique |
| Casing | Cased |
General Turkish tokenizers fragment legal terminology into meaningless subwords. For example, "temerrüt" may be split into ["te", "##mer", "##rüt"] by a general tokenizer. HukukBERT's 48K legal-domain tokenizer recognizes such terms as single tokens, preserving semantic meaning for downstream tasks.
hukukbert/
├── README.md ← this file
├── LICENSE ← Apache 2.0 (code)
├── LICENSE-DATA ← CC BY 4.0 (benchmark data)
├── CITATION.cff ← citation metadata
└── benchmark/
├── README.md ← benchmark usage & detailed results
├── data/
│ └── hukukbert_v1_cloze.jsonl (750 cloze items)
└── scripts/
└── cloze_benchmark_test.py (evaluation script)
- ✅ Cloze benchmark dataset (750 items, Turkish legal domain)
- ✅ Evaluation script with confidence intervals
- ✅ Full checkpoint results across training progression
- ❌ Model weights (available for research collaboration — see below)
- ❌ Tokenizer files
- ❌ Training data or training pipeline
The benchmark is a cloze test for Turkish legal language modeling. Each item contains a legal sentence with a single [MASK] token, a set of candidate options, and one gold answer. Results are reported with Top-1 and Top-3 accuracy plus Wilson 95% confidence intervals.
See benchmark/README.md for usage instructions and full checkpoint results.
HukukBERT serves as the foundation for several downstream Turkish legal NLP tasks:
- Court decision segmentation — classifying sections (iddia, savunma, gerekçe, hüküm)
- Party identification — detecting and classifying parties (kamu, tüzel, gerçek kişi)
- Judgment extraction — extracting structured hüküm from decision text
If you use this benchmark in your research, please cite:
@software{hukukbert2026,
title = {HukukBERT: A Domain-Specific Language Model for Turkish Legal Text},
author = {Turkoglu, Tansu},
email = {tansu@turkhukuk.ai},
year = {2026},
url = {https://github.com/TurkHukuk/hukukbert},
publisher = {TurkHukuk.ai}
}- Code: Apache License 2.0
- Benchmark data: Creative Commons Attribution 4.0