HukukBERT

A domain-specific BERT model for Turkish legal text, pretrained from scratch on 6 million unique court decisions with a custom 48K WordPiece tokenizer.

HukukBERT is developed by TurkHukuk.ai

Key Results

Model	Top-1 Accuracy	Top-1 95% CI	Top-3 Accuracy	Top-3 95% CI
turkhukuk.ai/hukukbert	84.40%	[81.63%, 86.82%]	98.80%	[97.74%, 99.37%]
newmindai/Mursit-Large	78.80%	[75.73%, 81.57%]	97.73%	[96.40%, 98.58%]
KocLab-Bilkent/BERTurk-Legal	75.47%	[72.26%, 78.41%]	96.00%	[94.35%, 97.18%]
dbmdz/bert-base-turkish-128k-cased	71.87%	[68.54%, 74.97%]	95.20%	[93.43%, 96.51%]
boun-tabilab/TabiBERT	68.13%	[64.71%, 71.37%]	95.33%	[93.58%, 96.63%]
dbmdz/bert-base-turkish-cased	63.73%	[60.23%, 67.10%]	93.47%	[91.47%, 95.02%]
ytu-ce-cosmos/turkish-large-bert-cased	61.60%	[58.07%, 65.01%]	91.20%	[88.96%, 93.02%]

turkhukuk.ai/hukukbert outperforms the strongest baseline in this table (newmindai/Mursit-Large) by +5,60 points on Top-1 and +1,07 points on Top-3 accuracy on the legal cloze benchmark (n=750).

Model Details


Architecture	BERT-base (12 layers, 768 hidden, 12 heads)
Tokenizer	Custom 48K WordPiece, trained on Turkish legal corpus
Pretraining corpus	~6M unique court decisions, mevzuat text, and various legal articles (Yargıtay, İstinaf, İlk Derece, Danıştay, AYM, Mevzuat, legal articles)
Deduplication	MinHash + LSH on 11M original decisions → 6M unique
Casing	Cased

Why a Domain-Specific Tokenizer?

General Turkish tokenizers fragment legal terminology into meaningless subwords. For example, "temerrüt" may be split into ["te", "##mer", "##rüt"] by a general tokenizer. HukukBERT's 48K legal-domain tokenizer recognizes such terms as single tokens, preserving semantic meaning for downstream tasks.

Repository Contents

hukukbert/
├── README.md                ← this file
├── LICENSE                  ← Apache 2.0 (code)
├── LICENSE-DATA             ← CC BY 4.0 (benchmark data)
├── CITATION.cff             ← citation metadata
└── benchmark/
    ├── README.md            ← benchmark usage & detailed results
    ├── data/
    │   └── hukukbert_v1_cloze.jsonl  (750 cloze items)
    └── scripts/
        └── cloze_benchmark_test.py   (evaluation script)

What's Included

✅ Cloze benchmark dataset (750 items, Turkish legal domain)
✅ Evaluation script with confidence intervals
✅ Full checkpoint results across training progression

What's Not Included

❌ Model weights (available for research collaboration — see below)
❌ Tokenizer files
❌ Training data or training pipeline

Benchmark

The benchmark is a cloze test for Turkish legal language modeling. Each item contains a legal sentence with a single [MASK] token, a set of candidate options, and one gold answer. Results are reported with Top-1 and Top-3 accuracy plus Wilson 95% confidence intervals.

See benchmark/README.md for usage instructions and full checkpoint results.

Downstream Applications

HukukBERT serves as the foundation for several downstream Turkish legal NLP tasks:

Court decision segmentation — classifying sections (iddia, savunma, gerekçe, hüküm)
Party identification — detecting and classifying parties (kamu, tüzel, gerçek kişi)
Judgment extraction — extracting structured hüküm from decision text

Citation

If you use this benchmark in your research, please cite:

@software{hukukbert2026,
  title     = {HukukBERT: A Domain-Specific Language Model for Turkish Legal Text},
  author    = {Turkoglu, Tansu},
  email     = {tansu@turkhukuk.ai},
  year      = {2026},
  url       = {https://github.com/TurkHukuk/hukukbert},
  publisher = {TurkHukuk.ai}
}

License

Code: Apache License 2.0
Benchmark data: Creative Commons Attribution 4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HukukBERT

Key Results

Model Details

Why a Domain-Specific Tokenizer?

Repository Contents

What's Included

What's Not Included

Benchmark

Downstream Applications

Citation

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmark		benchmark
docs		docs
logs		logs
CITATION.cff		CITATION.cff
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HukukBERT

Key Results

Model Details

Why a Domain-Specific Tokenizer?

Repository Contents

What's Included

What's Not Included

Benchmark

Downstream Applications

Citation

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages