OPENNLP-1220: Add support for Byte Pair Encoding (BPE)#1011
Merged
Conversation
mawiesne
commented
Apr 7, 2026
- Use Collections.unmodifiableList() for BPEModel.getMerges() - Replace NullPointerException with IllegalArgumentException for input validation - Remove getMerges() from BPETokenizerFactory, access merges via model directly - Implement Trainer<Parameters> interface in BPETokenizerTrainer - Use InvalidFormatException instead of IOException for invalid merge lines - Add Javadoc to encodeToBPE(), applyMerges(), learnMerges(), applyMerge() - Document BufferedWriter/OutputStream ownership in BPEMergesSerializer - Refactor BPEModelTest into abstract base + EN/DE/FR test classes - Refactor BPETokenizerRealisticTest into abstract base + EN/DE/FR/IT/ES test classes - Use try-with-resources for ByteArrayOutputStream in tests - Fix German grammar: "die Monographie" (feminine article)
- Use Collections.unmodifiableList() for BPEModel.getMerges() - Move merges storage from BPETokenizerFactory into BPEModel artifact map - Remove merges field and getMerges() from BPETokenizerFactory - Replace NullPointerException with IllegalArgumentException for input validation - Implement Trainer<Parameters> interface in BPETokenizerTrainer - Use InvalidFormatException for invalid merge lines in BPEMergesSerializer - Add Javadoc to encodeToBPE(), applyMerges(), learnMerges(), applyMerge() - Document BufferedWriter/OutputStream ownership in BPEMergesSerializer - Refactor BPEModelTest into abstract base + EN/DE/FR test classes - Refactor BPETokenizerRealisticTest into abstract base + EN/DE/FR/IT/ES - Use try-with-resources for ByteArrayOutputStream in tests - Fix German grammar (die Monographie) and use proper accents (ä/ö/ü/ß/é/ñ
rzo1
approved these changes
Apr 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.