Fix sentence boundary logic for non-English languages (#6189)#6237
Fix sentence boundary logic for non-English languages (#6189)#6237
Conversation
Define SENTENCE_ENDERS frozenset with CJK (。!?), Arabic (؟۔), Hindi (।॥) punctuation. Replace all 6 hardcoded English-only checks in combine_segments() helpers. Fix _is_sentence_complete() to handle caseless scripts (CJK, Arabic). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded .?! regex with pre-compiled _SENTENCE_FINDALL_RE from transcript_segment module. Enables correct sentence splitting for CJK, Arabic, Hindi text in translation batching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded .?! checks in _is_text_stable() and _compute_stability_signals() with imported SENTENCE_ENDERS constant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test CJK (。!?), Hindi (।), Arabic (؟) sentence enders in segment merging/splitting logic. 8 new tests covering boundary detection, long text split, caseless continuation, and mixed-script text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test Chinese (。!?), Hindi (।), Arabic (؟), and mixed English/CJK sentence splitting. 6 new tests verifying correct boundary detection for translation batching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename _SENTENCE_ENDERS_CLASS, _SENTENCE_SPLIT_RE, _SENTENCE_FINDALL_RE to public names (no underscore prefix) so utils modules can import them at top level without coupling to private internals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix in-function import violation: import SENTENCE_FINDALL_RE at module top level instead of inside split_into_sentences(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 new tests verifying _is_text_stable() and _compute_stability_signals() recognize Unicode sentence enders (。।؟) for translation coordinator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR fixes sentence-boundary detection for non-English languages by introducing a shared Key findings:
Confidence Score: 5/5Safe to merge — all remaining findings are P2 style/convention issues that do not affect runtime correctness. The functional fix (SENTENCE_ENDERS, regex updates, caseless-script handling) is correct and well-tested. The two remaining comments concern a rule-violating in-function import and a frozenset-ordering cosmetic issue — neither affects behavior. backend/utils/translation.py — in-function import should be moved to the top-level import block. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Transcript text received] --> B{Last char in SENTENCE_ENDERS?}
B -- Yes --> C[Add STABILITY_PUNCTUATION signal]
B -- No --> D[Check other signals: silence, speaker switch, token count]
C --> E[_is_text_stable = True, TranslationCoordinator proceeds]
D --> E
A --> F[combine_segments]
F --> G{Same speaker?}
G -- Yes --> H{len less than 125 OR last char NOT in SENTENCE_ENDERS?}
H -- Yes --> I[Merge segments]
H -- No --> J[Keep as separate segment]
G -- No --> K[_extract_last_incomplete_sentence via _SENTENCE_SPLIT_RE]
K --> L{Incomplete tail found?}
L -- Yes --> M[Attempt backward merge with next segment first sentence]
L -- No --> J
A --> N[split_into_sentences in translation.py]
N --> O[_SENTENCE_FINDALL_RE Unicode-aware findall]
O --> P[Sentence list for CJK, Arabic, Hindi, English]
Reviews (1): Last reviewed commit: "Add CJK/Hindi/Arabic tests for coordinat..." | Re-trigger Greptile |
| Recognizes Unicode sentence enders for CJK, Arabic, Hindi, and other non-English languages. | ||
| """ | ||
| if not text: | ||
| return [] |
There was a problem hiding this comment.
In-function import violates backend-imports rule
The import from models.transcript_segment import _SENTENCE_FINDALL_RE is placed inside split_into_sentences(). The project's backend-imports rule explicitly prohibits in-function imports — they must live at the module top level alongside the other imports.
No circular import exists here: transcript_segment.py only imports from models.other, pydantic, and the standard library, so moving this to the top of translation.py is safe.
Additionally, _SENTENCE_FINDALL_RE carries a leading _, marking it as a module-private symbol. Exporting private internals across module boundaries is a code smell; the constant should either be renamed to drop the underscore (making it public) or translation.py should derive its own pattern from the already-public SENTENCE_ENDERS.
| return [] | |
| from models.transcript_segment import SENTENCE_ENDERS, _SENTENCE_FINDALL_RE |
(add to the top-level imports block and remove the in-function import line)
Context Used: Backend Python import rules - no in-function impor... (source)
| SENTENCE_ENDERS_CLASS = '[' + re.escape(''.join(SENTENCE_ENDERS)) + ']' | ||
| SENTENCE_SPLIT_RE = re.compile(r'(?<=' + SENTENCE_ENDERS_CLASS + r')\s*') | ||
| SENTENCE_FINDALL_RE = re.compile( | ||
| r'[^' + re.escape(''.join(SENTENCE_ENDERS)) + r']+(?:' + SENTENCE_ENDERS_CLASS + r'\s*|\s*$)' | ||
| ) |
There was a problem hiding this comment.
frozenset iteration order makes regex pattern non-deterministic between runs
SENTENCE_ENDERS is a frozenset, and ''.join(SENTENCE_ENDERS) iterates it in an unspecified order. In CPython the order is stable within one process but can differ across Python versions or interpreter restarts, causing _SENTENCE_ENDERS_CLASS (and therefore _SENTENCE_SPLIT_RE / _SENTENCE_FINDALL_RE) to compile to slightly different pattern strings each time. Character-class matching is order-independent so correctness is not affected today, but it makes the compiled pattern unpredictable and harder to reason about.
Use a deterministic sequence (e.g. a str literal) instead of a frozenset for the join, and keep SENTENCE_ENDERS as the frozenset for fast membership tests:
_SENTENCE_ENDERS_STR = '.?!。!?؟۔।॥' # stable order for regex construction
SENTENCE_ENDERS = frozenset(_SENTENCE_ENDERS_STR) # O(1) `in` checks
_SENTENCE_ENDERS_CLASS = '[' + re.escape(_SENTENCE_ENDERS_STR) + ']'
_SENTENCE_SPLIT_RE = re.compile(r'(?<=' + _SENTENCE_ENDERS_CLASS + r')\s*')
_SENTENCE_FINDALL_RE = re.compile(
r'[^' + re.escape(_SENTENCE_ENDERS_STR) + r']+(?:' + _SENTENCE_ENDERS_CLASS + r'\s*|\s*$)'
)
Live Audio Test Evidence — PR #6237 (Issue #6189)Setup: Local dev backend (uvicorn, port 10160) running PR branch code with SENTENCE_ENDERS changes. 5-min French TTS audio streamed via WebSocket Backend logs confirm translation coordinator active: Result: Both cases completed with 33 unique segments, 0 errors. French sentence boundaries (periods Case 1: French audio WITH translation (language=auto)
Full transcript segments (33 segments)Case 2: French audio WITHOUT translation (language=fr)
Full transcript segments (33 segments)Unit test results (20 new tests, 205 total passing)by AI for @beastoin |
|
lgtm |
Fixes #6189.
Sentence splitting and segment combination logic hardcoded English punctuation
.!?as the only sentence-ending markers. This broke non-English languages: Chinese sentences ran together (。 not recognized), Hindi segments never split on ।, Arabic ؟ was ignored.Changes:
SENTENCE_ENDERSfrozenset with Unicode sentence-ending punctuation: English (.!?), CJK (。!?), Arabic/Urdu (؟۔), Hindi/Sanskrit (।॥)SENTENCE_SPLIT_RE,SENTENCE_FINDALL_RE) built from the constant.?!checks incombine_segments()helpers withSENTENCE_ENDERS_is_sentence_complete()— replacetext[0].isupper()withnot _starts_with_lowercase_cased()to handle caseless scripts (CJK, Arabic, Hindi)split_into_sentences()in translation.py to use pre-compiled Unicode-aware regex (top-level import)_is_text_stable()and_compute_stability_signals()in translation_coordinator.pyReview cycle fixes:
_SENTENCE_*symbols to publicSENTENCE_*for clean cross-module importsTests: 205 passed (22 + 86 + 77 existing + 20 new non-English boundary tests)
Files changed:
backend/models/transcript_segment.py— SENTENCE_ENDERS constant, regex helpers, 6 check replacementsbackend/utils/translation.py— split_into_sentences() regex update, top-level importbackend/utils/translation_coordinator.py— stability check updatesbackend/tests/unit/test_transcript_segment.py— 8 new testsbackend/tests/unit/test_translation_optimization.py— 6 new testsbackend/tests/unit/test_translation_cost_optimization.py— 6 new testsRisk: Conservative approach — only added well-established sentence enders for supported locales. Avoided ambiguous marks (Greek
;, Armenian:, Spanish¡¿) that could over-split.by AI for @beastoin