Description
Currently, the KoreanTokenizer in the Nori module segments numeric values containing decimal points into multiple separate tokens.
For example, the phrase "10.1인치 모니터" is currently tokenized as follows (when discardPunctuation = false):
["10", ".", "1", "인치", "모니터"]
Why this is important (for example: Media/News Use Cases)
In news datasets, numeric precision is vital for accurate information retrieval. Key values such as economic growth rates ("3.5%"), poll results ("42.1%"), or exchange rates ("1532.26원") are currently fragmented.
This leads to poor search relevancy for data-driven reporting, as it becomes difficult to search for specific numeric measurements or versions (e.g., searching for "10.1" might match "10" and "1" separately). Treating these as atomic units is essential for maintaining the semantic integrity of the data.
Proposed Enhancement
I propose adding a new configuration option, keepDecimalPoint (boolean), to the KoreanTokenizer.
- Behavior: When set to
true, the tokenizer treats a sequence of [Digit][Dot][Digit] as a single token during the lattice construction phase.
- Default value:
false (to maintain backward compatibility).
Example of desired output (keepDecimalPoint=true)
- Input:
"10.1인치 모니터"
- Output:
["10.1", "인치", "모니터"]
Description
Currently, the
KoreanTokenizerin the Nori module segments numeric values containing decimal points into multiple separate tokens.For example, the phrase "10.1인치 모니터" is currently tokenized as follows (when
discardPunctuation = false):["10", ".", "1", "인치", "모니터"]Why this is important (for example: Media/News Use Cases)
In news datasets, numeric precision is vital for accurate information retrieval. Key values such as economic growth rates ("3.5%"), poll results ("42.1%"), or exchange rates ("1532.26원") are currently fragmented.
This leads to poor search relevancy for data-driven reporting, as it becomes difficult to search for specific numeric measurements or versions (e.g., searching for "10.1" might match "10" and "1" separately). Treating these as atomic units is essential for maintaining the semantic integrity of the data.
Proposed Enhancement
I propose adding a new configuration option,
keepDecimalPoint(boolean), to theKoreanTokenizer.true, the tokenizer treats a sequence of[Digit][Dot][Digit]as a single token during the lattice construction phase.false(to maintain backward compatibility).Example of desired output (
keepDecimalPoint=true)"10.1인치 모니터"["10.1", "인치", "모니터"]