Add Korean stemmer#268
Conversation
| do remove_mixed_tokens | ||
| /* Skip stemming for tokens with 2 or fewer characters. */ | ||
| $ascii_count = len | ||
| ($ascii_count > 2) |
There was a problem hiding this comment.
There's no need to assign to a variable like this - you can replace these two lines with:
$(len > 2)
|
Please see the "Adding a new stemming algorithm" section in You seem to be missing at least the PR in snowball-data adding a sample vocabulary and list of expected stems, and the PR in snowball-website adding documentation of the algorithm. Also:
|
Got it. I will check it and try to make PR again. Thanks for your great advice. |
| '{U+ADF8}{U+B7EC}{U+BBC0}{U+B85C}' (atlimit delete) /* 그러므로 */ | ||
| '{U+B2E4}{U+B9CC}' (atlimit delete) /* 다만 */ | ||
| ) | ||
| ) |
There was a problem hiding this comment.
This replaces all these words with an empty stem, which seems like stop-word removal rather than stemming. We assume stopword removal is handled as a separate step outside of Snowball; we also try to avoid producing empty stems.
(The "porter" algorithm produces an empty stem for input s, but it's intended as a reference implementation of the algorithm described in Martin's 1980 paper so it does what the paper says to s; the "english" stemmer is the one we recommend for general use and does not produce an empty stem.)
There was a problem hiding this comment.
I'd suggest instead we create a Korean stop.txt which we can make available on the website, similar to:
There was a problem hiding this comment.
As the big differences between English(and LATIN variants) and Korean, I am thinking of the new way of splitting the stop word from the nouns by adding the noun dictionary.
| define remove_particles as ( | ||
| [substring] among ( | ||
| '{U+D55C}{U+BC18}{U+B3C4}' /* 한반도: keep lexical "도" */ | ||
| (atlimit) |
There was a problem hiding this comment.
I'm wondering if this is working as you intend.
The among will find the longest matching substring, then execute its action.
So if a word ends -한반도 (but has other characters before that) it will check atlimit here, and that will fail. It won't remove -도 via the entry below. Is that what should happen?
Also, if the word is exactly 한반도, we match here, atlimit signals t and we return to the repeat remove_particles loop. We signalled t so the loop continues and we get called a second time. Fortunately the first call advanced the cursor and it's now at the limit so none of the strings can match and we avoid an infinite loop.
Assuming we shouldn't remove -도 from any word ending -한반도 except 한반도, it'd be better to replace the atlimit with false - then this routine will signal f for the word 한반도 and the repeat will exit without the second call to this routine.
If we should remove -도 for these cases, I can suggest a way to achieve that.
There was a problem hiding this comment.
You are right. I have checked that this rule works only if the grammar is precisely correct. So. I am trying to merge this potential bug into the previous issue.
| '{U+ACE0}' (not atlimit <- '{U+B2E4}') /* 고 -> 다 */ | ||
| '{U+BA74}' (not atlimit <- '{U+B2E4}') /* 면 -> 다 */ | ||
| '{U+C9C0}' (not atlimit <- '{U+B2E4}') /* 지 -> 다 */ | ||
| '{U+B2E4}' /* already dictionary form */ |
There was a problem hiding this comment.
This final entry is a no-op (we call this routine with do remove_predicate_endings so any signal or cursor movement is reverted by the do). If it's useful to note, better to do it in a comment.
There was a problem hiding this comment.
I would check it and let me ask of your opinion again.
| ) | ||
| /* Mixed token: contains ASCII alnum, but not only ASCII alnum. */ | ||
| ($ascii_count > 0 and $ascii_count < len) | ||
| repeat ( [next] delete ) |
There was a problem hiding this comment.
This seems to return an empty stem for any input with a mix of ASCII alphanumerics and characters which aren't ASCII alphanumerics.
As noted above we try to avoid empty stems in general, but this also seems unhelpful as some foreign names will get eliminated, but in a fairly arbitrary way. For example, if Korean text refers to Renée Pohlmann (co-author of the Dutch stemmer), then we get:
$ printf '%s\n' renée pohlmann|./stemwords -l ko -p2
renée
pohlmann pohlmann
(This happens because é isn't an ASCII letter.)
I assume the intent was for this to actually trigger for a word with a mix of Korean and non-Korean characters, but even then is it really helpful to make such words non-searchable?
It probably makes more sense to address this by splitting words between Korean and non-Korean letters before calling Snowball, but if someone building an information retrieval system really wants the behaviour currently implemented here, that seems better handled as a type of stopword removal, so separate to stemming.
There was a problem hiding this comment.
As people uses new generated words by mixing multiple languages, this becomes a big issue for stemming. I am still in studying for the decision. It was my first trying but didn't think about getting it eliminated for non-English words. Thanks.
There was a problem hiding this comment.
My guess would be that we probably wouldn't need to do anything very special for mixed language words - if they end with latin characters, the stemmer won't match any Korean suffix and so won't do anything; if they end with a Korean suffix, it's not really very different to a fully Korean neologism and we can probably just remove it without really caring what alphabet the stem is in.
|
I'm putting this on the 3.2.0 milestone, as it's high time we made another release (it's just under a year since the last) and this lacks a vocabulary list and website description which are requirements for merging. If it's ready to merge before 3.1.0 is released I'm happy to reconsider. |
|
@ojwb no worry. I am not hurry to make it merged for the latest release. I would like to make it better and let you know. |
Summary
This PR adds the Korean stemmer behavior and adds documentation/tests for the new rules.
Changes
Created algorithms/korean.sbl:
Skip stemming for tokens with 2 or fewer characters so as to not miss its unique meaning of its vector values.
Preserve the proper noun like '한반도' as an example. it would be added whenever new noun identified as a unique noun. Thi rule does not strip the final lexical '도' of its last character.
Added metadata comment (Version: v0.8, Date: 2026-03-13, Author: Bonghwan Kim).
Added Korean regression cases in tests/stemtest.c:
Added documentation:
README-ko.md with Korean stemmer notes and examples.
README.rst with an English summary section for Korean stemmer behavior.
Motivation
I needed to get Korean support for snowball due to Redis.
Validation
Ran make check_stemtest successfully.
Manual checks with stemwords -l korean -p2 on Wikipedia-derived samples (한글, 대한민국) confirmed:
short tokens are not stemmed
expected particle stripping still works for longer tokens (e.g., 남부에 -> 남부).
Notes
Test artifact files under tmp/ are not part of this PR.