Add Korean stemmer by nethippo · Pull Request #268 · snowballstem/snowball

nethippo · 2026-03-13T04:57:50Z

Summary
This PR adds the Korean stemmer behavior and adds documentation/tests for the new rules.

Changes
Created algorithms/korean.sbl:
Skip stemming for tokens with 2 or fewer characters so as to not miss its unique meaning of its vector values.
Preserve the proper noun like '한반도' as an example. it would be added whenever new noun identified as a unique noun. Thi rule does not strip the final lexical '도' of its last character.

Added metadata comment (Version: v0.8, Date: 2026-03-13, Author: Bonghwan Kim).
Added Korean regression cases in tests/stemtest.c:

Added documentation:
README-ko.md with Korean stemmer notes and examples.
README.rst with an English summary section for Korean stemmer behavior.

Motivation
I needed to get Korean support for snowball due to Redis.

Validation
Ran make check_stemtest successfully.
Manual checks with stemwords -l korean -p2 on Wikipedia-derived samples (한글, 대한민국) confirmed:
short tokens are not stemmed
expected particle stripping still works for longer tokens (e.g., 남부에 -> 남부).

Notes
Test artifact files under tmp/ are not part of this PR.

ojwb · 2026-04-07T02:16:07Z

+    do remove_mixed_tokens
+    /* Skip stemming for tokens with 2 or fewer characters. */
+    $ascii_count = len
+    ($ascii_count > 2)


There's no need to assign to a variable like this - you can replace these two lines with:

$(len > 2)

ojwb · 2026-04-07T02:18:59Z

Please see the "Adding a new stemming algorithm" section in CONTRIBUTING.rst for what's required for a new stemming algorithm.

You seem to be missing at least the PR in snowball-data adding a sample vocabulary and list of expected stems, and the PR in snowball-website adding documentation of the algorithm.

Also:

we don't have documentation of other stemmers in README.rst or files like README-*.md - that information should go in a new page on the website, like the existing stemmers have.
stemtest.c is meant for regression tests for bugs which aren't triggered by real words. General testing for Korean should instead be done by adding a sample vocabulary and list of expected stems in snowball-data, which then gets used by make check, make check_python, etc which gives us testing of the generated stemmer for all the supported target programming languages (whereas your addition to stemtest.c only tests the generated C stemmer).

nethippo · 2026-04-09T06:11:59Z

Please see the "Adding a new stemming algorithm" section in CONTRIBUTING.rst for what's required for a new stemming algorithm.

You seem to be missing at least the PR in snowball-data adding a sample vocabulary and list of expected stems, and the PR in snowball-website adding documentation of the algorithm.

Also:

we don't have documentation of other stemmers in README.rst or files like README-*.md - that information should go in a new page on the website, like the existing stemmers have.

stemtest.c is meant for regression tests for bugs which aren't triggered by real words. General testing for Korean should instead be done by adding a sample vocabulary and list of expected stems in snowball-data, which then gets used by make check, make check_python, etc which gives us testing of the generated stemmer for all the supported target programming languages (whereas your addition to stemtest.c only tests the generated C stemmer).

Got it. I will check it and try to make PR again. Thanks for your great advice.

See #268

ojwb · 2026-04-23T22:51:35Z

+            '{U+ADF8}{U+B7EC}{U+BBC0}{U+B85C}'          (atlimit delete) /* 그러므로 */
+            '{U+B2E4}{U+B9CC}'                          (atlimit delete) /* 다만 */
+        )
+    )


This replaces all these words with an empty stem, which seems like stop-word removal rather than stemming. We assume stopword removal is handled as a separate step outside of Snowball; we also try to avoid producing empty stems.

(The "porter" algorithm produces an empty stem for input s, but it's intended as a reference implementation of the algorithm described in Martin's 1980 paper so it does what the paper says to s; the "english" stemmer is the one we recommend for general use and does not produce an empty stem.)

I'd suggest instead we create a Korean stop.txt which we can make available on the website, similar to:

https://snowballstem.org/algorithms/english/stop.txt

As the big differences between English(and LATIN variants) and Korean, I am thinking of the new way of splitting the stop word from the nouns by adding the noun dictionary.

ojwb · 2026-04-23T23:06:32Z

+    define remove_particles as (
+        [substring] among (
+            '{U+D55C}{U+BC18}{U+B3C4}' /* 한반도: keep lexical "도" */
+            (atlimit)


I'm wondering if this is working as you intend.

The among will find the longest matching substring, then execute its action.

So if a word ends -한반도 (but has other characters before that) it will check atlimit here, and that will fail. It won't remove -도 via the entry below. Is that what should happen?

Also, if the word is exactly 한반도, we match here, atlimit signals t and we return to the repeat remove_particles loop. We signalled t so the loop continues and we get called a second time. Fortunately the first call advanced the cursor and it's now at the limit so none of the strings can match and we avoid an infinite loop.

Assuming we shouldn't remove -도 from any word ending -한반도 except 한반도, it'd be better to replace the atlimit with false - then this routine will signal f for the word 한반도 and the repeat will exit without the second call to this routine.

If we should remove -도 for these cases, I can suggest a way to achieve that.

You are right. I have checked that this rule works only if the grammar is precisely correct. So. I am trying to merge this potential bug into the previous issue.

ojwb · 2026-04-23T23:09:48Z

+            '{U+ACE0}'                  (not atlimit <- '{U+B2E4}') /* 고 -> 다 */
+            '{U+BA74}'                  (not atlimit <- '{U+B2E4}') /* 면 -> 다 */
+            '{U+C9C0}'                  (not atlimit <- '{U+B2E4}') /* 지 -> 다 */
+            '{U+B2E4}'                  /* already dictionary form */


This final entry is a no-op (we call this routine with do remove_predicate_endings so any signal or cursor movement is reverted by the do). If it's useful to note, better to do it in a comment.

I would check it and let me ask of your opinion again.

ojwb · 2026-04-23T23:21:00Z

+    )
+    /* Mixed token: contains ASCII alnum, but not only ASCII alnum. */
+    ($ascii_count > 0 and $ascii_count < len)
+    repeat ( [next] delete )


This seems to return an empty stem for any input with a mix of ASCII alphanumerics and characters which aren't ASCII alphanumerics.

As noted above we try to avoid empty stems in general, but this also seems unhelpful as some foreign names will get eliminated, but in a fairly arbitrary way. For example, if Korean text refers to Renée Pohlmann (co-author of the Dutch stemmer), then we get:

$ printf '%s\n' renée pohlmann|./stemwords -l ko -p2 renée pohlmann pohlmann

(This happens because é isn't an ASCII letter.)

I assume the intent was for this to actually trigger for a word with a mix of Korean and non-Korean characters, but even then is it really helpful to make such words non-searchable?

It probably makes more sense to address this by splitting words between Korean and non-Korean letters before calling Snowball, but if someone building an information retrieval system really wants the behaviour currently implemented here, that seems better handled as a type of stopword removal, so separate to stemming.

As people uses new generated words by mixing multiple languages, this becomes a big issue for stemming. I am still in studying for the decision. It was my first trying but didn't think about getting it eliminated for non-English words. Thanks.

My guess would be that we probably wouldn't need to do anything very special for mixed language words - if they end with latin characters, the stemmer won't match any Korean suffix and so won't do anything; if they end with a Korean suffix, it's not really very different to a fully Korean neologism and we can probably just remove it without really caring what alphabet the stem is in.

ojwb · 2026-04-23T23:38:11Z

I'm putting this on the 3.2.0 milestone, as it's high time we made another release (it's just under a year since the last) and this lacks a vocabulary list and website description which are requirements for merging.

If it's ready to merge before 3.1.0 is released I'm happy to reconsider.

nethippo · 2026-04-27T08:24:45Z

@ojwb no worry. I am not hurry to make it merged for the latest release. I would like to make it better and let you know.

nethippo and others added 4 commits March 11, 2026 12:12

Add Korean stemming rules and filtering

8cb5c57

Refine Korean suffix handling and empty-result safeguards

03e1591

korean stemmer: skip <=2-char tokens and keep 한반도

a49ee19

docs: add Korean stemmer notes in ko/en READMEs

86ac656

ojwb reviewed Apr 7, 2026

View reviewed changes

ojwb added a commit that referenced this pull request Apr 9, 2026

stemtest: Clarify what testcases are appropriate here

6f27a5e

See #268

ojwb reviewed Apr 23, 2026

View reviewed changes

ojwb added this to the 3.2.0 milestone Apr 23, 2026

Conversation

nethippo commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ojwb commented Apr 7, 2026

Uh oh!

nethippo commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ojwb commented Apr 23, 2026

Uh oh!

nethippo commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nethippo commented Mar 13, 2026 •

edited

Loading