Skip to content

Add Korean stemmer#268

Open
nethippo wants to merge 4 commits intosnowballstem:masterfrom
nethippo:master
Open

Add Korean stemmer#268
nethippo wants to merge 4 commits intosnowballstem:masterfrom
nethippo:master

Conversation

@nethippo
Copy link
Copy Markdown

@nethippo nethippo commented Mar 13, 2026

Summary
This PR adds the Korean stemmer behavior and adds documentation/tests for the new rules.

Changes
Created algorithms/korean.sbl:
Skip stemming for tokens with 2 or fewer characters so as to not miss its unique meaning of its vector values.
Preserve the proper noun like '한반도' as an example. it would be added whenever new noun identified as a unique noun. Thi rule does not strip the final lexical '도' of its last character.

Added metadata comment (Version: v0.8, Date: 2026-03-13, Author: Bonghwan Kim).
Added Korean regression cases in tests/stemtest.c:

Added documentation:
README-ko.md with Korean stemmer notes and examples.
README.rst with an English summary section for Korean stemmer behavior.

Motivation
I needed to get Korean support for snowball due to Redis.

Validation
Ran make check_stemtest successfully.
Manual checks with stemwords -l korean -p2 on Wikipedia-derived samples (한글, 대한민국) confirmed:
short tokens are not stemmed
expected particle stripping still works for longer tokens (e.g., 남부에 -> 남부).

Notes
Test artifact files under tmp/ are not part of this PR.

Comment thread algorithms/korean.sbl
do remove_mixed_tokens
/* Skip stemming for tokens with 2 or fewer characters. */
$ascii_count = len
($ascii_count > 2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to assign to a variable like this - you can replace these two lines with:

$(len > 2)

@ojwb
Copy link
Copy Markdown
Member

ojwb commented Apr 7, 2026

Please see the "Adding a new stemming algorithm" section in CONTRIBUTING.rst for what's required for a new stemming algorithm.

You seem to be missing at least the PR in snowball-data adding a sample vocabulary and list of expected stems, and the PR in snowball-website adding documentation of the algorithm.

Also:

  • we don't have documentation of other stemmers in README.rst or files like README-*.md - that information should go in a new page on the website, like the existing stemmers have.
  • stemtest.c is meant for regression tests for bugs which aren't triggered by real words. General testing for Korean should instead be done by adding a sample vocabulary and list of expected stems in snowball-data, which then gets used by make check, make check_python, etc which gives us testing of the generated stemmer for all the supported target programming languages (whereas your addition to stemtest.c only tests the generated C stemmer).

@nethippo
Copy link
Copy Markdown
Author

nethippo commented Apr 9, 2026

Please see the "Adding a new stemming algorithm" section in CONTRIBUTING.rst for what's required for a new stemming algorithm.

You seem to be missing at least the PR in snowball-data adding a sample vocabulary and list of expected stems, and the PR in snowball-website adding documentation of the algorithm.

Also:

  • we don't have documentation of other stemmers in README.rst or files like README-*.md - that information should go in a new page on the website, like the existing stemmers have.
  • stemtest.c is meant for regression tests for bugs which aren't triggered by real words. General testing for Korean should instead be done by adding a sample vocabulary and list of expected stems in snowball-data, which then gets used by make check, make check_python, etc which gives us testing of the generated stemmer for all the supported target programming languages (whereas your addition to stemtest.c only tests the generated C stemmer).

Got it. I will check it and try to make PR again. Thanks for your great advice.

ojwb added a commit that referenced this pull request Apr 9, 2026
Comment thread algorithms/korean.sbl
'{U+ADF8}{U+B7EC}{U+BBC0}{U+B85C}' (atlimit delete) /* 그러므로 */
'{U+B2E4}{U+B9CC}' (atlimit delete) /* 다만 */
)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces all these words with an empty stem, which seems like stop-word removal rather than stemming. We assume stopword removal is handled as a separate step outside of Snowball; we also try to avoid producing empty stems.

(The "porter" algorithm produces an empty stem for input s, but it's intended as a reference implementation of the algorithm described in Martin's 1980 paper so it does what the paper says to s; the "english" stemmer is the one we recommend for general use and does not produce an empty stem.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest instead we create a Korean stop.txt which we can make available on the website, similar to:

https://snowballstem.org/algorithms/english/stop.txt

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the big differences between English(and LATIN variants) and Korean, I am thinking of the new way of splitting the stop word from the nouns by adding the noun dictionary.

Comment thread algorithms/korean.sbl
define remove_particles as (
[substring] among (
'{U+D55C}{U+BC18}{U+B3C4}' /* 한반도: keep lexical "도" */
(atlimit)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is working as you intend.

The among will find the longest matching substring, then execute its action.

So if a word ends -한반도 (but has other characters before that) it will check atlimit here, and that will fail. It won't remove -도 via the entry below. Is that what should happen?

Also, if the word is exactly 한반도, we match here, atlimit signals t and we return to the repeat remove_particles loop. We signalled t so the loop continues and we get called a second time. Fortunately the first call advanced the cursor and it's now at the limit so none of the strings can match and we avoid an infinite loop.

Assuming we shouldn't remove -도 from any word ending -한반도 except 한반도, it'd be better to replace the atlimit with false - then this routine will signal f for the word 한반도 and the repeat will exit without the second call to this routine.

If we should remove -도 for these cases, I can suggest a way to achieve that.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I have checked that this rule works only if the grammar is precisely correct. So. I am trying to merge this potential bug into the previous issue.

Comment thread algorithms/korean.sbl
'{U+ACE0}' (not atlimit <- '{U+B2E4}') /* 고 -> 다 */
'{U+BA74}' (not atlimit <- '{U+B2E4}') /* 면 -> 다 */
'{U+C9C0}' (not atlimit <- '{U+B2E4}') /* 지 -> 다 */
'{U+B2E4}' /* already dictionary form */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This final entry is a no-op (we call this routine with do remove_predicate_endings so any signal or cursor movement is reverted by the do). If it's useful to note, better to do it in a comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check it and let me ask of your opinion again.

Comment thread algorithms/korean.sbl
)
/* Mixed token: contains ASCII alnum, but not only ASCII alnum. */
($ascii_count > 0 and $ascii_count < len)
repeat ( [next] delete )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to return an empty stem for any input with a mix of ASCII alphanumerics and characters which aren't ASCII alphanumerics.

As noted above we try to avoid empty stems in general, but this also seems unhelpful as some foreign names will get eliminated, but in a fairly arbitrary way. For example, if Korean text refers to Renée Pohlmann (co-author of the Dutch stemmer), then we get:

$ printf '%s\n' renée pohlmann|./stemwords -l ko -p2
renée
pohlmann                      pohlmann

(This happens because é isn't an ASCII letter.)

I assume the intent was for this to actually trigger for a word with a mix of Korean and non-Korean characters, but even then is it really helpful to make such words non-searchable?

It probably makes more sense to address this by splitting words between Korean and non-Korean letters before calling Snowball, but if someone building an information retrieval system really wants the behaviour currently implemented here, that seems better handled as a type of stopword removal, so separate to stemming.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As people uses new generated words by mixing multiple languages, this becomes a big issue for stemming. I am still in studying for the decision. It was my first trying but didn't think about getting it eliminated for non-English words. Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be that we probably wouldn't need to do anything very special for mixed language words - if they end with latin characters, the stemmer won't match any Korean suffix and so won't do anything; if they end with a Korean suffix, it's not really very different to a fully Korean neologism and we can probably just remove it without really caring what alphabet the stem is in.

@ojwb ojwb added this to the 3.2.0 milestone Apr 23, 2026
@ojwb
Copy link
Copy Markdown
Member

ojwb commented Apr 23, 2026

I'm putting this on the 3.2.0 milestone, as it's high time we made another release (it's just under a year since the last) and this lacks a vocabulary list and website description which are requirements for merging.

If it's ready to merge before 3.1.0 is released I'm happy to reconsider.

@nethippo
Copy link
Copy Markdown
Author

@ojwb no worry. I am not hurry to make it merged for the latest release. I would like to make it better and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants