french: Add “ç” to the list of elision letters#281
french: Add “ç” to the list of elision letters#281dscorbett wants to merge 1 commit intosnowballstem:masterfrom
Conversation
|
ç’a isn't a great motivation for doing this as - we're stemming to improve recall in information retrieval rather than as a theoretical exercise in computational linguistics, and a seems to carry little useful meaning (it's not currently in the stopword list we offer, which I'm a little surprised by as many other forms of avoir are). The extra cost of checking for it is fairly small but not zero. Grepping a frequency list from fr.wikipedia I have to hand (from a dump dated 20240102) I get these occurring more than once: Most seem to be other forms of avoir; en, que and te are stopwords; ui doesn't appear to even be a French word (typo for çui perhaps?) There actually seems more of a case for handling z' (https://en.wiktionary.org/wiki/z%27#French) as that is actually followed by non-stopwords in this frequency list - entries occurring 10 or more times: |
“Ç’” is the rare but unambiguous elided form of “ce” or “ça” before hard vowels, as in “ce”/“ça” + “a” → “ç’a”. See snowballstem/snowball-data#38 and snowballstem/snowball-website#50.