Skip to content

french: Add “ç” to the list of elision letters#281

Open
dscorbett wants to merge 1 commit intosnowballstem:masterfrom
dscorbett:fr-c-cedilla-apostrophe
Open

french: Add “ç” to the list of elision letters#281
dscorbett wants to merge 1 commit intosnowballstem:masterfrom
dscorbett:fr-c-cedilla-apostrophe

Conversation

@dscorbett
Copy link
Copy Markdown
Contributor

“Ç’” is the rare but unambiguous elided form of “ce” or “ça” before hard vowels, as in “ce”/“ça” + “a” → “ç’a”. See snowballstem/snowball-data#38 and snowballstem/snowball-website#50.

@ojwb
Copy link
Copy Markdown
Member

ojwb commented Apr 17, 2026

ç’a isn't a great motivation for doing this as - we're stemming to improve recall in information retrieval rather than as a theoretical exercise in computational linguistics, and a seems to carry little useful meaning (it's not currently in the stopword list we offer, which I'm a little surprised by as many other forms of avoir are). The extra cost of checking for it is fairly small but not zero.

Grepping a frequency list from fr.wikipedia I have to hand (from a dump dated 20240102) I get these occurring more than once:

116	ç'aurait
67	ç'a
39	ç'est
19	ç'avait
6	ç'ait
6	ç'en
5	ç''a
4	ç'eût
3	ç'que
3	ç'''a
3	ç'eut
3	ç'aura
3	ç'ui
2	ç'te

Most seem to be other forms of avoir; en, que and te are stopwords; ui doesn't appear to even be a French word (typo for çui perhaps?)

There actually seems more of a case for handling z' (https://en.wiktionary.org/wiki/z%27#French) as that is actually followed by non-stopwords in this frequency list - entries occurring 10 or more times:

86	z'avez
79	z'ailées
57	z'êtes
47	z'en
44	z'ont
23	z'enfants
22	z'graggen
15	z'éditions
13	z'yeux
13	z'n
12	z'ai
12	z'à
11	z'auriez
11	z'y
11	z'ahidi
10	z'oreille
10	z'aussi
10	z'azimut
10	z'nuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants