Skip to content

Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429

Merged
g-despot merged 7 commits into
mainfrom
tokenization-updates
May 14, 2026
Merged

Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429
g-despot merged 7 commits into
mainfrom
tokenization-updates

Conversation

@g-despot
Copy link
Copy Markdown
Contributor

@g-despot g-despot commented Apr 30, 2026

Summary

Brings the TS client to parity with the python client for Weaviate v1.37 tokenization config. Pre-patch, users had to fall back to as any for per-property textAnalyzer, invertedIndex.stopwordPresets, and the /v1/tokenize stopwords / stopwordPresets fields.

Public surface:

  • TextAnalyzerConfig — new type used for both per-property textAnalyzer and tokenize.text({ analyzerConfig }). Ergonomic union: asciiFold: boolean | { ignore: string[] }.
  • InvertedIndexConfig.stopwordPresets — exposed on create / read / update, plus on the configure.invertedIndex(...) and reconfigure.invertedIndex(...) builders.
  • tokenize.text — now accepts stopwords (one-off block) and stopwordPresets (named catalog). Mutually exclusive — passing both rejects client-side with WeaviateInvalidInputError. Version-gated at >= 1.37.2.

Schema: tools/refresh_schema.sh v1.37.2 refreshed src/openapi/schema.ts so TokenizeRequest carries stopwords (top-level) and the flat stopwordPresets shape. CI matrix bumped to 1.37.2.

Test plan

  • WEAVIATE_VERSION=1.37.2 npm run test:unit — 323/323 pass
  • npm run build / npm run lint — clean
  • Integration tests against live Weaviate 1.37.2:
    • test/tokenize/integration.test.ts — covers analyzerConfig, stopwords (preset+additions / additions-only / removals-only), stopwordPresets (named ref / builtin override), mutex rejection. Inputs/outputs match the python integration suite.
    • test/collections/tokenization/integration.test.ts — round-trips textAnalyzer and stopwordPresets through collection.config.get().

🤖 Generated with Claude Code

@g-despot g-despot requested a review from a team as a code owner April 30, 2026 11:41
Copy link
Copy Markdown

@orca-security-eu orca-security-eu Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

CI prettier flagged whitespace inside empty `() => { }` arrow bodies.
Strip to `() => {}` to match repo style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the public TypeScript client types and (de)serialization to expose Weaviate v1.37’s per-property text-analysis configuration (textAnalyzer) and collection-level invertedIndex.stopwordPresets, and ensures the tokenize endpoint uses the same shared translation logic.

Changes:

  • Exposes TextAnalyzerConfig and wires it through collection property create/read types, with shared union↔wire translation helpers.
  • Exposes InvertedIndexConfig.stopwordPresets on schema create/read surfaces and maps it through config deserialization.
  • Updates tokenize endpoint typing/docs and CI matrix to target Weaviate 1.37.2, plus adds unit + integration coverage for round-tripping.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/collections/tokenization/integration.test.ts Adds integration coverage for schema-config round-tripping of textAnalyzer and stopwordPresets.
src/tokenize/index.ts Switches tokenize analyzerConfig serialization to the shared translator and updates stopword preset typing.
src/collections/tokenization/unit.test.ts Adds type-level tests pinning the public tokenization surface across schema refreshes.
src/collections/configure/types/base.ts Wires textAnalyzer and stopwordPresets into public “configure/create/update” types.
src/collections/config/utils.ts Introduces shared textAnalyzerConfigToWire / textAnalyzerConfigFromWire and plugs into schema create + config.get mapping.
src/collections/config/types/index.ts Adds public TextAnalyzerConfig and exposes stopwordPresets + PropertyConfig.textAnalyzer.
.github/workflows/main.yaml Updates CI matrix Weaviate 1.37 entry to 1.37.2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/collections/config/types/index.ts Outdated
Comment thread src/collections/config/utils.ts Outdated
Comment thread src/tokenize/index.ts
Comment thread test/collections/tokenization/integration.test.ts
Comment thread src/collections/config/utils.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread src/tokenize/index.ts
Comment thread src/collections/config/utils.ts
Comment thread src/tokenize/index.ts Outdated
@g-despot g-despot requested a review from bevzzz May 11, 2026 09:22
Comment thread src/collections/tokenization/unit.test.ts Outdated
Comment thread src/openapi/schema.ts
@g-despot g-despot requested a review from bevzzz May 13, 2026 07:52
Comment thread test/tokenize/integration.test.ts Outdated
});
});

requireAtLeast(1, 37, 2).describe('tokenize stopwords / stopwordPresets (>= 1.37.2)', () => {
Copy link
Copy Markdown
Collaborator

@bevzzz bevzzz May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requireAtLeast(1, 37, 2).describe('tokenize stopwords / stopwordPresets (>= 1.37.2)', () => {
requireAtLeast(1, 37, 2).describe('tokenize stopwords / stopwordPresets', () => {

nit: the clarification is redundant as requireAtLeast(1, 37, 2) says the same

Comment thread src/collections/config/utils.ts Outdated
Comment on lines +87 to +89
return out.asciiFold !== undefined || out.asciiFoldIgnore !== undefined || out.stopwordPreset !== undefined
? out
: undefined;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return out.asciiFold !== undefined || out.asciiFoldIgnore !== undefined || out.stopwordPreset !== undefined
? out
: undefined;
return out;

nit: is there any harm in returning an empty {} object if all it's properties are ?-optional? Especially on the wire, where these will be turned to nulls or omitted entirely.
The return long_condition ? out : undefined seems unnecessary.

@g-despot g-despot merged commit 9c6a71b into main May 14, 2026
13 checks passed
@g-despot g-despot deleted the tokenization-updates branch May 14, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants