fix: handle empty gremlin examples when building index by Gardenia-zx · Pull Request #323 · apache/hugegraph-ai

Gardenia-zx · 2026-05-03T05:11:49Z

Purpose

This PR handles the empty examples case in BuildGremlinExampleIndex.

Previously, when examples was empty, run() could access examples_embedding[0] and raise an IndexError.

Changes

Return early when examples is empty.
Return early when generated embeddings are empty.
Update the unit test to verify that no embedding or vector index operation is triggered for empty examples.
Adjust the mock target to avoid coroutine warnings in tests.

Tests

uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v

Copilot

Pull request overview

This PR fixes an IndexError in BuildGremlinExampleIndex.run() by safely handling the case where no Gremlin examples (or no embeddings) are produced, and updates unit tests to assert that embedding/vector-index operations are skipped in those cases.

Changes:

Add an early return in BuildGremlinExampleIndex.run() when examples is empty (sets context["embed_dim"]=0).
Add an early return when the generated embeddings list is empty (avoids examples_embedding[0] access).
Update unit tests to validate “no-op” behavior for empty examples and adjust mocking to prevent coroutine warnings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py`	Adds guards for empty `examples` / empty embeddings to avoid indexing into an empty list and to skip vector index operations.
`hugegraph-llm/src/tests/operators/index_op/test_build_gremlin_example_index.py`	Updates mocks and assertions to ensure no embedding/index calls happen when examples are empty, and verifies `embed_dim` is set to `0`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

imbajin · 2026-05-05T05:30:22Z


    def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
-        # !: We have assumed that self.example is not empty
+        if not self.examples:


‼️ BuildGremlinExampleIndexNode.node_init() still rejects [] via if not self.wk_input.examples (hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_gremlin_example_index.py:37), so the main BuildExampleIndexFlow never reaches this branch for the empty-list case. As written, the unit test passes but the end-to-end workflow still returns examples is required. Could we align the node-level guard in the same PR or add flow-level coverage?

imbajin · 2026-05-05T05:30:22Z

-        # !: We have assumed that self.example is not empty
+        if not self.examples:
+            context["embed_dim"] = 0
+            return context


‼️ Returning early here leaves any previously persisted gremlin_examples index untouched. GremlinExampleIndexQuery reuses that index whenever exist("gremlin_examples") is true, so rebuilding with an empty example set will still serve stale examples from the previous run. If an empty list is meant to clear the few-shot examples, we should wipe the index before returning.

Suggested change

return context

if not self.examples:

self.vector_index.clean(self.vector_index_name)

context["embed_dim"] = 0

return context

Thanks for the review. I updated the implementation to clean the existing gremlin_examples index before returning for the empty examples / empty embeddings cases, so stale examples will not be reused.

I also aligned the node-level guard to reject only None and allow an empty examples list to reach the no-op path.

Tests:
uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v
uv run pytest src/tests/operators/index_op/ -v

imbajin

Thanks for the fix! The core idea of guarding against IndexError is correct, but the current implementation introduces two hidden destructive side effects: both early-return branches now call vector_index.clean("gremlin_examples"), which in this codebase means deleting the on-disk index, not a no-op. Combined with the not → is None relaxation in node_init, passing examples=[] will silently wipe out an already-built gremlin_examples index. I'd suggest removing the two clean() calls (and re-aligning the node check) before merging. See inline comments for details.

imbajin · 2026-05-06T13:28:16Z

    def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
-        # !: We have assumed that self.example is not empty
+        if not self.examples:
+            self.vector_index.clean(self.vector_index_name)


‼️ Calling clean() on empty examples is a destructive side effect that the PR description does not mention.

Why this is a problem

In this repository, vector_index.clean(name) is the API used to destroy an existing index on disk. Look at how it is used elsewhere:

utils/graph_index_utils.py:49-50 — used to wipe graph_vids and gremlin_examples when the user explicitly asks for a reset.

utils/vector_index_utils.py:77 — used to wipe the chunks index before a rebuild.

So the intent of clean() is always "tear down", never "skip this step".

What this PR actually does

Previously, if a caller accidentally passed examples=[], run() would raise IndexError — noisy, but non-destructive. With this change, the same call path will now:

Return embed_dim=0 silently, and

Delete the existing gremlin_examples index on disk.

A user who has already built a working index and then re-triggers the workflow with an empty example list will lose their index without any warning. The PR title says "handle empty gremlin examples when building index", which readers will reasonably interpret as a no-op guard, not a reset.

Suggested fix

Drop the clean() call; a pure early return preserves existing state and still fixes the IndexError:

Suggested change

self.vector_index.clean(self.vector_index_name)

if not self.examples:

context["embed_dim"] = 0

return context

If the intent really is "empty input should reset the index", please:

State this explicitly in the PR description so reviewers and downstream users are aware, and

Confirm no upstream caller (e.g. operators/operator_list.py:69) can reach this path with an unintended empty list.

imbajin · 2026-05-06T13:28:17Z

        examples_embedding = asyncio.run(get_embeddings_parallel(self.embedding, queries))
+
+        if not examples_embedding:
+            self.vector_index.clean(self.vector_index_name)


‼️ This branch silently hides embedding failures and destroys the existing index.

When this branch is actually reached

Control only gets here when self.examples is non-empty but get_embeddings_parallel(...) returns []. For a non-empty query list, an empty embedding result is almost never a legitimate business outcome — it typically indicates:

The embedding provider is unreachable (network / DNS / timeout),

Authentication / quota failure (bad API key, rate limit, exhausted credit),

A bug in the embedding client returning an empty list instead of raising.

In other words, this is an error signal, not an empty-input case.

Why the current handling is wrong

The code currently:

Silently calls clean(), deleting any previously built index, and

Returns embed_dim=0 as if the operation succeeded.

From the caller's perspective, the workflow looks successful. But on disk, a working gremlin_examples index has just been wiped, and the user has no way to tell something went wrong.

Suggested fix

At minimum, do not clean(). Prefer one of the two options below:

Option A — log loudly and skip (preserves existing index):

Suggested change

self.vector_index.clean(self.vector_index_name)

if not examples_embedding:

log.error(

"Embedding service returned empty result for %d examples; skipping index build",

len(self.examples),

)

context["embed_dim"] = 0

return context

Option B — fail fast, which is what the previous IndexError was accidentally doing:

if not examples_embedding: raise RuntimeError( f"Embedding service returned empty result for {len(self.examples)} examples" )

Either is acceptable; the key point is that a silent clean() here turns a recoverable failure into irrecoverable data loss.

imbajin · 2026-05-06T13:28:17Z

        from hugegraph_llm.utils.vector_index_utils import get_vector_index_class

-        if not self.wk_input.examples:
+        if self.wk_input.examples is None:


‼️ not → is None relaxes the check and contradicts the error message.

The behavior change

Before: if not self.wk_input.examples: — both None and [] are rejected, caller gets CStatus(-1, "examples is required...").

After: if self.wk_input.examples is None: — only None is rejected; [] is accepted and flows through to the operator.

Why this matters (interacts with the operator change)

Combined with the new empty-examples branch in the operator, the effective path for wk_input.examples = [] is now:

node_init accepts the empty list (no longer returns CStatus(-1)).

The operator's run() enters its new early-return branch.

That branch calls self.vector_index.clean("gremlin_examples") — silently wiping the user's existing index.

So an empty list, which the error message still claims is invalid ("examples is required in BuildGremlinExampleIndexNode"), in fact now acts as an implicit reset command. The validation message and the runtime behavior contradict each other.

Suggested resolution — pick one

Keep the stricter check (recommended unless there is an explicit reason to accept []). Revert this line to if not self.wk_input.examples: and, per the operator-side comments, remove the clean() calls.

Intentionally allow empty list as a reset command. Then update the error message to something like "examples cannot be None", document this behavior in the PR description, and keep (some form of) the operator-side reset logic — but make the reset explicit and logged, not silent.

imbajin · 2026-05-06T13:28:17Z

+
        context["embed_dim"] = embed_dim
-        return context
+        return context


⚠️ Missing newline at end of file.

The diff shows \ No newline at end of file on the final return context line. This violates PEP 8 (rule W292) and will likely trip the project's pre-commit / lint hooks. Please add a trailing newline.

imbajin · 2026-05-06T13:28:17Z

-    @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
+    @patch(
+        'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
+        new_callable=MagicMock,


⚠️ This test locks in the destructive behavior as a contract; please reconsider after the operator changes.

What this test currently asserts

mock_vector_store_class.clean.assert_called_once_with("gremlin_examples") mock_vector_store_class.from_name.assert_not_called() # ...

This effectively turns "empty examples → delete the on-disk index" into a frozen contract enforced by CI. If the operator-side comments are accepted and clean() is removed, this assertion must flip:

Suggested change

new_callable=MagicMock,

mock_vector_store_class.clean.assert_not_called()

Missing coverage

The two early-return branches in run() (empty examples vs. empty examples_embedding) are semantically distinct — the second one implies an embedding-service failure, not an empty-input case. Please add a dedicated test, e.g. test_run_with_non_empty_examples_but_empty_embeddings, that:

Constructs the builder with non-empty examples,

Makes asyncio.run(...) return [],

Asserts the chosen error/logging behavior (raise, or log + skip),

Asserts clean() is not called.

imbajin · 2026-05-06T13:28:17Z

    @patch('asyncio.run')
-    @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
+    @patch(
+        'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',


🧹 Patch target fix is correct — nice catch.

Rebinding the patch from hugegraph_llm.utils.embedding_utils.get_embeddings_parallel to hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel is the right move. The rule of thumb for unittest.mock.patch is "patch where the name is looked up, not where it is defined" — since the operator does from hugegraph_llm.utils.embedding_utils import get_embeddings_parallel, the function is bound to the operator module's namespace at import time, and that's the binding the test must replace.

The previous patch target looked plausible but never actually intercepted the call, so the test was implicitly relying on asyncio.run being mocked to avoid real work. Using new_callable=MagicMock here is also correct — it prevents the mock from being treated as an awaitable and suppresses the "coroutine was never awaited" warning. No change needed to this line.

fix: handle empty gremlin examples when building index

de7d534

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 3, 2026

github-actions Bot added the llm label May 3, 2026

dosubot Bot added the bug Something isn't working label May 3, 2026

Gardenia-zx mentioned this pull request May 3, 2026

[Summary] Good AI/LLM Tasks for Newcomers #107

Open

11 tasks

imbajin requested a review from Copilot May 5, 2026 03:40

Copilot started reviewing on behalf of imbajin May 5, 2026 03:41 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

imbajin reviewed May 5, 2026

View reviewed changes

fix: address empty gremlin examples review comments

6cfa3ab

This comment was marked as duplicate.

Sign in to view

imbajin reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle empty gremlin examples when building index#323

fix: handle empty gremlin examples when building index#323
Gardenia-zx wants to merge 2 commits into
apache:mainfrom
Gardenia-zx:fix-gremlin-example-empty-list

Gardenia-zx commented May 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

imbajin May 5, 2026

Uh oh!

imbajin May 5, 2026

Uh oh!

Gardenia-zx May 6, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

imbajin left a comment

Uh oh!

imbajin May 6, 2026

Uh oh!

imbajin May 6, 2026

Uh oh!

imbajin May 6, 2026

Uh oh!

imbajin May 6, 2026

Uh oh!

imbajin May 6, 2026

Uh oh!

imbajin May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	new_callable=MagicMock,
	mock_vector_store_class.clean.assert_not_called()

Conversation

Gardenia-zx commented May 3, 2026

Purpose

Changes

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Why this is a problem

What this PR actually does

Suggested fix

Uh oh!

Choose a reason for hiding this comment

When this branch is actually reached

Why the current handling is wrong

Suggested fix

Uh oh!

Choose a reason for hiding this comment

The behavior change

Why this matters (interacts with the operator change)

Suggested resolution — pick one

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

What this test currently asserts

Missing coverage

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants