fix: handle empty gremlin examples when building index#323
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes an IndexError in BuildGremlinExampleIndex.run() by safely handling the case where no Gremlin examples (or no embeddings) are produced, and updates unit tests to assert that embedding/vector-index operations are skipped in those cases.
Changes:
- Add an early return in
BuildGremlinExampleIndex.run()whenexamplesis empty (setscontext["embed_dim"]=0). - Add an early return when the generated embeddings list is empty (avoids
examples_embedding[0]access). - Update unit tests to validate “no-op” behavior for empty examples and adjust mocking to prevent coroutine warnings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py |
Adds guards for empty examples / empty embeddings to avoid indexing into an empty list and to skip vector index operations. |
hugegraph-llm/src/tests/operators/index_op/test_build_gremlin_example_index.py |
Updates mocks and assertions to ensure no embedding/index calls happen when examples are empty, and verifies embed_dim is set to 0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def run(self, context: Dict[str, Any]) -> Dict[str, Any]: | ||
| # !: We have assumed that self.example is not empty | ||
| if not self.examples: |
There was a problem hiding this comment.
BuildGremlinExampleIndexNode.node_init() still rejects [] via if not self.wk_input.examples (hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_gremlin_example_index.py:37), so the main BuildExampleIndexFlow never reaches this branch for the empty-list case. As written, the unit test passes but the end-to-end workflow still returns examples is required. Could we align the node-level guard in the same PR or add flow-level coverage?
| # !: We have assumed that self.example is not empty | ||
| if not self.examples: | ||
| context["embed_dim"] = 0 | ||
| return context |
There was a problem hiding this comment.
gremlin_examples index untouched. GremlinExampleIndexQuery reuses that index whenever exist("gremlin_examples") is true, so rebuilding with an empty example set will still serve stale examples from the previous run. If an empty list is meant to clear the few-shot examples, we should wipe the index before returning.
| return context | |
| if not self.examples: | |
| self.vector_index.clean(self.vector_index_name) | |
| context["embed_dim"] = 0 | |
| return context |
There was a problem hiding this comment.
Thanks for the review. I updated the implementation to clean the existing gremlin_examples index before returning for the empty examples / empty embeddings cases, so stale examples will not be reused.
I also aligned the node-level guard to reject only None and allow an empty examples list to reach the no-op path.
Tests:
uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v
uv run pytest src/tests/operators/index_op/ -v
imbajin
left a comment
There was a problem hiding this comment.
Thanks for the fix! The core idea of guarding against IndexError is correct, but the current implementation introduces two hidden destructive side effects: both early-return branches now call vector_index.clean("gremlin_examples"), which in this codebase means deleting the on-disk index, not a no-op. Combined with the not → is None relaxation in node_init, passing examples=[] will silently wipe out an already-built gremlin_examples index. I'd suggest removing the two clean() calls (and re-aligning the node check) before merging. See inline comments for details.
| def run(self, context: Dict[str, Any]) -> Dict[str, Any]: | ||
| # !: We have assumed that self.example is not empty | ||
| if not self.examples: | ||
| self.vector_index.clean(self.vector_index_name) |
There was a problem hiding this comment.
clean() on empty examples is a destructive side effect that the PR description does not mention.
Why this is a problem
In this repository, vector_index.clean(name) is the API used to destroy an existing index on disk. Look at how it is used elsewhere:
utils/graph_index_utils.py:49-50— used to wipegraph_vidsandgremlin_exampleswhen the user explicitly asks for a reset.utils/vector_index_utils.py:77— used to wipe thechunksindex before a rebuild.
So the intent of clean() is always "tear down", never "skip this step".
What this PR actually does
Previously, if a caller accidentally passed examples=[], run() would raise IndexError — noisy, but non-destructive. With this change, the same call path will now:
- Return
embed_dim=0silently, and - Delete the existing
gremlin_examplesindex on disk.
A user who has already built a working index and then re-triggers the workflow with an empty example list will lose their index without any warning. The PR title says "handle empty gremlin examples when building index", which readers will reasonably interpret as a no-op guard, not a reset.
Suggested fix
Drop the clean() call; a pure early return preserves existing state and still fixes the IndexError:
| self.vector_index.clean(self.vector_index_name) | |
| if not self.examples: | |
| context["embed_dim"] = 0 | |
| return context |
If the intent really is "empty input should reset the index", please:
- State this explicitly in the PR description so reviewers and downstream users are aware, and
- Confirm no upstream caller (e.g.
operators/operator_list.py:69) can reach this path with an unintended empty list.
| examples_embedding = asyncio.run(get_embeddings_parallel(self.embedding, queries)) | ||
|
|
||
| if not examples_embedding: | ||
| self.vector_index.clean(self.vector_index_name) |
There was a problem hiding this comment.
When this branch is actually reached
Control only gets here when self.examples is non-empty but get_embeddings_parallel(...) returns []. For a non-empty query list, an empty embedding result is almost never a legitimate business outcome — it typically indicates:
- The embedding provider is unreachable (network / DNS / timeout),
- Authentication / quota failure (bad API key, rate limit, exhausted credit),
- A bug in the embedding client returning an empty list instead of raising.
In other words, this is an error signal, not an empty-input case.
Why the current handling is wrong
The code currently:
- Silently calls
clean(), deleting any previously built index, and - Returns
embed_dim=0as if the operation succeeded.
From the caller's perspective, the workflow looks successful. But on disk, a working gremlin_examples index has just been wiped, and the user has no way to tell something went wrong.
Suggested fix
At minimum, do not clean(). Prefer one of the two options below:
Option A — log loudly and skip (preserves existing index):
| self.vector_index.clean(self.vector_index_name) | |
| if not examples_embedding: | |
| log.error( | |
| "Embedding service returned empty result for %d examples; skipping index build", | |
| len(self.examples), | |
| ) | |
| context["embed_dim"] = 0 | |
| return context |
Option B — fail fast, which is what the previous IndexError was accidentally doing:
if not examples_embedding:
raise RuntimeError(
f"Embedding service returned empty result for {len(self.examples)} examples"
)Either is acceptable; the key point is that a silent clean() here turns a recoverable failure into irrecoverable data loss.
| from hugegraph_llm.utils.vector_index_utils import get_vector_index_class | ||
|
|
||
| if not self.wk_input.examples: | ||
| if self.wk_input.examples is None: |
There was a problem hiding this comment.
not → is None relaxes the check and contradicts the error message.
The behavior change
- Before:
if not self.wk_input.examples:— bothNoneand[]are rejected, caller getsCStatus(-1, "examples is required..."). - After:
if self.wk_input.examples is None:— onlyNoneis rejected;[]is accepted and flows through to the operator.
Why this matters (interacts with the operator change)
Combined with the new empty-examples branch in the operator, the effective path for wk_input.examples = [] is now:
node_initaccepts the empty list (no longer returnsCStatus(-1)).- The operator's
run()enters its new early-return branch. - That branch calls
self.vector_index.clean("gremlin_examples")— silently wiping the user's existing index.
So an empty list, which the error message still claims is invalid ("examples is required in BuildGremlinExampleIndexNode"), in fact now acts as an implicit reset command. The validation message and the runtime behavior contradict each other.
Suggested resolution — pick one
- Keep the stricter check (recommended unless there is an explicit reason to accept
[]). Revert this line toif not self.wk_input.examples:and, per the operator-side comments, remove theclean()calls. - Intentionally allow empty list as a reset command. Then update the error message to something like "examples cannot be None", document this behavior in the PR description, and keep (some form of) the operator-side reset logic — but make the reset explicit and logged, not silent.
|
|
||
| context["embed_dim"] = embed_dim | ||
| return context | ||
| return context No newline at end of file |
There was a problem hiding this comment.
The diff shows \ No newline at end of file on the final return context line. This violates PEP 8 (rule W292) and will likely trip the project's pre-commit / lint hooks. Please add a trailing newline.
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||
| @patch( | ||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', | ||
| new_callable=MagicMock, |
There was a problem hiding this comment.
What this test currently asserts
mock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
mock_vector_store_class.from_name.assert_not_called()
# ...This effectively turns "empty examples → delete the on-disk index" into a frozen contract enforced by CI. If the operator-side comments are accepted and clean() is removed, this assertion must flip:
| new_callable=MagicMock, | |
| mock_vector_store_class.clean.assert_not_called() |
Missing coverage
The two early-return branches in run() (empty examples vs. empty examples_embedding) are semantically distinct — the second one implies an embedding-service failure, not an empty-input case. Please add a dedicated test, e.g. test_run_with_non_empty_examples_but_empty_embeddings, that:
- Constructs the builder with non-empty
examples, - Makes
asyncio.run(...)return[], - Asserts the chosen error/logging behavior (raise, or log + skip),
- Asserts
clean()is not called.
| @patch('asyncio.run') | ||
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||
| @patch( | ||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', |
There was a problem hiding this comment.
🧹 Patch target fix is correct — nice catch.
Rebinding the patch from hugegraph_llm.utils.embedding_utils.get_embeddings_parallel to hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel is the right move. The rule of thumb for unittest.mock.patch is "patch where the name is looked up, not where it is defined" — since the operator does from hugegraph_llm.utils.embedding_utils import get_embeddings_parallel, the function is bound to the operator module's namespace at import time, and that's the binding the test must replace.
The previous patch target looked plausible but never actually intercepted the call, so the test was implicitly relying on asyncio.run being mocked to avoid real work. Using new_callable=MagicMock here is also correct — it prevents the mock from being treated as an awaitable and suppresses the "coroutine was never awaited" warning. No change needed to this line.
Purpose
This PR handles the empty examples case in
BuildGremlinExampleIndex.Previously, when
exampleswas empty,run()could accessexamples_embedding[0]and raise anIndexError.Changes
examplesis empty.Tests