Skip to content

fix: handle empty gremlin examples when building index#323

Open
Gardenia-zx wants to merge 2 commits into
apache:mainfrom
Gardenia-zx:fix-gremlin-example-empty-list
Open

fix: handle empty gremlin examples when building index#323
Gardenia-zx wants to merge 2 commits into
apache:mainfrom
Gardenia-zx:fix-gremlin-example-empty-list

Conversation

@Gardenia-zx
Copy link
Copy Markdown

Purpose

This PR handles the empty examples case in BuildGremlinExampleIndex.

Previously, when examples was empty, run() could access examples_embedding[0] and raise an IndexError.

Changes

  • Return early when examples is empty.
  • Return early when generated embeddings are empty.
  • Update the unit test to verify that no embedding or vector index operation is triggered for empty examples.
  • Adjust the mock target to avoid coroutine warnings in tests.

Tests

uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 3, 2026
@github-actions github-actions Bot added the llm label May 3, 2026
@dosubot dosubot Bot added the bug Something isn't working label May 3, 2026
@imbajin imbajin requested a review from Copilot May 5, 2026 03:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an IndexError in BuildGremlinExampleIndex.run() by safely handling the case where no Gremlin examples (or no embeddings) are produced, and updates unit tests to assert that embedding/vector-index operations are skipped in those cases.

Changes:

  • Add an early return in BuildGremlinExampleIndex.run() when examples is empty (sets context["embed_dim"]=0).
  • Add an early return when the generated embeddings list is empty (avoids examples_embedding[0] access).
  • Update unit tests to validate “no-op” behavior for empty examples and adjust mocking to prevent coroutine warnings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py Adds guards for empty examples / empty embeddings to avoid indexing into an empty list and to skip vector index operations.
hugegraph-llm/src/tests/operators/index_op/test_build_gremlin_example_index.py Updates mocks and assertions to ensure no embedding/index calls happen when examples are empty, and verifies embed_dim is set to 0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
# !: We have assumed that self.example is not empty
if not self.examples:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ BuildGremlinExampleIndexNode.node_init() still rejects [] via if not self.wk_input.examples (hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_gremlin_example_index.py:37), so the main BuildExampleIndexFlow never reaches this branch for the empty-list case. As written, the unit test passes but the end-to-end workflow still returns examples is required. Could we align the node-level guard in the same PR or add flow-level coverage?

# !: We have assumed that self.example is not empty
if not self.examples:
context["embed_dim"] = 0
return context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Returning early here leaves any previously persisted gremlin_examples index untouched. GremlinExampleIndexQuery reuses that index whenever exist("gremlin_examples") is true, so rebuilding with an empty example set will still serve stale examples from the previous run. If an empty list is meant to clear the few-shot examples, we should wipe the index before returning.

Suggested change
return context
if not self.examples:
self.vector_index.clean(self.vector_index_name)
context["embed_dim"] = 0
return context

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I updated the implementation to clean the existing gremlin_examples index before returning for the empty examples / empty embeddings cases, so stale examples will not be reused.

I also aligned the node-level guard to reject only None and allow an empty examples list to reach the no-op path.

Tests:
uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v
uv run pytest src/tests/operators/index_op/ -v

imbajin

This comment was marked as duplicate.

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! The core idea of guarding against IndexError is correct, but the current implementation introduces two hidden destructive side effects: both early-return branches now call vector_index.clean("gremlin_examples"), which in this codebase means deleting the on-disk index, not a no-op. Combined with the notis None relaxation in node_init, passing examples=[] will silently wipe out an already-built gremlin_examples index. I'd suggest removing the two clean() calls (and re-aligning the node check) before merging. See inline comments for details.

def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
# !: We have assumed that self.example is not empty
if not self.examples:
self.vector_index.clean(self.vector_index_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Calling clean() on empty examples is a destructive side effect that the PR description does not mention.

Why this is a problem

In this repository, vector_index.clean(name) is the API used to destroy an existing index on disk. Look at how it is used elsewhere:

  • utils/graph_index_utils.py:49-50 — used to wipe graph_vids and gremlin_examples when the user explicitly asks for a reset.
  • utils/vector_index_utils.py:77 — used to wipe the chunks index before a rebuild.

So the intent of clean() is always "tear down", never "skip this step".

What this PR actually does

Previously, if a caller accidentally passed examples=[], run() would raise IndexError — noisy, but non-destructive. With this change, the same call path will now:

  1. Return embed_dim=0 silently, and
  2. Delete the existing gremlin_examples index on disk.

A user who has already built a working index and then re-triggers the workflow with an empty example list will lose their index without any warning. The PR title says "handle empty gremlin examples when building index", which readers will reasonably interpret as a no-op guard, not a reset.

Suggested fix

Drop the clean() call; a pure early return preserves existing state and still fixes the IndexError:

Suggested change
self.vector_index.clean(self.vector_index_name)
if not self.examples:
context["embed_dim"] = 0
return context

If the intent really is "empty input should reset the index", please:

  1. State this explicitly in the PR description so reviewers and downstream users are aware, and
  2. Confirm no upstream caller (e.g. operators/operator_list.py:69) can reach this path with an unintended empty list.

examples_embedding = asyncio.run(get_embeddings_parallel(self.embedding, queries))

if not examples_embedding:
self.vector_index.clean(self.vector_index_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ This branch silently hides embedding failures and destroys the existing index.

When this branch is actually reached

Control only gets here when self.examples is non-empty but get_embeddings_parallel(...) returns []. For a non-empty query list, an empty embedding result is almost never a legitimate business outcome — it typically indicates:

  • The embedding provider is unreachable (network / DNS / timeout),
  • Authentication / quota failure (bad API key, rate limit, exhausted credit),
  • A bug in the embedding client returning an empty list instead of raising.

In other words, this is an error signal, not an empty-input case.

Why the current handling is wrong

The code currently:

  1. Silently calls clean(), deleting any previously built index, and
  2. Returns embed_dim=0 as if the operation succeeded.

From the caller's perspective, the workflow looks successful. But on disk, a working gremlin_examples index has just been wiped, and the user has no way to tell something went wrong.

Suggested fix

At minimum, do not clean(). Prefer one of the two options below:

Option A — log loudly and skip (preserves existing index):

Suggested change
self.vector_index.clean(self.vector_index_name)
if not examples_embedding:
log.error(
"Embedding service returned empty result for %d examples; skipping index build",
len(self.examples),
)
context["embed_dim"] = 0
return context

Option B — fail fast, which is what the previous IndexError was accidentally doing:

        if not examples_embedding:
            raise RuntimeError(
                f"Embedding service returned empty result for {len(self.examples)} examples"
            )

Either is acceptable; the key point is that a silent clean() here turns a recoverable failure into irrecoverable data loss.

from hugegraph_llm.utils.vector_index_utils import get_vector_index_class

if not self.wk_input.examples:
if self.wk_input.examples is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ notis None relaxes the check and contradicts the error message.

The behavior change

  • Before: if not self.wk_input.examples: — both None and [] are rejected, caller gets CStatus(-1, "examples is required...").
  • After: if self.wk_input.examples is None: — only None is rejected; [] is accepted and flows through to the operator.

Why this matters (interacts with the operator change)

Combined with the new empty-examples branch in the operator, the effective path for wk_input.examples = [] is now:

  1. node_init accepts the empty list (no longer returns CStatus(-1)).
  2. The operator's run() enters its new early-return branch.
  3. That branch calls self.vector_index.clean("gremlin_examples")silently wiping the user's existing index.

So an empty list, which the error message still claims is invalid ("examples is required in BuildGremlinExampleIndexNode"), in fact now acts as an implicit reset command. The validation message and the runtime behavior contradict each other.

Suggested resolution — pick one

  • Keep the stricter check (recommended unless there is an explicit reason to accept []). Revert this line to if not self.wk_input.examples: and, per the operator-side comments, remove the clean() calls.
  • Intentionally allow empty list as a reset command. Then update the error message to something like "examples cannot be None", document this behavior in the PR description, and keep (some form of) the operator-side reset logic — but make the reset explicit and logged, not silent.


context["embed_dim"] = embed_dim
return context
return context No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Missing newline at end of file.

The diff shows \ No newline at end of file on the final return context line. This violates PEP 8 (rule W292) and will likely trip the project's pre-commit / lint hooks. Please add a trailing newline.

@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
new_callable=MagicMock,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This test locks in the destructive behavior as a contract; please reconsider after the operator changes.

What this test currently asserts

mock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
mock_vector_store_class.from_name.assert_not_called()
# ...

This effectively turns "empty examples → delete the on-disk index" into a frozen contract enforced by CI. If the operator-side comments are accepted and clean() is removed, this assertion must flip:

Suggested change
new_callable=MagicMock,
mock_vector_store_class.clean.assert_not_called()

Missing coverage

The two early-return branches in run() (empty examples vs. empty examples_embedding) are semantically distinct — the second one implies an embedding-service failure, not an empty-input case. Please add a dedicated test, e.g. test_run_with_non_empty_examples_but_empty_embeddings, that:

  • Constructs the builder with non-empty examples,
  • Makes asyncio.run(...) return [],
  • Asserts the chosen error/logging behavior (raise, or log + skip),
  • Asserts clean() is not called.

@patch('asyncio.run')
@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Patch target fix is correct — nice catch.

Rebinding the patch from hugegraph_llm.utils.embedding_utils.get_embeddings_parallel to hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel is the right move. The rule of thumb for unittest.mock.patch is "patch where the name is looked up, not where it is defined" — since the operator does from hugegraph_llm.utils.embedding_utils import get_embeddings_parallel, the function is bound to the operator module's namespace at import time, and that's the binding the test must replace.

The previous patch target looked plausible but never actually intercepted the call, so the test was implicitly relying on asyncio.run being mocked to avoid real work. Using new_callable=MagicMock here is also correct — it prevents the mock from being treated as an awaitable and suppresses the "coroutine was never awaited" warning. No change needed to this line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working llm size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants