FEAT: Introduce IdentifierFilters to allow generic DB queries on identifier…#1557
FEAT: Introduce IdentifierFilters to allow generic DB queries on identifier…#1557behnam-o wants to merge 9 commits intomicrosoft:mainfrom
Conversation
pyrit/memory/identifier_filters.py
Outdated
| class AttackIdentifierFilter(IdentifierFilter[AttackIdentifierProperty]): | ||
| """ | ||
| Immutable filter definition for matching JSON-backed attack identifier properties. | ||
|
|
||
| Args: | ||
| property_path: The JSON path of the property to filter on. | ||
| value_to_match: The value to match against the property. | ||
| partial_match: Whether to allow partial matches (default: False). | ||
| """ | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class TargetIdentifierFilter(IdentifierFilter[TargetIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed target identifier properties.""" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ConverterIdentifierFilter(IdentifierFilter[ConverterIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed converter identifier properties.""" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ScorerIdentifierFilter(IdentifierFilter[ScorerIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed scorer identifier properties.""" |
There was a problem hiding this comment.
Here you have a _StrEnum base, a Generic[T] ABC IdentifierFilter, then 4 *Property enums, and 4 *Filter subclasses, but every filter subclass is an empty body, they add zero behavior. I think the IdentifierFilter type hierarchy is unnecessary.
The Generic[T] bound gives you type-time safety on which *Property enum you pass, but at runtime __post_init__ immediately calls str(self.property_path), erasing the enum type entirely. So a caller can pass an any string and it works fine.
For 4 empty subclasses are only different in the type parameter, I think this is a little bit of over-engineering. You're essentially encoding which JSON column to query in the type of the filter, but the actual column is still chosen by the caller at the call site (e.g. json_column=ScoreEntry.scorer_class_identifier). The type hierarchy doesn't prevent users misusing it, nothing stops you from passing a ScorerIdentifierFilter with json_column=AttackResultEntry.atomic_attack_identifier.
I think a single IdentifierFilter dataclass with a flat property_path: str would be simpler, equally extensible, and more transparent about what the runtime actually does.
Here is what I propose:
@dataclass(frozen=True)
class IdentifierFilter:
property_path: str
value_to_match: str
partial_match: bool = FalseThat's it. The *Property enums are fine to keep as constants (or even a flat module-level class IdentifierPaths namespace), but there's no need for them to constrain the filter type generically. The column binding already happens at the call site in memory_interface.py, so the filter is purely about what path, what value, plus the exact or partial condition.
This removes most of the classes you proposed here and is equally type safe because the real safety comes from which json_column you pass, not the filter type, It is also extensible since new properties are just new enum values and no new filter class is needed.
And I think for _get_condition_json_array_match, you should add a sub_path: str | None = None parameter (that mimics _get_unique_json_array_values), and use it instead of hardcoding '$.class_name'.
Here is an example of how this would work:
results = memory.get_attack_results(
identifier_filter=IdentifierFilter(
property_path=AttackIdentifierProperty.ATTACK_CLASS_NAME,
value_to_match="Crescendo",
partial_match=True,
),
)
results = memory.get_scores(
identifier_filter=IdentifierFilter(
property_path=ScorerIdentifierProperty.CLASS_NAME,
value_to_match="SelfAskLikertScorer",
),
)Then inside memory_interface.py, the method signature pins which column the filter applies to:
def get_scores(
self,
*,
scorer_identifier_filter: IdentifierFilter | None = None,
...
) -> Sequence[Score]:
if scorer_identifier_filter:
conditions.append(
self._get_condition_json_property_match(
# this is where the column is bound ---
json_column=ScoreEntry.scorer_class_identifier,
# ---
property_path=scorer_identifier_filter.property_path,
value_to_match=scorer_identifier_filter.value_to_match,
partial_match=scorer_identifier_filter.partial_match,
)
)There was a problem hiding this comment.
The main reason I introduced all those ***IdentifierProperty.XYZ was to limit what path on an identifier can be queried ... I agree it's not a bad idea to keep it free form, especially since our identifiers are constructed in somewhat of a free-form manner where keys are arbitrary strings.
maybe down the road, we want to have our identifiers statically typed, and then it might make sense to also have filters enforce that.
for now, made the property_path to allow a free-form string
There was a problem hiding this comment.
@bashirpartovi @hannahwestra25 @ValbuenaVC Thanks for your comments, I think you all touched on this free-form vs. restricted property_path pattern, and I agree it is a bit of an over-engineering with no "real" benefit. Please let me know if we should have any follow-ups on this.
ValbuenaVC
left a comment
There was a problem hiding this comment.
This PR (#1451) might be a useful reference since it also handled filtering concerns, although for datasets.
Uh oh!
There was an error while loading. Please reload this page.