Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions multilingual-entity-alias-guard/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Multilingual Entity Alias Guard

This module adds a focused Scientific Knowledge Graph Integration slice for SCIBASE issue #17. It normalizes multilingual scientific mentions before they become graph nodes, entity-page aliases, or recommendation signals.

The guard accepts trusted translated aliases only when numeric confidence evidence is present, preserves original language tags, normalizes language-tag casing and underscore or hyphen regional separators for lookup, falls back from regional language tags to their base language, emits JSON-LD-style entity packets, holds homographs, false friends, same-language alias collisions, extractor-candidate/alias conflicts, malformed mention text, and mixed-script Latin-language lookalikes including lowercase Greek or Cyrillic confusables for curator review, suppresses low-confidence or missing-confidence aliases before recommendations are shown, and treats omitted or malformed localized names, mentions, or homograph policies as sparse graph evidence instead of crashing corpus review.

## Run

```bash
npm test
npm run demo
npm run video
npm run check
```

## Outputs

- `reports/alias-guard-packet.json`
- `reports/sparse-alias-guard-packet.json`
- `reports/candidate-alias-conflict-packet.json`
- `reports/malformed-mention-text-packet.json`
- `reports/malformed-alias-evidence-packet.json`
- `reports/alias-guard-report.md`
- `reports/summary.svg`
- `reports/demo.mp4`

All data is synthetic. The module does not call live ontologies, identity providers, external APIs, private corpora, search indexes, or recommendation systems.
28 changes: 28 additions & 0 deletions multilingual-entity-alias-guard/acceptance-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Acceptance Notes

This #17 slice focuses specifically on multilingual scientific alias quality before graph nodes and recommendations are produced.

It is not:

- a broad entity extractor or navigator
- an ontology deprecation or synonym migration tool
- a recommendation visibility or diversity guard
- a geospatial, clinical trial, biological accession, software runtime, or temporal validity guard

Validation coverage:

- trusted CRISPR aliases in English, German, and Spanish map to one canonical MeSH entity
- Spanish `control` is held as a homograph/false friend instead of silently creating a statistical control-group edge
- same-language translated alias collisions are held instead of silently attaching a mention to the wrong canonical entity
- extractor candidate IDs that disagree with multilingual alias lookup are held instead of silently overriding either signal
- language-tag case differences do not suppress trusted translated aliases
- regional language tags such as `es-MX` use base-language alias and homograph policy while preserving the original tag
- underscore regional language tags such as `es_MX` use the same base-language alias and homograph policy while preserving the original tag
- mixed-script Latin-language aliases such as Cyrillic-lookalike `CRISPR` text or lowercase Greek-alpha `CRISPR-Cαs9` text are held for curator review instead of becoming quiet unknowns
- low-confidence French alias output is suppressed from recommendations
- missing or non-numeric confidence evidence is suppressed before graph recommendations
- sparse ontology/corpus exports with omitted localized names, mention lists, or homograph policies do not crash corpus review
- malformed localized-name entries are omitted from alias lookup and JSON-LD alternate names, with alias evidence issues preserved for review
- malformed mention text values are held for curator review instead of crashing alias normalization or reaching recommendation-safe IDs
- localized names remain language-tagged on entity packets
- audit output is deterministic and private-data free
172 changes: 172 additions & 0 deletions multilingual-entity-alias-guard/demo.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
const fs = require('fs');
const path = require('path');
const { evaluateAliasGuard, buildSampleCorpus } = require('./index');

const reportsDir = path.join(__dirname, 'reports');
fs.mkdirSync(reportsDir, { recursive: true });

const result = evaluateAliasGuard(buildSampleCorpus());
const sparseResult = evaluateAliasGuard({
corpusId: 'kg-sparse-ontology-export-17',
generatedAt: '2026-05-30T12:00:00Z',
entities: [
{
id: 'entity:mesh:D012345',
canonicalName: 'Sparse Ontology Entity',
ontology: 'MeSH',
identifier: 'D012345'
}
]
});
const conflictResult = evaluateAliasGuard({
...buildSampleCorpus(),
corpusId: 'kg-candidate-alias-conflict-17',
generatedAt: '2026-05-30T12:30:00Z',
mentions: [
{
id: 'mention-diabetes-conflicting-candidate',
documentId: 'paper-17',
text: 'diabetes mellitus',
language: 'es',
confidence: 0.93,
candidateEntityId: 'entity:stat:control-group'
}
]
});
const malformedMentionResult = evaluateAliasGuard({
...buildSampleCorpus(),
corpusId: 'kg-malformed-mention-text-17',
generatedAt: '2026-05-31T10:45:00Z',
mentions: [
{
id: 'mention-malformed-text',
documentId: 'paper-18',
text: { value: 'diabetes mellitus' },
language: 'es',
confidence: 0.94,
candidateEntityId: 'entity:mesh:D003920'
}
]
});
const malformedAliasEvidenceResult = evaluateAliasGuard({
...buildSampleCorpus(),
corpusId: 'kg-malformed-localized-name-17',
generatedAt: '2026-05-31T10:46:00Z',
entities: [
{
id: 'entity:mesh:D003920',
canonicalName: 'Diabetes Mellitus',
ontology: 'MeSH',
identifier: 'D003920',
localizedNames: {
es: ['diabetes mellitus', { value: 'diabete mellitus' }]
}
}
],
mentions: [
{
id: 'mention-diabetes-es',
documentId: 'paper-19',
text: 'diabetes mellitus',
language: 'es',
confidence: 0.94
}
]
});

const packetPath = path.join(reportsDir, 'alias-guard-packet.json');
const sparsePacketPath = path.join(reportsDir, 'sparse-alias-guard-packet.json');
const conflictPacketPath = path.join(reportsDir, 'candidate-alias-conflict-packet.json');
const malformedMentionPacketPath = path.join(reportsDir, 'malformed-mention-text-packet.json');
const malformedAliasEvidencePacketPath = path.join(reportsDir, 'malformed-alias-evidence-packet.json');
const reportPath = path.join(reportsDir, 'alias-guard-report.md');
const svgPath = path.join(reportsDir, 'summary.svg');

fs.writeFileSync(packetPath, `${JSON.stringify(result, null, 2)}\n`);
fs.writeFileSync(sparsePacketPath, `${JSON.stringify(sparseResult, null, 2)}\n`);
fs.writeFileSync(conflictPacketPath, `${JSON.stringify(conflictResult, null, 2)}\n`);
fs.writeFileSync(malformedMentionPacketPath, `${JSON.stringify(malformedMentionResult, null, 2)}\n`);
fs.writeFileSync(malformedAliasEvidencePacketPath, `${JSON.stringify(malformedAliasEvidenceResult, null, 2)}\n`);

const accepted = result.mentionDecisions
.filter((decision) => decision.decision === 'accept-canonical-entity')
.map((decision) => `- ${decision.id}: ${decision.text} (${decision.language}) -> ${decision.candidateEntityId}`)
.join('\n');

const held = result.curatorActions
.map((action) => `- ${action.id}: ${action.action} (${action.language}:${action.text})`)
.join('\n');

const markdown = `# Multilingual Entity Alias Guard

Corpus: ${result.corpusId}
Generated: ${result.generatedAt}

## Summary

- Accepted mentions: ${result.summary.acceptedMentions}
- Held curator-review mentions: ${result.summary.heldMentions}
- Suppressed low-confidence mentions: ${result.summary.suppressedMentions}
- Entity packets emitted: ${result.summary.entityPackets}
- Audit digest: ${result.auditDigest}

## Accepted Canonical Mappings

${accepted}

## Curator Actions

${held}

## Recommendation Guard

Held or suppressed mentions are not allowed to drive entity-page recommendations until a curator verifies the alias mapping.

## Sparse Corpus Guard

Sparse ontology or corpus exports that omit localized names, mention lists, or homograph policy still produce deterministic graph review evidence. The sparse fixture emitted ${sparseResult.summary.entityPackets} entity packet and ${sparseResult.mentionDecisions.length} mention decisions.

## Candidate Alias Conflict Guard

Extractor candidates that disagree with trusted multilingual alias lookup are held for curator review instead of silently overriding the upstream candidate. The conflict fixture decision is ${conflictResult.mentionDecisions[0].decision} with reason ${conflictResult.mentionDecisions[0].reason}.

## Malformed Mention Text Guard

Malformed mention text values are held for curator review instead of crashing alias normalization. The malformed fixture decision is ${malformedMentionResult.mentionDecisions[0].decision} with reason ${malformedMentionResult.mentionDecisions[0].reason}, and it emits ${malformedMentionResult.curatorActions[0].action}.

## Malformed Alias Evidence Guard

Malformed localized-name evidence is omitted from alias lookup and JSON-LD alternate names instead of crashing ontology review. The malformed alias fixture records ${malformedAliasEvidenceResult.entityPackets[0].aliasEvidenceIssues.length} alias evidence issue with reason ${malformedAliasEvidenceResult.entityPackets[0].aliasEvidenceIssues[0].reason}.

## Safety

All fixtures are synthetic. The module does not call live ontologies, identity providers, external APIs, private corpora, search indexes, or recommendation systems.
`;

fs.writeFileSync(reportPath, markdown);

const svg = `<svg xmlns="http://www.w3.org/2000/svg" width="1280" height="720" viewBox="0 0 1280 720">
<rect width="1280" height="720" fill="#0c2130"/>
<rect x="54" y="58" width="1172" height="604" rx="18" fill="#142f42" stroke="#7bd88f" stroke-width="4"/>
<text x="96" y="136" fill="#ffffff" font-family="Arial, sans-serif" font-size="44" font-weight="700">Multilingual Entity Alias Guard</text>
<text x="96" y="210" fill="#d8f6df" font-family="Arial, sans-serif" font-size="28">Accepted canonical mentions: ${result.summary.acceptedMentions}</text>
<text x="96" y="260" fill="#d8f6df" font-family="Arial, sans-serif" font-size="28">Held curator-review mentions: ${result.summary.heldMentions}</text>
<text x="96" y="310" fill="#d8f6df" font-family="Arial, sans-serif" font-size="28">Suppressed low-confidence mentions: ${result.summary.suppressedMentions}</text>
<text x="96" y="380" fill="#ffffff" font-family="Arial, sans-serif" font-size="24">Languages preserved: en, de, es, fr</text>
<text x="96" y="430" fill="#ffffff" font-family="Arial, sans-serif" font-size="24">JSON-LD entity packets ready for schema.org-style pages</text>
<text x="96" y="510" fill="#ffd37a" font-family="Arial, sans-serif" font-size="26">Unsafe or malformed aliases are held before recommendations are shown.</text>
<text x="96" y="574" fill="#a6d7c3" font-family="Arial, sans-serif" font-size="18">${result.auditDigest}</text>
</svg>
`;

fs.writeFileSync(svgPath, svg);

console.log(`Wrote ${path.relative(__dirname, packetPath)}`);
console.log(`Wrote ${path.relative(__dirname, sparsePacketPath)}`);
console.log(`Wrote ${path.relative(__dirname, conflictPacketPath)}`);
console.log(`Wrote ${path.relative(__dirname, malformedMentionPacketPath)}`);
console.log(`Wrote ${path.relative(__dirname, malformedAliasEvidencePacketPath)}`);
console.log(`Wrote ${path.relative(__dirname, reportPath)}`);
console.log(`Wrote ${path.relative(__dirname, svgPath)}`);
console.log(`Accepted mentions: ${result.summary.acceptedMentions}`);
console.log(`Suppressed mentions: ${result.summary.suppressedMentions}`);
Loading