Skip to content

Escape XML reserved characters when writing JATS-formatted text to database#751

Merged
ja573 merged 3 commits into
developfrom
feature/jats_special_chars
May 14, 2026
Merged

Escape XML reserved characters when writing JATS-formatted text to database#751
ja573 merged 3 commits into
developfrom
feature/jats_special_chars

Conversation

@rhigman
Copy link
Copy Markdown
Member

@rhigman rhigman commented May 13, 2026

Raw characters <, >, & etc were not previously being escaped when writing JATS fields such as abstract, leading to bugs downstream such as https://help.thoth.pub/#ticket/zoom/363 (where a stray < character used as a "from" symbol caused truncation/malformation of the Crossref XML output).

@rhigman rhigman requested a review from ja573 May 13, 2026 14:58
@ja573
Copy link
Copy Markdown
Member

ja573 commented May 14, 2026

Issues Fixed

1. normalise_crossref_abstract_jats had no input validation

normalise_crossref_abstract_jats passed content directly to jats_to_ast, which uses html5ever — a lenient HTML parser. Legacy content with raw < (stored before the escape fix) would be silently misinterpreted. For example, "<p>1 < 2</p>" was handled differently depending on whether < was followed by a letter (parsed as a tag) or a space (treated as text), producing silently wrong output instead of an error.
Fix: Added validate_jats_subset(content, ConversionLimit::Abstract)?; before jats_to_ast at mod.rs:395. Malformed XML is now rejected early with a clear error.

2. write_jats_content double-escaped attribute entities

The old code used String::from_utf8_lossy(&attr.value) which reads raw bytes from quick-xml without resolving XML character references. When xml-rs then wrote the attribute via event_builder.attr(), it re-escaped the & in &, producing &amp;.
Fix: Replaced with attr.decode_and_unescape_value(reader.decoder()) in both Event::Start and Event::Empty branches (crossref.rs:336, crossref.rs:383). Entities are resolved before xml-rs encodes them, giving exactly-once encoding.

Test adjustments

Tests for <break/> and nested <p> in the Crossref export were updated from expecting silent normalization to expecting rejection, matching the stricter validation policy. New tests verify exactly-once entity preservation for text (&&) and link attributes (&/"/' survive without double-escaping).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37b4db6c1e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread thoth-api/src/markup/mod.rs
@ja573 ja573 merged commit 64b81a7 into develop May 14, 2026
11 checks passed
@ja573 ja573 deleted the feature/jats_special_chars branch May 14, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants