Context
Currently, a Record (Version or Snapshot) carries a single temporal field: fetchDate, representing when the source content was captured. On the Git side, this fetchDate is used as both GIT_AUTHOR_DATE and GIT_COMMITTER_DATE of the commit, overwriting Git's native distinction between these two notions. On the Mongo side, a created_at field is written but never surfaced in the domain model.
This asymmetry becomes problematic as soon as we want to distinguish two separate chronological questions:
-
"When was the service's legal content in effect?" → fetchDate. Useful for chronological navigation of terms, querying at a given date, semantic diff between two versions.
-
"When did the engine record this interpretation?" → information currently lost (which could be called saveDate). Useful for engine traceability, audit, activity dashboards, and pagination by actual recording order.
The two notions diverge as soon as a commit is backdated, which happens systematically with applyTechnicalUpgrades: existing snapshots are re-rendered today (T_now) with current extraction rules, but the fetchDate of the original snapshot (T_old) is semantically preserved. The commit is created at T_now but dated at T_old.
Initial problem that surfaced the topic
While adding the Atom feed endpoints (PR #1242), it became apparent that paginating findAll/findByService/findByServiceAndTermsType with --max-count on the Git side can return incorrect results after a batch of technical upgrades: recently-created but backdated commits occupy the first topological positions, pushing chronologically more recent commits (by fetchDate) out of the window. The JavaScript sort at the end doesn't fix this, it only reorders what git already returned.
For the feed, the decision was made that technical upgrades don't belong (a re-render is not a change event for subscribers) and they are filtered out via an option (see PR #1242). This solves the feed problem without touching the data model.
However, any future use case that exposes versions to end users (e.g. a navigation/exploration UI with diffs between versions) would need to include technical upgrades, otherwise two versions separated by a technical upgrade would show a false diff. Such a use case would require both chronological dimensions to be properly exposed.
Solutions considered
Option A: Filter out technical upgrades from all chronological queries
Adopted for the feed PR (#1242), but insufficient for the navigation UI which needs to see them.
Option B: Split by prefix inside #getCommits
Run two Git queries (one for "real change" commits, one for "technical upgrade" commits) each with --max-count=X, merge them, JS-sort, then slice to X.
Mostly correct because startTracking/update commits are never backdated, so their topological order matches their chronological order; topo-pagination of that subset is exact. Backdated commits are isolated in the second query where we accept that the X most recently-created are returned regardless of their fetchDate. The merge + sort + slice then surfaces the truly recent commits at the top, and backdated ones land at their semantic position.
Residual edge case: a single applyTechnicalUpgrades run on more than X services with heterogeneous snapshot ages, where the iteration order happens to put the recently-dated snapshots first (so their commits sit at lower topological positions). The --max-count=X window for the technical upgrade query then captures the older-dated upgrades, and the recently-dated ones are missed.
Option C: Decouple authorDate = fetchDate from commitDate = saveDate
Use Git's native distinction, expose saveDate as a field in the Record model, and stop overwriting GIT_COMMITTER_DATE at commit time. On the Mongo side, surface created_at as saveDate.
Benefits:
- Aligns the Git model with the Mongo model (which already distinguished both dates internally).
- Gives consumers a clean semantic axis for each question.
- Enables pagination by
saveDate (= topological order in a linear history) without JS sort and without incorrectness.
- Lets the navigation UI display both the content validity date AND the date of the last re-render.
Context
Currently, a Record (Version or Snapshot) carries a single temporal field:
fetchDate, representing when the source content was captured. On the Git side, thisfetchDateis used as bothGIT_AUTHOR_DATEandGIT_COMMITTER_DATEof the commit, overwriting Git's native distinction between these two notions. On the Mongo side, acreated_atfield is written but never surfaced in the domain model.This asymmetry becomes problematic as soon as we want to distinguish two separate chronological questions:
"When was the service's legal content in effect?" →
fetchDate. Useful for chronological navigation of terms, querying at a given date, semantic diff between two versions."When did the engine record this interpretation?" → information currently lost (which could be called
saveDate). Useful for engine traceability, audit, activity dashboards, and pagination by actual recording order.The two notions diverge as soon as a commit is backdated, which happens systematically with
applyTechnicalUpgrades: existing snapshots are re-rendered today (T_now) with current extraction rules, but thefetchDateof the original snapshot (T_old) is semantically preserved. The commit is created at T_now but dated at T_old.Initial problem that surfaced the topic
While adding the Atom feed endpoints (PR #1242), it became apparent that paginating
findAll/findByService/findByServiceAndTermsTypewith--max-counton the Git side can return incorrect results after a batch of technical upgrades: recently-created but backdated commits occupy the first topological positions, pushing chronologically more recent commits (byfetchDate) out of the window. The JavaScript sort at the end doesn't fix this, it only reorders what git already returned.For the feed, the decision was made that technical upgrades don't belong (a re-render is not a change event for subscribers) and they are filtered out via an option (see PR #1242). This solves the feed problem without touching the data model.
However, any future use case that exposes versions to end users (e.g. a navigation/exploration UI with diffs between versions) would need to include technical upgrades, otherwise two versions separated by a technical upgrade would show a false diff. Such a use case would require both chronological dimensions to be properly exposed.
Solutions considered
Option A: Filter out technical upgrades from all chronological queries
Adopted for the feed PR (#1242), but insufficient for the navigation UI which needs to see them.
Option B: Split by prefix inside
#getCommitsRun two Git queries (one for "real change" commits, one for "technical upgrade" commits) each with
--max-count=X, merge them, JS-sort, then slice to X.Mostly correct because
startTracking/updatecommits are never backdated, so their topological order matches their chronological order; topo-pagination of that subset is exact. Backdated commits are isolated in the second query where we accept that the X most recently-created are returned regardless of theirfetchDate. The merge + sort + slice then surfaces the truly recent commits at the top, and backdated ones land at their semantic position.Residual edge case: a single
applyTechnicalUpgradesrun on more than X services with heterogeneous snapshot ages, where the iteration order happens to put the recently-dated snapshots first (so their commits sit at lower topological positions). The--max-count=Xwindow for the technical upgrade query then captures the older-dated upgrades, and the recently-dated ones are missed.Option C: Decouple
authorDate = fetchDatefromcommitDate = saveDateUse Git's native distinction, expose
saveDateas a field in the Record model, and stop overwritingGIT_COMMITTER_DATEat commit time. On the Mongo side, surfacecreated_atassaveDate.Benefits:
saveDate(= topological order in a linear history) without JS sort and without incorrectness.