|
| 1 | +# pardus-kg Optimization Plan |
| 2 | + |
| 3 | +## Pre-requisite Fix (unrelated pre-existing bug) |
| 4 | + |
| 5 | +**File:** `crates/pardus-core/src/page.rs:783-791` |
| 6 | + |
| 7 | +The `snapshot()` method is missing the `redirect_chain` field, causing compilation failure. |
| 8 | + |
| 9 | +```rust |
| 10 | +pub fn snapshot(&self) -> PageSnapshot { |
| 11 | + PageSnapshot { |
| 12 | + url: self.url.clone(), |
| 13 | + status: self.status, |
| 14 | + content_type: self.content_type.clone(), |
| 15 | + title: self.title(), |
| 16 | + html: self.html.html(), |
| 17 | + redirect_chain: self.redirect_chain.clone(), |
| 18 | + } |
| 19 | +} |
| 20 | +``` |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## Phase 1: Bug Fixes (B1-B4) |
| 25 | + |
| 26 | +### B1. Fix query param sorting in `normalize_url` |
| 27 | + |
| 28 | +**File:** `crates/pardus-kg/src/crawler.rs:224-234` |
| 29 | + |
| 30 | +Docstring says "sort query params" but code doesn't sort. Sort query pairs before rebuilding URL string. |
| 31 | + |
| 32 | +### B2. Fix `Box::leak` memory leak in pagination |
| 33 | + |
| 34 | +**File:** `crates/pardus-kg/src/discovery.rs:139,146` |
| 35 | + |
| 36 | +Change `segments` from `Vec<&str>` to `Vec<String>` and use local `String` instead of `Box::leak`. |
| 37 | + |
| 38 | +### B3. Fix duplicate `navigation_graph()` call |
| 39 | + |
| 40 | +**File:** `crates/pardus-kg/src/crawler.rs:176-221` |
| 41 | + |
| 42 | +Pass pre-built `nav_graph` into `discover_transitions_for_page` instead of rebuilding inside. |
| 43 | + |
| 44 | +### B4. Fix same-origin filtering at frontier insertion |
| 45 | + |
| 46 | +**File:** `crates/pardus-kg/src/crawler.rs:237-239` |
| 47 | + |
| 48 | +Replace unused `_root_origin` param with actual same-origin check. Skip cross-origin URLs before enqueueing. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Phase 2: Quick Wins (H4, M1) |
| 53 | + |
| 54 | +### H4. Incremental blake3 hashing |
| 55 | + |
| 56 | +**File:** `crates/pardus-kg/src/fingerprint.rs` |
| 57 | + |
| 58 | +Replace 3 functions (`hash_tree_structure`, `hash_resource_set`, `compute_view_state_id`) to use `blake3::Hasher::new()` + incremental `update()` calls instead of building large intermediate strings. |
| 59 | + |
| 60 | +### M1. Remove duplicate `role_str` function |
| 61 | + |
| 62 | +**File:** `crates/pardus-kg/src/fingerprint.rs:88-122` |
| 63 | + |
| 64 | +Delete the local `role_str()` that allocates `String` per call. Use `node.role.role_str()` which already exists on `SemanticRole` in pardus-core and returns `&str`. |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Phase 3: Parallel Fetch (H1) |
| 69 | + |
| 70 | +**File:** `crates/pardus-kg/src/crawler.rs` |
| 71 | + |
| 72 | +- Add `concurrency: usize` field to `CrawlConfig` (default: 4) |
| 73 | +- Add `tokio/sync` and `tokio/rt` features to `Cargo.toml` |
| 74 | +- Replace serial BFS loop with batched parallel fetch using `tokio::task::JoinSet` + `tokio::sync::Semaphore` |
| 75 | +- Result processing stays serial to maintain BFS ordering and safe `HashMap` mutation |
| 76 | +- Parallelism is I/O-bound fetch only; semantic tree building stays in collection loop |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Phase 4: Single-Pass HTML (H3) |
| 81 | + |
| 82 | +### New unified analysis API |
| 83 | + |
| 84 | +**New file:** `crates/pardus-core/src/page_analysis.rs` |
| 85 | + |
| 86 | +Create `PageAnalysis` struct with `build(html, page_url)` that produces both `SemanticTree` and `NavigationGraph` through a single API call. Initially delegates to individual builders; evolved later into true single-pass. |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## Phase 5: Memory Optimization (M2-M4, L1, L3) |
| 91 | + |
| 92 | +### M2. Optional tree storage |
| 93 | + |
| 94 | +- Add `store_full_trees: bool` to `CrawlConfig` |
| 95 | +- Make `semantic_tree` and `navigation_graph` `Option<T>` on `ViewState` |
| 96 | +- Skip serializing when `None` |
| 97 | + |
| 98 | +### M3. Type-safe HashMap keys |
| 99 | + |
| 100 | +**File:** `crates/pardus-kg/src/graph.rs` |
| 101 | + |
| 102 | +Change `states: HashMap<String, ViewState>` to `HashMap<ViewStateId, ViewState>`. Update `add_state` and `has_state` accordingly. Update all callers. |
| 103 | + |
| 104 | +### M4. HashSet for resources |
| 105 | + |
| 106 | +Change `resource_urls: BTreeSet<String>` to `HashSet<String>` across state.rs and fingerprint.rs. Sort only when hashing. |
| 107 | + |
| 108 | +### L1. Remove dead `verify_transitions` config |
| 109 | + |
| 110 | +Remove the unused field from `CrawlConfig`. |
| 111 | + |
| 112 | +### L3. Crawler-level retry |
| 113 | + |
| 114 | +Add `retries: u8` to `FrontierEntry`. On fetch failure, re-enqueue up to 2 retries. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## File Change Summary |
| 119 | + |
| 120 | +| Order | File | Changes | |
| 121 | +|-------|------|---------| |
| 122 | +| 0 | `pardus-core/src/page.rs` | Fix missing `redirect_chain` in `snapshot()` | |
| 123 | +| 1 | `pardus-kg/src/discovery.rs` | B2: Fix `Box::leak` | |
| 124 | +| 1 | `pardus-kg/src/crawler.rs` | B1: query param sort; B3: pass nav_graph; B4: same-origin filter | |
| 125 | +| 2 | `pardus-kg/src/fingerprint.rs` | H4: incremental blake3; M1: remove role_str; M4: HashSet | |
| 126 | +| 3 | `pardus-kg/src/crawler.rs` | H1: parallel fetch | |
| 127 | +| 3 | `pardus-kg/src/config.rs` | Add `concurrency` | |
| 128 | +| 3 | `pardus-kg/Cargo.toml` | Add tokio features | |
| 129 | +| 4 | `pardus-core/src/page_analysis.rs` | New file: unified PageAnalysis | |
| 130 | +| 4 | `pardus-kg/src/crawler.rs` | Use PageAnalysis | |
| 131 | +| 5 | `pardus-kg/src/graph.rs` | M3: HashMap key type | |
| 132 | +| 5 | `pardus-kg/src/state.rs` | M2: optional trees; M4: HashSet | |
| 133 | +| 5 | `pardus-kg/src/config.rs` | L1: remove dead field; add store_full_trees | |
| 134 | + |
| 135 | +## Verification |
| 136 | + |
| 137 | +```bash |
| 138 | +cargo test -p pardus-kg |
| 139 | +cargo test -p pardus-core |
| 140 | +cargo clippy -p pardus-kg -- -D warnings |
| 141 | +cargo build -p pardus-kg |
| 142 | +``` |
0 commit comments