Skip to content

Commit f9a735c

Browse files
authored
Merge pull request #13 from JasonHonKL/dev/roadmap
tauri
2 parents 8efc6f8 + a84bbba commit f9a735c

64 files changed

Lines changed: 6427 additions & 1902 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
/target
22
**/*.rs.bk
3+
**/*.bak
34
Cargo.lock
45
.DS_Store
56
**/.DS_Store
@@ -16,7 +17,9 @@ adapters/node/pardus-playwright/dist/
1617

1718
# Tauri frontend
1819
crates/pardus-tauri/node_modules/
20+
crates/pardus-tauri/frontend/node_modules/
1921
crates/pardus-tauri/dist/
22+
crates/pardus-tauri/frontend/dist/
2023

2124
# Web dashboard
2225
web/node_modules/
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# pardus-kg Optimization Plan
2+
3+
## Pre-requisite Fix (unrelated pre-existing bug)
4+
5+
**File:** `crates/pardus-core/src/page.rs:783-791`
6+
7+
The `snapshot()` method is missing the `redirect_chain` field, causing compilation failure.
8+
9+
```rust
10+
pub fn snapshot(&self) -> PageSnapshot {
11+
PageSnapshot {
12+
url: self.url.clone(),
13+
status: self.status,
14+
content_type: self.content_type.clone(),
15+
title: self.title(),
16+
html: self.html.html(),
17+
redirect_chain: self.redirect_chain.clone(),
18+
}
19+
}
20+
```
21+
22+
---
23+
24+
## Phase 1: Bug Fixes (B1-B4)
25+
26+
### B1. Fix query param sorting in `normalize_url`
27+
28+
**File:** `crates/pardus-kg/src/crawler.rs:224-234`
29+
30+
Docstring says "sort query params" but code doesn't sort. Sort query pairs before rebuilding URL string.
31+
32+
### B2. Fix `Box::leak` memory leak in pagination
33+
34+
**File:** `crates/pardus-kg/src/discovery.rs:139,146`
35+
36+
Change `segments` from `Vec<&str>` to `Vec<String>` and use local `String` instead of `Box::leak`.
37+
38+
### B3. Fix duplicate `navigation_graph()` call
39+
40+
**File:** `crates/pardus-kg/src/crawler.rs:176-221`
41+
42+
Pass pre-built `nav_graph` into `discover_transitions_for_page` instead of rebuilding inside.
43+
44+
### B4. Fix same-origin filtering at frontier insertion
45+
46+
**File:** `crates/pardus-kg/src/crawler.rs:237-239`
47+
48+
Replace unused `_root_origin` param with actual same-origin check. Skip cross-origin URLs before enqueueing.
49+
50+
---
51+
52+
## Phase 2: Quick Wins (H4, M1)
53+
54+
### H4. Incremental blake3 hashing
55+
56+
**File:** `crates/pardus-kg/src/fingerprint.rs`
57+
58+
Replace 3 functions (`hash_tree_structure`, `hash_resource_set`, `compute_view_state_id`) to use `blake3::Hasher::new()` + incremental `update()` calls instead of building large intermediate strings.
59+
60+
### M1. Remove duplicate `role_str` function
61+
62+
**File:** `crates/pardus-kg/src/fingerprint.rs:88-122`
63+
64+
Delete the local `role_str()` that allocates `String` per call. Use `node.role.role_str()` which already exists on `SemanticRole` in pardus-core and returns `&str`.
65+
66+
---
67+
68+
## Phase 3: Parallel Fetch (H1)
69+
70+
**File:** `crates/pardus-kg/src/crawler.rs`
71+
72+
- Add `concurrency: usize` field to `CrawlConfig` (default: 4)
73+
- Add `tokio/sync` and `tokio/rt` features to `Cargo.toml`
74+
- Replace serial BFS loop with batched parallel fetch using `tokio::task::JoinSet` + `tokio::sync::Semaphore`
75+
- Result processing stays serial to maintain BFS ordering and safe `HashMap` mutation
76+
- Parallelism is I/O-bound fetch only; semantic tree building stays in collection loop
77+
78+
---
79+
80+
## Phase 4: Single-Pass HTML (H3)
81+
82+
### New unified analysis API
83+
84+
**New file:** `crates/pardus-core/src/page_analysis.rs`
85+
86+
Create `PageAnalysis` struct with `build(html, page_url)` that produces both `SemanticTree` and `NavigationGraph` through a single API call. Initially delegates to individual builders; evolved later into true single-pass.
87+
88+
---
89+
90+
## Phase 5: Memory Optimization (M2-M4, L1, L3)
91+
92+
### M2. Optional tree storage
93+
94+
- Add `store_full_trees: bool` to `CrawlConfig`
95+
- Make `semantic_tree` and `navigation_graph` `Option<T>` on `ViewState`
96+
- Skip serializing when `None`
97+
98+
### M3. Type-safe HashMap keys
99+
100+
**File:** `crates/pardus-kg/src/graph.rs`
101+
102+
Change `states: HashMap<String, ViewState>` to `HashMap<ViewStateId, ViewState>`. Update `add_state` and `has_state` accordingly. Update all callers.
103+
104+
### M4. HashSet for resources
105+
106+
Change `resource_urls: BTreeSet<String>` to `HashSet<String>` across state.rs and fingerprint.rs. Sort only when hashing.
107+
108+
### L1. Remove dead `verify_transitions` config
109+
110+
Remove the unused field from `CrawlConfig`.
111+
112+
### L3. Crawler-level retry
113+
114+
Add `retries: u8` to `FrontierEntry`. On fetch failure, re-enqueue up to 2 retries.
115+
116+
---
117+
118+
## File Change Summary
119+
120+
| Order | File | Changes |
121+
|-------|------|---------|
122+
| 0 | `pardus-core/src/page.rs` | Fix missing `redirect_chain` in `snapshot()` |
123+
| 1 | `pardus-kg/src/discovery.rs` | B2: Fix `Box::leak` |
124+
| 1 | `pardus-kg/src/crawler.rs` | B1: query param sort; B3: pass nav_graph; B4: same-origin filter |
125+
| 2 | `pardus-kg/src/fingerprint.rs` | H4: incremental blake3; M1: remove role_str; M4: HashSet |
126+
| 3 | `pardus-kg/src/crawler.rs` | H1: parallel fetch |
127+
| 3 | `pardus-kg/src/config.rs` | Add `concurrency` |
128+
| 3 | `pardus-kg/Cargo.toml` | Add tokio features |
129+
| 4 | `pardus-core/src/page_analysis.rs` | New file: unified PageAnalysis |
130+
| 4 | `pardus-kg/src/crawler.rs` | Use PageAnalysis |
131+
| 5 | `pardus-kg/src/graph.rs` | M3: HashMap key type |
132+
| 5 | `pardus-kg/src/state.rs` | M2: optional trees; M4: HashSet |
133+
| 5 | `pardus-kg/src/config.rs` | L1: remove dead field; add store_full_trees |
134+
135+
## Verification
136+
137+
```bash
138+
cargo test -p pardus-kg
139+
cargo test -p pardus-core
140+
cargo clippy -p pardus-kg -- -D warnings
141+
cargo build -p pardus-kg
142+
```

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ futures-util = "0.3"
3030
blake3 = "1"
3131
lol_html = "2"
3232
reqwest = { version = "0.12", features = ["cookies", "gzip", "brotli", "deflate", "json"] }
33-
rquest = { version = "5", features = ["cookies", "gzip", "brotli", "deflate", "json", "stream", "socks", "multipart"] }
33+
rquest = { version = "5", features = ["cookies", "gzip", "brotli", "deflate", "json", "stream", "socks", "multipart", "webpki-roots"] }
3434
rquest-util = "2"
3535
parking_lot = "0.12"
3636
base64 = "0.22"

0 commit comments

Comments
 (0)