webcloner-js - Advanced Website Cloner

A powerful, stealthy website cloner/scraper built with TypeScript that downloads entire websites for offline use. Supports HTTP proxy authentication, comprehensive asset downloading (CSS, JS, images, SVG sprites, fonts, etc.), and intelligent URL rewriting.

Now with a beautiful Electron GUI! 🎨 See ELECTRON_GUI.md for details.

Features

🚀 Complete Website Cloning - Downloads HTML, CSS, JavaScript, images, fonts, and all other assets
🛡️ Anti-Bot Protection Bypass - Automatically detects and bypasses JavaScript-based protection (Cloudflare, fingerprinting, etc.)
📋 Fetch/Curl Import - Copy fetch() or curl from browser DevTools and paste directly - auto-extracts URL, headers & cookies
🍪 Cookie Support - Inject cookies and automatically persist session state across requests
🔒 HTTP Proxy Support - Connect through HTTP proxies with username/password authentication
💾 Proxy Configuration Management - Save, load, and manage multiple proxy configurations with password masking
🎯 SVG Sprite Support - Properly handles SVG sprites with xlink:href references
🔄 Smart URL Rewriting - Converts all URLs to relative local paths for offline browsing
🕷️ Stealthy Crawling - Configurable delays, random user agents, and realistic headers
📦 Asset Discovery - Extracts assets from:
- HTML tags (img, script, link, etc.)
- CSS files (background images, fonts, etc.)
- Inline styles
- SVG sprites and references
- srcset attributes
- Data attributes (data-src, data-lazy-src)
🎨 CSS Processing - Parses CSS files to download referenced assets
🌐 External Link Handling - Optional following of external links
📊 Progress Tracking - Real-time statistics and detailed logging
⚙️ Highly Configurable - Control depth, patterns, delays, and more

Installation

# Install dependencies
npm install

# Build the project
npm run build

# Or use directly with ts-node
npm run dev -- <url> [options]

Quick Start

GUI Application (Recommended)

# Run the Electron GUI
npm run start:electron

The GUI provides an intuitive interface with all features accessible through a modern, minimalistic design.

CLI Usage

Basic Usage

# Clone a website to default directory (./cloned-site)
npm run dev -- https://example.com

# Specify output directory
npm run dev -- https://example.com -o ./my-site

# Set crawl depth
npm run dev -- https://example.com -d 5

With HTTP Proxy

# Using proxy with authentication
npm run dev -- https://example.com \
  --proxy-host proxy.example.com \
  --proxy-port 8080 \
  --proxy-user myusername \
  --proxy-pass mypassword

# Save proxy configuration for reuse
npm run dev -- https://example.com \
  --proxy-host proxy.example.com \
  --proxy-port 8080 \
  --proxy-user myusername \
  --proxy-pass mypassword \
  --save-proxy my-proxy

# Load saved proxy configuration
npm run dev -- https://example.com --load-proxy my-proxy

Proxy Management

# List all saved proxies (passwords masked)
npm run dev -- list-proxies

# List proxies with passwords visible
npm run dev -- list-proxies --show-passwords

# Show specific proxy details
npm run dev -- show-proxy my-proxy

# Show proxy with password visible
npm run dev -- show-proxy my-proxy --show-password

# Delete a saved proxy
npm run dev -- delete-proxy my-proxy

📖 See PROXY_MANAGEMENT.md for complete proxy management documentation.

Advanced Options

# Full example with all options
npm run dev -- https://example.com \
  -o ./output \
  -d 3 \
  --delay 200 \
  --follow-external \
  --user-agent "Mozilla/5.0 Custom Agent" \
  --include ".*\\.example\\.com.*" ".*\\.cdn\\.com.*" \
  --exclude ".*\\.pdf$" ".*login.*" \
  --header "Authorization: Bearer token123" \
  --header "X-Custom-Header: value" \
  --proxy-host proxy.example.com \
  --proxy-port 8080 \
  --proxy-user username \
  --proxy-pass password

CLI Options

Main Clone Command

Option	Description	Default
`<url>`	Target website URL to clone	Required
`-o, --output <dir>`	Output directory	`./cloned-site`
`-d, --depth <number>`	Maximum crawl depth	`3`
`--delay <ms>`	Delay between requests (milliseconds)	`100`
`--proxy-host <host>`	Proxy server host	-
`--proxy-port <port>`	Proxy server port	-
`--proxy-user <username>`	Proxy authentication username	-
`--proxy-pass <password>`	Proxy authentication password	-
`--load-proxy <name>`	Load saved proxy configuration	-
`--save-proxy <name>`	Save proxy configuration with name	-
`--user-agent <agent>`	Custom user agent string	Random
`--follow-external`	Follow external links	`false`
`--include <patterns...>`	Include URL patterns (regex)	All
`--exclude <patterns...>`	Exclude URL patterns (regex)	None
`--header <header...>`	Custom headers (format: "Key: Value")	-

Proxy Management Commands

Command	Description
`list-proxies`	List all saved proxy configurations
`list-proxies --show-passwords`	List proxies with passwords visible
`show-proxy <name>`	Show details of a specific proxy
`show-proxy <name> --show-password`	Show proxy details with password visible
`delete-proxy <name>`	Delete a saved proxy configuration

Programmatic Usage

import { WebsiteCloner } from "./src/cloner.js";

const cloner = new WebsiteCloner({
  targetUrl: "https://example.com",
  outputDir: "./cloned-site",
  maxDepth: 3,
  delay: 100,
  proxy: {
    host: "proxy.example.com",
    port: 8080,
    username: "user",
    password: "pass",
  },
  userAgent: "Custom User Agent",
  followExternalLinks: false,
  includePatterns: [".*\\.example\\.com.*"],
  excludePatterns: [".*\\.pdf$"],
  headers: {
    Authorization: "Bearer token",
  },
});

await cloner.clone();

How It Works

Initial Request - Downloads the target URL's HTML content using fast HTTP client
Protection Detection - Automatically detects anti-bot protection and switches to browser mode if needed
Asset Extraction - Parses HTML to find all assets:
- Stylesheets (<link rel="stylesheet">)
- Scripts (<script src>)
- Images (<img>, srcset, background images)
- SVG sprites (<use xlink:href>)
- Fonts (from CSS @font-face)
- Videos, audio, iframes, etc.
Asset Download - Downloads each asset with proper referer headers
CSS Processing - Parses CSS files to find and download referenced assets
URL Rewriting - Converts all absolute URLs to relative local paths
Link Crawling - Follows links within the same domain (respecting depth limit)
File Organization - Saves files maintaining directory structure

SVG Sprite Support

The cloner properly handles SVG sprites referenced with xlink:href:

<!-- Original -->
<svg class="icon">
  <use xlink:href="./assets/sprite.svg#icon-name"></use>
</svg>

<!-- After cloning (with proper relative path) -->
<svg class="icon">
  <use xlink:href="../assets/sprite.svg#icon-name"></use>
</svg>

Output Structure

cloned-site/
├── index.html                 # Main page
├── about.html                 # Other pages
├── assets/
│   ├── css/
│   │   └── style.css
│   ├── js/
│   │   └── script.js
│   ├── images/
│   │   ├── logo.png
│   │   └── sprite.svg
│   └── fonts/
│       └── font.woff2
├── external/                  # External domain assets (if enabled)
│   └── cdn_example_com/
│       └── library.js
└── url-mapping.json          # URL to local path mapping

Stealth Features

Random User Agents - Rotates between realistic browser user agents
Realistic Headers - Includes Accept, Accept-Language, Accept-Encoding, etc.
Referer Headers - Sends proper referer for each request
Configurable Delays - Adds delays between requests to avoid detection
Proxy Support - Routes traffic through HTTP proxies

Error Handling

Failed downloads are logged but don't stop the cloning process
Statistics show successful and failed downloads
Detailed error messages for debugging

Performance Tips

Adjust Delay - Lower delay for faster cloning (but less stealthy)
Limit Depth - Reduce depth for large sites
Use Patterns - Include/exclude patterns to focus on specific content
Proxy Selection - Use fast, reliable proxies for better performance

Limitations

JavaScript-rendered content requires the page to be pre-rendered
Dynamic content loaded via AJAX may not be captured
Some anti-scraping measures may block requests
Very large sites may take significant time to clone

Security & Legal

⚠️ Important: Always respect website terms of service and robots.txt. This tool is for:

Backing up your own websites
Archiving public domain content
Educational purposes
Authorized testing

Do not use this tool to:

Violate copyright laws
Bypass paywalls or authentication
Overload servers with requests
Access restricted content without permission

License

ISC

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Troubleshooting

"Failed to download" errors

Check if the website blocks scrapers
Try increasing the delay
Use a different user agent
Check proxy configuration

Missing assets

Increase crawl depth
Check include/exclude patterns
Some assets may be loaded dynamically via JavaScript

Proxy connection issues

Verify proxy credentials
Check proxy host and port
Ensure proxy supports HTTP/HTTPS

Anti-bot protection detected

If you see "🔒 Anti-bot protection detected, using browser mode...":

This is normal! The tool automatically handles it
The download will take a bit longer (2-5 seconds per page)
If it fails, the site may have advanced CAPTCHA protection
See ANTI_BOT_PROTECTION.md for details

Examples

Clone a blog

npm run dev -- https://blog.example.com -d 2 -o ./blog-backup

Clone with proxy

npm run dev -- https://example.com \
  --proxy-host 192.168.1.100 \
  --proxy-port 3128 \
  --proxy-user admin \
  --proxy-pass secret123

Clone only specific sections

npm run dev -- https://example.com \
  --include ".*example\\.com/docs.*" \
  --exclude ".*\\.pdf$"

Clone with custom headers

npm run dev -- https://api.example.com \
  --header "Authorization: Bearer YOUR_TOKEN" \
  --header "X-API-Key: YOUR_KEY"

Clone with cookies

# With cookie file
npm run dev -- https://members.example.com \
  --cookie-file ./cookies.json \
  -o ./members-content

# With inline cookies
npm run dev -- https://example.com \
  --cookie "session=abc123" \
  --cookie "user_id=456;domain=example.com"

📖 See COOKIE_SUPPORT.md for complete cookie documentation.

Clone with fetch request or curl command (easiest!)

# Copy from browser DevTools:
# - Fetch: F12 → Network → Right-click → Copy as fetch
# - Curl: F12 → Network → Right-click → Copy as cURL

# Save to file and use:
npm run dev -- --fetch-file ./request.txt -o ./output

# Or inline (fetch):
npm run dev -- --fetch 'fetch("https://example.com", {
  "headers": {
    "cookie": "session=abc123"
  }
})' -o ./output

# Or inline (curl):
npm run dev -- --fetch 'curl "https://example.com" -H "cookie: session=abc123"' -o ./output

📋 See FETCH_REQUEST_IMPORT.md and CURL_SUPPORT.md for complete guides.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
electron		electron
src		src
.gitignore		.gitignore
ANTI_BOT_PROTECTION.md		ANTI_BOT_PROTECTION.md
CHANGELOG_PROXY_MANAGEMENT.md		CHANGELOG_PROXY_MANAGEMENT.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COOKIE_IMPLEMENTATION_SUMMARY.md		COOKIE_IMPLEMENTATION_SUMMARY.md
COOKIE_QUICK_REFERENCE.md		COOKIE_QUICK_REFERENCE.md
COOKIE_SUPPORT.md		COOKIE_SUPPORT.md
CURL_SUPPORT.md		CURL_SUPPORT.md
ELECTRON_GUI.md		ELECTRON_GUI.md
ELECTRON_PROXY_FEATURES.md		ELECTRON_PROXY_FEATURES.md
FETCH_IMPORT_SUMMARY.md		FETCH_IMPORT_SUMMARY.md
FETCH_REQUEST_IMPORT.md		FETCH_REQUEST_IMPORT.md
GUI_SUMMARY.md		GUI_SUMMARY.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
PROXY_MANAGEMENT.md		PROXY_MANAGEMENT.md
PROXY_QUICK_REFERENCE.md		PROXY_QUICK_REFERENCE.md
QUICKSTART.md		QUICKSTART.md
QUICK_START_PROTECTED_SITES.md		QUICK_START_PROTECTED_SITES.md
README.md		README.md
USAGE_EXAMPLE.md		USAGE_EXAMPLE.md
USAGE_EXAMPLES.md		USAGE_EXAMPLES.md
copy-electron-assets.mjs		copy-electron-assets.mjs
download.ts		download.ts
example.config.json		example.config.json
package-lock.json		package-lock.json
package.json		package.json
test-curl.txt		test-curl.txt
tsconfig.electron.json		tsconfig.electron.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

webcloner-js - Advanced Website Cloner

Features

Installation

Quick Start

GUI Application (Recommended)

CLI Usage

Basic Usage

With HTTP Proxy

Proxy Management

Advanced Options

CLI Options

Main Clone Command

Proxy Management Commands

Programmatic Usage

How It Works

SVG Sprite Support

Output Structure

Stealth Features

Error Handling

Performance Tips

Limitations

Security & Legal

License

Contributing

Troubleshooting

"Failed to download" errors

Missing assets

Proxy connection issues

Anti-bot protection detected

Examples

Clone a blog

Clone with proxy

Clone only specific sections

Clone with custom headers

Clone with cookies

Clone with fetch request or curl command (easiest!)

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages