Wayback Machine Snapshot Program
I created a script to extract websites and their assets from the Wayback Machine.
It requires Python 3.7+ and the following dependencies: aiohttp
and aiofiles
.
How to Use
-
Install Python:
HERE
⚠️ During installation, make sure to check "Add Python to PATH"
-
Install the required packages by typing in Command Prompt or PowerShell:
pip install aiohttp aiofiles
-
Use the Wayback Machine Snapshot Generator to get your JSON file:
HERE
This generator uses the Internet Archive’s CDX API to fetch snapshots in JSON format.
It allows you to extract archived websites and their assets over a specific time range.
⚠️ Tip: To grab an entire website, add /*
to the end of the URL.
Example: https://www.yahoo.com/*
Steps:
- Enter the website URL (include
https://
or http://
).
- Optionally set a start date (YYYYMMDD).
- Optionally set an end date (YYYYMMDD).
- The script generates a CDX API link.
- Open the link in your browser and wait for it to load.
- When the JSON data appears, right-click and choose “Save As” to download it.
-
Go to the WebArchiveSnapshotProgram generator site:
HERE
How to Use WebArchiveSnapshotProgram
- Prepare your JSON file with timestamps and original URLs.
- Create a download folder (e.g.,
C:\download_images
).
- Generate the Python script by filling in your JSON path and folder path. Download
WebArchiveSnapshotProgram.py
.
- Run the script:
python WebArchiveSnapshotProgram.py
- The program downloads images asynchronously, skipping duplicates. You’ll see “Happy searching :)”.
Things to Know
- Extracting an entire site may take time; JSON files can be large.
- If it struggles to connect and keeps retrying, exit and reopen the script — it will skip duplicates.
- Be as specific as possible when entering URLs (e.g.,
spacejam.com/1996
).
- It automatically skips duplicate files.
Happy searching :)
Click to go back back to my website