feat: add Twitter tweet/thread archiving and platform shorthand support (#5)

* Add Twitter tweet and thread archiving support * Fix tweet scraper path resolution and error reporting * Flatten tweet archives and rearchive tweet assets * refactor: simplify archive source parsing * Refactor tweet archive source handling * Clean up some clanker-written code Signed-off-by: TheGeneralist <180094941+thegeneralist01@users.noreply.github.com> * Rename resolve_from_cwd to absolutize_path Update call sites and tests to use the new API. Adjust tweet scraper path/credentials handling and make small tweaks to local path hashing and raw store helpers. Signed-off-by: TheGeneralist <180094941+thegeneralist01@users.noreply.github.com> Signed-off-by: TheGeneralist <180094941+thegeneralist01@users.noreply.github.com> * Add docs for supported platforms, shorthands, and env vars * Minor clean up * Implement social shorthand URL expansion and tweet alias parsing * Extract store and Twitter helpers into shared modules --------- Signed-off-by: TheGeneralist <180094941+thegeneralist01@users.noreply.github.com>
2026-07-22 03:05:32 +02:00 · 2026-04-03 15:34:26 +02:00 · 2026-04-03 15:34:26 +02:00 · 6bee5adfdd
commit 6bee5adfdd
parent 9441a9d9fb
9 changed files with 2377 additions and 54 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -3,47 +3,114 @@
 An open-source self-hosted archiving tool. Work in progress.

 ## Milestones
+
 - [ ] Archiving
-    - [X] Archiving media files from social media platforms
-        - [X] YouTube Videos
-        - [X] Twitter Videos
-        - [X] Instagram
-        - [X] Facebook
-        - [X] TikTok
-        - [X] Reddit
-        - [X] Snapchat
-        - [ ] YouTube Posts (postponed)
-    - [X] Archiving local files
-    - [ ] Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
-        - [ ] URLs
-        - [ ] Google Drive
-        - [ ] Dropbox
-        - [ ] OneDrive
-        - (Some of these could be postponed for later.)
-    - [ ] Archiving Twitter threads
-    - [ ] Archive web pages (HTML, CSS, JS, images)
-    - [ ] Archiving emails (???)
-        - [ ] Gmail
-        - [ ] Outlook
-        - [ ] Yahoo Mail
+  - [x] Archiving media files from social media platforms
+    - [x] YouTube Videos
+    - [x] Twitter Videos
+    - [x] Instagram
+    - [x] Facebook
+    - [x] TikTok
+    - [x] Reddit
+    - [x] Snapchat
+    - [ ] YouTube Posts (postponed)
+  - [x] Archiving local files
+  - [x] Archiving Twitter Tweets & Threads
+  - [ ] Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
+    - [ ] URLs
+    - [ ] Google Drive
+    - [ ] Dropbox
+    - [ ] OneDrive
+    - (Some of these could be postponed for later.)
+  - [ ] Archiving Twitter articles
+  - [ ] Archive web pages (HTML, CSS, JS, images)
+  - [ ] Archiving emails (???)
+    - [ ] Gmail
+    - [ ] Outlook
+    - [ ] Yahoo Mail
 - [ ] Management
-    - [ ] Deduplication
-    - [ ] Tagging system
-    - [ ] Search functionality
-    - [ ] Categorization
-    - [ ] Metadata extraction and storage
+  - [ ] Deduplication
+  - [ ] Tagging system
+  - [ ] Search functionality
+  - [ ] Categorization
+  - [ ] Metadata extraction and storage
 - [ ] User Interface
-    - [ ] Web-based UI
+  - [ ] Web-based UI
 - [ ] Backup and Sync
-    - [ ] Cloud backup (AWS S3, Google Cloud Storage)
-    - [ ] Local backup
+  - [ ] Cloud backup (AWS S3, Google Cloud Storage)
+  - [ ] Local backup

 ## Motivation
+
 There are two driving factors behind this project:
- In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is *very important* for preserving personal memories and digital history.
+
+- In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is _very important_ for preserving personal memories and digital history.
 - I will be creating a small encyclopedia for my future family and kids. Therefore, I want to make sure that all the information I gather is preserved and accessible for future reference.

 This project aims to provide a reliable solution for archiving important data from various sources, ensuring that users can preserve their digital assets for the long term.

+## Archive Inputs
+
+`archivr archive <path>` currently accepts three kinds of inputs:
+
+- Local files via `file://...`
+- Direct platform URLs
+- Platform shorthand inputs such as `tweet:...`, `yt:...`, or `instagram:...`
+
+### Supported Platforms
+
+- Local files: `file:///absolute/path/to/file.ext`
+- YouTube media: standard video/short URLs, plus [shorthand video inputs](#supported-shorthand-inputs)
+- X/Twitter media from Tweets: normal Tweet URLs or the `tweet:media:ID` shorthand
+- X/Twitter Tweet content scrape: [Tweet and Thread shorthands](#supported-shorthand-inputs). (These are saved as TOML files in `raw_tweets/`)
+- Instagram, Facebook, TikTok, Reddit, Snapchat: direct URLs or platform-prefixed shorthand passed through to `yt-dlp`
+
+### Supported Shorthand Inputs
+
+- YouTube video/short media:
+  - `yt:video/ID`
+  - `youtube:video/ID`
+  - `yt:short/ID`
+  - `yt:shorts/ID`
+  - `youtube:shorts/ID`
+- X/Twitter tweet TOML content:
+  - `tweet:ID`
+  - `x:tweet:ID`
+  - `x:x:ID`
+  - `twitter:x:ID`
+  - `twitter:tweet:ID`
+- X/Twitter media/video download:
+  - `tweet:media:ID`
+- X/Twitter thread TOML content:
+  - `x:thread:ID`
+  - `twitter:thread:ID`
+- Other platform shorthands:
+  - `instagram:ID`
+  - `facebook:ID`
+  - `tiktok:ID`
+  - `reddit:ID`
+  - `snapchat:ID`
+
+### Environment Variables
+
+- `ARCHIVR_YT_DLP`
+  - Optional.
+  - Overrides the `yt-dlp` binary used for YouTube, X media posts, Instagram, Facebook, TikTok, Reddit, and Snapchat downloads.
+- `ARCHIVR_TWITTER_CREDENTIALS_FILE`
+  - Required for tweet/thread scraping inputs such as `tweet:ID` and `x:thread:ID`.
+  - Must point to a cookies file for the vendored scraper.
+- `ARCHIVR_TWEET_SCRAPER`
+  - Optional.
+  - Overrides the tweet scraper script path. Default: `vendor/twitter/scrape_user_tweet_contents.py`.
+- `ARCHIVR_TWEET_PYTHON`
+  - Optional.
+  - Overrides the Python executable used to run the tweet scraper. Default: `python3`.
+
+### Current Limitations
+
+- Arbitrary `http://` or `https://` pages are not archived yet unless they match one of the currently supported platforms above.
+- Local files currently need to be passed as `file://...` paths.
+
 ## License
+
 This project is licensed under the MIT License. See the [LICENSE](LICENSE.md) file for details.