Add docs for supported platforms, shorthands, and env vars

2026-05-30 08:36:47 +02:00 · 2026-04-03 14:17:30 +02:00 · 2026-04-03 14:17:30 +02:00 · 4871ca7da5
commit 4871ca7da5
parent 9837bda0c2
1 changed files with 94 additions and 37 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -3,24 +3,25 @@
 An open-source self-hosted archiving tool. Work in progress.
 ## Milestones
 - [ ] Archiving
-    - [X] Archiving media files from social media platforms
+  - [x] Archiving media files from social media platforms
-        - [X] YouTube Videos
+    - [x] YouTube Videos
-        - [X] Twitter Videos
+    - [x] Twitter Videos
-        - [X] Instagram
+    - [x] Instagram
-        - [X] Facebook
+    - [x] Facebook
-        - [X] TikTok
+    - [x] TikTok
-        - [X] Reddit
+    - [x] Reddit
-        - [X] Snapchat
+    - [x] Snapchat
    - [ ] YouTube Posts (postponed)
-    - [X] Archiving local files
+  - [x] Archiving local files
  - [ ] Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
    - [ ] URLs
    - [ ] Google Drive
    - [ ] Dropbox
    - [ ] OneDrive
    - (Some of these could be postponed for later.)
-    - [X] Archiving Twitter threads
+  - [x] Archiving Twitter threads
  - [ ] Archive web pages (HTML, CSS, JS, images)
  - [ ] Archiving emails (???)
    - [ ] Gmail
@ -39,20 +40,76 @@ An open-source self-hosted archiving tool. Work in progress.
  - [ ] Local backup
 ## Motivation
 There are two driving factors behind this project:
- In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is *very important* for preserving personal memories and digital history.
+
 - In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is _very important_ for preserving personal memories and digital history.
 - I will be creating a small encyclopedia for my future family and kids. Therefore, I want to make sure that all the information I gather is preserved and accessible for future reference.
 This project aims to provide a reliable solution for archiving important data from various sources, ensuring that users can preserve their digital assets for the long term.
-## Twitter/X Archive Inputs
+## Archive Inputs
 - Tweet content TOML: `tweet:ID`, `x:tweet:ID`, `x:x:ID`, `twitter:x:ID`, `twitter:tweet:ID`
 - Tweet media/video: `tweet:media:ID`
 - Thread TOML content: `x:thread:ID`, `twitter:thread:ID`
-Tweet and thread TOMLs are stored directly in `raw_tweets/`. Downloaded tweet media and avatars are re-archived into the hashed `raw/` store, and the TOMLs point at those archived files using store-relative `raw/...` paths.
+`archivr archive <path>` currently accepts three kinds of inputs:
-Twitter tweet/thread scraping requires `ARCHIVR_TWITTER_CREDENTIALS_FILE` to point to a cookies file for the vendored scraper.
+- Local files via `file://...`
 - Direct platform URLs
 - Platform shorthand inputs such as `tweet:...`, `yt:...`, or `instagram:...`
 ### Supported Platforms
 - Local files: `file:///absolute/path/to/file.ext`
 - YouTube media: standard video/short URLs, plus [shorthand video inputs](#supported-shorthand-inputs)
 - X/Twitter media from Tweets: normal Tweet URLs or the `tweet:media:ID` shorthand
 - X/Twitter Tweet content scrape: [Tweet and Thread shorthands](#supported-shorthand-inputs). (These are saved as TOML files in `raw_tweets/`)
 - Instagram, Facebook, TikTok, Reddit, Snapchat: direct URLs or platform-prefixed shorthand passed through to `yt-dlp`
 ### Supported Shorthand Inputs
 - YouTube video/short media:
  - `yt:video/ID`
  - `youtube:video/ID`
  - `yt:short/ID`
  - `yt:shorts/ID`
  - `youtube:shorts/ID`
 - X/Twitter tweet TOML content:
  - `tweet:ID`
  - `x:tweet:ID`
  - `x:x:ID`
  - `twitter:x:ID`
  - `twitter:tweet:ID`
 - X/Twitter media/video download:
  - `tweet:media:ID`
 - X/Twitter thread TOML content:
  - `x:thread:ID`
  - `twitter:thread:ID`
 - Other platform shorthands:
  - `instagram:ID`
  - `facebook:ID`
  - `tiktok:ID`
  - `reddit:ID`
  - `snapchat:ID`
 ### Environment Variables
 - `ARCHIVR_YT_DLP`
  - Optional.
  - Overrides the `yt-dlp` binary used for YouTube, X media posts, Instagram, Facebook, TikTok, Reddit, and Snapchat downloads.
 - `ARCHIVR_TWITTER_CREDENTIALS_FILE`
  - Required for tweet/thread scraping inputs such as `tweet:ID` and `x:thread:ID`.
  - Must point to a cookies file for the vendored scraper.
 - `ARCHIVR_TWEET_SCRAPER`
  - Optional.
  - Overrides the tweet scraper script path. Default: `vendor/twitter/scrape_user_tweet_contents.py`.
 - `ARCHIVR_TWEET_PYTHON`
  - Optional.
  - Overrides the Python executable used to run the tweet scraper. Default: `python3`.
 ### Current Limitations
 - Arbitrary `http://` or `https://` pages are not archived yet unless they match one of the currently supported platforms above.
 - Local files currently need to be passed as `file://...` paths.
 ## License
 This project is licensed under the MIT License. See the [LICENSE](LICENSE.md) file for details.