mirror of
https://github.com/thegeneralist01/archivr
synced 2026-05-30 08:36:47 +02:00
Add docs for supported platforms, shorthands, and env vars
This commit is contained in:
parent
9837bda0c2
commit
4871ca7da5
1 changed files with 94 additions and 37 deletions
131
docs/README.md
131
docs/README.md
|
|
@ -3,56 +3,113 @@
|
||||||
An open-source self-hosted archiving tool. Work in progress.
|
An open-source self-hosted archiving tool. Work in progress.
|
||||||
|
|
||||||
## Milestones
|
## Milestones
|
||||||
|
|
||||||
- [ ] Archiving
|
- [ ] Archiving
|
||||||
- [X] Archiving media files from social media platforms
|
- [x] Archiving media files from social media platforms
|
||||||
- [X] YouTube Videos
|
- [x] YouTube Videos
|
||||||
- [X] Twitter Videos
|
- [x] Twitter Videos
|
||||||
- [X] Instagram
|
- [x] Instagram
|
||||||
- [X] Facebook
|
- [x] Facebook
|
||||||
- [X] TikTok
|
- [x] TikTok
|
||||||
- [X] Reddit
|
- [x] Reddit
|
||||||
- [X] Snapchat
|
- [x] Snapchat
|
||||||
- [ ] YouTube Posts (postponed)
|
- [ ] YouTube Posts (postponed)
|
||||||
- [X] Archiving local files
|
- [x] Archiving local files
|
||||||
- [ ] Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
|
- [ ] Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
|
||||||
- [ ] URLs
|
- [ ] URLs
|
||||||
- [ ] Google Drive
|
- [ ] Google Drive
|
||||||
- [ ] Dropbox
|
- [ ] Dropbox
|
||||||
- [ ] OneDrive
|
- [ ] OneDrive
|
||||||
- (Some of these could be postponed for later.)
|
- (Some of these could be postponed for later.)
|
||||||
- [X] Archiving Twitter threads
|
- [x] Archiving Twitter threads
|
||||||
- [ ] Archive web pages (HTML, CSS, JS, images)
|
- [ ] Archive web pages (HTML, CSS, JS, images)
|
||||||
- [ ] Archiving emails (???)
|
- [ ] Archiving emails (???)
|
||||||
- [ ] Gmail
|
- [ ] Gmail
|
||||||
- [ ] Outlook
|
- [ ] Outlook
|
||||||
- [ ] Yahoo Mail
|
- [ ] Yahoo Mail
|
||||||
- [ ] Management
|
- [ ] Management
|
||||||
- [ ] Deduplication
|
- [ ] Deduplication
|
||||||
- [ ] Tagging system
|
- [ ] Tagging system
|
||||||
- [ ] Search functionality
|
- [ ] Search functionality
|
||||||
- [ ] Categorization
|
- [ ] Categorization
|
||||||
- [ ] Metadata extraction and storage
|
- [ ] Metadata extraction and storage
|
||||||
- [ ] User Interface
|
- [ ] User Interface
|
||||||
- [ ] Web-based UI
|
- [ ] Web-based UI
|
||||||
- [ ] Backup and Sync
|
- [ ] Backup and Sync
|
||||||
- [ ] Cloud backup (AWS S3, Google Cloud Storage)
|
- [ ] Cloud backup (AWS S3, Google Cloud Storage)
|
||||||
- [ ] Local backup
|
- [ ] Local backup
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
There are two driving factors behind this project:
|
There are two driving factors behind this project:
|
||||||
- In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is *very important* for preserving personal memories and digital history.
|
|
||||||
|
- In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is _very important_ for preserving personal memories and digital history.
|
||||||
- I will be creating a small encyclopedia for my future family and kids. Therefore, I want to make sure that all the information I gather is preserved and accessible for future reference.
|
- I will be creating a small encyclopedia for my future family and kids. Therefore, I want to make sure that all the information I gather is preserved and accessible for future reference.
|
||||||
|
|
||||||
This project aims to provide a reliable solution for archiving important data from various sources, ensuring that users can preserve their digital assets for the long term.
|
This project aims to provide a reliable solution for archiving important data from various sources, ensuring that users can preserve their digital assets for the long term.
|
||||||
|
|
||||||
## Twitter/X Archive Inputs
|
## Archive Inputs
|
||||||
- Tweet content TOML: `tweet:ID`, `x:tweet:ID`, `x:x:ID`, `twitter:x:ID`, `twitter:tweet:ID`
|
|
||||||
- Tweet media/video: `tweet:media:ID`
|
|
||||||
- Thread TOML content: `x:thread:ID`, `twitter:thread:ID`
|
|
||||||
|
|
||||||
Tweet and thread TOMLs are stored directly in `raw_tweets/`. Downloaded tweet media and avatars are re-archived into the hashed `raw/` store, and the TOMLs point at those archived files using store-relative `raw/...` paths.
|
`archivr archive <path>` currently accepts three kinds of inputs:
|
||||||
|
|
||||||
Twitter tweet/thread scraping requires `ARCHIVR_TWITTER_CREDENTIALS_FILE` to point to a cookies file for the vendored scraper.
|
- Local files via `file://...`
|
||||||
|
- Direct platform URLs
|
||||||
|
- Platform shorthand inputs such as `tweet:...`, `yt:...`, or `instagram:...`
|
||||||
|
|
||||||
|
### Supported Platforms
|
||||||
|
|
||||||
|
- Local files: `file:///absolute/path/to/file.ext`
|
||||||
|
- YouTube media: standard video/short URLs, plus [shorthand video inputs](#supported-shorthand-inputs)
|
||||||
|
- X/Twitter media from Tweets: normal Tweet URLs or the `tweet:media:ID` shorthand
|
||||||
|
- X/Twitter Tweet content scrape: [Tweet and Thread shorthands](#supported-shorthand-inputs). (These are saved as TOML files in `raw_tweets/`)
|
||||||
|
- Instagram, Facebook, TikTok, Reddit, Snapchat: direct URLs or platform-prefixed shorthand passed through to `yt-dlp`
|
||||||
|
|
||||||
|
### Supported Shorthand Inputs
|
||||||
|
|
||||||
|
- YouTube video/short media:
|
||||||
|
- `yt:video/ID`
|
||||||
|
- `youtube:video/ID`
|
||||||
|
- `yt:short/ID`
|
||||||
|
- `yt:shorts/ID`
|
||||||
|
- `youtube:shorts/ID`
|
||||||
|
- X/Twitter tweet TOML content:
|
||||||
|
- `tweet:ID`
|
||||||
|
- `x:tweet:ID`
|
||||||
|
- `x:x:ID`
|
||||||
|
- `twitter:x:ID`
|
||||||
|
- `twitter:tweet:ID`
|
||||||
|
- X/Twitter media/video download:
|
||||||
|
- `tweet:media:ID`
|
||||||
|
- X/Twitter thread TOML content:
|
||||||
|
- `x:thread:ID`
|
||||||
|
- `twitter:thread:ID`
|
||||||
|
- Other platform shorthands:
|
||||||
|
- `instagram:ID`
|
||||||
|
- `facebook:ID`
|
||||||
|
- `tiktok:ID`
|
||||||
|
- `reddit:ID`
|
||||||
|
- `snapchat:ID`
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
- `ARCHIVR_YT_DLP`
|
||||||
|
- Optional.
|
||||||
|
- Overrides the `yt-dlp` binary used for YouTube, X media posts, Instagram, Facebook, TikTok, Reddit, and Snapchat downloads.
|
||||||
|
- `ARCHIVR_TWITTER_CREDENTIALS_FILE`
|
||||||
|
- Required for tweet/thread scraping inputs such as `tweet:ID` and `x:thread:ID`.
|
||||||
|
- Must point to a cookies file for the vendored scraper.
|
||||||
|
- `ARCHIVR_TWEET_SCRAPER`
|
||||||
|
- Optional.
|
||||||
|
- Overrides the tweet scraper script path. Default: `vendor/twitter/scrape_user_tweet_contents.py`.
|
||||||
|
- `ARCHIVR_TWEET_PYTHON`
|
||||||
|
- Optional.
|
||||||
|
- Overrides the Python executable used to run the tweet scraper. Default: `python3`.
|
||||||
|
|
||||||
|
### Current Limitations
|
||||||
|
|
||||||
|
- Arbitrary `http://` or `https://` pages are not archived yet unless they match one of the currently supported platforms above.
|
||||||
|
- Local files currently need to be passed as `file://...` paths.
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE.md) file for details.
|
This project is licensed under the MIT License. See the [LICENSE](LICENSE.md) file for details.
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue