1
Fork 0
mirror of https://github.com/thegeneralist01/archivr synced 2026-05-30 08:36:47 +02:00
No description
Find a file
2026-05-29 15:22:18 +02:00
docs Add SQLite metadata database support 2026-04-30 21:55:53 +04:00
src Implement archive metadata database 2026-05-04 20:27:54 +02:00
vendor/twitter fix: extract full tweet text from note_tweet field when available 2026-04-06 11:05:32 +02:00
.gitignore feat: add Twitter tweet/thread archiving and platform shorthand support (#5) 2026-04-03 15:34:26 +02:00
Cargo.lock Implement archive metadata database 2026-05-04 20:27:54 +02:00
Cargo.toml Implement archive metadata database 2026-05-04 20:27:54 +02:00
flake.lock feat: add archiving of platform media files (#1) 2026-03-31 12:39:35 +02:00
flake.nix chore: let's guess cargoHash because there's something wrong with nixpkgs! 2026-05-29 15:22:18 +02:00

archivr

An open-source self-hosted archiving tool. Work in progress.

Milestones

  • Archiving
    • Archiving media files from social media platforms
      • YouTube Videos
      • Twitter Videos
      • Instagram
      • Facebook
      • TikTok
      • Reddit
      • Snapchat
      • YouTube Posts (postponed)
    • Archiving local files
    • Archiving Twitter Tweets, Threads, and Articles
    • Archiving files from cloud storage services (Google Drive, Dropbox, OneDrive) and from URLs
      • URLs
      • Google Drive
      • Dropbox
      • OneDrive
      • (Some of these could be postponed for later.)
    • Archive web pages (HTML, CSS, JS, images)
    • Archiving emails (???)
      • Gmail
      • Outlook
      • Yahoo Mail
  • Management
    • Deduplication
    • Tagging system
    • Search functionality
    • Categorization
    • Metadata extraction and storage
  • User Interface
    • Web-based UI
  • Backup and Sync
    • Cloud backup (AWS S3, Google Cloud Storage)
    • Local backup

Motivation

There are two driving factors behind this project:

  • In the age of information, all data is ephemeral. Social media platforms frequently delete content, and cloud storage services can become inaccessible and unreliable. Being able to archive important data is very important for preserving personal memories and digital history.
  • I will be creating a small encyclopedia for my future family and kids. Therefore, I want to make sure that all the information I gather is preserved and accessible for future reference.

This project aims to provide a reliable solution for archiving important data from various sources, ensuring that users can preserve their digital assets for the long term.

Archive Inputs

archivr archive <path> currently accepts three kinds of inputs:

  • Local files via file://...
  • Direct platform URLs
  • Platform shorthand inputs such as tweet:..., yt:..., or instagram:...

Supported Platforms

  • Local files: file:///absolute/path/to/file.ext
  • YouTube media: standard video/short URLs, plus shorthand video inputs
  • X/Twitter media from Tweets: normal Tweet URLs or the tweet:media:ID shorthand
  • X/Twitter Tweet content scrape: Tweet and Thread shorthands. (These are saved as JSON files in raw_tweets/)
  • Instagram, Facebook, TikTok, Reddit, Snapchat: direct URLs or platform-prefixed shorthand passed through to yt-dlp

Supported Shorthand Inputs

  • YouTube video/short media:
    • yt:video/ID
    • youtube:video/ID
    • yt:short/ID
    • yt:shorts/ID
    • youtube:shorts/ID
  • X/Twitter tweet JSON content:
    • tweet:ID
    • x:tweet:ID
    • x:x:ID
    • twitter:x:ID
    • twitter:tweet:ID
  • X/Twitter media/video download:
    • tweet:media:ID
  • X/Twitter thread JSON content:
    • x:thread:ID
    • twitter:thread:ID
  • Other platform shorthands:
    • instagram:ID
    • facebook:ID
    • tiktok:ID
    • reddit:ID
    • snapchat:ID

Environment Variables

  • ARCHIVR_YT_DLP
    • Optional.
    • Overrides the yt-dlp binary used for YouTube, X media posts, Instagram, Facebook, TikTok, Reddit, and Snapchat downloads.
  • ARCHIVR_TWITTER_CREDENTIALS_FILE
    • Required for tweet/thread scraping inputs such as tweet:ID and x:thread:ID.
    • Must point to a cookies file for the vendored scraper.
  • ARCHIVR_TWEET_SCRAPER
    • Optional.
    • Overrides the tweet scraper script path. Default: vendor/twitter/scrape_user_tweet_contents.py.
  • ARCHIVR_TWEET_PYTHON
    • Optional.
    • Overrides the Python executable used to run the tweet scraper. Default: python3.

Current Limitations

  • Arbitrary http:// or https:// pages are not archived yet unless they match one of the currently supported platforms above.
  • Local files currently need to be passed as file://... paths.

License

This project is licensed under the MIT License. See the LICENSE file for details.