1
Fork 0
mirror of https://github.com/thegeneralist01/archivr synced 2026-05-30 08:36:47 +02:00
archivr/docs/PLAN.md

111 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Archivr Database Design Plan
## Summary
Design the first database as a `SQLite` metadata/index layer for the existing file-based archive store, while making the schema multi-user and public-archive ready from day one. The filesystem remains the source of truth for bytes and rendered archive output; the database becomes the source of truth for users, roles, archive runs, archived entries, visibility, hierarchy, blob reuse, and organization.
Each successfully archived thing becomes its own archived entry. Re-archiving the same source creates a new archived entry row, while deduplicated raw files continue to reuse the same blob rows underneath.
## Key Changes
### Identity, access, and visibility
- `users`
- Columns: stable public `user_uid`, `username`, `email` nullable, `password_hash`, `status`, `role`, `created_at`, `last_login_at` nullable.
- Roles: `admin`, `user`.
- `instance_settings`
- Global booleans for `public_index_enabled`, `public_entry_content_enabled`, `public_archive_submission_enabled`.
- Defaults all `false`.
- `archived_entries`
- Add `created_by_user_id`, `owned_by_user_id`, `visibility`.
- `visibility` values: `private`, `unlisted`, `public`.
- `archive_runs`
- Add `created_by_user_id`.
- Do not add groups or per-entry ACL tables in v1; keep the schema portable enough to add them later.
### Core archive model
- `archive_runs`
- One user-started archive operation.
- Columns: stable public `run_uid`, `created_by_user_id`, `started_at`, `finished_at`, `status`, `requested_count`, `discovered_count`, `completed_count`, `failed_count`, `error_summary`.
- `archive_run_items`
- One requested or discovered work item inside an archive run.
- Columns: `run_id`, stable `item_uid`, `parent_item_id` nullable, `ordinal`, `requested_locator`, `canonical_locator` nullable, `source_kind`, `entity_kind`, `status`, `error_text`, `produced_entry_id` nullable.
- Supports batch requests and container expansion with progress like `0/14`.
- `source_identities`
- Canonical identity of the thing being archived across re-archives.
- Columns: `source_kind`, `entity_kind`, `external_id` nullable, `canonical_url` nullable, `normalized_locator`, `identity_key`.
- Unique constraint on `identity_key`.
- `archived_entries`
- One archived thing shown in the archive.
- Columns: stable public `entry_uid`, `source_identity_id`, `archive_run_id`, `parent_entry_id` nullable, `root_entry_id`, `created_by_user_id`, `owned_by_user_id`, `source_kind`, `entity_kind`, `title` nullable, `visibility`, `archived_at`, `original_published_at` nullable, `structured_root_relpath`, `representation_kind`, `source_metadata_json`, `display_metadata_json` nullable.
- `structured_root_relpath` is required and points to one root under `structured/<entry_uid>/`.
- Main archive view queries only rows with `parent_entry_id IS NULL`.
- Child entries remain first-class rows but are nested under the parent in the main view.
- `blobs`
- One deduplicated raw file in `raw/`.
- Columns: `sha256`, `byte_size`, `mime_type` nullable, `extension` nullable, `raw_relpath`, `created_at`.
- `entry_artifacts`
- Selective file pointers attached to an archived entry.
- Columns: `entry_id`, `artifact_role`, `storage_area`, `relpath`, `blob_id` nullable, `logical_path` nullable, `metadata_json` nullable.
- `storage_area`: `raw`, `raw_tweets`, `structured`.
- Store important files only: primary media, raw tweet JSON, avatar, subtitle, thumbnail, manifest, cover image.
### Organization and extensibility
- `taxonomy_nodes`
- Hierarchical organization tree.
- Columns: stable `node_uid`, `parent_id` nullable, `name`, `slug`, `full_path`.
- `full_path` unique, example `/sciences/computer-science/compilers`.
- `entry_taxonomy_assignments`
- Many-to-many link between archived entries and taxonomy nodes.
- Assign the most specific node; ancestor membership is derived via recursive queries.
- Keep shared fields relational and source-specific details in `source_metadata_json`.
- YouTube examples: `video_id`, `channel_id`, duration, playlist membership.
- Tweet examples: `tweet_id`, `author_handle`, conversation ID, text summary fields.
- Do not create per-source tables in v1.
### Public/archive access behavior implied by schema
- Public archive browsing is controlled by both instance settings and entry visibility.
- `public` entries are eligible for anonymous listing/viewing only when instance-level public settings allow it.
- `unlisted` entries are not shown in public indexes but can be directly served later by URL/token design.
- `private` entries are visible only to authorized users.
- Ownership is recorded now even if the first UI only exposes simple admin/user behavior.
## Public APIs / Interfaces
- `archivr init`
- Create the SQLite database and schema alongside the existing archive metadata directory.
- Keep existing store directories.
- `archivr archive`
- Start one `archive_run` owned by a user.
- Insert one or more `archive_run_items`.
- On success, create one or more `archived_entries`.
- Link reused raw files through `blobs` and `entry_artifacts`.
- Record the entrys `structured_root_relpath`, visibility, and source metadata JSON.
- New persisted domain types
- `User`
- `ArchiveRun`
- `ArchiveRunItem`
- `ArchivedEntry`
- `SourceIdentity`
- `Blob`
- `EntryArtifact`
- `TaxonomyNode`
- `InstanceSettings`
## Test Plan
- Re-archiving the same YouTube video creates two `archived_entries`, one shared `source_identity`, and one shared primary `blob`.
- Archiving a tweet/thread creates one archived entry, records the raw tweet JSON as an `entry_artifact` in `raw_tweets`, and links downloaded media/avatar blobs correctly.
- Archiving a playlist/channel creates one top-level parent entry plus child entries; the main archive query returns only the parent.
- A single archive run with multiple requested locators records multiple run items and correct progress counters.
- A normal user can create entries but cannot manage other users or instance-wide public settings.
- An admin can manage users and instance-wide public settings.
- A `public` entry is still hidden from anonymous users when `public_index_enabled` or `public_entry_content_enabled` is disabled at the instance level.
- A `private` entry never appears in anonymous/public queries.
- Assigning `/sciences/computer-science/compilers` makes the item discoverable through ancestor queries for `sciences` and `computer-science`.
- A website-style entry can be represented as one archived entry with one structured root and no per-asset DB explosion.
## Assumptions
- SQLite is the only target for the first implementation, but the schema should avoid SQLite-only modeling that would block a later Postgres migration.
- The database indexes archive metadata; archive bytes stay on disk.
- Every archived entry gets a stable public ID used for `structured/<entry_uid>/`; timestamps are metadata, not identity.
- `raw_tweets/` remains a valid sibling storage area and is referenced through `entry_artifacts`.
- Titles are optional and nullable.
- Search, FTS, subtitles, transcript indexing, groups, and per-entry ACL sharing are deferred.
- Organization uses hierarchical taxonomy only for now; free-form tags are out of scope.
- The first permissions model matches the simpler ArchiveBox-style shape: admins, normal users, and optional public visibility, without custom group policy in v1.