mirror of
https://github.com/thegeneralist01/archivr
synced 2026-05-30 08:36:47 +02:00
7.3 KiB
7.3 KiB
Archivr Database Design Plan
Summary
Design the first database as a SQLite metadata/index layer for the existing file-based archive store, while making the schema multi-user and public-archive ready from day one. The filesystem remains the source of truth for bytes and rendered archive output; the database becomes the source of truth for users, roles, archive runs, archived entries, visibility, hierarchy, blob reuse, and organization.
Each successfully archived thing becomes its own archived entry. Re-archiving the same source creates a new archived entry row, while deduplicated raw files continue to reuse the same blob rows underneath.
Key Changes
Identity, access, and visibility
users- Columns: stable public
user_uid,username,emailnullable,password_hash,status,role,created_at,last_login_atnullable. - Roles:
admin,user.
- Columns: stable public
instance_settings- Global booleans for
public_index_enabled,public_entry_content_enabled,public_archive_submission_enabled. - Defaults all
false.
- Global booleans for
archived_entries- Add
created_by_user_id,owned_by_user_id,visibility. visibilityvalues:private,unlisted,public.
- Add
archive_runs- Add
created_by_user_id.
- Add
- Do not add groups or per-entry ACL tables in v1; keep the schema portable enough to add them later.
Core archive model
archive_runs- One user-started archive operation.
- Columns: stable public
run_uid,created_by_user_id,started_at,finished_at,status,requested_count,discovered_count,completed_count,failed_count,error_summary.
archive_run_items- One requested or discovered work item inside an archive run.
- Columns:
run_id, stableitem_uid,parent_item_idnullable,ordinal,requested_locator,canonical_locatornullable,source_kind,entity_kind,status,error_text,produced_entry_idnullable. - Supports batch requests and container expansion with progress like
0/14.
source_identities- Canonical identity of the thing being archived across re-archives.
- Columns:
source_kind,entity_kind,external_idnullable,canonical_urlnullable,normalized_locator,identity_key. - Unique constraint on
identity_key.
archived_entries- One archived thing shown in the archive.
- Columns: stable public
entry_uid,source_identity_id,archive_run_id,parent_entry_idnullable,root_entry_id,created_by_user_id,owned_by_user_id,source_kind,entity_kind,titlenullable,visibility,archived_at,original_published_atnullable,structured_root_relpath,representation_kind,source_metadata_json,display_metadata_jsonnullable. structured_root_relpathis required and points to one root understructured/<entry_uid>/.- Main archive view queries only rows with
parent_entry_id IS NULL. - Child entries remain first-class rows but are nested under the parent in the main view.
blobs- One deduplicated raw file in
raw/. - Columns:
sha256,byte_size,mime_typenullable,extensionnullable,raw_relpath,created_at.
- One deduplicated raw file in
entry_artifacts- Selective file pointers attached to an archived entry.
- Columns:
entry_id,artifact_role,storage_area,relpath,blob_idnullable,logical_pathnullable,metadata_jsonnullable. storage_area:raw,raw_tweets,structured.- Store important files only: primary media, raw tweet JSON, avatar, subtitle, thumbnail, manifest, cover image.
Organization and extensibility
taxonomy_nodes- Hierarchical organization tree.
- Columns: stable
node_uid,parent_idnullable,name,slug,full_path. full_pathunique, example/sciences/computer-science/compilers.
entry_taxonomy_assignments- Many-to-many link between archived entries and taxonomy nodes.
- Assign the most specific node; ancestor membership is derived via recursive queries.
- Keep shared fields relational and source-specific details in
source_metadata_json.- YouTube examples:
video_id,channel_id, duration, playlist membership. - Tweet examples:
tweet_id,author_handle, conversation ID, text summary fields. - Do not create per-source tables in v1.
- YouTube examples:
Public/archive access behavior implied by schema
- Public archive browsing is controlled by both instance settings and entry visibility.
publicentries are eligible for anonymous listing/viewing only when instance-level public settings allow it.unlistedentries are not shown in public indexes but can be directly served later by URL/token design.privateentries are visible only to authorized users.- Ownership is recorded now even if the first UI only exposes simple admin/user behavior.
Public APIs / Interfaces
archivr init- Create the SQLite database and schema alongside the existing archive metadata directory.
- Keep existing store directories.
archivr archive- Start one
archive_runowned by a user. - Insert one or more
archive_run_items. - On success, create one or more
archived_entries. - Link reused raw files through
blobsandentry_artifacts. - Record the entry’s
structured_root_relpath, visibility, and source metadata JSON.
- Start one
- New persisted domain types
UserArchiveRunArchiveRunItemArchivedEntrySourceIdentityBlobEntryArtifactTaxonomyNodeInstanceSettings
Test Plan
- Re-archiving the same YouTube video creates two
archived_entries, one sharedsource_identity, and one shared primaryblob. - Archiving a tweet/thread creates one archived entry, records the raw tweet JSON as an
entry_artifactinraw_tweets, and links downloaded media/avatar blobs correctly. - Archiving a playlist/channel creates one top-level parent entry plus child entries; the main archive query returns only the parent.
- A single archive run with multiple requested locators records multiple run items and correct progress counters.
- A normal user can create entries but cannot manage other users or instance-wide public settings.
- An admin can manage users and instance-wide public settings.
- A
publicentry is still hidden from anonymous users whenpublic_index_enabledorpublic_entry_content_enabledis disabled at the instance level. - A
privateentry never appears in anonymous/public queries. - Assigning
/sciences/computer-science/compilersmakes the item discoverable through ancestor queries forsciencesandcomputer-science. - A website-style entry can be represented as one archived entry with one structured root and no per-asset DB explosion.
Assumptions
- SQLite is the only target for the first implementation, but the schema should avoid SQLite-only modeling that would block a later Postgres migration.
- The database indexes archive metadata; archive bytes stay on disk.
- Every archived entry gets a stable public ID used for
structured/<entry_uid>/; timestamps are metadata, not identity. raw_tweets/remains a valid sibling storage area and is referenced throughentry_artifacts.- Titles are optional and nullable.
- Search, FTS, subtitles, transcript indexing, groups, and per-entry ACL sharing are deferred.
- Organization uses hierarchical taxonomy only for now; free-form tags are out of scope.
- The first permissions model matches the simpler ArchiveBox-style shape: admins, normal users, and optional public visibility, without custom group policy in v1.