thegeneralist01/facharbeit

Fork 0

TheGeneralist 0893ab3d7c

batman

2026-01-14 23:31:45 +01:00

7.1 KiB

Raw Blame History

Resource Classifier Development Prompt

Context

I'm building a resource classifier that:

Takes URLs from a file (test-classification-list)
Scrapes content (currently Twitter/X posts)
Classifies them using an LLM (Codex) against a hierarchical tag tree
Will eventually store results in SQLite

Current Status

✅ Twitter scraping works (scrapes to TOML files in scraped-tweets/) ✅ LLM classification works (returns JSON with tags, confidence, new_tags, reasoning) ✅ JSON parsing works (using Serde) ❌ Need SQLite storage implementation ❌ Need proper error handling for missing/malformed LLM responses ❌ Need to handle the scraped TOML format better

What I Need You To Do

Task 1: Implement SQLite Storage

Create a new module src/db.rs that:

Schema: Implements this database structure:

-- Resources table
CREATE TABLE IF NOT EXISTS resources (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,  -- 'twitter', 'bookmark', 'video', 'paper'
    url TEXT NOT NULL UNIQUE,
    title TEXT,
    content TEXT,
    saved_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT  -- JSON for type-specific fields
);

-- Tags table (hierarchical)
CREATE TABLE IF NOT EXISTS tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    full_path TEXT NOT NULL UNIQUE,  -- e.g. 'cs/theory/compilers'
    parent_path TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Resource-Tag relationships
CREATE TABLE IF NOT EXISTS resource_tags (
    resource_id TEXT NOT NULL,
    tag_path TEXT NOT NULL,
    confidence REAL NOT NULL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (resource_id, tag_path),
    FOREIGN KEY (resource_id) REFERENCES resources(id)
);

-- Classification log
CREATE TABLE IF NOT EXISTS classification_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    resource_id TEXT NOT NULL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    reasoning TEXT,
    new_tag_suggestions TEXT,  -- JSON array
    FOREIGN KEY (resource_id) REFERENCES resources(id)
);

API Functions:

pub struct Database {
    conn: rusqlite::Connection,
}

impl Database {
    pub fn new(path: &str) -> Result<Self>;
    pub fn init_schema(&self) -> Result<()>;

    // Resource operations
    pub fn insert_resource(&self, url: &str, resource_type: &str, content: &str) -> Result<String>;
    pub fn resource_exists(&self, url: &str) -> Result<bool>;

    // Tag operations
    pub fn ensure_tag_exists(&self, tag_path: &str) -> Result<()>;
    pub fn get_all_tags(&self) -> Result<Vec<String>>;

    // Classification storage
    pub fn store_classification(
        &self,
        resource_id: &str,
        result: &ClassificationResult
    ) -> Result<()>;

    // Query functions
    pub fn get_resources_by_tag(&self, tag_path: &str) -> Result<Vec<Resource>>;
    pub fn get_unclassified_resources(&self) -> Result<Vec<Resource>>;
}

Add rusqlite to Cargo.toml:

rusqlite = { version = "0.32", features = ["bundled"] }

Task 2: Improve Main Loop

Modify src/main.rs to:

Initialize database at startup:

let db = Database::new("resources.db")?;
db.init_schema()?;

For each URL:
- Check if already classified: db.resource_exists(url)?
- If not, scrape + classify
- Store result: db.store_classification(&resource_id, &result)?
- Handle new tag suggestions (print for now, later we'll add interactive review)
Add a --force flag to re-classify existing resources

Task 3: Better TOML Parsing

The scraped tweets are in TOML format. Add:

// In src/scrapers/twitter.rs
use serde::Deserialize;

#[derive(Debug, Deserialize)]
pub struct ScrapedTweet {
    pub id: String,
    pub text: String,
    pub author: String,
    // Add other fields as needed
}

pub fn parse_scraped_tweet(path: &PathBuf) -> Result<ScrapedTweet> {
    let contents = fs::read_to_string(path)?;
    let tweet: ScrapedTweet = toml::from_str(&contents)?;
    Ok(tweet)
}

Add toml = "0.8" to Cargo.toml.

Format the tweet nicely for classification:

format!("Title: Tweet by @{}\nContent: {}", tweet.author, tweet.text)

Task 4: Error Recovery

The LLM sometimes returns malformed JSON. Add retry logic:

// In src/classifiers.rs
pub fn classify_with_retry(
    tag_tree: &str,
    content: String,
    max_attempts: u32
) -> Result<ClassificationResult> {
    for attempt in 1..=max_attempts {
        match classify(tag_tree, content.clone()) {
            Ok(json) => {
                match ClassificationResult::from_json(&json) {
                    Ok(result) => return Ok(result),
                    Err(e) => {
                        eprintln!("Attempt {}/{}: Failed to parse: {}", attempt, max_attempts, e);
                        eprintln!("Raw response: {}", json);
                        if attempt == max_attempts {
                            return Err(e.into());
                        }
                    }
                }
            }
            Err(e) => {
                eprintln!("Attempt {}/{}: LLM call failed: {}", attempt, max_attempts, e);
                if attempt == max_attempts {
                    return Err(e);
                }
            }
        }
    }
    unreachable!()
}

Task 5: CLI Structure

Add clap for better CLI:

clap = { version = "4.5", features = ["derive"] }

use clap::{Parser, Subcommand};

#[derive(Parser)]
#[command(name = "classifier")]
#[command(about = "Resource classifier with hierarchical tags")]
struct Cli {
    #[command(subcommand)]
    command: Commands,
}

#[derive(Subcommand)]
enum Commands {
    /// Classify resources from a file
    Classify {
        /// Path to file with URLs
        #[arg(short, long, default_value = "test-classification-list")]
        input: String,

        /// Force re-classification of existing resources
        #[arg(short, long)]
        force: bool,
    },

    /// Export resources to JSON
    Export {
        /// Output file
        #[arg(short, long)]
        output: String,
    },

    /// Show statistics
    Stats,
}

Expected Behavior After Implementation

# Classify resources
cargo run -- classify

# Force re-classify
cargo run -- classify --force

# Export to JSON (like Ludwig's site)
cargo run -- export -o bookmarks.json

# Show stats
cargo run -- stats

Testing Checklist

Database initializes without errors
Can classify a Twitter URL end-to-end
Classification is stored in DB
Running twice doesn't re-classify (unless --force)
Can export to JSON
Handles LLM returning malformed JSON (retries)
Handles missing fields in LLM response (thanks to #[serde(default)])

Notes

Use anyhow::Context for good error messages
Log important steps to stdout for debugging
The tag-tree file contains the hierarchical tag structure (one tag per line in path format)
Keep existing code structure, just add the missing pieces

Questions to Consider

What to do with low-confidence classifications?
How to review and approve new tag suggestions?

Start with Task 1 (SQLite), then integrate it into main.rs, then add the other improvements.

7.1 KiB Raw Blame History