Skip to content

Software Catalog

The software catalog is the foundation of the AI Imaging Agent, containing curated information about imaging analysis tools.

Overview

Format: JSON Lines (JSONL)
Location: dataset/catalog.jsonl
Schema: Based on schema.org SoftwareSourceCode
Size: ~150 tools currently

Catalog Schema

Core Fields

Based on schema.org/SoftwareSourceCode:

{
  "@type": "SoftwareSourceCode",
  "name": "TotalSegmentator",
  "description": "Tool for automated segmentation of 104 anatomical structures",
  "url": "https://github.com/wasserth/TotalSegmentator",
  "codeRepository": "https://github.com/wasserth/TotalSegmentator",
  "programmingLanguage": "Python",
  "runtimePlatform": "PyTorch",
  "license": "Apache-2.0",
  "keywords": ["segmentation", "CT", "MRI", "medical-imaging"],
  "applicationCategory": "Medical Imaging",
  "operatingSystem": ["Linux", "Windows", "macOS"],
  "softwareVersion": "2.0.0",
  "datePublished": "2022-09-01",
  "dateModified": "2024-01-15",
  "author": {
    "@type": "Person",
    "name": "Jakob Wasserthal"
  }
}

Extended Fields

Custom fields in supportingData:

{
  "supportingData": {
    "modalities": ["CT", "MRI"],
    "dimensions": ["3D"],
    "formats": ["DICOM", "NIfTI", "PNG"],
    "tasks": ["segmentation", "organ-segmentation"],
    "demo_url": "https://huggingface.co/spaces/username/totalsegmentator",
    "paper_url": "https://doi.org/10.1000/example",
    "citations": 150,
    "github_stars": 1200
  }
}

Field Descriptions

name

Canonical tool name (matches repository or published name)

Example: "TotalSegmentator", "nnU-Net", "MedSAM"

description

Brief description of tool's purpose and capabilities

Guidelines:

  • 1-2 sentences
  • Mention key features
  • Include domain/modality if specific

url

Primary landing page (usually GitHub repo)

codeRepository

Source code repository URL (GitHub, GitLab, etc.)

programmingLanguage

Primary language(s)

Common values: "Python", "C++", "JavaScript", "Jupyter Notebook"

license

Software license identifier (SPDX format)

Common values:

  • "Apache-2.0": Permissive, commercial OK
  • "MIT": Very permissive
  • "GPL-3.0": Copyleft
  • "BSD-3-Clause": Permissive
  • "Proprietary": Restricted

keywords

Array of relevant tags/keywords

Categories:

  • Tasks: segmentation, classification, registration, detection
  • Modalities: CT, MRI, X-ray, ultrasound, microscopy
  • Techniques: deep-learning, traditional-cv, machine-learning
  • Domains: medical-imaging, scientific-imaging, neuroscience

supportingData.modalities

Medical imaging modalities supported

Standard values:

  • "CT": Computed Tomography
  • "MRI": Magnetic Resonance Imaging
  • "XR": X-ray radiography
  • "US": Ultrasound
  • "PET": Positron Emission Tomography
  • "SPECT": Single-Photon Emission CT
  • "OCT": Optical Coherence Tomography
  • "Microscopy": Various microscopy types

supportingData.dimensions

Spatial dimensions supported

Values: ["2D"], ["3D"], ["2D", "3D"], ["4D"]

  • 2D: Single slice images
  • 3D: Volumetric data
  • 4D: Time-series volumes (3D + time)

supportingData.formats

File formats supported for input/output

Common values:

  • Medical: "DICOM", "NIfTI", "NRRD", "Analyze"
  • Standard: "PNG", "JPEG", "TIFF", "BMP"
  • Scientific: "HDF5", "Zarr", "OME-TIFF"
  • Other: "NumPy", "MAT"

supportingData.tasks

Analysis tasks the tool performs

Common values:

  • "segmentation": Image segmentation
  • "classification": Image classification
  • "detection": Object detection
  • "registration": Image registration/alignment
  • "reconstruction": 3D reconstruction
  • "enhancement": Image enhancement
  • "analysis": General analysis

supportingData.demo_url

Link to runnable demo (HuggingFace Space, Colab, web app)

Preferred: HuggingFace Gradio Spaces (best integration)

Example: "https://huggingface.co/spaces/username/toolname"

Catalog Structure

File Format

JSON Lines (JSONL): Each line is a complete JSON object

{"@type": "SoftwareSourceCode", "name": "Tool1", ...}
{"@type": "SoftwareSourceCode", "name": "Tool2", ...}
{"@type": "SoftwareSourceCode", "name": "Tool3", ...}

Benefits:

  • Easy to append new tools
  • Stream processing for large catalogs
  • Each line independently parseable
  • Git-friendly (line-based diffs)

Catalog Loading

import json

def load_catalog(path: str) -> list[dict]:
    tools = []
    with open(path) as f:
        for line in f:
            if line.strip():
                tools.append(json.loads(line))
    return tools

Validation

Tools are validated on load:

from pydantic import BaseModel, HttpUrl

class SoftwareSourceCode(BaseModel):
    name: str
    description: str
    url: HttpUrl
    license: str
    keywords: list[str]
    supportingData: dict

    class Config:
        extra = "allow"  # Allow additional schema.org fields

Catalog Management

Adding New Tools

  1. Create entry following schema:
{
  "@type": "SoftwareSourceCode",
  "name": "NewTool",
  "description": "Brief description of the tool",
  "url": "https://github.com/user/newtool",
  "codeRepository": "https://github.com/user/newtool",
  "programmingLanguage": "Python",
  "license": "MIT",
  "keywords": ["segmentation", "CT"],
  "supportingData": {
    "modalities": ["CT"],
    "dimensions": ["3D"],
    "formats": ["DICOM", "NIfTI"],
    "tasks": ["segmentation"],
    "demo_url": "https://huggingface.co/spaces/user/newtool"
  }
}
  1. Append to catalog.jsonl (as single line, no pretty printing)

  2. Update checksum:

shasum dataset/catalog.jsonl > dataset/catalog.jsonl.sha1
  1. Sync catalog:
ai_agent sync

This rebuilds the embeddings and FAISS index.

Updating Existing Tools

  1. Find tool in catalog.jsonl
  2. Edit JSON (update fields)
  3. Validate JSON syntax
  4. Update checksum and sync

Removing Tools

  1. Delete line from catalog.jsonl
  2. Update checksum and sync

Synchronization

The catalog is populated by querying a GraphDB SPARQL endpoint and converting the results to JSONL. This is handled by catalog/sync.py via the sync_once() function (called at startup and by ai_agent sync).

Sync Flow

graph LR
    A[GraphDB SPARQL] --> B[fetch_jsonld]
    B --> C[catalog.jsonld]
    C --> D[full_processing]
    D --> E[catalog.jsonl]
    E --> F[VectorIndex.sync_with_catalog]
    F --> G[FAISS index]
  1. Query — load SPARQL query from GRAPHDB_QUERY_FILE (default: get_relevant_software.rq)
  2. Fetch — send query to GRAPHDB_URL, receive JSON-LD (falls back to TURTLE → rdflib → JSON-LD)
  3. Save snapshot — write raw result to OUTPUT_JSONLD (default: dataset/catalog.jsonld)
  4. Convert — run full_processing() to transform JSON-LD into flat JSONL (OUTPUT_JSONL, default: dataset/catalog.jsonl)
  5. Diff — compute SHA-1 hash of normalized docs; compare with previous hash to detect changes
  6. Rebuild index — if changed (or FAISS is missing), rebuild and save to RAG_INDEX_DIR

Required Environment Variables for Sync

Variable Description
GRAPHDB_URL SPARQL endpoint URL (required for ai_agent sync)
GRAPHDB_GRAPH Named graph IRI to query (absolute IRI, required)
GRAPHDB_QUERY_FILE Path to .rq SPARQL query file (default: get_relevant_software.rq)
GRAPHDB_USER GraphDB username (optional, for authenticated endpoints)
GRAPHDB_PASSWORD GraphDB password (optional)

See Environment Variables for all options.

Freshness Skip

You can skip remote sync if the local catalog is recent enough:

SYNC_SKIP_IF_FRESH_SECONDS=3600   # Skip if catalog is < 1 hour old
SYNC_FORCE=1                       # Always sync, ignoring freshness

Auto-Sync (Background)

Configure periodic background sync via .env:

SYNC_EVERY_HOURS=24

When the catalog changes (detected via SHA-1 diff), the background thread: 1. Calls sync_once() to fetch and rebuild 2. Calls pipeline.reload_index() to hot-reload FAISS without restart 3. Refreshes UI tool card data

Manual Sync

ai_agent sync

Embeddings and Index

Embedding Process

At startup (or after sync), each tool doc is embedded and stored in a FAISS index. Embedding is performed by VectorIndex.sync_with_catalog() using the configured embedder (see Retrieval Pipeline).

Index Storage

artifacts/rag_index/
├── index.faiss          # FAISS IndexFlatIP binary
└── meta.json            # Tool IDs, embedding config, timestamps

meta.json structure:

{
  "tool_ids": ["tool1", "tool2", ...],
  "embedding_model": "Qwen/Qwen3-Embedding-8B",
  "num_tools": 150,
  "created_at": "2025-05-08T12:00:00Z"
}

Note

The embedding model recorded in meta.json is set by config.yaml → retrieval.embedder.model_name. If you change the model, the index is rebuilt automatically during the next sync.

Quality Assurance

Validation Rules

  1. Required fields: name, description, url, license
  2. Valid URLs: Well-formed HTTP/HTTPS URLs
  3. Standard licenses: SPDX identifiers preferred
  4. Consistent keywords: Use standard terminology
  5. Demo URLs: Verify demos are live and accessible

Automated Checks

def validate_catalog(catalog_path):
    errors = []

    with open(catalog_path) as f:
        for i, line in enumerate(f, 1):
            try:
                tool = json.loads(line)

                # Required fields
                for field in ['name', 'description', 'url']:
                    if field not in tool:
                        errors.append(f"Line {i}: Missing {field}")

                # URL validation
                if not tool['url'].startswith('http'):
                    errors.append(f"Line {i}: Invalid URL")

                # supportingData structure
                if 'supportingData' in tool:
                    sd = tool['supportingData']
                    if 'demo_url' in sd and sd['demo_url']:
                        if not sd['demo_url'].startswith('http'):
                            errors.append(f"Line {i}: Invalid demo_url")

            except json.JSONDecodeError as e:
                errors.append(f"Line {i}: JSON syntax error - {e}")

    return errors

Best Practices

Tool Descriptions

Good:

"Automated multi-organ segmentation for CT and MRI supporting 104 anatomical structures"

Bad:

"A tool"  # Too vague
"The best segmentation tool ever created with amazing accuracy..."  # Too marketing-y

Keywords

Good:

["segmentation", "CT", "MRI", "medical-imaging", "deep-learning", "organ-segmentation"]

Bad:

["cool", "awesome", "the best"]  # Not searchable terms

Demo URLs

Preferred: - HuggingFace Gradio Spaces - Google Colab notebooks - Live web demos

Avoid: - Dead links - Paywalled demos - Demos requiring registration

Next Steps