Software Catalog¶

The software catalog is the foundation of the AI Imaging Agent, containing curated information about imaging analysis tools.

Overview¶

Format: JSON Lines (JSONL)
Location: dataset/catalog.jsonl
Schema: Based on schema.org SoftwareSourceCode
Size: ~150 tools currently

Catalog Schema¶

Core Fields¶

Based on schema.org/SoftwareSourceCode:

{
  "@type": "SoftwareSourceCode",
  "name": "TotalSegmentator",
  "description": "Tool for automated segmentation of 104 anatomical structures",
  "url": "https://github.com/wasserth/TotalSegmentator",
  "codeRepository": "https://github.com/wasserth/TotalSegmentator",
  "programmingLanguage": "Python",
  "runtimePlatform": "PyTorch",
  "license": "Apache-2.0",
  "keywords": ["segmentation", "CT", "MRI", "medical-imaging"],
  "applicationCategory": "Medical Imaging",
  "operatingSystem": ["Linux", "Windows", "macOS"],
  "softwareVersion": "2.0.0",
  "datePublished": "2022-09-01",
  "dateModified": "2024-01-15",
  "author": {
    "@type": "Person",
    "name": "Jakob Wasserthal"
  }
}

Extended Fields¶

Custom fields in supportingData:

{
  "supportingData": {
    "modalities": ["CT", "MRI"],
    "dimensions": ["3D"],
    "formats": ["DICOM", "NIfTI", "PNG"],
    "tasks": ["segmentation", "organ-segmentation"],
    "demo_url": "https://huggingface.co/spaces/username/totalsegmentator",
    "paper_url": "https://doi.org/10.1000/example",
    "citations": 150,
    "github_stars": 1200
  }
}

Field Descriptions¶

name¶

Canonical tool name (matches repository or published name)

Example: "TotalSegmentator", "nnU-Net", "MedSAM"

description¶

Brief description of tool's purpose and capabilities

Guidelines:

1-2 sentences
Mention key features
Include domain/modality if specific

url¶

Primary landing page (usually GitHub repo)

codeRepository¶

Source code repository URL (GitHub, GitLab, etc.)

programmingLanguage¶

Primary language(s)

Common values: "Python", "C++", "JavaScript", "Jupyter Notebook"

license¶

Software license identifier (SPDX format)

Common values:

"Apache-2.0": Permissive, commercial OK
"MIT": Very permissive
"GPL-3.0": Copyleft
"BSD-3-Clause": Permissive
"Proprietary": Restricted

keywords¶

Array of relevant tags/keywords

Categories:

Tasks: segmentation, classification, registration, detection
Modalities: CT, MRI, X-ray, ultrasound, microscopy
Techniques: deep-learning, traditional-cv, machine-learning
Domains: medical-imaging, scientific-imaging, neuroscience

supportingData.modalities¶

Medical imaging modalities supported

Standard values:

"CT": Computed Tomography
"MRI": Magnetic Resonance Imaging
"XR": X-ray radiography
"US": Ultrasound
"PET": Positron Emission Tomography
"SPECT": Single-Photon Emission CT
"OCT": Optical Coherence Tomography
"Microscopy": Various microscopy types

supportingData.dimensions¶

Spatial dimensions supported

Values: ["2D"], ["3D"], ["2D", "3D"], ["4D"]

2D: Single slice images
3D: Volumetric data
4D: Time-series volumes (3D + time)

supportingData.formats¶

File formats supported for input/output

Common values:

Medical: "DICOM", "NIfTI", "NRRD", "Analyze"
Standard: "PNG", "JPEG", "TIFF", "BMP"
Scientific: "HDF5", "Zarr", "OME-TIFF"
Other: "NumPy", "MAT"

supportingData.tasks¶

Analysis tasks the tool performs

Common values:

"segmentation": Image segmentation
"classification": Image classification
"detection": Object detection
"registration": Image registration/alignment
"reconstruction": 3D reconstruction
"enhancement": Image enhancement
"analysis": General analysis

supportingData.demo_url¶

Link to runnable demo (HuggingFace Space, Colab, web app)

Preferred: HuggingFace Gradio Spaces (best integration)

Example: "https://huggingface.co/spaces/username/toolname"

Catalog Structure¶

File Format¶

JSON Lines (JSONL): Each line is a complete JSON object

{"@type": "SoftwareSourceCode", "name": "Tool1", ...}
{"@type": "SoftwareSourceCode", "name": "Tool2", ...}
{"@type": "SoftwareSourceCode", "name": "Tool3", ...}

Benefits:

Easy to append new tools
Stream processing for large catalogs
Each line independently parseable
Git-friendly (line-based diffs)

Catalog Loading¶

import json

def load_catalog(path: str) -> list[dict]:
    tools = []
    with open(path) as f:
        for line in f:
            if line.strip():
                tools.append(json.loads(line))
    return tools

Validation¶

Tools are validated on load:

from pydantic import BaseModel, HttpUrl

class SoftwareSourceCode(BaseModel):
    name: str
    description: str
    url: HttpUrl
    license: str
    keywords: list[str]
    supportingData: dict

    class Config:
        extra = "allow"  # Allow additional schema.org fields

Catalog Management¶

Adding New Tools¶

Create entry following schema:

{
  "@type": "SoftwareSourceCode",
  "name": "NewTool",
  "description": "Brief description of the tool",
  "url": "https://github.com/user/newtool",
  "codeRepository": "https://github.com/user/newtool",
  "programmingLanguage": "Python",
  "license": "MIT",
  "keywords": ["segmentation", "CT"],
  "supportingData": {
    "modalities": ["CT"],
    "dimensions": ["3D"],
    "formats": ["DICOM", "NIfTI"],
    "tasks": ["segmentation"],
    "demo_url": "https://huggingface.co/spaces/user/newtool"
  }
}

Append to catalog.jsonl (as single line, no pretty printing)
Update checksum:

shasum dataset/catalog.jsonl > dataset/catalog.jsonl.sha1

Sync catalog:

ai_agent sync

This rebuilds the embeddings and FAISS index.

Updating Existing Tools¶

Find tool in catalog.jsonl
Edit JSON (update fields)
Validate JSON syntax
Update checksum and sync

Removing Tools¶

Delete line from catalog.jsonl
Update checksum and sync

Synchronization¶

The catalog is populated by querying a GraphDB SPARQL endpoint and converting the results to JSONL. This is handled by catalog/sync.py via the sync_once() function (called at startup and by ai_agent sync).

Sync Flow¶

graph LR
    A[GraphDB SPARQL] --> B[fetch_jsonld]
    B --> C[catalog.jsonld]
    C --> D[full_processing]
    D --> E[catalog.jsonl]
    E --> F[VectorIndex.sync_with_catalog]
    F --> G[FAISS index]

Query — load SPARQL query from GRAPHDB_QUERY_FILE (default: get_relevant_software.rq)
Fetch — send query to GRAPHDB_URL, receive JSON-LD (falls back to TURTLE → rdflib → JSON-LD)
Save snapshot — write raw result to OUTPUT_JSONLD (default: dataset/catalog.jsonld)
Convert — run full_processing() to transform JSON-LD into flat JSONL (OUTPUT_JSONL, default: dataset/catalog.jsonl)
Diff — compute SHA-1 hash of normalized docs; compare with previous hash to detect changes
Rebuild index — if changed (or FAISS is missing), rebuild and save to RAG_INDEX_DIR

Required Environment Variables for Sync¶

Variable	Description
`GRAPHDB_URL`	SPARQL endpoint URL (required for `ai_agent sync`)
`GRAPHDB_GRAPH`	Named graph IRI to query (absolute IRI, required)
`GRAPHDB_QUERY_FILE`	Path to `.rq` SPARQL query file (default: `get_relevant_software.rq`)
`GRAPHDB_USER`	GraphDB username (optional, for authenticated endpoints)
`GRAPHDB_PASSWORD`	GraphDB password (optional)

See Environment Variables for all options.

Freshness Skip¶

You can skip remote sync if the local catalog is recent enough:

SYNC_SKIP_IF_FRESH_SECONDS=3600   # Skip if catalog is < 1 hour old
SYNC_FORCE=1                       # Always sync, ignoring freshness

Auto-Sync (Background)¶

Configure periodic background sync via .env:

SYNC_EVERY_HOURS=24

When the catalog changes (detected via SHA-1 diff), the background thread: 1. Calls sync_once() to fetch and rebuild 2. Calls pipeline.reload_index() to hot-reload FAISS without restart 3. Refreshes UI tool card data

Manual Sync¶

ai_agent sync

Embeddings and Index¶

Embedding Process¶

At startup (or after sync), each tool doc is embedded and stored in a FAISS index. Embedding is performed by VectorIndex.sync_with_catalog() using the configured embedder (see Retrieval Pipeline).

Index Storage¶

artifacts/rag_index/
├── index.faiss          # FAISS IndexFlatIP binary
└── meta.json            # Tool IDs, embedding config, timestamps

meta.json structure:

{
  "tool_ids": ["tool1", "tool2", ...],
  "embedding_model": "Qwen/Qwen3-Embedding-8B",
  "num_tools": 150,
  "created_at": "2025-05-08T12:00:00Z"
}

Note

The embedding model recorded in meta.json is set by config.yaml → retrieval.embedder.model_name. If you change the model, the index is rebuilt automatically during the next sync.

Quality Assurance¶

Validation Rules¶

Required fields: name, description, url, license
Valid URLs: Well-formed HTTP/HTTPS URLs
Standard licenses: SPDX identifiers preferred
Consistent keywords: Use standard terminology
Demo URLs: Verify demos are live and accessible

Automated Checks¶

def validate_catalog(catalog_path):
    errors = []

    with open(catalog_path) as f:
        for i, line in enumerate(f, 1):
            try:
                tool = json.loads(line)

                # Required fields
                for field in ['name', 'description', 'url']:
                    if field not in tool:
                        errors.append(f"Line {i}: Missing {field}")

                # URL validation
                if not tool['url'].startswith('http'):
                    errors.append(f"Line {i}: Invalid URL")

                # supportingData structure
                if 'supportingData' in tool:
                    sd = tool['supportingData']
                    if 'demo_url' in sd and sd['demo_url']:
                        if not sd['demo_url'].startswith('http'):
                            errors.append(f"Line {i}: Invalid demo_url")

            except json.JSONDecodeError as e:
                errors.append(f"Line {i}: JSON syntax error - {e}")

    return errors

Best Practices¶

Tool Descriptions¶

✅ Good:

"Automated multi-organ segmentation for CT and MRI supporting 104 anatomical structures"

❌ Bad:

"A tool"  # Too vague
"The best segmentation tool ever created with amazing accuracy..."  # Too marketing-y

Keywords¶

✅ Good:

["segmentation", "CT", "MRI", "medical-imaging", "deep-learning", "organ-segmentation"]

❌ Bad:

["cool", "awesome", "the best"]  # Not searchable terms

Demo URLs¶

✅ Preferred: - HuggingFace Gradio Spaces - Google Colab notebooks - Live web demos

❌ Avoid: - Dead links - Paywalled demos - Demos requiring registration

Software Catalog¶

Overview¶

Catalog Schema¶

Core Fields¶

Extended Fields¶

Field Descriptions¶

name¶

description¶

url¶

codeRepository¶

programmingLanguage¶

license¶

keywords¶

supportingData.modalities¶

supportingData.dimensions¶

supportingData.formats¶

supportingData.tasks¶

supportingData.demo_url¶

Catalog Structure¶

File Format¶

Catalog Loading¶

Validation¶

Catalog Management¶

Adding New Tools¶

Updating Existing Tools¶

Removing Tools¶

Synchronization¶

Sync Flow¶

Required Environment Variables for Sync¶

Freshness Skip¶

Auto-Sync (Background)¶

Manual Sync¶

Embeddings and Index¶

Embedding Process¶

Index Storage¶

Quality Assurance¶

Validation Rules¶

Automated Checks¶

Best Practices¶

Tool Descriptions¶

Keywords¶

Demo URLs¶

Next Steps¶