Software Catalog¶
The software catalog is the foundation of the AI Imaging Agent, containing curated information about imaging analysis tools.
Overview¶
Format: JSON Lines (JSONL)
Location: dataset/catalog.jsonl
Schema: Based on schema.org SoftwareSourceCode
Size: ~150 tools currently
Catalog Schema¶
Core Fields¶
Based on schema.org/SoftwareSourceCode:
{
"@type": "SoftwareSourceCode",
"name": "TotalSegmentator",
"description": "Tool for automated segmentation of 104 anatomical structures",
"url": "https://github.com/wasserth/TotalSegmentator",
"codeRepository": "https://github.com/wasserth/TotalSegmentator",
"programmingLanguage": "Python",
"runtimePlatform": "PyTorch",
"license": "Apache-2.0",
"keywords": ["segmentation", "CT", "MRI", "medical-imaging"],
"applicationCategory": "Medical Imaging",
"operatingSystem": ["Linux", "Windows", "macOS"],
"softwareVersion": "2.0.0",
"datePublished": "2022-09-01",
"dateModified": "2024-01-15",
"author": {
"@type": "Person",
"name": "Jakob Wasserthal"
}
}
Extended Fields¶
Custom fields in supportingData:
{
"supportingData": {
"modalities": ["CT", "MRI"],
"dimensions": ["3D"],
"formats": ["DICOM", "NIfTI", "PNG"],
"tasks": ["segmentation", "organ-segmentation"],
"demo_url": "https://huggingface.co/spaces/username/totalsegmentator",
"paper_url": "https://doi.org/10.1000/example",
"citations": 150,
"github_stars": 1200
}
}
Field Descriptions¶
name¶
Canonical tool name (matches repository or published name)
Example: "TotalSegmentator", "nnU-Net", "MedSAM"
description¶
Brief description of tool's purpose and capabilities
Guidelines:
- 1-2 sentences
- Mention key features
- Include domain/modality if specific
url¶
Primary landing page (usually GitHub repo)
codeRepository¶
Source code repository URL (GitHub, GitLab, etc.)
programmingLanguage¶
Primary language(s)
Common values: "Python", "C++", "JavaScript", "Jupyter Notebook"
license¶
Software license identifier (SPDX format)
Common values:
"Apache-2.0": Permissive, commercial OK"MIT": Very permissive"GPL-3.0": Copyleft"BSD-3-Clause": Permissive"Proprietary": Restricted
keywords¶
Array of relevant tags/keywords
Categories:
- Tasks: segmentation, classification, registration, detection
- Modalities: CT, MRI, X-ray, ultrasound, microscopy
- Techniques: deep-learning, traditional-cv, machine-learning
- Domains: medical-imaging, scientific-imaging, neuroscience
supportingData.modalities¶
Medical imaging modalities supported
Standard values:
"CT": Computed Tomography"MRI": Magnetic Resonance Imaging"XR": X-ray radiography"US": Ultrasound"PET": Positron Emission Tomography"SPECT": Single-Photon Emission CT"OCT": Optical Coherence Tomography"Microscopy": Various microscopy types
supportingData.dimensions¶
Spatial dimensions supported
Values: ["2D"], ["3D"], ["2D", "3D"], ["4D"]
- 2D: Single slice images
- 3D: Volumetric data
- 4D: Time-series volumes (3D + time)
supportingData.formats¶
File formats supported for input/output
Common values:
- Medical:
"DICOM","NIfTI","NRRD","Analyze" - Standard:
"PNG","JPEG","TIFF","BMP" - Scientific:
"HDF5","Zarr","OME-TIFF" - Other:
"NumPy","MAT"
supportingData.tasks¶
Analysis tasks the tool performs
Common values:
"segmentation": Image segmentation"classification": Image classification"detection": Object detection"registration": Image registration/alignment"reconstruction": 3D reconstruction"enhancement": Image enhancement"analysis": General analysis
supportingData.demo_url¶
Link to runnable demo (HuggingFace Space, Colab, web app)
Preferred: HuggingFace Gradio Spaces (best integration)
Example: "https://huggingface.co/spaces/username/toolname"
Catalog Structure¶
File Format¶
JSON Lines (JSONL): Each line is a complete JSON object
{"@type": "SoftwareSourceCode", "name": "Tool1", ...}
{"@type": "SoftwareSourceCode", "name": "Tool2", ...}
{"@type": "SoftwareSourceCode", "name": "Tool3", ...}
Benefits:
- Easy to append new tools
- Stream processing for large catalogs
- Each line independently parseable
- Git-friendly (line-based diffs)
Catalog Loading¶
import json
def load_catalog(path: str) -> list[dict]:
tools = []
with open(path) as f:
for line in f:
if line.strip():
tools.append(json.loads(line))
return tools
Validation¶
Tools are validated on load:
from pydantic import BaseModel, HttpUrl
class SoftwareSourceCode(BaseModel):
name: str
description: str
url: HttpUrl
license: str
keywords: list[str]
supportingData: dict
class Config:
extra = "allow" # Allow additional schema.org fields
Catalog Management¶
Adding New Tools¶
- Create entry following schema:
{
"@type": "SoftwareSourceCode",
"name": "NewTool",
"description": "Brief description of the tool",
"url": "https://github.com/user/newtool",
"codeRepository": "https://github.com/user/newtool",
"programmingLanguage": "Python",
"license": "MIT",
"keywords": ["segmentation", "CT"],
"supportingData": {
"modalities": ["CT"],
"dimensions": ["3D"],
"formats": ["DICOM", "NIfTI"],
"tasks": ["segmentation"],
"demo_url": "https://huggingface.co/spaces/user/newtool"
}
}
-
Append to catalog.jsonl (as single line, no pretty printing)
-
Update checksum:
- Sync catalog:
This rebuilds the embeddings and FAISS index.
Updating Existing Tools¶
- Find tool in
catalog.jsonl - Edit JSON (update fields)
- Validate JSON syntax
- Update checksum and sync
Removing Tools¶
- Delete line from
catalog.jsonl - Update checksum and sync
Synchronization¶
The catalog is populated by querying a GraphDB SPARQL endpoint and converting the results to JSONL. This is handled by catalog/sync.py via the sync_once() function (called at startup and by ai_agent sync).
Sync Flow¶
graph LR
A[GraphDB SPARQL] --> B[fetch_jsonld]
B --> C[catalog.jsonld]
C --> D[full_processing]
D --> E[catalog.jsonl]
E --> F[VectorIndex.sync_with_catalog]
F --> G[FAISS index]
- Query — load SPARQL query from
GRAPHDB_QUERY_FILE(default:get_relevant_software.rq) - Fetch — send query to
GRAPHDB_URL, receive JSON-LD (falls back to TURTLE → rdflib → JSON-LD) - Save snapshot — write raw result to
OUTPUT_JSONLD(default:dataset/catalog.jsonld) - Convert — run
full_processing()to transform JSON-LD into flat JSONL (OUTPUT_JSONL, default:dataset/catalog.jsonl) - Diff — compute SHA-1 hash of normalized docs; compare with previous hash to detect changes
- Rebuild index — if changed (or FAISS is missing), rebuild and save to
RAG_INDEX_DIR
Required Environment Variables for Sync¶
| Variable | Description |
|---|---|
GRAPHDB_URL |
SPARQL endpoint URL (required for ai_agent sync) |
GRAPHDB_GRAPH |
Named graph IRI to query (absolute IRI, required) |
GRAPHDB_QUERY_FILE |
Path to .rq SPARQL query file (default: get_relevant_software.rq) |
GRAPHDB_USER |
GraphDB username (optional, for authenticated endpoints) |
GRAPHDB_PASSWORD |
GraphDB password (optional) |
See Environment Variables for all options.
Freshness Skip¶
You can skip remote sync if the local catalog is recent enough:
SYNC_SKIP_IF_FRESH_SECONDS=3600 # Skip if catalog is < 1 hour old
SYNC_FORCE=1 # Always sync, ignoring freshness
Auto-Sync (Background)¶
Configure periodic background sync via .env:
When the catalog changes (detected via SHA-1 diff), the background thread:
1. Calls sync_once() to fetch and rebuild
2. Calls pipeline.reload_index() to hot-reload FAISS without restart
3. Refreshes UI tool card data
Manual Sync¶
Embeddings and Index¶
Embedding Process¶
At startup (or after sync), each tool doc is embedded and stored in a FAISS index. Embedding is performed by VectorIndex.sync_with_catalog() using the configured embedder (see Retrieval Pipeline).
Index Storage¶
artifacts/rag_index/
├── index.faiss # FAISS IndexFlatIP binary
└── meta.json # Tool IDs, embedding config, timestamps
meta.json structure:
{
"tool_ids": ["tool1", "tool2", ...],
"embedding_model": "Qwen/Qwen3-Embedding-8B",
"num_tools": 150,
"created_at": "2025-05-08T12:00:00Z"
}
Note
The embedding model recorded in meta.json is set by config.yaml → retrieval.embedder.model_name. If you change the model, the index is rebuilt automatically during the next sync.
Quality Assurance¶
Validation Rules¶
- Required fields: name, description, url, license
- Valid URLs: Well-formed HTTP/HTTPS URLs
- Standard licenses: SPDX identifiers preferred
- Consistent keywords: Use standard terminology
- Demo URLs: Verify demos are live and accessible
Automated Checks¶
def validate_catalog(catalog_path):
errors = []
with open(catalog_path) as f:
for i, line in enumerate(f, 1):
try:
tool = json.loads(line)
# Required fields
for field in ['name', 'description', 'url']:
if field not in tool:
errors.append(f"Line {i}: Missing {field}")
# URL validation
if not tool['url'].startswith('http'):
errors.append(f"Line {i}: Invalid URL")
# supportingData structure
if 'supportingData' in tool:
sd = tool['supportingData']
if 'demo_url' in sd and sd['demo_url']:
if not sd['demo_url'].startswith('http'):
errors.append(f"Line {i}: Invalid demo_url")
except json.JSONDecodeError as e:
errors.append(f"Line {i}: JSON syntax error - {e}")
return errors
Best Practices¶
Tool Descriptions¶
✅ Good:
❌ Bad:
"A tool" # Too vague
"The best segmentation tool ever created with amazing accuracy..." # Too marketing-y
Keywords¶
✅ Good:
❌ Bad:
Demo URLs¶
✅ Preferred: - HuggingFace Gradio Spaces - Google Colab notebooks - Live web demos
❌ Avoid: - Dead links - Paywalled demos - Demos requiring registration
Next Steps¶
- Return to Architecture Overview
- Learn about Retrieval Pipeline
- Explore Agent & VLM Selection