Software Catalog¶
The software catalog is the foundation of the AI Imaging Agent, containing curated information about imaging analysis tools.
Overview¶
Format: JSON Lines (JSONL)
Location: dataset/catalog.jsonl
Schema: Based on schema.org SoftwareSourceCode
Size: ~150 tools currently
Catalog Schema¶
Core Fields¶
Based on schema.org/SoftwareSourceCode:
{
"@type": "SoftwareSourceCode",
"name": "TotalSegmentator",
"description": "Tool for automated segmentation of 104 anatomical structures",
"url": "https://github.com/wasserth/TotalSegmentator",
"codeRepository": "https://github.com/wasserth/TotalSegmentator",
"programmingLanguage": "Python",
"runtimePlatform": "PyTorch",
"license": "Apache-2.0",
"keywords": ["segmentation", "CT", "MRI", "medical-imaging"],
"applicationCategory": "Medical Imaging",
"operatingSystem": ["Linux", "Windows", "macOS"],
"softwareVersion": "2.0.0",
"datePublished": "2022-09-01",
"dateModified": "2024-01-15",
"author": {
"@type": "Person",
"name": "Jakob Wasserthal"
}
}
Extended Fields¶
Custom fields in supportingData:
{
"supportingData": {
"modalities": ["CT", "MRI"],
"dimensions": ["3D"],
"formats": ["DICOM", "NIfTI", "PNG"],
"tasks": ["segmentation", "organ-segmentation"],
"demo_url": "https://huggingface.co/spaces/username/totalsegmentator",
"paper_url": "https://doi.org/10.1000/example",
"citations": 150,
"github_stars": 1200
}
}
Field Descriptions¶
name¶
Canonical tool name (matches repository or published name)
Example: "TotalSegmentator", "nnU-Net", "MedSAM"
description¶
Brief description of tool's purpose and capabilities
Guidelines:
- 1-2 sentences
- Mention key features
- Include domain/modality if specific
url¶
Primary landing page (usually GitHub repo)
codeRepository¶
Source code repository URL (GitHub, GitLab, etc.)
programmingLanguage¶
Primary language(s)
Common values: "Python", "C++", "JavaScript", "Jupyter Notebook"
license¶
Software license identifier (SPDX format)
Common values:
"Apache-2.0": Permissive, commercial OK"MIT": Very permissive"GPL-3.0": Copyleft"BSD-3-Clause": Permissive"Proprietary": Restricted
keywords¶
Array of relevant tags/keywords
Categories:
- Tasks: segmentation, classification, registration, detection
- Modalities: CT, MRI, X-ray, ultrasound, microscopy
- Techniques: deep-learning, traditional-cv, machine-learning
- Domains: medical-imaging, scientific-imaging, neuroscience
supportingData.modalities¶
Medical imaging modalities supported
Standard values:
"CT": Computed Tomography"MRI": Magnetic Resonance Imaging"XR": X-ray radiography"US": Ultrasound"PET": Positron Emission Tomography"SPECT": Single-Photon Emission CT"OCT": Optical Coherence Tomography"Microscopy": Various microscopy types
supportingData.dimensions¶
Spatial dimensions supported
Values: ["2D"], ["3D"], ["2D", "3D"], ["4D"]
- 2D: Single slice images
- 3D: Volumetric data
- 4D: Time-series volumes (3D + time)
supportingData.formats¶
File formats supported for input/output
Common values:
- Medical:
"DICOM","NIfTI","NRRD","Analyze" - Standard:
"PNG","JPEG","TIFF","BMP" - Scientific:
"HDF5","Zarr","OME-TIFF" - Other:
"NumPy","MAT"
supportingData.tasks¶
Analysis tasks the tool performs
Common values:
"segmentation": Image segmentation"classification": Image classification"detection": Object detection"registration": Image registration/alignment"reconstruction": 3D reconstruction"enhancement": Image enhancement"analysis": General analysis
supportingData.demo_url¶
Link to runnable demo (HuggingFace Space, Colab, web app)
Preferred: HuggingFace Gradio Spaces (best integration)
Example: "https://huggingface.co/spaces/username/toolname"
Catalog Structure¶
File Format¶
JSON Lines (JSONL): Each line is a complete JSON object
{"@type": "SoftwareSourceCode", "name": "Tool1", ...}
{"@type": "SoftwareSourceCode", "name": "Tool2", ...}
{"@type": "SoftwareSourceCode", "name": "Tool3", ...}
Benefits:
- Easy to append new tools
- Stream processing for large catalogs
- Each line independently parseable
- Git-friendly (line-based diffs)
Catalog Loading¶
import json
def load_catalog(path: str) -> list[dict]:
tools = []
with open(path) as f:
for line in f:
if line.strip():
tools.append(json.loads(line))
return tools
Validation¶
Tools are validated on load:
from pydantic import BaseModel, HttpUrl
class SoftwareSourceCode(BaseModel):
name: str
description: str
url: HttpUrl
license: str
keywords: list[str]
supportingData: dict
class Config:
extra = "allow" # Allow additional schema.org fields
Catalog Management¶
Adding New Tools¶
- Create entry following schema:
{
"@type": "SoftwareSourceCode",
"name": "NewTool",
"description": "Brief description of the tool",
"url": "https://github.com/user/newtool",
"codeRepository": "https://github.com/user/newtool",
"programmingLanguage": "Python",
"license": "MIT",
"keywords": ["segmentation", "CT"],
"supportingData": {
"modalities": ["CT"],
"dimensions": ["3D"],
"formats": ["DICOM", "NIfTI"],
"tasks": ["segmentation"],
"demo_url": "https://huggingface.co/spaces/user/newtool"
}
}
-
Append to catalog.jsonl (as single line, no pretty printing)
-
Update checksum:
- Sync catalog:
This rebuilds the embeddings and FAISS index.
Updating Existing Tools¶
- Find tool in
catalog.jsonl - Edit JSON (update fields)
- Validate JSON syntax
- Update checksum and sync
Removing Tools¶
- Delete line from
catalog.jsonl - Update checksum and sync
Synchronization¶
Auto-Sync¶
Configured via .env:
Process: 1. Background thread checks catalog every 24h 2. Compares SHA1 checksum 3. If changed: - Reload catalog - Re-embed all tools - Rebuild FAISS index - Update vocabulary for query expansion
Manual Sync¶
Output:
[sync] 150 → dataset/catalog.jsonl
[sync] Rebuilding embeddings...
[sync] Embedding 150 tools... (5.2s)
[sync] Building FAISS index...
[sync] Saved to artifacts/rag_index/
[sync] Updating vocabulary...
[sync] Sync complete.
Embeddings and Index¶
Embedding Process¶
For each tool, create text representation:
tool_text = f"{tool['name']} {tool['description']} {' '.join(tool['keywords'])}"
# Optional: Include supportingData
if 'supportingData' in tool:
sd = tool['supportingData']
tool_text += f" {' '.join(sd.get('modalities', []))}"
tool_text += f" {' '.join(sd.get('tasks', []))}"
# Embed
embedding = embedder.encode(tool_text, normalize_embeddings=True)
Index Storage¶
artifacts/rag_index/
├── index.faiss # FAISS IndexFlatIP
└── meta.json # Tool IDs, config, timestamps
meta.json structure:
{
"tool_ids": ["tool1", "tool2", ...],
"version": "1.0",
"embedding_model": "BAAI/bge-m3",
"embedding_dim": 1024,
"num_tools": 150,
"created_at": "2024-03-01T12:00:00Z",
"catalog_sha1": "abc123..."
}
Vocabulary Extraction¶
Purpose¶
Extract terms for query expansion:
vocabulary = set()
for tool in catalog:
vocabulary.add(tool['name'].lower())
vocabulary.update(tool['description'].lower().split())
vocabulary.update(tool.get('keywords', []))
if 'supportingData' in tool:
sd = tool['supportingData']
vocabulary.update(sd.get('modalities', []))
vocabulary.update(sd.get('tasks', []))
# Result: ~5000 unique terms
Vocabulary Embeddings¶
Pre-embed vocabulary for fast query expansion:
vocab_list = list(vocabulary)
vocab_embeddings = embedder.encode(vocab_list, normalize_embeddings=True)
# Save for query expansion
np.save("artifacts/vocab_embeddings.npy", vocab_embeddings)
At query time, find nearest neighbors efficiently.
Quality Assurance¶
Validation Rules¶
- Required fields: name, description, url, license
- Valid URLs: Well-formed HTTP/HTTPS URLs
- Standard licenses: SPDX identifiers preferred
- Consistent keywords: Use standard terminology
- Demo URLs: Verify demos are live and accessible
Automated Checks¶
def validate_catalog(catalog_path):
errors = []
with open(catalog_path) as f:
for i, line in enumerate(f, 1):
try:
tool = json.loads(line)
# Required fields
for field in ['name', 'description', 'url']:
if field not in tool:
errors.append(f"Line {i}: Missing {field}")
# URL validation
if not tool['url'].startswith('http'):
errors.append(f"Line {i}: Invalid URL")
# supportingData structure
if 'supportingData' in tool:
sd = tool['supportingData']
if 'demo_url' in sd and sd['demo_url']:
if not sd['demo_url'].startswith('http'):
errors.append(f"Line {i}: Invalid demo_url")
except json.JSONDecodeError as e:
errors.append(f"Line {i}: JSON syntax error - {e}")
return errors
Best Practices¶
Tool Descriptions¶
✅ Good:
❌ Bad:
"A tool" # Too vague
"The best segmentation tool ever created with amazing accuracy..." # Too marketing-y
Keywords¶
✅ Good:
❌ Bad:
Demo URLs¶
✅ Preferred: - HuggingFace Gradio Spaces - Google Colab notebooks - Live web demos
❌ Avoid: - Dead links - Paywalled demos - Demos requiring registration
Next Steps¶
- Return to Architecture Overview
- Learn about Retrieval Pipeline
- Explore Agent & VLM Selection