Agent & VLM Selection¶
The second stage of the pipeline uses a vision-language model (VLM) with PydanticAI agents to select and rank the best tools from candidates.
Overview¶
Goal: Select the most relevant tools using vision + text understanding
Characteristics:
- ๐ง Intelligent reasoning with explanations
- ๐๏ธ Vision-aware (analyzes image content)
- ๐ฏ Comparative ranking of candidates
- ๐ฌ Conversational with context
- ๐ Structured output (Pydantic schemas)
Architecture¶
graph TB
A[User Message + Files] --> B[PydanticAI Agent]
B --> C{Agent Router}
C -->|Tool Call| D[Agent Tools]
C -->|LLM Reasoning| E[GPT-4o/4o-mini]
D --> B
E --> F[ToolSelection Schema]
F --> G[Structured Response]
G --> B
B --> H[Formatted Reply]
PydanticAI Agent¶
Agent Framework¶
Framework: PydanticAI
Benefits:
- Type-safe with Pydantic models
- Structured output validation
- Built-in tool support
- Async/await support
- Easy testing with dependency injection
Agent Definition¶
The agent model is determined by config.yaml (agent_model section). When a base_url is provided, OpenAIChatModel (chat/completions API) is used; otherwise OpenAIResponsesModel (Responses API) is used:
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel, OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from ai_agent.generator.prompts import get_agent_system_prompt
from ai_agent.agent.utils import AgentState
# Custom endpoint (e.g. EPFL) โ OpenAIChatModel
provider = OpenAIProvider(base_url="https://inference-rcp.epfl.ch/v1", api_key=api_key)
openai_model = OpenAIChatModel(model_name="openai/gpt-oss-120b", provider=provider)
# Default OpenAI endpoint โ OpenAIResponsesModel
# provider = OpenAIProvider(api_key=api_key)
# openai_model = OpenAIResponsesModel(model_name="gpt-4o-mini", provider=provider)
agent = Agent(
model=openai_model,
system_prompt=get_agent_system_prompt(num_choices=3),
deps_type=AgentState,
output_retries=3, # configurable via AGENT_OUTPUT_RETRIES
)
Key parameters:
model: VLM model to use (configurable viaconfig.yaml)system_prompt: Agent role, scoring rules, and output format (fromgenerator/prompts.py)deps_type:AgentStateโ tracks tool calls, quotas, and session overridesoutput_retries: Number of times the agent retries if output validation fails (envAGENT_OUTPUT_RETRIES, default3)output_type:ToolSelectionโ passed toagent.run_sync()to enforce structured JSON output
Agent cache
A bounded LRU cache of agent instances (max size AGENT_CACHE_MAX, default 16) avoids rebuilding the provider/model objects on every request when the same custom endpoint + model combination is used repeatedly.
Conversation State¶
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any, Set
class AgentState(BaseModel):
"""Holds incremental tool call logs and runtime overrides."""
tool_calls: List[Dict[str, Any]] = Field(default_factory=list)
tool_counts: Dict[str, int] = Field(default_factory=dict)
disabled_tools: Set[str] = Field(default_factory=set)
excluded_tools: List[str] = Field(default_factory=list)
# Runtime overrides (session-only)
override_model: Optional[str] = None
override_base_url: Optional[str] = None
override_top_k: Optional[int] = None
override_num_choices: Optional[int] = None
image_paths: List[str] = Field(default_factory=list)
original_formats: List[str] = Field(default_factory=list)
Passed to every tool call via dependency injection. Also carries per-tool call counts for quota enforcement.
Agent Tools¶
The agent has three registered tools, each with a hard per-run call limit enforced by the @limit_tool_calls decorator:
| Tool | Cap | Description |
|---|---|---|
search_tools |
1 | Initial semantic search โ called exactly once per run |
search_alternative |
3 | Alternative query formulation for broader/different search |
repo_info_batch |
4 | Batch GitHub repository summary lookup |
search_tools¶
Initial tool retrieval โ called exactly once per agent run:
@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("search_tools", cap=1)
async def search_tools(
ctx: RunContext[AgentState],
query: str,
excluded: List[str] | None = None,
top_k: int = 12,
) -> List[dict]:
"""Initial semantic search with automatic reranking."""
...
Automatically injects excluded_tools, image_paths, and original_formats from AgentState, so the LLM does not reason about file paths.
search_alternative¶
Try a different query formulation (up to 3 times per run):
@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("search_alternative", cap=3)
async def search_alternative(
ctx: RunContext[AgentState],
alternative_query: str,
excluded: List[str] | None = None,
top_k: int = 12,
) -> List[dict]:
"""Search with an alternative query formulation (includes automatic reranking)."""
...
Example:
User: Show me alternatives
Agent: [Calls search_alternative with "pulmonary airway segmentation CT volume"]
repo_info_batch¶
Fetch GitHub repository summaries for multiple repositories in parallel (up to 4 calls per run):
@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("repo_info_batch", cap=4)
async def repo_info_batch(
ctx: RunContext[AgentState],
urls: List[str],
) -> List[dict]:
"""Fetch repository summaries for multiple repositories in parallel."""
...
- Accepts a list of GitHub URLs; non-GitHub URLs are skipped with a
NON_GITHUB_URLreason - Deduplicates URLs before fetching
- Uses
asyncio.gather()for parallel fetching - Falls back gracefully if a single repo fetch fails
Data sources (tried in order):
- DeepWiki MCP: Pre-indexed repository documentation โ fast, no rate limits
- Repocards: Direct library-based fetch โ fallback for repos not yet in DeepWiki
Example:
Agent: [Calls repo_info_batch(["https://github.com/wasserth/TotalSegmentator",
"https://github.com/MIC-DKFZ/nnUNet"])]
Selection and Ranking¶
The PydanticAI agent performs tool selection and ranking directly as part of its LLM reasoning step. There is no separate VLMToolSelector class โ the agent's system prompt (defined in generator/prompts.py) encodes the scoring rules, and the ToolSelection Pydantic schema (defined in generator/schema.py) enforces structured output.
System Prompt¶
The agent system prompt is assembled by get_agent_system_prompt() in generator/prompts.py and covers:
- Image analysis: Instructions to analyze the attached preview image and reference visual observations in explanations
- Tool call sequence: When to call
search_tools,search_alternative,repo_info_batch - Scoring rules: Accuracy (0โ100) = Task match (40) + Format compatibility (30) + Features (30)
- Output format: Single JSON object matching the
ToolSelectionschema
from ai_agent.generator.prompts import get_agent_system_prompt
# Generates a prompt that instructs the agent to return up to N ranked choices
system_prompt = get_agent_system_prompt(num_choices=3)
Selection Process¶
Step 1: Tool Calls (Retrieval)¶
The agent calls search_tools exactly once (and optionally search_alternative up to 3 times) to retrieve candidate tools from the vector index:
Agent โ search_tools(query="segment lungs", top_k=12)
โ [TotalSegmentator, MedSAM, nnU-Net, ...]
Step 2: Verification¶
For finalists the agent plans to recommend, it calls repo_info_batch in a single batch call:
Agent โ repo_info_batch(urls=["https://github.com/wasserth/TotalSegmentator", ...])
โ [{stars: 1200, language: "Python", topics: [...], description: "..."}, ...]
Step 3: Structured Output¶
The agent returns one JSON object (no prose) that is validated against the ToolSelection schema:
run_result = agent_instance.run_sync(
user_prompt, # text + optional BinaryContent image
deps=deps, # AgentState with image_paths, excluded_tools, etc.
output_type=ToolSelection,
usage_limits=UsageLimits(tool_calls_limit=20),
)
result = run_result.output # ToolSelection instance
Multimodal input:
- Text: User task + hidden metadata (format hints, image dimensions)
- Image: PNG preview bytes passed as
BinaryContent(data=image_bytes, media_type="image/png") - Context: Conversation history prepended to the prompt
Structured Response Schema¶
The ToolSelection Pydantic model (in generator/schema.py) validates the agent output:
from ai_agent.generator.schema import (
ToolSelection, ToolChoice, Conversation,
ConversationStatus, NoToolReason
)
class ToolChoice(BaseModel):
name: str
rank: int
accuracy: float # 0-100
why: str
demo_link: Optional[str] = None
class Conversation(BaseModel):
status: ConversationStatus
question: Optional[str] = None # required if status=needs_clarification
context: Optional[str] = None # required if status=needs_clarification
options: Optional[List[str]] = None
class ToolSelection(BaseModel):
conversation: Conversation
choices: List[ToolChoice] = []
explanation: Optional[str] = None
reason: Optional[NoToolReason] = None
Example response (ToolSelection):
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [
{
"rank": 1,
"name": "TotalSegmentator",
"accuracy": 95.0,
"why": "Specifically designed for automated multi-organ CT segmentation...",
"demo_link": "https://huggingface.co/spaces/..."
},
{
"rank": 2,
"name": "MedSAM",
"accuracy": 85.0,
"why": "Flexible SAM-based segmentation supporting DICOM input...",
"demo_link": "https://huggingface.co/spaces/..."
}
],
"explanation": null,
"reason": null
}
Validation¶
Pydantic validates:
- All required fields present
- Types correct (
int,float,str, enum) accuracywithin 0โ100 rangeConversationStatusis one of the allowed enum valuesNoToolReasonis a valid enum value whenchoicesis empty
ToolSelection.normalize() also enforces consistency rules automatically (e.g. setting status=complete when choices are returned, status=needs_clarification when a question is present).
Conversation States¶
State machine for conversation flow:
from ai_agent.generator.schema import ConversationStatus
class ConversationStatus(str, Enum):
COMPLETE = "complete" # Recommendations provided (or no tool found)
NEEDS_CLARIFICATION = "needs_clarification" # Agent needs more info
Complete¶
Normal successful response:
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [...],
"explanation": null,
"reason": null
}
Triggers:
- Query is clear
- Candidates found
- Image/metadata sufficient
Needs Clarification¶
Agent requests more information:
{
"conversation": {
"status": "needs_clarification",
"question": "Which specific organ would you like to segment?",
"context": "Several segmentation tools available; target organ narrows choices.",
"options": ["Lungs", "Brain", "Liver", "Other (briefly specify)"]
},
"choices": [],
"explanation": null,
"reason": null
}
Triggers: - Ambiguous query - Multiple valid interpretations - Missing critical information
Example flow:
User: Segment this MRI
Agent: [STATUS: needs_clarification] Which organ would you like to segment?
User: The brain
Agent: [STATUS: complete] Here are brain segmentation tools...
No Tool Terminal¶
No suitable tools in catalog โ status is still complete, but choices is empty and a reason + explanation are provided:
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [],
"reason": "no_task_match",
"explanation": "No tools in the catalog handle audio processing. This catalog covers imaging analysis software."
}
Available NoToolReason values: no_suitable_tool, no_modality_match, no_task_match, no_dimension_match, invalid_files.
Ranking Logic¶
Scoring Factors¶
The agent considers:
High Priority¶
- Task Match: Tool designed for this specific task
- Format Compatibility: Supports user's file format
- Visual Analysis: Image content matches tool's domain
Medium Priority¶
- Modality Alignment: CT tool for CT image, MRI for MRI
- Dimension Match: 3D tool for 3D volume
- Feature Coverage: Specific capabilities mentioned
Low Priority¶
- License: Open-source preference (if no preference stated)
- Demo Availability: Has runnable demo
- Popularity: Community adoption
Explanation Generation¶
Each recommendation includes explanation:
Good explanation template:
{Tool} is {specifically designed / well-suited} for {task}
on {modality} images. It supports {format} input {with/without}
preprocessing and provides {key features}. {Caveats if any}.
Example:
TotalSegmentator is specifically designed for automated multi-organ
segmentation on CT scans. It supports DICOM input without preprocessing
and can segment 104 anatomical structures including lungs, air airways,
and vessels. It works best on whole-body CT but also performs well on
thoracic scans.
Rank Assignment¶
- Rank 1: Best overall match (highest accuracy score)
- Rank 2: Strong alternative or different approach
- Rank 3: Fallback option or specialized capability
Important: Ranks are relative to this specific query, not absolute tool quality.
Model Configuration¶
Model Selection¶
Available via config.yaml:
Model Comparison¶
| Model | Vision | Speed | Cost | Best For |
|---|---|---|---|---|
| gpt-4o-mini | โ | โกโกโก | $ | Most queries, fast iteration |
| gpt-4o | โ โ | โกโก | $$ | Complex visual analysis |
| gpt-5.1 | โ โ โ | โก | $$$ | Maximum accuracy needed |
Custom Endpoints¶
Support for OpenAI-compatible APIs:
agent_model:
name: "llama-3.2-vision"
base_url: "https://inference.epfl.ch/v1"
api_key_env: "EPFL_API_KEY"
Error Handling¶
Agent Errors¶
Tool quota exceeded (handled gracefully in run_agent):
except UsageLimitExceeded:
# Returns a ToolSelection with empty choices and an explanation
result = ToolSelection(
conversation=Conversation(status=ConversationStatus.COMPLETE, ...),
choices=[],
explanation="Tool call limit reached. Try a more specific query.",
)
Invalid structured output:
PydanticAI automatically retries the LLM call (up to retries=2 per tool) if the model returns output that fails ToolSelection validation. The ToolSelection.normalize() model validator also auto-corrects minor inconsistencies.
API Errors:
except Exception as e:
log.warning(f"Agent execution encountered an error: {e}")
raise # propagated to the UI layer
Graceful Degradation¶
If the agent fails after all retries:
- Return empty
choiceswith anexplanationdescribing what was searched - UI surfaces the explanation so users can refine their query
- Suggest manual exploration of the catalog
<!-- ## Performance
Latency¶
Typical VLM call: 2-5 seconds
Breakdown:
- Prompt construction: <100ms
- API call: 2-4s (network + inference)
- Response parsing: <100ms
- Validation: <50ms
Optimization¶
Prompt optimization:
- Concise candidate descriptions
- Limit to top-8 candidates
- Structured format for parsing
Caching:
- Model endpoint reused
- Agent instance persists across requests
Batch processing (for testing):
# Process multiple queries
responses = await asyncio.gather(*[
agent.run(query1),
agent.run(query2),
agent.run(query3)
])
``` -->
## Testing
### Unit Tests
Test agent selection with PydanticAI's built-in test model (your catalog should contain the choice provided below, i.e. the `TotalSegmentator` tool):
```python
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from ai_agent.generator.schema import ToolSelection, Conversation, ConversationStatus, ToolChoice
from ai_agent.agent.utils import AgentState
def test_agent_selection():
test_model = TestModel()
test_agent = Agent(model=test_model, deps_type=AgentState)
mock_output = ToolSelection(
conversation=Conversation(status=ConversationStatus.COMPLETE),
choices=[
ToolChoice(name="TotalSegmentator", rank=1, accuracy=95.0, why="Best CT segmenter")
]
)
with test_agent.override(model=test_model):
result = test_agent.run_sync("segment lungs", deps=AgentState(), output_type=ToolSelection)
assert result.output.conversation.status == ConversationStatus.COMPLETE
assert len(result.output.choices) == 1
assert result.output.choices[0].rank == 1
Integration Tests¶
Test with real VLM (expensive, slow):
@pytest.mark.integration
def test_real_agent():
from ai_agent.agent.agent import run_agent
with open("tests/data/sample.tif", "rb") as f:
image_bytes = f.read()
result = run_agent(
task="I want to segment the lungs of this CT scan",
image_paths=["tests/data/sample.tif"],
image_bytes=image_bytes,
)
assert result.conversation.status == ConversationStatus.COMPLETE
assert len(result.choices) > 0
assert all(0 <= c.accuracy <= 100 for c in result.choices)
Next Steps¶
- Learn about Software Catalog
- Return to Architecture Overview
- Explore Retrieval Pipeline