Agent & VLM Selection¶
The second stage of the pipeline uses a vision-language model (VLM) with PydanticAI agents to select and rank the best tools from candidates.
Overview¶
Goal: Select the most relevant tools using vision + text understanding
Characteristics:
- ๐ง Intelligent reasoning with explanations
- ๐๏ธ Vision-aware (analyzes image content)
- ๐ฏ Comparative ranking of candidates
- ๐ฌ Conversational with context
- ๐ Structured output (Pydantic schemas)
Architecture¶
graph TB
A[User Message + Files] --> B[PydanticAI Agent]
B --> C{Agent Router}
C -->|Tool Call| D[Agent Tools]
C -->|LLM Reasoning| E[GPT-4o/4o-mini]
D --> B
E --> F[ToolSelection Schema]
F --> G[Structured Response]
G --> B
B --> H[Formatted Reply]
PydanticAI Agent¶
Agent Framework¶
Framework: PydanticAI
Benefits:
- Type-safe with Pydantic models
- Structured output validation
- Built-in tool support
- Async/await support
- Easy testing with dependency injection
Agent Definition¶
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider
from ai_agent.generator.prompts import get_agent_system_prompt
from ai_agent.generator.schema import ToolSelection
from ai_agent.agent.utils import AgentState
provider = OpenAIProvider(api_key=os.getenv("OPENAI_API_KEY"))
openai_model = OpenAIResponsesModel(model_name="gpt-4o-mini", provider=provider)
agent = Agent(
model=openai_model,
system_prompt=get_agent_system_prompt(num_choices=3),
deps_type=AgentState,
)
Key parameters:
model: VLM model to use (configurable viaconfig.yaml)system_prompt: Agent role, scoring rules, and output format (fromgenerator/prompts.py)deps_type:AgentStateโ tracks tool calls, quotas, and session overridesoutput_type:ToolSelectionโ passed toagent.run_sync()to enforce structured JSON output (not set on theAgentconstructor)
Conversation State¶
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any, Set
class AgentState(BaseModel):
"""Holds incremental tool call logs and runtime overrides."""
tool_calls: List[Dict[str, Any]] = Field(default_factory=list)
tool_counts: Dict[str, int] = Field(default_factory=dict)
disabled_tools: Set[str] = Field(default_factory=set)
excluded_tools: List[str] = Field(default_factory=list)
# Runtime overrides (session-only)
override_model: Optional[str] = None
override_base_url: Optional[str] = None
override_top_k: Optional[int] = None
override_num_choices: Optional[int] = None
image_paths: List[str] = Field(default_factory=list)
original_formats: List[str] = Field(default_factory=list)
Passed to every tool call via dependency injection. Also carries per-tool call counts for quota enforcement.
Agent Tools¶
Tools extend agent capabilities beyond chat:
search_alternative¶
Request alternative search with different query formulation:
@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("search_alternative", cap=3)
async def search_alternative(
ctx: RunContext[AgentState],
alternative_query: str,
excluded: List[str] | None = None,
top_k: int = 12,
) -> List[dict]:
"""Search for tools using an alternative query formulation."""
inp = SearchAlternativeInput(
alternative_query=alternative_query,
excluded=excluded or [],
top_k=top_k,
original_formats=ctx.deps.original_formats,
image_paths=ctx.deps.image_paths,
)
out = tool_search_alternative(inp)
return [c.model_dump(mode="python") for c in out.candidates]
Usage:
- Agent invokes when user asks for alternatives
- Up to 3 calls per conversation
- Formulates semantically different queries
Example:
repo_info¶
Fetch GitHub repository details:
@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("repo_info", cap=12)
async def repo_info(ctx: RunContext[AgentState], url: str, tool_name: str | None = None) -> dict:
"""Fetch a short summary of a GitHub repository."""
# Normalize to canonical GitHub URL
norm_url = coerce_github_url_or_none(url)
# Call tool_repo_summary (tries DeepWiki MCP first, falls back to repocards)
out = await tool_repo_summary(RepoSummaryInput(url=norm_url, tool_name=tool_name))
return out.model_dump(mode="python")
Data sources:
- DeepWiki MCP: Pre-indexed, fast, no rate limits
- Repocards: Direct fetch, fallback for new repos
Returns:
- Repository description
- Stars, language, topics
- Last update date
- License information
Example:
User: Tell me about TotalSegmentator
Agent: [Calls repo_info("https://github.com/wasserth/TotalSegmentator")]
TotalSegmentator is an automated multi-organ segmentation tool...
โญ 1.2k stars | Python | Apache-2.0 license
Topics: segmentation, medical-imaging, deep-learning
run_example¶
Execute Gradio Space demos (optional, experimental):
@agent.tool(retries=0, prepare=cap_prepare)
@limit_tool_calls("run_example", cap=1)
async def run_example(
ctx: RunContext[AgentState],
tool_name: str,
endpoint_url: str | None = None,
extra_text: str | None = None,
) -> dict:
"""Run an example / demo for a given tool via its Gradio space."""
out = tool_run_example(RunExampleInput(
tool_name=tool_name,
endpoint_url=endpoint_url,
extra_text=extra_text,
))
return out.model_dump(mode="python")
Status: Partially implemented, limited to specific demo formats.
Selection and Ranking¶
The PydanticAI agent performs tool selection and ranking directly as part of its LLM reasoning step. There is no separate VLMToolSelector class โ the agent's system prompt (defined in generator/prompts.py) encodes the scoring rules, and the ToolSelection Pydantic schema (defined in generator/schema.py) enforces structured output.
System Prompt¶
The agent system prompt is assembled by get_agent_system_prompt() in generator/prompts.py and covers:
- Image analysis: Instructions to analyze the attached preview image and reference visual observations in explanations
- Tool call sequence: When to call
search_tools,search_alternative,repo_info, andrun_example - Scoring rules: Accuracy (0โ100) = Task match (40) + Format compatibility (30) + Features (30)
- Output format: Single JSON object matching the
ToolSelectionschema
from ai_agent.generator.prompts import get_agent_system_prompt
# Generates a prompt that instructs the agent to return up to N ranked choices
system_prompt = get_agent_system_prompt(num_choices=3)
Selection Process¶
Step 1: Tool Calls (Retrieval)¶
The agent calls search_tools once (and optionally search_alternative up to 3 times) to retrieve candidate tools from the vector index:
Agent โ search_tools(query="segment lungs", top_k=12)
โ [TotalSegmentator, MedSAM, nnU-Net, ...]
Step 2: Verification¶
For each finalist the agent plans to recommend, it calls repo_info to fetch up-to-date GitHub metadata:
Agent โ repo_info(url="https://github.com/wasserth/TotalSegmentator")
โ {stars: 1200, language: "Python", topics: [...], description: "..."}
Step 3: Structured Output¶
The agent returns one JSON object (no prose) that is validated against the ToolSelection schema:
run_result = agent_instance.run_sync(
user_prompt, # text + optional BinaryContent image
deps=deps, # AgentState with image_paths, excluded_tools, etc.
output_type=ToolSelection,
usage_limits=UsageLimits(tool_calls_limit=20),
)
result = run_result.output # ToolSelection instance
Multimodal input:
- Text: User task + hidden metadata (format hints, image dimensions)
- Image: PNG preview bytes passed as
BinaryContent(data=image_bytes, media_type="image/png") - Context: Conversation history prepended to the prompt
Structured Response Schema¶
The ToolSelection Pydantic model (in generator/schema.py) validates the agent output:
from ai_agent.generator.schema import (
ToolSelection, ToolChoice, Conversation,
ConversationStatus, NoToolReason
)
class ToolChoice(BaseModel):
name: str
rank: int
accuracy: float # 0-100
why: str
demo_link: Optional[str] = None
class Conversation(BaseModel):
status: ConversationStatus
question: Optional[str] = None # required if status=needs_clarification
context: Optional[str] = None # required if status=needs_clarification
options: Optional[List[str]] = None
class ToolSelection(BaseModel):
conversation: Conversation
choices: List[ToolChoice] = []
explanation: Optional[str] = None
reason: Optional[NoToolReason] = None
Example response (ToolSelection):
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [
{
"rank": 1,
"name": "TotalSegmentator",
"accuracy": 95.0,
"why": "Specifically designed for automated multi-organ CT segmentation...",
"demo_link": "https://huggingface.co/spaces/..."
},
{
"rank": 2,
"name": "MedSAM",
"accuracy": 85.0,
"why": "Flexible SAM-based segmentation supporting DICOM input...",
"demo_link": "https://huggingface.co/spaces/..."
}
],
"explanation": null,
"reason": null
}
Validation¶
Pydantic validates:
- All required fields present
- Types correct (
int,float,str, enum) accuracywithin 0โ100 rangeConversationStatusis one of the allowed enum valuesNoToolReasonis a valid enum value whenchoicesis empty
ToolSelection.normalize() also enforces consistency rules automatically (e.g. setting status=complete when choices are returned, status=needs_clarification when a question is present).
Conversation States¶
State machine for conversation flow:
from ai_agent.generator.schema import ConversationStatus
class ConversationStatus(str, Enum):
COMPLETE = "complete" # Recommendations provided (or no tool found)
NEEDS_CLARIFICATION = "needs_clarification" # Agent needs more info
Complete¶
Normal successful response:
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [...],
"explanation": null,
"reason": null
}
Triggers:
- Query is clear
- Candidates found
- Image/metadata sufficient
Needs Clarification¶
Agent requests more information:
{
"conversation": {
"status": "needs_clarification",
"question": "Which specific organ would you like to segment?",
"context": "Several segmentation tools available; target organ narrows choices.",
"options": ["Lungs", "Brain", "Liver", "Other (briefly specify)"]
},
"choices": [],
"explanation": null,
"reason": null
}
Triggers: - Ambiguous query - Multiple valid interpretations - Missing critical information
Example flow:
User: Segment this MRI
Agent: [STATUS: needs_clarification] Which organ would you like to segment?
User: The brain
Agent: [STATUS: complete] Here are brain segmentation tools...
No Tool Terminal¶
No suitable tools in catalog โ status is still complete, but choices is empty and a reason + explanation are provided:
{
"conversation": {"status": "complete", "question": null, "context": null},
"choices": [],
"reason": "no_task_match",
"explanation": "No tools in the catalog handle audio processing. This catalog covers imaging analysis software."
}
Available NoToolReason values: no_suitable_tool, no_modality_match, no_task_match, no_dimension_match, invalid_files.
Ranking Logic¶
Scoring Factors¶
The agent considers:
High Priority¶
- Task Match: Tool designed for this specific task
- Format Compatibility: Supports user's file format
- Visual Analysis: Image content matches tool's domain
Medium Priority¶
- Modality Alignment: CT tool for CT image, MRI for MRI
- Dimension Match: 3D tool for 3D volume
- Feature Coverage: Specific capabilities mentioned
Low Priority¶
- License: Open-source preference (if no preference stated)
- Demo Availability: Has runnable demo
- Popularity: Community adoption
Explanation Generation¶
Each recommendation includes explanation:
Good explanation template:
{Tool} is {specifically designed / well-suited} for {task}
on {modality} images. It supports {format} input {with/without}
preprocessing and provides {key features}. {Caveats if any}.
Example:
TotalSegmentator is specifically designed for automated multi-organ
segmentation on CT scans. It supports DICOM input without preprocessing
and can segment 104 anatomical structures including lungs, air airways,
and vessels. It works best on whole-body CT but also performs well on
thoracic scans.
Rank Assignment¶
- Rank 1: Best overall match (highest accuracy score)
- Rank 2: Strong alternative or different approach
- Rank 3: Fallback option or specialized capability
Important: Ranks are relative to this specific query, not absolute tool quality.
Model Configuration¶
Model Selection¶
Available via config.yaml:
Model Comparison¶
| Model | Vision | Speed | Cost | Best For |
|---|---|---|---|---|
| gpt-4o-mini | โ | โกโกโก | $ | Most queries, fast iteration |
| gpt-4o | โ โ | โกโก | $$ | Complex visual analysis |
| gpt-5.1 | โ โ โ | โก | $$$ | Maximum accuracy needed |
Custom Endpoints¶
Support for OpenAI-compatible APIs:
agent_model:
name: "llama-3.2-vision"
base_url: "https://inference.epfl.ch/v1"
api_key_env: "EPFL_API_KEY"
Error Handling¶
Agent Errors¶
Tool quota exceeded (handled gracefully in run_agent):
except UsageLimitExceeded:
# Returns a ToolSelection with empty choices and an explanation
result = ToolSelection(
conversation=Conversation(status=ConversationStatus.COMPLETE, ...),
choices=[],
explanation="Tool call limit reached. Try a more specific query.",
)
Invalid structured output:
PydanticAI automatically retries the LLM call (up to retries=2 per tool) if the model returns output that fails ToolSelection validation. The ToolSelection.normalize() model validator also auto-corrects minor inconsistencies.
API Errors:
except Exception as e:
log.warning(f"Agent execution encountered an error: {e}")
raise # propagated to the UI layer
Graceful Degradation¶
If the agent fails after all retries:
- Return empty
choiceswith anexplanationdescribing what was searched - UI surfaces the explanation so users can refine their query
- Suggest manual exploration of the catalog
<!-- ## Performance
Latency¶
Typical VLM call: 2-5 seconds
Breakdown:
- Prompt construction: <100ms
- API call: 2-4s (network + inference)
- Response parsing: <100ms
- Validation: <50ms
Optimization¶
Prompt optimization:
- Concise candidate descriptions
- Limit to top-8 candidates
- Structured format for parsing
Caching:
- Model endpoint reused
- Agent instance persists across requests
Batch processing (for testing):
# Process multiple queries
responses = await asyncio.gather(*[
agent.run(query1),
agent.run(query2),
agent.run(query3)
])
``` -->
## Testing
### Unit Tests
Test agent selection with PydanticAI's built-in test model (your catalog should contain the choice provided below, i.e. the `TotalSegmentator` tool):
```python
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from ai_agent.generator.schema import ToolSelection, Conversation, ConversationStatus, ToolChoice
from ai_agent.agent.utils import AgentState
def test_agent_selection():
test_model = TestModel()
test_agent = Agent(model=test_model, deps_type=AgentState)
mock_output = ToolSelection(
conversation=Conversation(status=ConversationStatus.COMPLETE),
choices=[
ToolChoice(name="TotalSegmentator", rank=1, accuracy=95.0, why="Best CT segmenter")
]
)
with test_agent.override(model=test_model):
result = test_agent.run_sync("segment lungs", deps=AgentState(), output_type=ToolSelection)
assert result.output.conversation.status == ConversationStatus.COMPLETE
assert len(result.output.choices) == 1
assert result.output.choices[0].rank == 1
Integration Tests¶
Test with real VLM (expensive, slow):
@pytest.mark.integration
def test_real_agent():
from ai_agent.agent.agent import run_agent
with open("tests/data/sample.tif", "rb") as f:
image_bytes = f.read()
result = run_agent(
task="I want to segment the lungs of this CT scan",
image_paths=["tests/data/sample.tif"],
image_bytes=image_bytes,
)
assert result.conversation.status == ConversationStatus.COMPLETE
assert len(result.choices) > 0
assert all(0 <= c.accuracy <= 100 for c in result.choices)
Next Steps¶
- Learn about Software Catalog
- Return to Architecture Overview
- Explore Retrieval Pipeline