Agent & VLM Selection¶

The second stage of the pipeline uses a vision-language model (VLM) with PydanticAI agents to select and rank the best tools from candidates.

Overview¶

Goal: Select the most relevant tools using vision + text understanding

Characteristics:

🧠 Intelligent reasoning with explanations
👁️ Vision-aware (analyzes image content)
🎯 Comparative ranking of candidates
💬 Conversational with context
📊 Structured output (Pydantic schemas)

Architecture¶

graph TB
    A[User Message + Files] --> B[PydanticAI Agent]
    B --> C{Agent Router}
    C -->|Tool Call| D[Agent Tools]
    C -->|LLM Reasoning| E[GPT-4o/4o-mini]
    D --> B
    E --> F[ToolSelection Schema]
    F --> G[Structured Response]
    G --> B
    B --> H[Formatted Reply]

PydanticAI Agent¶

Agent Framework¶

Framework: PydanticAI

Benefits:

Type-safe with Pydantic models
Structured output validation
Built-in tool support
Async/await support
Easy testing with dependency injection

Agent Definition¶

The agent model is determined by config.yaml (agent_model section). When a base_url is provided, OpenAIChatModel (chat/completions API) is used; otherwise OpenAIResponsesModel (Responses API) is used:

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel, OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from ai_agent.generator.prompts import get_agent_system_prompt
from ai_agent.agent.utils import AgentState

# Custom endpoint (e.g. EPFL) → OpenAIChatModel
provider = OpenAIProvider(base_url="https://inference-rcp.epfl.ch/v1", api_key=api_key)
openai_model = OpenAIChatModel(model_name="openai/gpt-oss-120b", provider=provider)

# Default OpenAI endpoint → OpenAIResponsesModel
# provider = OpenAIProvider(api_key=api_key)
# openai_model = OpenAIResponsesModel(model_name="gpt-4o-mini", provider=provider)

agent = Agent(
    model=openai_model,
    system_prompt=get_agent_system_prompt(num_choices=3),
    deps_type=AgentState,
    output_retries=3,  # configurable via AGENT_OUTPUT_RETRIES
)

Key parameters:

model: VLM model to use (configurable via config.yaml)
system_prompt: Agent role, scoring rules, and output format (from generator/prompts.py)
deps_type: AgentState — tracks tool calls, quotas, and session overrides
output_retries: Number of times the agent retries if output validation fails (env AGENT_OUTPUT_RETRIES, default 3)
output_type: ToolSelection — passed to agent.run_sync() to enforce structured JSON output

Agent cache

A bounded LRU cache of agent instances (max size AGENT_CACHE_MAX, default 16) avoids rebuilding the provider/model objects on every request when the same custom endpoint + model combination is used repeatedly.

Conversation State¶

from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any, Set

class AgentState(BaseModel):
    """Holds incremental tool call logs and runtime overrides."""

    tool_calls: List[Dict[str, Any]] = Field(default_factory=list)
    tool_counts: Dict[str, int] = Field(default_factory=dict)
    disabled_tools: Set[str] = Field(default_factory=set)
    excluded_tools: List[str] = Field(default_factory=list)

    # Runtime overrides (session-only)
    override_model: Optional[str] = None
    override_base_url: Optional[str] = None
    override_top_k: Optional[int] = None
    override_num_choices: Optional[int] = None

    image_paths: List[str] = Field(default_factory=list)
    original_formats: List[str] = Field(default_factory=list)

Passed to every tool call via dependency injection. Also carries per-tool call counts for quota enforcement.

Agent Tools¶

The agent has three registered tools, each with a hard per-run call limit enforced by the @limit_tool_calls decorator:

Tool	Cap	Description
`search_tools`	1	Initial semantic search — called exactly once per run
`search_alternative`	3	Alternative query formulation for broader/different search
`repo_info_batch`	4	Batch GitHub repository summary lookup

search_tools¶

Initial tool retrieval — called exactly once per agent run:

@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("search_tools", cap=1)
async def search_tools(
    ctx: RunContext[AgentState],
    query: str,
    excluded: List[str] | None = None,
    top_k: int = 12,
) -> List[dict]:
    """Initial semantic search with automatic reranking."""
    ...

Automatically injects excluded_tools, image_paths, and original_formats from AgentState, so the LLM does not reason about file paths.

search_alternative¶

Try a different query formulation (up to 3 times per run):

@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("search_alternative", cap=3)
async def search_alternative(
    ctx: RunContext[AgentState],
    alternative_query: str,
    excluded: List[str] | None = None,
    top_k: int = 12,
) -> List[dict]:
    """Search with an alternative query formulation (includes automatic reranking)."""
    ...

Example:

User: Show me alternatives
Agent: [Calls search_alternative with "pulmonary airway segmentation CT volume"]

repo_info_batch¶

Fetch GitHub repository summaries for multiple repositories in parallel (up to 4 calls per run):

@agent.tool(retries=2, prepare=cap_prepare)
@limit_tool_calls("repo_info_batch", cap=4)
async def repo_info_batch(
    ctx: RunContext[AgentState],
    urls: List[str],
) -> List[dict]:
    """Fetch repository summaries for multiple repositories in parallel."""
    ...

Accepts a list of GitHub URLs; non-GitHub URLs are skipped with a NON_GITHUB_URL reason
Deduplicates URLs before fetching
Uses asyncio.gather() for parallel fetching
Falls back gracefully if a single repo fetch fails

Data sources (tried in order):

DeepWiki MCP: Pre-indexed repository documentation — fast, no rate limits
Repocards: Direct library-based fetch — fallback for repos not yet in DeepWiki

Example:

Agent: [Calls repo_info_batch(["https://github.com/wasserth/TotalSegmentator",
                                "https://github.com/MIC-DKFZ/nnUNet"])]

Selection and Ranking¶

The PydanticAI agent performs tool selection and ranking directly as part of its LLM reasoning step. There is no separate VLMToolSelector class — the agent's system prompt (defined in generator/prompts.py) encodes the scoring rules, and the ToolSelection Pydantic schema (defined in generator/schema.py) enforces structured output.

System Prompt¶

The agent system prompt is assembled by get_agent_system_prompt() in generator/prompts.py and covers:

Image analysis: Instructions to analyze the attached preview image and reference visual observations in explanations
Tool call sequence: When to call search_tools, search_alternative, repo_info_batch
Scoring rules: Accuracy (0–100) = Task match (40) + Format compatibility (30) + Features (30)
Output format: Single JSON object matching the ToolSelection schema

from ai_agent.generator.prompts import get_agent_system_prompt

# Generates a prompt that instructs the agent to return up to N ranked choices
system_prompt = get_agent_system_prompt(num_choices=3)

Selection Process¶

Step 1: Tool Calls (Retrieval)¶

The agent calls search_tools exactly once (and optionally search_alternative up to 3 times) to retrieve candidate tools from the vector index:

Agent → search_tools(query="segment lungs", top_k=12)
      ← [TotalSegmentator, MedSAM, nnU-Net, ...]

Step 2: Verification¶

For finalists the agent plans to recommend, it calls repo_info_batch in a single batch call:

Agent → repo_info_batch(urls=["https://github.com/wasserth/TotalSegmentator", ...])
      ← [{stars: 1200, language: "Python", topics: [...], description: "..."}, ...]

Step 3: Structured Output¶

The agent returns one JSON object (no prose) that is validated against the ToolSelection schema:

run_result = agent_instance.run_sync(
    user_prompt,       # text + optional BinaryContent image
    deps=deps,         # AgentState with image_paths, excluded_tools, etc.
    output_type=ToolSelection,
    usage_limits=UsageLimits(tool_calls_limit=20),
)
result = run_result.output  # ToolSelection instance

Multimodal input:

Text: User task + hidden metadata (format hints, image dimensions)
Image: PNG preview bytes passed as BinaryContent(data=image_bytes, media_type="image/png")
Context: Conversation history prepended to the prompt

Structured Response Schema¶

The ToolSelection Pydantic model (in generator/schema.py) validates the agent output:

from ai_agent.generator.schema import (
    ToolSelection, ToolChoice, Conversation,
    ConversationStatus, NoToolReason
)

class ToolChoice(BaseModel):
    name: str
    rank: int
    accuracy: float          # 0-100
    why: str
    demo_link: Optional[str] = None

class Conversation(BaseModel):
    status: ConversationStatus
    question: Optional[str] = None   # required if status=needs_clarification
    context: Optional[str] = None    # required if status=needs_clarification
    options: Optional[List[str]] = None

class ToolSelection(BaseModel):
    conversation: Conversation
    choices: List[ToolChoice] = []
    explanation: Optional[str] = None
    reason: Optional[NoToolReason] = None

Example response (ToolSelection):

{
  "conversation": {"status": "complete", "question": null, "context": null},
  "choices": [
    {
      "rank": 1,
      "name": "TotalSegmentator",
      "accuracy": 95.0,
      "why": "Specifically designed for automated multi-organ CT segmentation...",
      "demo_link": "https://huggingface.co/spaces/..."
    },
    {
      "rank": 2,
      "name": "MedSAM",
      "accuracy": 85.0,
      "why": "Flexible SAM-based segmentation supporting DICOM input...",
      "demo_link": "https://huggingface.co/spaces/..."
    }
  ],
  "explanation": null,
  "reason": null
}

Validation¶

Pydantic validates:

All required fields present
Types correct (int, float, str, enum)
accuracy within 0–100 range
ConversationStatus is one of the allowed enum values
NoToolReason is a valid enum value when choices is empty

ToolSelection.normalize() also enforces consistency rules automatically (e.g. setting status=complete when choices are returned, status=needs_clarification when a question is present).

Conversation States¶

State machine for conversation flow:

from ai_agent.generator.schema import ConversationStatus

class ConversationStatus(str, Enum):
    COMPLETE = "complete"                    # Recommendations provided (or no tool found)
    NEEDS_CLARIFICATION = "needs_clarification"  # Agent needs more info

Complete¶

Normal successful response:

{
    "conversation": {"status": "complete", "question": null, "context": null},
    "choices": [...],
    "explanation": null,
    "reason": null
}

Triggers:

Query is clear
Candidates found
Image/metadata sufficient

Needs Clarification¶

Agent requests more information:

{
    "conversation": {
        "status": "needs_clarification",
        "question": "Which specific organ would you like to segment?",
        "context": "Several segmentation tools available; target organ narrows choices.",
        "options": ["Lungs", "Brain", "Liver", "Other (briefly specify)"]
    },
    "choices": [],
    "explanation": null,
    "reason": null
}

Triggers: - Ambiguous query - Multiple valid interpretations - Missing critical information

Example flow:

User: Segment this MRI
Agent: [STATUS: needs_clarification] Which organ would you like to segment?
User: The brain
Agent: [STATUS: complete] Here are brain segmentation tools...

No Tool Terminal¶

No suitable tools in catalog — status is still complete, but choices is empty and a reason + explanation are provided:

{
    "conversation": {"status": "complete", "question": null, "context": null},
    "choices": [],
    "reason": "no_task_match",
    "explanation": "No tools in the catalog handle audio processing. This catalog covers imaging analysis software."
}

Available NoToolReason values: no_suitable_tool, no_modality_match, no_task_match, no_dimension_match, invalid_files.

Ranking Logic¶

Scoring Factors¶

The agent considers:

High Priority¶

Task Match: Tool designed for this specific task
Format Compatibility: Supports user's file format
Visual Analysis: Image content matches tool's domain

Medium Priority¶

Modality Alignment: CT tool for CT image, MRI for MRI
Dimension Match: 3D tool for 3D volume
Feature Coverage: Specific capabilities mentioned

Low Priority¶

License: Open-source preference (if no preference stated)
Demo Availability: Has runnable demo
Popularity: Community adoption

Explanation Generation¶

Each recommendation includes explanation:

Good explanation template:

{Tool} is {specifically designed / well-suited} for {task} 
on {modality} images. It supports {format} input {with/without} 
preprocessing and provides {key features}. {Caveats if any}.

Example:

TotalSegmentator is specifically designed for automated multi-organ 
segmentation on CT scans. It supports DICOM input without preprocessing 
and can segment 104 anatomical structures including lungs, air airways, 
and vessels. It works best on whole-body CT but also performs well on 
thoracic scans.

Rank Assignment¶

Rank 1: Best overall match (highest accuracy score)
Rank 2: Strong alternative or different approach
Rank 3: Fallback option or specialized capability

Important: Ranks are relative to this specific query, not absolute tool quality.

Model Configuration¶

Model Selection¶

Available via config.yaml:

agent_model:
  name: "gpt-4o-mini"
  base_url: null
  api_key_env: "OPENAI_API_KEY"

Model Comparison¶

Model	Vision	Speed	Cost	Best For
gpt-4o-mini	✅	⚡⚡⚡	$	Most queries, fast iteration
gpt-4o	✅✅	⚡⚡	$$	Complex visual analysis
gpt-5.1	✅✅✅	⚡	$$$	Maximum accuracy needed

Custom Endpoints¶

Support for OpenAI-compatible APIs:

agent_model:
  name: "llama-3.2-vision"
  base_url: "https://inference.epfl.ch/v1"
  api_key_env: "EPFL_API_KEY"

Error Handling¶

Agent Errors¶

Tool quota exceeded (handled gracefully in run_agent):

except UsageLimitExceeded:
    # Returns a ToolSelection with empty choices and an explanation
    result = ToolSelection(
        conversation=Conversation(status=ConversationStatus.COMPLETE, ...),
        choices=[],
        explanation="Tool call limit reached. Try a more specific query.",
    )

Invalid structured output:

PydanticAI automatically retries the LLM call (up to retries=2 per tool) if the model returns output that fails ToolSelection validation. The ToolSelection.normalize() model validator also auto-corrects minor inconsistencies.

API Errors:

except Exception as e:
    log.warning(f"Agent execution encountered an error: {e}")
    raise  # propagated to the UI layer

Graceful Degradation¶

If the agent fails after all retries:

Return empty choices with an explanation describing what was searched
UI surfaces the explanation so users can refine their query
Suggest manual exploration of the catalog

<!-- ## Performance

Latency¶

Typical VLM call: 2-5 seconds

Breakdown:

Prompt construction: <100ms
API call: 2-4s (network + inference)
Response parsing: <100ms
Validation: <50ms

Optimization¶

Prompt optimization:

Concise candidate descriptions
Limit to top-8 candidates
Structured format for parsing

Caching:

Model endpoint reused
Agent instance persists across requests

Batch processing (for testing):

# Process multiple queries
responses = await asyncio.gather(*[
    agent.run(query1),
    agent.run(query2),
    agent.run(query3)
])
``` -->

## Testing

### Unit Tests

Test agent selection with PydanticAI's built-in test model (your catalog should contain the choice provided below, i.e. the `TotalSegmentator` tool):

```python
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from ai_agent.generator.schema import ToolSelection, Conversation, ConversationStatus, ToolChoice
from ai_agent.agent.utils import AgentState

def test_agent_selection():
    test_model = TestModel()
    test_agent = Agent(model=test_model, deps_type=AgentState)

    mock_output = ToolSelection(
        conversation=Conversation(status=ConversationStatus.COMPLETE),
        choices=[
            ToolChoice(name="TotalSegmentator", rank=1, accuracy=95.0, why="Best CT segmenter")
        ]
    )

    with test_agent.override(model=test_model):
        result = test_agent.run_sync("segment lungs", deps=AgentState(), output_type=ToolSelection)

    assert result.output.conversation.status == ConversationStatus.COMPLETE
    assert len(result.output.choices) == 1
    assert result.output.choices[0].rank == 1

Integration Tests¶

Test with real VLM (expensive, slow):

@pytest.mark.integration
def test_real_agent():
    from ai_agent.agent.agent import run_agent

    with open("tests/data/sample.tif", "rb") as f:
        image_bytes = f.read()

    result = run_agent(
        task="I want to segment the lungs of this CT scan",
        image_paths=["tests/data/sample.tif"],
        image_bytes=image_bytes,
    )

    assert result.conversation.status == ConversationStatus.COMPLETE
    assert len(result.choices) > 0
    assert all(0 <= c.accuracy <= 100 for c in result.choices)

Agent & VLM Selection¶

Overview¶

Architecture¶

PydanticAI Agent¶

Agent Framework¶

Agent Definition¶

Conversation State¶

Agent Tools¶

search_tools¶

search_alternative¶

repo_info_batch¶

Selection and Ranking¶

System Prompt¶

Selection Process¶

Step 1: Tool Calls (Retrieval)¶

Step 2: Verification¶

Step 3: Structured Output¶

Structured Response Schema¶

Validation¶

Conversation States¶

Complete¶

Needs Clarification¶

No Tool Terminal¶

Ranking Logic¶

Scoring Factors¶

High Priority¶

Medium Priority¶

Low Priority¶

Explanation Generation¶

Rank Assignment¶

Model Configuration¶

Model Selection¶

Model Comparison¶

Custom Endpoints¶

Error Handling¶

Agent Errors¶

Graceful Degradation¶

Latency¶

Optimization¶

Integration Tests¶

Next Steps¶