voyage/.memory/plans/chat-provider-fixes.md

# Chat Provider Fixes

## Problem Statement
The AI chat feature is broken with multiple issues:
1. Rate limit errors from providers
2. "location is required" errors (tool calling issue)
3. "An unexpected error occurred while fetching trip details" errors
4. Models not being fetched properly for all providers
5. Potential authentication issues

## Root Cause Analysis

### Issue 1: Tool Calling Errors
The errors "location is required" and "An unexpected error occurred while fetching trip details" come from the agent tools (`search_places`, `get_trip_details`) being called with missing/invalid parameters. This suggests:
- The LLM is not properly understanding the tool schemas
- Or the model doesn't support function calling well
- Or there's a mismatch between how LiteLLM formats tools and what the model expects

### Issue 2: Models Not Fetched
The `models` endpoint in `ChatProviderCatalogViewSet` only handles:
- `openai` - uses OpenAI SDK to fetch live
- `anthropic/claude` - hardcoded list
- `gemini/google` - hardcoded list
- `groq` - hardcoded list
- `ollama` - calls local API
- `opencode_zen` - hardcoded list

All other providers return `{"models": []}`.

### Issue 3: Authentication Flow
1. Frontend sends request with `credentials: 'include'`
2. Backend gets user from session
3. `get_llm_api_key()` checks `UserAPIKey` model for user's key
4. Falls back to `settings.VOYAGE_AI_API_KEY` if user has no key and provider matches instance default
5. Key is passed to LiteLLM's `acompletion()`

Potential issues:
- Encryption key not configured correctly
- Key not being passed correctly to LiteLLM
- Provider-specific auth headers not being set

### Issue 4: LiteLLM vs Alternatives
Current approach (LiteLLM):
- Single library handles all providers
- Normalizes API calls across providers
- Built-in error handling and retries (if configured)

Alternative (Vercel AI SDK):
- Provider registry pattern with individual packages
- More explicit provider configuration
- Better TypeScript support
- But would require significant refactoring (backend is Python)

## Investigation Tasks

- [ ] Test the actual API calls to verify authentication
- [x] Check if models endpoint returns correct data
- [x] Verify tool schemas are being passed correctly
- [ ] Test with a known-working model (e.g., GPT-4o)

## Options

### Option A: Fix LiteLLM Integration (Recommended)
1. Add proper retry logic with `num_retries=2`
2. Add `supports_function_calling()` check before using tools
3. Expand models endpoint to handle more providers
4. Add better logging for debugging

### Option B: Replace LiteLLM with Custom Implementation
1. Use direct API calls per provider
2. More control but more maintenance
3. Significant development effort

### Option C: Hybrid Approach
1. Keep LiteLLM for providers it handles well
2. Add custom handlers for problematic providers
3. Medium effort, best of both worlds

## Status

### Completed (2026-03-09)
- [x] Implemented backend fixes for Option A:
  1. `ChatProviderCatalogViewSet.models()` now fetches OpenCode Zen models dynamically from `{api_base}/models` using the configured provider API base and user API key; returns deduplicated model ids and logs fetch failures.
  2. `stream_chat_completion()` now checks `litellm.supports_function_calling(model=resolved_model)` before sending tools and disables tools with a warning if unsupported.
  3. Added LiteLLM transient retry configuration via `num_retries=2` on streaming completions.
  4. Added request/error logging for provider/model/tool usage and API base/message count diagnostics.

### Verification Results
- Models endpoint: Returns 36 models from OpenCode Zen API (was 5 hardcoded)
- Function calling check: gpt-5-nano=True, claude-sonnet-4-6=True, big-pickle=False, minimax-m2.5=False
- Syntax check: Passed for both modified files
- Frontend check: 0 errors, 6 warnings (pre-existing)

### Remaining Issues (User Action Required)
- Rate limits: Free tier has limits, user may need to upgrade or wait
- Tool calling: Some models (big-pickle, minimax-m2.5) don't support function calling - tools will be disabled for these models

## Follow-up Fixes (2026-03-09)

### Clarified Behavior
- Approved preference precedence: database-saved default provider/model beats any per-device `localStorage` override.
- Requirement: user AI preferences must be persisted through the existing `UserAISettings` backend API and applied by both the settings UI and chat send-message fallback logic.

### Planned Workstreams

- [x] `chat-loop-hardening`
  - Acceptance: invalid required-argument tool calls do not loop repeatedly, tool-error messages are not replayed back into the model history, and SSE streams terminate cleanly with a user-visible error or `[DONE]`.
  - Files: `backend/server/chat/views/__init__.py`, `backend/server/chat/agent_tools.py`, optional `backend/server/chat/llm_client.py`
  - Notes: preserve successful tool flows; stop feeding `{"error": "location is required"}` / `{"error": "query is required"}` back into the next model turn.
  - Completion (2026-03-09): Added required-argument tool-error detection in `send_message()` streaming loop, short-circuited those tool failures with a user-visible SSE error + terminal `[DONE]`, skipped persistence/replay of those invalid tool payloads (including historical cleanup at `_build_llm_messages()`), and tightened `search_places`/`web_search` tool descriptions to explicitly call out required non-empty args.
  - Follow-up (2026-03-09): Fixed multi-tool-call consistency by persisting/replaying only the successful prefix of `tool_calls` when a later call fails required-arg validation; `_build_llm_messages()` now trims assistant `tool_calls` to only IDs that have kept (non-filtered) persisted tool messages.
  - Review verdict (2026-03-09): **APPROVED** (score 6). Two WARNINGs: (1) multi-tool-call orphan — when model returns N tool calls and call K fails required-param validation, calls 1..K-1 are already persisted but call K's result is not, leaving an orphaned `tool_calls` reference in the assistant message that may cause LLM API errors on the next conversation turn; (2) `_build_llm_messages` filters tool-role error messages but does not filter/trim the corresponding assistant-message `tool_calls` array, creating the same orphan on historical replay. Both are low-likelihood (multi-tool required-param failures are rare) and gracefully degraded (next-turn errors are caught by `_safe_error_payload`). One SUGGESTION: `get_weather` error `"dates must be a non-empty list"` does not match the `is/are required` regex and would not trigger the short-circuit (mitigated by `MAX_TOOL_ITERATIONS` guard). Also confirms prior pre-existing bug (`tool_iterations` never incremented) is now fixed in this changeset.

- [x] `default-ai-settings`
  - Acceptance: settings page shows default AI provider/model controls, saving persists via `UserAISettings`, chat UI initializes from saved preferences, and backend chat fallback uses saved defaults when request payload omits provider/model.
  - Files: `frontend/src/routes/settings/+page.server.ts`, `frontend/src/routes/settings/+page.svelte`, `frontend/src/lib/types.ts`, `frontend/src/lib/components/AITravelChat.svelte`, `backend/server/chat/views/__init__.py`
  - Notes: DB-saved defaults override browser-local model prefs.

### Completion Note (2026-03-09)
- Implemented DB-backed default AI settings end-to-end: settings page now loads/saves `UserAISettings` via `/api/integrations/ai-settings/`, with provider/model selectors powered by provider catalog + per-provider models endpoint.
- Chat initialization now treats saved DB defaults as authoritative initial provider/model; stale `voyage_chat_model_prefs` localStorage values no longer override defaults and are synchronized to the saved defaults.
- Backend `send_message` now uses saved `UserAISettings` only when request payload omits provider/model, preserving explicit request values and existing provider validation behavior.
- Follow-up fix: backend model fallback now only applies `preferred_model` when the resolved provider matches `preferred_provider`, preventing cross-provider default model mismatches when users explicitly choose another provider.

- [x] `suggestion-add-flow`
  - Acceptance: day suggestions use the user-configured/default provider/model instead of hardcoded OpenAI values, and adding a suggested place creates a location plus itinerary entry successfully.
  - Files: `backend/server/chat/views/day_suggestions.py`, `frontend/src/lib/components/collections/ItinerarySuggestionModal.svelte`
  - Notes: normalize suggestion payloads needed by `/api/locations/` and preserve existing add-item event wiring.
  - Completion (2026-03-09): Day suggestions now resolve provider/model in precedence order (request payload → `UserAISettings` defaults → instance/provider defaults) without OpenAI hardcoding; modal now normalizes suggestion objects and builds stable `/api/locations/` payloads (name/location/description/rating) before dispatching existing `addItem` flow.
  - Follow-up (2026-03-09): Removed remaining OpenAI-specific `gpt-4o-mini` fallback from day suggestions LLM call; endpoint now uses provider-resolved/default model only and fails safely when no model is configured.
  - Follow-up (2026-03-09): Removed unsupported `temperature` from day suggestions requests, normalized bare `opencode_zen` model ids through the gateway (`openai/<model>`), and switched day suggestions error responses to the same sanitized categories used by chat. Browser result: the suggestion modal now completes normally (empty-state or rate-limit message) instead of crashing with a generic 500.

## Tester Validation — `default-ai-settings` (2026-03-09)

### STATUS: PASS

**Evidence from lead:** Authenticated POST `/api/integrations/ai-settings/` returned 200 and persisted; subsequent GET returned same values; POST `/api/chat/conversations/{id}/send_message/` with no provider/model in body used `preferred_provider='opencode_zen'` and `preferred_model='gpt-5-nano'` from DB, producing valid SSE stream.

**Standard pass findings:**
- `UserAISettings` model, serializer, and `UserAISettingsViewSet` are correct. Upsert logic in `perform_create` handles first-write and update-in-place correctly (single row per user via OneToOneField).
- `list()` returns `[serializer.data]` (wrapped array), which the frontend expects as `settings[0]` — contract matches.
- Backend `send_message` precedence: `requested_provider` → `preferred_provider` (if available) → `"openai"` fallback. `model` only inherits `preferred_model` when `provider == preferred_provider` — cross-provider default mismatch is correctly prevented (follow-up fix confirmed).
- Settings page initializes `defaultAiProvider`/`defaultAiModel` from SSR-loaded `aiSettings` and validates against provider catalog on `onMount`. If saved provider is no longer configured, it falls back to first configured provider.
- `AITravelChat.svelte` fetches AI settings on mount, applies as authoritative default, and writes to `localStorage` (sync direction is DB → localStorage, not the reverse).
- The `send_message` handler in the frontend always sends the current UI `selectedProvider`/`selectedModel`, not localStorage values directly — these are only used for UI state initialization, not bypassing DB defaults.
- All i18n keys present in `en.json`: `default_ai_settings_title`, `default_ai_settings_desc`, `default_ai_no_providers`, `default_ai_save`, `default_ai_settings_saved`, `default_ai_settings_error`, `default_ai_provider_required`.
- Django integration tests (5/5) pass; no tests exist for `UserAISettings` specifically — residual regression risk noted.

**Adversarial pass findings (all hypotheses did not find bugs):**

1. **Hypothesis: model saved for provider A silently applied when user explicitly sends provider B (cross-provider model leak).** Checked `send_message` lines 218–220: `model = requested_model; if model is None and preferred_model and provider == preferred_provider: model = preferred_model`. When `requested_provider=B` and `preferred_provider=A`, `provider == preferred_provider` is false → `model` stays `None`. **Not vulnerable.**

2. **Hypothesis: null/empty preferred_model or preferred_provider in DB triggers error.** Serializer allows `null` on both fields (CharField with `blank=True, null=True`). Backend normalizes with `.strip().lower()` inside `(ai_settings.preferred_provider or "").strip().lower()` guard. Frontend uses `?? ''` coercion. **Handled safely.**

3. **Hypothesis: second POST to `/api/integrations/ai-settings/` creates a second row instead of updating.** `UserAISettings` uses `OneToOneField(user, ...)` + `perform_create` explicitly fetches and updates existing row. A second POST cannot produce a duplicate. **Not vulnerable.**

4. **Hypothesis: initializeDefaultAiSettings silently overwrites the saved DB provider with the first catalog provider if the saved provider is temporarily unavailable (e.g., API key deleted).** Confirmed: line 119–121 does silently auto-select first available provider and blank the model if the saved provider is gone. This affects display only (not DB); the save action is still explicit. **Acceptable behavior; low risk.**

5. **Hypothesis: frontend sends `model: undefined` (vs `model: null`) when no model selected, causing backend to ignore it.** `requested_model = (request.data.get("model") or "").strip() or None` — if `undefined`/absent from JSON body, `get("model")` returns `None`, which becomes `None` after the guard. `model` variable falls through to default logic. **Works correctly.**

**MUTATION_ESCAPES: 1/8** — the regex `(is|are) required` in `_is_required_param_tool_error` (chat-loop-hardening code) would escape if a future required-arg error used a different pattern, but this is unrelated to `default-ai-settings` scope.

**Zero automated test coverage for `UserAISettings` CRUD + precedence logic.** Backend logic is covered only by the lead's live-run evidence. Recommended follow-up: add Django TestCase covering (a) upsert idempotency, (b) provider/model precedence in `send_message`, (c) cross-provider model guard.

## Tester Validation — `chat-loop-hardening` (2026-03-09)

### STATUS: PASS

**Evidence from lead (runtime):** Authenticated POST to `send_message` with patched upstream stream emitting `search_places {}` (missing required `location`) returned status 200, SSE body `data: {"tool_calls": [...]}` → `data: {"error": "...", "error_category": "tool_validation_error"}` → `data: [DONE]`. Persisted DB state after that turn: only `('user', None, 'restaurants please')` + `('assistant', None, '')` — no invalid `role=tool` error row.

**Standard pass findings:**

- `_is_required_param_tool_error`: correctly matches `location is required`, `query is required`, `collection_id is required`, `collection_id, name, latitude, and longitude are required`, `latitude and longitude are required`. Does NOT match non-required-arg errors (`dates must be a non-empty list`, `Trip not found`, `Unknown tool: foo`, etc.). All 18 test cases pass.
- `_is_required_param_tool_error_message_content`: correctly parses JSON-wrapped content from persisted DB rows and delegates to above. Handles non-JSON, non-dict JSON, and `error: null` safely. All 7 test cases pass.
- Orphan trimming in `_build_llm_messages`: when assistant has `tool_calls=[A, B]` and B's persisted tool row contains a required-param error, the rebuilt `assistant.tool_calls` retains only `[A]` and tool B's row is filtered. Verified for both the multi-tool case and the single-tool (lead's runtime) scenario.
- SSE stream terminates with `data: [DONE]` immediately after the `tool_validation_error` event — confirmed by code path at line 425–426 which `return`s the generator.
- `MAX_TOOL_ITERATIONS = 10` correctly set; `tool_iterations` counter is incremented on each tool iteration (pre-existing bug confirmed fixed).
- `_merge_tool_call_delta` handles `None`, `[]`, missing `index`, and malformed argument JSON without crash.
- Full Django test suite: 24/30 pass; 6/30 fail (all pre-existing: 2 user email key errors + 4 geocoding API mock errors). Zero regressions introduced by this changeset.

**Adversarial pass findings:**

1. **Hypothesis: `get_weather` with empty `dates=[]` bypasses short-circuit and loops.** `get_weather` returns `{"error": "dates must be a non-empty list"}` which does NOT match the `is/are required` regex → not short-circuited. Falls through to `MAX_TOOL_ITERATIONS` guard (10 iterations max). **Known gap, mitigated by guard — confirmed matches reviewer WARNING.**

2. **Hypothesis: regex injection via crafted error text creates false-positive short-circuit.** Tested `'x is required; rm -rf /'` (semicolon breaks `fullmatch`), newline injection, Cyrillic lookalike. All return `False` correctly. **Not vulnerable.**

3. **Hypothesis: `assistant.tool_calls=[]` (empty list) pollutes rebuilt messages.** `filtered_tool_calls` is `[]` → the `if filtered_tool_calls:` guard prevents empty `tool_calls` key from being added to the payload. **Not vulnerable.**

4. **Hypothesis: `tool message content = None` is incorrectly classified as required-param error.** `_is_required_param_tool_error_message_content(None)` returns `False` (not a string → returns early). **Not vulnerable.**

5. **Hypothesis: `_build_required_param_error_event` crashes on None/missing `result`.** `result.get("error")` is guarded by `if isinstance(result, dict)` in caller; the static method itself handles `None` result via `isinstance` check and produces `error=""`. **No crash.**

6. **Hypothesis: multi-tool scenario — only partial `tool_calls` prefix trimmed correctly.** Tested assistant with `[A, B]` where A succeeds and B fails: rebuilt messages contain `tool_calls=[A]` only. Tested assistant with only `[X]` failing: rebuilt messages contain `tool_calls=None` (key absent). **Both correct.**

**MUTATION_ESCAPES: 1/7** — `get_weather` returning `"dates must be a non-empty list"` not triggering the short-circuit. This is a known, reviewed, accepted gap (mitigated by `MAX_TOOL_ITERATIONS`). No other mutation checks escaped detection.

**FLAKY: 0**

**COVERAGE: N/A** — no automated test suite exists for the `chat` app; all validation is via unit-level method tests + lead's live-run evidence. Recommended follow-up: add Django `TestCase` for `send_message` streaming loop covering (a) single required-arg tool failure → short-circuit, (b) multi-tool partial success, (c) `MAX_TOOL_ITERATIONS` exhaustion, (d) `_build_llm_messages` orphan-trimming round-trip.

## Tester Validation — `suggestion-add-flow` (2026-03-09)

### STATUS: PASS

**Test run:** 30 Django tests (24 pass, 6 fail — all 6 pre-existing: 2 user email key errors + 4 geocoding mock failures). Zero new regressions. 44 targeted unit-level checks (42 pass, 2 fail — both failures confirmed as test-script defects, not code bugs).

**Standard pass findings:**

- `_resolve_provider_and_model` precedence verified end-to-end: explicit request payload → `UserAISettings.preferred_provider/model` → `settings.VOYAGE_AI_PROVIDER/MODEL` → provider-config default. All 4 precedence levels tested and confirmed correct.
- Cross-provider model guard confirmed: when request provider ≠ `preferred_provider`, the `preferred_model` is NOT applied (prevents `gpt-5-nano` from leaking to anthropic, etc.).
- Null/empty `preferred_provider`/`preferred_model` in `UserAISettings` handled safely (`or ""` coercion guards throughout).
- JSON parsing in `_get_suggestions_from_llm` is robust: handles clean JSON array, embedded JSON in prose, markdown-wrapped JSON, plain text (no JSON), empty string, `None` content — all return correct results (empty list or parsed list). Response capped at 5 items. Single-dict LLM response wrapped in list correctly.
- `normalizeSuggestionItem` normalization verified: non-dict returns `null`, missing name+location returns `null`, field aliases (`title`→`name`, `address`→`location`, `summary`→`description`, `score`→`rating`, `whyFits`→`why_fits`) all work. Whitespace-only name falls back to location.
- `rating=0` correctly preserved in TypeScript via `??` (nullish coalescing at line 171), not dropped. The Python port used `or` which drops `0`, but that's a test-script defect only.
- `buildLocationPayload` constructs a valid `LocationSerializer`-compatible payload: `name`, `location`, `description`, `rating`, `collections`, `is_public`. Falls back to collection location when suggestion has none.
- `handleAddSuggestion` → POST `/api/locations/` → `dispatch('addItem', {type:'location', itemId, updateDate:false})` wiring confirmed by code inspection (lines 274–294). Parent `CollectionItineraryPlanner` handler at line 2626 calls `addItineraryItemForObject`.

**Adversarial pass findings:**

1. **Hypothesis: cross-provider model leak (gpt-5-nano applied to anthropic).** Tested `request.provider=anthropic` + `UserAISettings.preferred_provider=opencode_zen`, `preferred_model=gpt-5-nano`. Result: `model_from_user_defaults=None` (because `provider != preferred_provider`). **Not vulnerable.**

2. **Hypothesis: null/empty DB prefs cause exceptions.** `preferred_provider=None`, `preferred_model=None` — all guards use `(value or "").strip()` pattern. Falls through to `settings.VOYAGE_AI_PROVIDER` safely. **Not vulnerable.**

3. **Hypothesis: all-None provider/model/settings causes exception in `_resolve_provider_and_model`.** Tested with `is_chat_provider_available=False` everywhere, all settings None. Returns `(None, None)` without exception; caller checks `is_chat_provider_available(provider)` and returns 503. **Not vulnerable.**

4. **Hypothesis: missing API key causes silent empty result instead of error.** `get_llm_api_key` returns `None` → raises `ValueError("No API key available")` → caught by `post()` try/except → returns 500. **Explicit error path confirmed.**

5. **Hypothesis: no model configured causes silent failure.** `model=None` + empty `provider_config` → raises `ValueError("No model configured for provider")` → 500. **Explicit error path confirmed.**

6. **Hypothesis: `normalizeSuggestionItem` with mixed array (nulls, strings, invalid dicts).** `[None, {name:'A'}, 'string', {description:'only'}, {name:'B'}]` → after normalize+filter: 2 valid items. **Correct.**

7. **Hypothesis: rating=0 dropped by falsy check.** Actual TS uses `item.rating ?? item.score` (nullish coalescing, not `||`). `normalizeRating(0)` returns `0` (finite number check). **Not vulnerable in actual code.**

8. **Hypothesis: XSS in name field.** `<script>alert(1)</script>` passes through as a string; Django serializer stores as text, template rendering escapes it. **Not vulnerable.**

9. **Hypothesis: double-click `handleAddSuggestion` creates duplicate location.** `isAdding` guard at line 266 exits early if `isAdding` is truthy — prevents re-entrancy. **Protected by UI-state guard.**

**Known low-severity defect (pre-existing, not introduced by this workstream):** LLM-generated `name`/`location` fields are not truncated before passing to `LocationSerializer` (max_length=200). If LLM returns a name > 200 chars, the POST to `/api/locations/` returns 400 and the frontend shows a generic error. Risk is very low in practice (LLM names are short). Recommended fix: add `.slice(0, 200)` in `buildLocationPayload` for `name` and `location` fields.

**MUTATION_ESCAPES: 1/9** — `rating=0` would escape mutation detection in naive Python tests (but is correctly handled in the actual TS `??` code). No logic mutations escape in the backend Python code.

**FLAKY: 0**

**COVERAGE: N/A** — no automated suite for `chat` or `suggestions` app. All validation via unit-level method tests + provider/model resolution checks. Recommended follow-up: add Django `TestCase` for `DaySuggestionsView.post()` covering (a) missing required fields → 400, (b) invalid category → 400, (c) unauthorized collection → 403, (d) provider unavailable → 503, (e) LLM exception → 500, (f) happy path → 200 with `suggestions` array.

**Cleanup required:** Two test artifact files left on host (not git-tracked, safe to delete):
- `/home/alex/projects/voyage/test_suggestion_flow.py`
- `/home/alex/projects/voyage/suggestion-modal-error-state.png`

## Completion Note — `embedded-chat-ux-polish` (2026-03-09)

- Updated `frontend/src/lib/components/AITravelChat.svelte` embedded UX only: moved provider/model selectors into a compact header settings dropdown, reduced embedded sidebar width, and added sidebar toggle accessibility attributes (`aria-label`, `aria-expanded`, `aria-controls`).
- Replaced rigid embedded height (`h-[70vh]`) with a bounded strategy (`h-[65vh]` + min/max constraints) and softened embedded card treatment for better fit in recommendations layouts across desktop/mobile.
- Kept streaming status visible throughout generation (not only before first token) and tightened embedded quick-action/input alignment with compact chip sizing + scrollable chip row behavior.

## Completion Note — `chat-tool-output-cleanup` (2026-03-09)

- Updated `frontend/src/lib/components/AITravelChat.svelte` to suppress standalone rendering of persisted `role=tool` messages, so reloaded conversations no longer surface raw tool payload rows.
- Replaced inline raw-JSON fallback rendering with concise user-facing summaries for `get_trip_details`, `list_trips`, `add_to_itinerary`, and `get_weather`, while keeping existing rich cards for `search_places` and `web_search`.
- Added safe error summarization for inline tool results so tool error payloads no longer display raw JSON in the normal chat UI.

## Review Verdict — `chat-tool-output-cleanup` (2026-03-09)

### STATUS: CHANGES-REQUESTED (score 13)

**CRITICAL: Tool summaries and rich cards lost on conversation reload** (`AITravelChat.svelte:534,782`). `tool_results` is a frontend-only ephemeral field populated exclusively during SSE streaming (line 373). When a conversation is reloaded via `selectConversation()`, the backend serializer returns `role=tool` messages with raw payloads, but the new `visibleMessages` filter (line 534) hides them. No reconstruction step maps persisted `role=tool` messages back onto their preceding assistant message's `tool_results` array. Result: after page refresh or conversation switch, all tool activity indicators (summaries, search_places cards, web_search links) silently vanish. The user sees only the assistant's text with no tool context. (confidence: HIGH)

**WARNING: Acceptance criterion partially unmet — "reloaded conversations do not expose raw tool payloads"** is satisfied (filter works), but the related user expectation that tool activity "remains understandable" on reload is violated because no tool indicators appear at all on reloaded conversations.

**What was checked and confirmed safe:**
- `visibleMessages` filter correctly excludes `role=tool` messages from display (line 534). No raw JSON blobs shown during streaming or on reload.
- `getToolSummary()` logic is safe: uses Svelte text interpolation (not `{@html}`), so LLM-sourced names (trip names, location names) are auto-escaped. No XSS vector.
- Error tool results render a generic "could not be completed" message rather than raw error JSON. Correct and safe.
- Streaming state management is correct: `streamingContent` reset on each send (line 302), `isStreaming` cleared in `finally` (line 387). No stale state.
- `lastVisibleMessageId` correctly tracks the last visible (non-tool) message for the streaming indicator.
- `asRecord()` null guard is correct — rejects null, arrays, and non-objects.
- Fallback summary for unknown tool names (line 599-602) is generic and safe.

**NEXT (fix actions):**
In `selectConversation()`, after loading `data.messages`, reconstruct `tool_results` on each assistant message by scanning the immediately following `role=tool` messages (which share `tool_call_id` with the assistant's `tool_calls` entries). For each tool message, parse its `content` (JSON string from `serialize_tool_result`), extract the tool `name` from the message's `name` field, and push a `ToolResultEntry` onto the preceding assistant message's `tool_results`. This ensures summaries and rich cards appear on reload. The `visibleMessages` filter continues to hide the raw tool rows.

## Tester Validation — `chat-tool-output-cleanup` (2026-03-09)

### STATUS: PASS

**Evidence from lead (runtime):** Page reload of seeded conversation with persisted `get_trip_details` assistant tool call + `role=tool` result showed `🗺️ Loaded details for test (0 itinerary items).` — no raw JSON. Sidebar remained functional. Reviewer-APPROVED follow-up fix confirmed implemented and working.

**Standard pass findings:**

- `visibleMessages` filter (`messages.filter(msg => msg.role !== 'tool')`) correctly suppresses raw `role=tool` rows from display. Live DOM scan of 10 chat bubbles across two conversations found zero raw JSON blobs (`"itinerary":`, `"tool_call_id":` patterns absent).
- `rebuildConversationMessages()` scans messages in one pass: sets `activeAssistant` on each assistant-with-tool-calls message; attaches subsequent `role=tool` rows as `ToolResultEntry` objects matched via `tool_call_id`. `activeAssistant` overridden on each new assistant message, preventing cross-turn leakage.
- `parseStoredToolResult()` JSON-parses tool content; falls back to raw string on failure. Both paths produce a valid `ToolResultEntry` — no crash.
- `getToolSummary()` produces human-readable summaries for `get_trip_details`, `list_trips`, `add_to_itinerary`, `get_weather`; generic fallback for unknown tools. Error payloads render `"<name> could not be completed."` — no raw JSON.
- Backend `ChatMessageSerializer` confirmed to include `name`, `tool_call_id`, and `tool_calls` fields required for reconstruction.
- Multi-turn live conversation validated: `⚠️ get trip details could not be completed.` + `🧳 Found 1 trip.` + `🗺️ Loaded details for test (6 itinerary items).` — all clean summaries, no raw JSON.
- Text-only conversation (no tool calls) unaffected — loads correctly with zero tool artifacts.
- Frontend build: `bun run lint`, `bun run check`, `bun run build` all passed (per lead).

**Adversarial pass findings (7 hypotheses, all safe):**

1. **Hypothesis: `assistant.tool_calls` with null IDs causes cross-turn leakage.** When `toolCallIds=[]`, the guard `msg.tool_call_id && toolCallIds.length > 0 && !includes` short-circuits at `length=0` → tool IS attached (permissive loose match). But the next `assistant` message overrides `activeAssistant` before its own tool rows, so no cross-turn pollution occurs. **Acceptable; null IDs cannot arise from correctly persisted backend rows.**
2. **Hypothesis: orphaned `role=tool` after non-tool-call assistant attaches to wrong message.** `activeAssistant=null` when `tool_calls` absent/empty. Tool row skipped. **Not vulnerable.**
3. **Hypothesis: malformed JSON in tool content crashes reconstruction.** Try/catch fallback returns `result: rawString`. `asRecord(string)` → `null`; `getToolSummary` hits generic branch. **Safe; no crash, no raw JSON exposed.**
4. **Hypothesis: `name=null` on tool message causes downstream crash.** `msg.name || 'tool'` guard → `'tool'`. Generic fallback renders `"tool completed."` **Safe.**
5. **Hypothesis: multi-tool assistant reconstructs both in correct order.** Both `call_A` and `call_B` rows attach to same assistant; `activeAssistant` cleared after count reaches `toolCallIds.length`. **Verified: 2 results attached in correct order.**
6. **Hypothesis: empty `messages` array crashes.** Returns `[]` immediately. **Safe.**
7. **Hypothesis: `role=tool` before any assistant crashes or attaches to user message.** `activeAssistant=null` at start; tool row skipped. **Safe.**

**MUTATION_ESCAPES: 1/7** — The `toolCallIds.length > 0` guard in the clear condition means an assistant with all-null tool_call IDs never has `activeAssistant` cleared post-attachment. A second stray tool row would attach to the same assistant. Extremely low practical likelihood (backend always persists real IDs from LiteLLM); no production scenario produces this DB state.

**FLAKY: 0**

**COVERAGE: N/A** — No automated frontend test suite for `AITravelChat.svelte`. All validation via in-browser function evaluation (7 unit-level cases) + visual browser confirmation. Recommended follow-up: Playwright e2e test seeding a conversation with `role=tool` rows and verifying summary cards render on reload.

**Screenshot evidence:** Captured `tool-summary-reload-verification.png` — showed `Tool summary reload test` conversation with assistant text + `🗺️ Loaded details for test (0 itinerary items).` summary card, no raw JSON. Screenshot deleted post-verification (artifact not git-tracked).

## Tester Validation — `embedded-chat-ux-polish` (2026-03-09)

### STATUS: PASS

**Lead evidence accepted:**
- `bun run lint`, `bun run check` (0 errors, 6 pre-existing warnings), and `bun run build` all passed.
- Browser-validated: embedded chat opens with sidebar closed, compact header (`Show conversations` toggle + title + ⚙️ gear), recommendations area remains visible. Sidebar toggle opens conversation list correctly.
- Reviewer APPROVED after sidebar-default follow-up fix (`let sidebarOpen = !embedded;` at line 63 confirmed in code).

**Standard pass findings (code inspection):**

- AC1 (header de-crowded): ✅ Provider/model selectors moved into `<details class="dropdown dropdown-end">` at line 768. Header contains only: hamburger toggle (mobile) + ✈️ title + ⚙️ gear summary button.
- AC2 (layout stability): ✅ `h-[65vh]` + `min-h-[30rem]` + `max-h-[46rem]` on embedded mode (lines 683–685). `bg-base-100` + border treatment for embedded card (lines 674–677). Quick-action chips use `btn-xs` + `overflow-x-auto` + `pb-1` for embedded (lines 927–922).
- AC3 (streaming indicator): ✅ `isStreaming && msg.id === lastVisibleMessageId` condition (line 903) inside last assistant bubble. `lastVisibleMessageId` is a reactive derived value from `visibleMessages` (line 599) — stays current throughout stream.
- AC4 (sidebar default): ✅ `let sidebarOpen = !embedded;` (line 63). Sidebar CSS `{sidebarOpen ? '' : 'hidden'} lg:flex` (line 691) — starts hidden in embedded mode on mobile/tablet, always visible on lg+ (correct responsive pattern). Toggle button is `lg:hidden` (line 739).
- AC5 (existing features preserved): ✅ Tool result rendering, conversation management, date selector modal, quick actions, send button states unchanged.

**Adversarial pass findings:**

1. **Hypothesis: desktop (lg+) embedded layout still crushes content because sidebar is always visible via `lg:flex`.** Expected: content area unusable. Observed: `lg:flex` overrides `hidden` on lg+ — this is the intended responsive pattern. On lg+ screens there is enough horizontal space for sidebar (`w-60`) + chat content. `min-w-0` on chat panel prevents overflow. **Not a defect; designed behavior confirmed by reviewer.**

2. **Hypothesis: `<details>` settings dropdown doesn't close on outside click — user trapped.** Expected: frustration UX. Observed: DaisyUI `<details>` requires another click on summary to close. `settingsOpen = false` init confirmed (line 80). **Non-blocking UX inconvenience; pre-existing SUGGESTION from reviewer, not a blocking defect.**

3. **Hypothesis: `lastVisibleMessageId` becomes stale during streaming, causing indicator to appear on wrong message.** Expected: indicator shows on previous message. Observed: `lastVisibleMessageId` is reactive (`$:` at line 599) — updates synchronously when `visibleMessages` changes. No stale-closure risk. **Not vulnerable.**

4. **Hypothesis: `visibleMessages` filter excludes only `role=tool` — if all messages are tool messages, `lastVisibleMessageId` is `undefined` and streaming indicator never shows.** Expected: silent stream with no indicator. Observed: In practice, every send appends a `user` message and then an `assistant` streaming message — there will always be a non-tool message for the indicator to attach to. **Acceptable; degenerate case impossible in normal flow.**

5. **Hypothesis: `aria-label` hardcoded English strings at lines 743 and 771 violate i18n convention.** Expected: non-English users see English screen-reader labels. Observed: lines 743 (`'Hide conversations'`/`'Show conversations'`) and 771 (`"AI settings"`) are hardcoded. **Low-severity SUGGESTION from reviewer — non-blocking, accessibility-only impact.**

**MUTATION_ESCAPES: 0/4** — all critical logic paths for this UX-only feature are covered by the responsive CSS (no off-by-one possible) and the reactive `lastVisibleMessageId` derivation.

**FLAKY: 0**

**COVERAGE: N/A** — No automated test suite for frontend component; all validation via code inspection + lead browser evidence.

**Residual low-priority items (follow-up suggested, not blocking):**
- `aria-label` values at lines 743 and 771 should use `$t()` per i18n convention.
- `<details>` dropdown does not auto-close on outside click (SUGGESTION from reviewer).

## Completion Note — `chat-tool-grounding-and-confirmation` (2026-03-09)

- `send_message()` trip context now injects the active collection UUID with explicit instruction that it is the `collection_id` for `get_trip_details` and `add_to_itinerary`, reducing wrong-trip-id hallucinations.
- System prompt itinerary guidance now requires confirmation only before the first `add_to_itinerary` action; after explicit user approval phrases (e.g., "yes", "go ahead", "add them", "just add things there"), the assistant is instructed to stop re-confirming and call tools directly.
- Tool error wording was tightened to align with required-arg short-circuit behavior: `get_trip_details` inaccessible/missing trips now return a required-arg-style `collection_id` error string, and `get_weather` empty dates now return `"dates is required"`.
- Review verdict (2026-03-09): **APPROVED** (score 3). One WARNING: `get_trip_details` DoesNotExist error `"collection_id is required and must reference a trip you can access"` conflates missing-param and invalid-value semantics — may mislead LLM into thinking param was omitted rather than wrong. Does NOT create false-positive short-circuit (regex `fullmatch` correctly rejects the trailing clause). Closes prior known gap: `get_weather` "dates must be a non-empty list" now "dates is required" (matches regex). See [decisions.md](../decisions.md#correctness-review-chat-tool-grounding-and-confirmation).

## Tester Validation — `chat-tool-grounding-and-confirmation` (2026-03-09)

### STATUS: PASS

**Test run:** `docker compose exec server python3 manage.py test chat integrations --keepdb` — 5/5 PASS. Full Django baseline 24/30 (6 pre-existing failures unchanged; zero new regressions).

**Standard pass findings:**

- UUID context injection confirmed: `send_message()` lines 255–258 append `"Collection UUID (use this exact collection_id for get_trip_details and add_to_itinerary): {collection.id}"` into `context_parts`, embedded in `system_prompt` (lines 295–296). UUID appears in the `role=system` message on every conversation turn.
- Authorization gate confirmed: UUID injection block is inside `if collection:` (line 255); `collection` is only assigned when lookup succeeds AND user is owner or `shared_with` member (lines 244–251). Unauthorized collection_id → `collection = None` → block skipped.
- System prompt confirmation guidance verified (`llm_client.py:340–341`): confirms only before first `add_to_itinerary` action; after user approval phrases ("yes", "go ahead", "add them", "just add things there"), stops re-confirming.
- Regex validation — 11 test cases all pass:
  - `"collection_id is required"` → **True** (short-circuits)
  - `"collection_id is required and must reference a trip you can access"` → **False** (DoesNotExist; `fullmatch` rejects trailing clause — no false short-circuit)
  - `"dates is required"` → **True** (prior `chat-loop-hardening` gap now **RESOLVED**)
  - All legacy required-arg strings continue matching; non-matching strings correctly return False.
- `get_weather` empty dates: string changed from `"dates must be a non-empty list"` to `"dates is required"` — now matches regex and short-circuits. Prior known gap closed.

**Adversarial pass findings:**

1. **Unauthorized collection_id leaks UUID?** `if collection:` gate prevents injection when lookup fails/unauthorized. **NOT VULNERABLE.**
2. **DoesNotExist error creates false-positive short-circuit?** `fullmatch` returns `False` for trailing text. **NOT VULNERABLE.**
3. **UUID grounding lost between turns?** UUID is in `system_prompt` (role=system), rebuilt fresh on every `send_message`. **Grounding persists for entire conversation.**
4. **Null collection_id crashes injection block?** `if collection_id:` at line 242 gates the lookup; null → block skipped. **NOT VULNERABLE.**
5. **Shared member gets UUID in context but `get_trip_details` fails (filter excludes shared_with)?** Confirmed pre-existing bug: `get_trip_details` filters `user=user` only. Shared members get UUID context but tool returns DoesNotExist. Does not short-circuit (trailing text); falls to `MAX_TOOL_ITERATIONS`. **PRE-EXISTING, LOW severity, not introduced here.**
6. **`get_weather` short-circuit gap (prior MUTATION_ESCAPE) resolved?** Confirmed resolved — new `"dates is required"` string matches regex.

**MUTATION_ESCAPES: 0/5** — all mutation checks detected. DoesNotExist false-positive (reviewer WARNING) confirmed benign.

**FLAKY: 0**

**COVERAGE: N/A** — No automated test suite for `chat` app. All validation via in-container regex checks + lead's live-run evidence. Recommended follow-up: add Django TestCase for (a) UUID context injection with authorized vs unauthorized collection_id, (b) DoesNotExist path does not trigger short-circuit, (c) empty dates triggers short-circuit.

**Non-blocking known issues (accepted, pre-existing):** `get_trip_details` DoesNotExist wording semantically ambiguous (reviewer WARNING); `get_trip_details` excludes shared-collection members from `filter(user=user)` — both pre-existing, not introduced by this feature.