From 78d2b3774ebf76f47098bad80d041c04a295dc12 Mon Sep 17 00:00:00 2001 From: alex Date: Tue, 10 Mar 2026 18:40:45 +0000 Subject: [PATCH] chore: remove accidental memory files from repo root These were stashed basic-memory files that got staged. Memory files belong in .memory/ (managed by basic-memory MCP), not the repo root. --- ...rospective - Chat Tool Failure Handling.md | 19 -- gates/chat-tool-loop-fix tester gate.md | 65 ------ .../Chat Tool Error Handling Architecture.md | 59 ----- plans/assistant-add-flow-fixes.md | 101 --------- voyage/plans/assistant-add-flow-fixes.md | 208 ------------------ 5 files changed, 452 deletions(-) delete mode 100644 decisions/Retrospective - Chat Tool Failure Handling.md delete mode 100644 gates/chat-tool-loop-fix tester gate.md delete mode 100644 knowledge/Chat Tool Error Handling Architecture.md delete mode 100644 plans/assistant-add-flow-fixes.md delete mode 100644 voyage/plans/assistant-add-flow-fixes.md diff --git a/decisions/Retrospective - Chat Tool Failure Handling.md b/decisions/Retrospective - Chat Tool Failure Handling.md deleted file mode 100644 index 88281c71..00000000 --- a/decisions/Retrospective - Chat Tool Failure Handling.md +++ /dev/null @@ -1,19 +0,0 @@ ---- -title: Retrospective - Chat Tool Failure Handling -type: note -permalink: voyage/decisions/retrospective-chat-tool-failure-handling -tags: -- pattern -- lesson -- chat -- retrospective ---- - -# Retrospective - Chat Tool Failure Handling - -## Retrospective: chat tool failure handling -- [pattern] Separating successful-tool iterations from bounded all-failure rounds prevents external tool outages from exhausting the main tool-call budget. #pattern -- [pattern] Classify chat tool failures into required-parameter errors, retryable execution failures, and permanent execution failures so each path can use the right UX and retry behavior. #pattern -- [risk] Mixed batches with both successful and failed tool calls still rely on the global success iteration cap as a backstop; acceptable pre-release, but worth revisiting if multi-tool itineraries remain noisy. #risk - -confidence=high; last_validated=2026-03-10; volatility=medium; review_after_days=60; validation_count=1; contradiction_count=0 diff --git a/gates/chat-tool-loop-fix tester gate.md b/gates/chat-tool-loop-fix tester gate.md deleted file mode 100644 index d78ba56d..00000000 --- a/gates/chat-tool-loop-fix tester gate.md +++ /dev/null @@ -1,65 +0,0 @@ ---- -title: chat-tool-loop-fix tester gate -type: note -permalink: voyage/gates/chat-tool-loop-fix-tester-gate -tags: -- tester -- gate -- chat -- loop-fix -- adversarial ---- - -## Gate: chat-tool-loop-fix adversarial validation (2026-03-10) - -**Verdict: PASS** -**Plan ref:** [[voyage/plans/assistant-add-flow-fixes]] -**Branch:** fix/chat-tool-loop-fix - -### Test execution -- All 32 chat tests pass (docker compose exec server python3 manage.py test chat --keepdb --verbosity=2) -- 0 failures, 0 errors, 0 flaky - -### Adversarial checks and findings - -#### AC1 — all-failure rounds stop cleanly without falling into tool_loop_limit -PASS. `all_failure_rounds` counter increments only when `not successful_tool_calls and first_execution_failure` (line 749). On reaching `MAX_ALL_FAILURE_ROUNDS=3`, the code issues `tool_execution_error` and `return`s (line 777). `tool_iterations` is never incremented for all-failure rounds (it increments only at line 782, after `all_failure_rounds = 0`), so 10-iteration cap is never hit for pure-failure scenarios. Test `test_all_failure_rounds_stop_with_execution_error_before_tool_cap` confirms exactly 3 `stream_chat_completion` calls and `tool_execution_error` rather than `tool_loop_limit`. - -#### AC2 — permanent failures stop in one round -PASS. When `retryable=False`, `encountered_permanent_failure = True` is set (line 717), then at post-iteration check (line 750-751) `all_failure_rounds` is set directly to `MAX_ALL_FAILURE_ROUNDS` (bypassing the increment), triggering immediate stop. Test `test_permanent_execution_failure_stops_immediately` confirms 1 `stream_chat_completion` call. - -#### AC3 — required-parameter clarification flow still intact -PASS. The `_is_required_param_tool_error` path (lines 647-711) short-circuits before the execution-failure path (line 713). `search_places` missing-location triggers `clarification_content` SSE emit + DB persist + `return`. Other required-param errors still emit `tool_validation_error` and `return`. Tests `test_missing_search_place_location_streams_clarifying_content` and `test_collection_access_error_does_not_short_circuit_required_param_regex` confirm correctness. - -#### AC4 — execution failures sanitized and NOT emitted as tool_result SSE events -PASS. The execution-failure branch at line 713 does `continue` — it never reaches the `yield tool_event` at line 747. Failed tool results are recorded in `first_execution_failure` but not in `successful_tool_calls` / `successful_tool_messages` / `successful_tool_chat_entries`, so they are never persisted to DB as role=tool rows and never streamed to the frontend. `_build_tool_execution_error_event` uses only the sanitized error text (line 265-271), not the raw exception. Test confirms no `tool_result` events in failure scenarios. - -#### AC5 — mixed-batch partial failure: one tool succeeds, another fails -OBSERVED FINDING (LOW): When a batch has e.g. 2 tool calls where call_1 succeeds and call_2 fails with an execution error: -- `successful_tool_calls` will contain call_1's result; call_2's failure is silently dropped from this iteration's context. -- The assistant message persisted to DB will reference only call_1's tool_calls. -- `all_failure_rounds` stays at 0 (successful_tool_calls is non-empty, so the `not successful_tool_calls` condition at line 749 is False). -- Loop continues as if call_2 never happened — the LLM won't know call_2 failed. -- Acceptable for now (pre-release), but the LLM may loop trying to re-execute call_2 if it expects an answer. Not a regression from the old behavior (where ALL failures were fed back to LLM), and the `MAX_TOOL_ITERATIONS` cap remains as a backstop. - -#### AC6 — `_build_llm_messages` orphan tool_call filtering on reload -PASS. `valid_tool_call_ids` (line 59) is built only from non-required-param-error and non-execution-failure tool messages. Assistant messages with `tool_calls` have their `tool_calls` filtered to only the valid set (line 95-101). If ALL tool_calls for an assistant message are failures, `filtered_tool_calls` is empty and `payload["tool_calls"]` is not set (line 100 guard) — the assistant message is sent to LLM without tool_calls, avoiding orphaned tool references. - -#### AC7 — frontend `rebuildConversationMessages` dedupe -PASS. Line 374 forces `tool_results: undefined` on all messages before the sibling-walk, discarding any server-side `tool_results` field. `appendToolResultDedup` guards on `tool_call_id` uniqueness during the walk. `uniqueToolResultsByCallId` is applied again at render time (line 939) as a second safety net. Combined: no double-accumulation on reload. - -#### AC8 — `uniqueToolResultsByCallId` silent drop for null/undefined tool_call_id -LOW RISK: `uniqueToolResultsByCallId` (line 351-368) has a special case: if `result.tool_call_id` is falsy (null/undefined), the result is always appended without dedup check (line 359 guard). This means multiple results with no `tool_call_id` will all be included. Given that tool_call_id is always populated from the backend for proper tool calls, this is a safe fallback for edge cases. - -#### AC9 — `_is_execution_failure_tool_error` catches ANY non-empty error that is not required-param error -PASS. This is intentionally broad — any `{error: "..."}` dict that isn't a missing-param pattern is treated as an execution failure. This correctly captures geocode errors, network errors, unexpected exceptions, etc. - -#### AC10 — retry fallback after geocode failure still uses `_is_execution_failure_tool_error` guard -PASS (line 631-634): The retry succeeds only if the retry result is not a required-param error AND not an execution failure. If all retries fail, the `result` remains the original geocode error, which is then caught by `_is_execution_failure_tool_error` at line 713 and handled as a failure (not emitted as tool result). - -### Mutation escape analysis -- MUTATION_ESCAPES: 1/8 -- The unchecked mutation: flipping `not successful_tool_calls` → `successful_tool_calls` in the all-failure-round check (line 749) would cause failures to be silently dropped without incrementing the counter. No existing test with partial success + partial failure to detect this mutation. Low risk given the mixed-batch scenario is acceptable behavior. - -### Summary -Implementation is sound for the primary stated goals. All-failure rounds stop cleanly; permanent failures stop in 1 round; clarification flow is intact; execution failures are not leaked to frontend. Frontend dedupe is correct via `tool_results: undefined` reset + ID-based dedup. One low-severity gap: mixed-batch partial failure silently drops the failing tool call (acceptable pre-release). diff --git a/knowledge/Chat Tool Error Handling Architecture.md b/knowledge/Chat Tool Error Handling Architecture.md deleted file mode 100644 index 58c88316..00000000 --- a/knowledge/Chat Tool Error Handling Architecture.md +++ /dev/null @@ -1,59 +0,0 @@ ---- -title: Chat Tool Error Handling Architecture -type: note -permalink: voyage/knowledge/chat-tool-error-handling-architecture -tags: -- chat -- tools -- error-handling -- architecture -- pattern ---- - -# Chat Tool Error Handling Architecture - -## Overview - -The chat agent tool loop classifies tool call outcomes into three distinct categories, each with different retry and surfacing behavior. - -## Error Classification - -### 1. Required-parameter validation errors -- [pattern] Detected by `_is_required_param_tool_error()` regex matching `"... is required"` patterns in tool result `error` field -- [convention] Short-circuited immediately with a user-visible error — never replayed into LLM history -- [pattern] `search_places` missing `location` has a special path: `_is_search_places_location_retry_candidate_error()` triggers deterministic context-retry (trip destination → first itinerary stop → user clarification) before surfacing - -### 2. Execution failures (new in chat-tool-loop-fix) -- [pattern] Any `error`-bearing tool result dict that does NOT match the required-param pattern is classified as an execution failure by `_is_execution_failure_tool_error()` -- [convention] Execution failures are NEVER replayed into LLM context — they are excluded from `successful_tool_calls`, `successful_tool_messages`, and `successful_tool_chat_entries` -- [pattern] `tool_iterations` increments only after at least one successful tool call in a round -- [pattern] All-failure rounds (every tool in a round fails) increment `all_failure_rounds`, capped at `MAX_ALL_FAILURE_ROUNDS` (3) -- [pattern] Permanent failures (`retryable: false` in tool result, e.g. `web_search` ImportError) set `all_failure_rounds = MAX_ALL_FAILURE_ROUNDS` for immediate stop -- [convention] Execution failures emit a `tool_execution_error` SSE event with sanitized text via `_build_tool_execution_error_event()` - -### 3. Geocoding failures in search_places -- [pattern] `Could not geocode location: ...` errors are detected by `_is_search_places_location_retry_candidate_error()` (same path as missing-location) -- [convention] Eligible for the existing context-retry fallback before being treated as a terminal failure - -## Error Sanitization - -- [convention] `_safe_error_payload()` maps LiteLLM exceptions to sanitized user-safe categories — never forwards raw `exc.message` -- [convention] `execute_tool()` catch-all returns `{"error": "Tool execution failed"}` (hardcoded) — never raw `str(exc)` -- [decision] The `_build_tool_execution_error_event()` wraps sanitized tool error text in a user-safe sentence for SSE emission and DB persistence - -## Frontend Tool-Result Deduplication - -- [pattern] Three-layer dedup by `tool_call_id`: - 1. `rebuildConversationMessages()` sets `tool_results: undefined` on all assistant messages, then re-derives exclusively from persisted `role=tool` sibling rows — discards any server-side pre-populated `tool_results` - 2. `appendToolResultDedup()` deduplicates during both rebuild walk and live SSE ingestion - 3. `uniqueToolResultsByCallId()` at render time provides a final safety net - -## Key Files - -- Backend classification/loop: `backend/server/chat/views/__init__.py` -- Tool execution + sanitization: `backend/server/chat/agent_tools.py` -- Frontend dedup: `frontend/src/lib/components/AITravelChat.svelte` -- Tests: `backend/server/chat/tests.py` (32 total chat tests) - -## Relations -- related_to [[assistant-add-flow-fixes]] diff --git a/plans/assistant-add-flow-fixes.md b/plans/assistant-add-flow-fixes.md deleted file mode 100644 index b264963e..00000000 --- a/plans/assistant-add-flow-fixes.md +++ /dev/null @@ -1,101 +0,0 @@ ---- -title: assistant-add-flow-fixes -type: note -permalink: voyage/plans/assistant-add-flow-fixes-1 ---- - -## Feature A hotfix implementation note -- Added a targeted chat hotfix in `ChatViewSet` so required-arg validation for `search_places` with missing `location` now emits a natural assistant clarification as SSE `content` (and persists it as an assistant message) instead of streaming a `tool_validation_error` payload. -- Kept existing generic required-param short-circuit behavior unchanged for all other tools/errors. -- Added tests covering the specific detection helper and end-to-end streaming behavior for the missing-location clarify path. - -## Follow-up fix: itinerary-aware search fallback -- Feature: Use collection itinerary context to satisfy missing `search_places.location` when available, and preserve dining intent when category is omitted. -- Acceptance: - - Restaurant requests across a collection itinerary should not ask for location when itinerary context can provide one. - - `search_places` should not silently return tourism attractions for restaurant requests when category is omitted. - - Existing behavior for non-dining searches remains intact. -- Planned worktree: `.worktrees/chat-search-intent-fix` -- Planned branch: `fix/chat-search-intent-fix` - -## Implementation note: chat search intent + context retry -- Implemented deterministic `search_places` retry fallback in `ChatViewSet`: when tool call misses required `location`, retry first with trip context (`destination`, then first itinerary stop label), then retain existing user-location-reply retry behavior. -- Added deterministic dining-intent heuristic in `ChatViewSet` to inject `category="food"` when `search_places.category` is omitted and current/recent user messages indicate restaurant/dining intent. -- Updated `search_places` tool schema text to explicitly describe `food` (restaurants/dining), `tourism` (attractions), and `lodging` (hotels/stays). -- Added targeted tests for category inference and a collection-context integration flow proving restaurant requests can complete via destination-backed retry without location clarification prompt. - -## Regression Exploration: Restaurant Recommendations Across Test Itinerary (2026-03-10) - -### Observed failure modes -1. **"too many tool calls"** — `tool_loop_limit` error after 10 iterations -2. **Repeated "search places could not be completed"** — LLM retries `search_places` each iteration, each producing a non-location-error that bypasses the clarification/retry gate -3. **Repeated "web search could not be completed"** — LLM retries `web_search` each iteration; non-fatal errors (rate-limit, ImportError, generic failure) are treated as successful tool calls and fed back to the LLM, which re-calls the tool - -### Root cause analysis - -#### RC1 — `web_search` non-fatal errors are NOT short-circuited (primary cause of web_search loop) -**File**: `backend/server/chat/views/__init__.py`, `event_stream()` loop (line 474+) -- `web_search` returns `{"error": "...", "results": []}` on ImportError, rate-limit, and generic failures. -- `_is_required_param_tool_error()` only matches "location is required" / "query is required" / `"X is required"` pattern. It does **not** match "Web search failed. Please try again." or "Search rate limit reached." -- So the error payload passes through `_is_required_param_tool_error` → `False`, and the tool result (including the error text) is added to `successful_tool_calls` and fed back into the LLM context as a successful tool message. -- The LLM sees a tool result with `error` text but no real data, and calls `web_search` again — repeating until `MAX_TOOL_ITERATIONS` (10) is hit. - -#### RC2 — `search_places` geocoding/Overpass failures are NOT short-circuited (cause of search_places loop) -**File**: `backend/server/chat/agent_tools.py` `search_places()` (line 149, 199–205) -- `search_places` returns `{"error": "Could not geocode location: X"}` when Nominatim returns empty results, `{"error": "Places API request failed: ..."}` on network errors, or `{"error": "An unexpected error occurred..."}` on generic failures. -- **None of these match** `_is_required_param_tool_error()` (which only detects missing-param patterns). They are not `location is required`. -- The context-retry fallback (`_is_search_places_missing_location_required_error`, lines 522–565) only activates on `location is required`. Geocoding/network failures silently pass through as "successful" tool results. -- LLM receives `{"error": "Could not geocode location: Paris"}` in tool context, tries again → same failure → 10 times → loop limit. - -#### RC3 — `MAX_TOOL_ITERATIONS = 10` counter increments even when ALL tool calls in an iteration are failures -**File**: `backend/server/chat/views/__init__.py`, line 475 -- `tool_iterations += 1` happens **before** tool results are checked. A single iteration with 2 tool calls each producing non-fatal errors still costs 1 iteration count and adds 2 failed results to LLM context. -- For a restaurant request across a multi-stop itinerary, the LLM may call `search_places` for multiple stops per iteration, burning through 10 iterations rapidly. - -#### RC4 — `successful_tool_calls` naming is misleading — it accumulates ALL non-required-param-error results -**File**: `backend/server/chat/views/__init__.py`, lines 476, 635 -- The variable `successful_tool_calls` accumulates tool calls whose results are not `_is_required_param_tool_error`. This includes tool calls that returned geocoding failures, network errors, or `web_search` unavailability errors. These are appended to `current_messages` and persisted to DB, causing the LLM to see repeated failure payloads. - -### Edit points and recommended fix approaches - -#### EP1 — Add non-fatal tool error detection and short-circuit -**File**: `backend/server/chat/views/__init__.py` -**Symbol**: New static method `_is_nonfatal_tool_error(result)` + use in `event_stream()` -- Detect results that have `{"error": ..., "results": []}` (web_search style) or `{"error": "Could not geocode..."}` / `{"error": "Places API request failed..."}` (search_places failures). -- Recommended approach: check `result.get("error")` is non-empty AND the result is not a successful payload (no `results` data, no places data). If a tool returns an error result that is not a required-param error, log a warning, do NOT add to `successful_tool_calls`, and instead emit a summary to the LLM context as a tool message explaining the failure — then break the loop or emit a user-visible message after N consecutive non-fatal failures. - -#### EP2 — Distinguish `web_search` persistent unavailability (ImportError) from retryable errors -**File**: `backend/server/chat/agent_tools.py` -**Symbol**: `web_search()` (lines 294–307) -- Currently ImportError (`duckduckgo-search` not installed) and rate-limit both return the same shape: `{"error": "...", "results": []}`. -- Recommended approach: add a sentinel field `"permanent": True` for ImportError, so the view layer can detect non-retryable failures and short-circuit immediately on the first iteration rather than retrying 10 times. Alternatively add a distinct error key like `"error_code": "unavailable"` vs `"error_code": "rate_limited"`. - -#### EP3 — Add non-fatal tool error counting per-tool to break loops early -**File**: `backend/server/chat/views/__init__.py` -**Symbol**: `event_stream()` (line 424+) -- Track `consecutive_nonfatal_errors` per tool (or globally). After 2 consecutive failures of the same tool name, stop calling it for this turn and emit a user-facing message: "Search places could not be completed for this location." -- This prevents the 10-iteration burn for a single unreachable external API. - -#### EP4 — Only increment `tool_iterations` when at least one tool call succeeds -**File**: `backend/server/chat/views/__init__.py` -**Symbol**: `event_stream()` line 475 (`tool_iterations += 1`) -- Move `tool_iterations += 1` to after the tool result loop, and only increment if `len(successful_tool_calls) > 0`. Failed-only iterations should not count toward the limit (or should count less). Alternatively, add a separate `failed_tool_iterations` counter with a lower limit (e.g. 3) specifically for all-failure iterations. - -#### EP5 — `search_places` geocoding failure should surface earlier in retry logic -**File**: `backend/server/chat/views/__init__.py` -**Symbol**: `event_stream()` lines 522–565 (the `_is_search_places_missing_location_required_error` retry block) -- Current retry block only fires on `location is required`. Extend retry logic to also retry with fallback location when result contains `{"error": "Could not geocode location: ..."}` (Nominatim empty result) using the same `retry_locations` fallback list. This gives context-location a second chance when Nominatim fails on the first geocode. - -### Files / symbols map - -| File | Symbol | Role | -|------|--------|------| -| `backend/server/chat/views/__init__.py` | `MAX_TOOL_ITERATIONS = 10` (line 422) | Hard loop cap | -| `backend/server/chat/views/__init__.py` | `tool_iterations += 1` (line 475) | Iteration counter — always increments, even on all-failure iterations | -| `backend/server/chat/views/__init__.py` | `_is_required_param_tool_error()` (line 136) | Only detects missing-param errors; does NOT catch tool execution failures | -| `backend/server/chat/views/__init__.py` | `_is_search_places_missing_location_required_error()` (line 180) | Only triggers retry for `location is required`; geocoding/network failures bypass it | -| `backend/server/chat/views/__init__.py` | `successful_tool_calls` (line 476) | Misnamed — accumulates all non-required-param-error results including failures | -| `backend/server/chat/views/__init__.py` | `event_stream()` lines 480–660 | Full tool execution loop | -| `backend/server/chat/agent_tools.py` | `search_places()` (line 125) | Returns `{"error": "..."}` on geocode/network failure — not a required-param error | -| `backend/server/chat/agent_tools.py` | `web_search()` (line 257) | Returns `{"error": "...", "results": []}` on import/rate-limit/generic failure | -| `backend/server/chat/llm_client.py` | `stream_chat_completion()` (line 380) | No tool-level retry logic — passes through all tool results | diff --git a/voyage/plans/assistant-add-flow-fixes.md b/voyage/plans/assistant-add-flow-fixes.md deleted file mode 100644 index f9a6e893..00000000 --- a/voyage/plans/assistant-add-flow-fixes.md +++ /dev/null @@ -1,208 +0,0 @@ ---- -title: assistant-add-flow-fixes -type: note -permalink: voyage/plans/assistant-add-flow-fixes ---- - -## Feature A hotfix implementation note -- Added a targeted chat hotfix in `ChatViewSet` so required-arg validation for `search_places` with missing `location` now emits a natural assistant clarification as SSE `content` (and persists it as an assistant message) instead of streaming a `tool_validation_error` payload. -- Kept existing generic required-param short-circuit behavior unchanged for all other tools/errors. -- Added tests covering the specific detection helper and end-to-end streaming behavior for the missing-location clarify path. - -## Follow-up fix: itinerary-aware search fallback -- Feature: Use collection itinerary context to satisfy missing `search_places.location` when available, and preserve dining intent when category is omitted. -- Acceptance: - - Restaurant requests across a collection itinerary should not ask for location when itinerary context can provide one. - - `search_places` should not silently return tourism attractions for restaurant requests when category is omitted. - - Existing behavior for non-dining searches remains intact. -- Planned worktree: `.worktrees/chat-search-intent-fix` -- Planned branch: `fix/chat-search-intent-fix` - -## Implementation note: chat search intent + context retry -- Implemented deterministic `search_places` retry fallback in `ChatViewSet`: when tool call misses required `location`, retry first with trip context (`destination`, then first itinerary stop label), then retain existing user-location-reply retry behavior. -- Added deterministic dining-intent heuristic in `ChatViewSet` to inject `category="food"` when `search_places.category` is omitted and current/recent user messages indicate restaurant/dining intent. -- Updated `search_places` tool schema text to explicitly describe `food` (restaurants/dining), `tourism` (attractions), and `lodging` (hotels/stays). -- Added targeted tests for category inference and a collection-context integration flow proving restaurant requests can complete via destination-backed retry without location clarification prompt. - -## Regression: Duplicate Tool Summaries / Repeated Items — Frontend Analysis (2026-03-10) - -### Context -Exploring why restaurant recommendations across a test itinerary produce duplicated itinerary-item summaries and repeated warning toasts in the frontend chat renderer. - -### Root Cause 1 — `rebuildConversationMessages` double-accumulates `tool_results` on reload - -**File:** `frontend/src/lib/components/AITravelChat.svelte`, `rebuildConversationMessages()` (lines 333–375) - -**Mechanism:** When `selectConversation()` reloads a conversation from the API, `rebuildConversationMessages()` walks the raw message list and pushes each `role=tool` message into its parent assistant message's `tool_results` array. The backend serializer may return a pre-populated `tool_results` field on the assistant message object. Because the `rebuilt` map at line 336 preserves any server-side `tool_results` via spread (`msg.tool_results ? [...msg.tool_results] : undefined`), and then the walk loop **appends again** from the raw `role=tool` sibling messages, any tool result present on both the server-side `assistant.tool_results` field *and* as a sibling `role=tool` message will be pushed **twice**. - -**Second trigger path:** A multi-stop itinerary causes the backend tool loop to fire `search_places` once per stop via the context-retry fallback. Each call emits a separate `tool_result` SSE event. The frontend appends each into `assistantMsg.tool_results` during streaming (line 472). On conversation reload, `rebuildConversationMessages` re-appends them again from the persisted `role=tool` DB rows → **2× the place-result cards**. - -**Fix approach:** Change line 336 so the initial `rebuilt` map always sets `tool_results: undefined`. This discards any server-side `tool_results` field on assistant messages and lets the sibling-walk be the sole source of truth: -```ts -// line 334–337 — change to: -const rebuilt = rawMessages.map((msg) => ({ - ...msg, - tool_results: undefined // always re-derive from role=tool siblings below -})); -``` - -### Root Cause 2 — Multi-iteration tool loop creates multiple assistant messages; `rebuildConversationMessages` may pair results to wrong parent - -**File:** `AITravelChat.svelte`, `rebuildConversationMessages()` lines 342–344 and backend `event_stream` lines 662–688 - -**Mechanism:** For a multi-restaurant flow the backend iterates: `search_places` (iteration 1) → saves assistant+tool messages → then `add_to_itinerary × N` (iterations 2–N). Each iteration saves a separate assistant message with distinct `tool_calls`. On reload, the loop in `rebuildConversationMessages` sets `activeAssistant` to each assistant message with `tool_calls`, then nulls it when `tool_results.length >= toolCallIds.length` (lines 366–371). If the DB `created_at` timestamps for assistant and tool messages are identical (same DB transaction), the `order_by("created_at")` in `_build_llm_messages` is non-deterministic, and the sibling walk may misalign tool results to wrong parent assistant messages → wrong or duplicated summary cards. - -**Fix approach (backend):** After saving the assistant message at line 670 with `await sync_to_async(ChatMessage.objects.create)`, save each tool message **in a separate awaited call** (already the case at lines 678–682) and add a `time.sleep(0)` / tick between saves to guarantee distinct `created_at`. Alternatively, add an explicit `order` integer field to `ChatMessage` and sort by it instead of `created_at`. - -### Root Cause 3 — Repeated "success" toast from duplicate place-result cards - -**File:** `AITravelChat.svelte`, `addPlaceToItinerary()` lines 569–632 - -**Mechanism:** `addToast('success', $t('added_successfully'))` fires unconditionally at line 626 on every user-triggered add. When Root Cause 1 causes duplicate place-result cards to be rendered, the user sees two "Add to itinerary" buttons for the same place and clicking either fires a toast. Fixing Root Cause 1 (dedup in `rebuildConversationMessages`) eliminates the duplicate cards and therefore this repeat toast. - -**No action required here unless RC1 fix is insufficient.** As a defense-in-depth measure, consider tracking which place names have been added within the session and disabling the button. - -### Concrete Edit Points - -| File | Function/Section | Change | -|---|---|---| -| `AITravelChat.svelte` L334–336 | `rebuilt` map init in `rebuildConversationMessages` | Change `tool_results: msg.tool_results ? [...msg.tool_results] : undefined` → `tool_results: undefined` (always re-derive from sibling walk) | -| `AITravelChat.svelte` L467–474 | SSE `parsed.tool_result` handler in `sendMessage` | Add dedup guard: skip push if `tool_call_id` already in `assistantMsg.tool_results` | -| `AITravelChat.svelte` L894–944 | `{#each msg.tool_results as result}` template | Add JS-side dedup by `tool_call_id` or a `{#key}` block before rendering | -| `backend/server/chat/views/__init__.py` L670–682 | assistant + tool message save in `event_stream` | Verify saved order is deterministic; add explicit sequence/order field if `created_at` collisions are possible | - -### Recommended Fix Sequence -1. **Priority 1:** `AITravelChat.svelte` line 336 — change to `tool_results: undefined`. Single-line fix, eliminates Root Cause 1 and RC3 transitively. -2. **Priority 2:** `AITravelChat.svelte` SSE handler (line 472) — add `tool_call_id` dedup guard for defense against backend emitting duplicate events. -3. **Priority 3:** Template dedup (lines 894–944) as render-layer safety net. -4. **Verify:** Reload a multi-restaurant conversation; each `add_to_itinerary` summary and each place card must appear exactly once. - -### Related Notes -- [[assistant-add-flow-fixes]] — backend search intent + retry logic that triggers multi-iteration tool loops - -## Follow-on fix: restaurant recommendation tool loop -- [ ] Workstream `fix/chat-tool-loop-fix` in `.worktrees/chat-tool-loop-fix`: stop chat tool loops from replaying execution failures into the LLM context; only count successful tool iterations, cap repeated all-failure rounds, and hard-stop permanent `web_search` failures. - - Acceptance: restaurant recommendation requests do not end with "too many tool calls" when tool executions fail; `search_places`/`web_search` failures surface once and stop retry spirals. -- [ ] Keep frontend tool-result rendering deduplicated in `AITravelChat.svelte` during both SSE streaming and conversation rebuild. - - Acceptance: itinerary/tool summaries do not render duplicate cards or duplicate downstream add-to-itinerary affordances after reload. -- [ ] Validate the fix in worktree `fix/chat-tool-loop-fix` with targeted backend chat tests plus frontend `bun run check`/`bun run build`, then perform a functional restaurant-request verification. - - Acceptance: validation demonstrates the assistant either returns restaurant guidance or surfaces a single actionable failure instead of looping. - -## Implementation outcome: chat tool loop + duplicate summary fix -- Added execution-failure classification in `backend/server/chat/views/__init__.py` that separates required-parameter validation errors from execution failures, and excludes execution-failure tool rows from LLM replay/history filtering. -- Updated tool loop control so `tool_iterations` increments only after at least one successful tool call; added bounded `MAX_ALL_FAILURE_ROUNDS = 3` for all-failure rounds, with immediate stop for permanent failures (`retryable: false`). -- Extended `search_places` location retry candidate detection to include geocoding failures (`Could not geocode location: ...`) so destination/itinerary/user-location fallback can retry before terminating. -- Marked `web_search` dependency import failure as permanent in `backend/server/chat/agent_tools.py` (`retryable: false`) so it is not retried like transient outages. -- Fixed frontend duplication in `frontend/src/lib/components/AITravelChat.svelte` by rebuilding tool results only from persisted `role=tool` rows, deduping on `tool_call_id` during rebuild and live SSE ingestion, and rendering deduped tool results. -- Added focused backend tests in `backend/server/chat/tests.py` for execution-failure classification, geocode retry fallback, bounded all-failure loop behavior, permanent failure immediate stop, and `web_search` import-failure retryability flag. -- Related prior analysis note: [[assistant-add-flow-fixes]] - -## Security Review Verdict: chat-tool-loop-fix (2026-03-10) - -- **Verdict:** CHANGES-REQUESTED -- **Review Score:** 6 (2 WARNINGs × 3) -- **Lens:** Security -- **WARNING 1:** `_build_tool_execution_error_event` (views/__init__.py:260-272) forwards raw tool error text to the client SSE stream. The error string from `execute_tool` (agent_tools.py:643) uses `str(exc)`, which can contain internal details for uncaught exceptions. This is also persisted as assistant message content. -- **WARNING 2:** Persisted error text in assistant messages (L769) creates a durable exposure vector — raw error details survive in DB and are re-served on reload. -- **No CRITICALs.** Auth/permission checks, injection safety, and CSRF handling remain intact. -- **Fix required:** Sanitize error text in `_build_tool_execution_error_event` or at source in `execute_tool` catch-all. Targeted re-review sufficient after fix. -- **Confirmed:** `_safe_error_payload()` convention not violated by existing code, but new path bypasses it. -- Related: [[assistant-add-flow-fixes]] - -## Review verdict: chat-tool-loop-fix (2026-03-10) - -**VERDICT: APPROVED** -**LENS: correctness** -**REVIEW_SCORE: 0** - -### What was checked - -1. **Execution failure classification** (`_is_execution_failure_tool_error`, `_is_retryable_execution_failure`): Verified the classification is correctly the complement of `_is_required_param_tool_error` — any `error`-bearing dict that doesn't match the required-param pattern is an execution failure. The `retryable` flag defaults to `True` unless explicitly `False`. No false-positive/false-negative classification risk found. - -2. **Execution failure exclusion from LLM context**: Execution failure tool results are never added to `successful_tool_calls`/`successful_tool_messages`/`successful_tool_chat_entries` (skipped via `continue` at line 718). They are never persisted to DB and never appended to `current_messages`. The `_build_llm_messages` filter at lines 65-67 and 84-88 is a belt-and-suspenders defense for pre-existing DB data. - -3. **All-failure-round bounding**: `all_failure_rounds` increments by 1 on each all-failure round and is hard-set to MAX on permanent failures. The outer `while` loop guard (`tool_iterations < MAX_TOOL_ITERATIONS`) remains satisfied during all-failure rounds (since `tool_iterations` isn't incremented), but `all_failure_rounds >= MAX_ALL_FAILURE_ROUNDS` (3) exits within bounded iterations. No infinite loop possible. - -4. **Permanent failure fast-path**: `encountered_permanent_failure = True` → `all_failure_rounds = MAX_ALL_FAILURE_ROUNDS` → immediate stop. Confirmed `web_search` ImportError returns `retryable: False`. - -5. **Geocode failure retry**: `_is_search_places_location_retry_candidate_error` now matches both "location is required" and "Could not geocode location: ..." patterns. Retry uses trip context destination or user content. Retry success check (lines 631-635) now requires the retry to be free of BOTH required-param AND execution-failure errors — correct tightening vs main branch. - -6. **Frontend tool-result dedup**: Three layers of defense confirmed: - - `rebuildConversationMessages` now sets `tool_results: undefined` (line 374), eliminating RC1 (double-accumulation from server-side pre-populated `tool_results`). - - `appendToolResultDedup` deduplicates by `tool_call_id` during both rebuild (line 402) and SSE streaming (line 514). - - `uniqueToolResultsByCallId` at render time (line 939) provides a final safety net. - - Edge case: tool results without `tool_call_id` bypass dedup — acceptable since real tool calls always have IDs. - -7. **Test coverage**: Tests cover execution failure classification, geocode retry, bounded all-failure loop (verifying 3 LLM calls + 3 tool calls), permanent failure immediate stop (verifying 1 LLM call + 1 tool call), and `web_search` ImportError retryability flag. Mock setup is correct — `side_effect = failing_stream` (callable) creates fresh async generators per call. - -### No findings - -- No off-by-one in loop bounds. -- No async/await misuse — `sync_to_async` wraps synchronous `execute_tool` correctly. -- No resource leaks — generators terminate cleanly via `return` or `break`. -- No race conditions — the event stream is single-threaded. -- No contract violations — SSE event format is preserved for both success and error paths. -- No mutation/shared-state risks — `first_execution_failure` and `encountered_permanent_failure` are scoped per tool-call batch. - -## Security fix outcome: execute_tool error sanitization -- Updated `backend/server/chat/agent_tools.py` so `execute_tool()` catch-all no longer returns raw `str(exc)`. -- New catch-all payload is now the safe generic message: `{"error": "Tool execution failed"}`. -- Added focused regression test in `backend/server/chat/tests.py` (`ExecuteToolErrorSanitizationTests`) to verify exception text is not exposed via `execute_tool()`. - -## Security Re-Review Verdict: execute_tool sanitization fix (2026-03-10) - -- **Verdict:** APPROVED -- **Review Score:** 0 -- **Lens:** Security (targeted re-review) -- **Resolution:** Both prior WARNINGs (SSE raw error leak, DB persistence of raw errors) are RESOLVED. -- **Evidence:** - - `execute_tool()` catch-all at `agent_tools.py:643` now returns `{"error": "Tool execution failed"}` (hardcoded string, no `str(exc)`). - - `_build_tool_execution_error_event()` at `views/__init__.py:260-272` wraps the sanitized error in a user-safe sentence. - - SSE emission at `views/__init__.py:775` and DB persistence at `views/__init__.py:769` both receive only the wrapped safe text. - - Regression test `ExecuteToolErrorSanitizationTests` at `tests.py:49-65` validates sanitization by injecting `RuntimeError("sensitive backend detail")` and asserting the generic output. -- **No new issues introduced.** Single-line source-level fix with no new code paths or trust boundaries. -- Related: [[assistant-add-flow-fixes]] - -## Outcome: restaurant recommendation tool loop fix (2026-03-10) -- [x] Workstream `fix/chat-tool-loop-fix` in `.worktrees/chat-tool-loop-fix`: chat tool execution failures no longer replay into LLM context as successful tool results; successful-tool iteration budget is separate from all-failure rounds; permanent failures stop immediately. -- [x] Frontend `AITravelChat.svelte` dedupes tool results on rebuild, SSE ingestion, and render. -- [x] Validation passed: backend targeted chat suites passed after syncing worktree files into `voyage-server-1:/code/chat/...`; frontend `bun run check` and `bun run build` passed with only pre-existing warnings. -- [x] Review passed after sanitizing `execute_tool()` catch-all errors to `"Tool execution failed"`. - -Open risk accepted for pre-release: -- Mixed batches with both successful and failed tool calls can still drop the failed tool silently and rely on `MAX_TOOL_ITERATIONS` as the backstop; adversarial tester marked this low severity and acceptable for pre-release. - -## Implementation outcome: backend context-location extraction + clarification suppression -- Updated `backend/server/chat/views/__init__.py` itinerary-stop fallback parsing so address-only stops derive a city hint from the last non-numeric comma-separated segment (e.g. `Little Turnstile 6, London` → `London`) before computing `trip_context_location`. -- Updated `search_places` context-retry block to track `attempted_location_retry`; when retries were attempted but still return required-param errors, result is converted to execution failure (`Could not search places at the provided itinerary locations`) to avoid location clarification prompts. -- Clarification branch now only triggers for missing-location required-param errors when no context retry was attempted. -- Added tests in `backend/server/chat/tests.py` for city extraction from fallback address and for retry-with-context failure path asserting no clarification SSE is emitted and execution-failure behavior is returned. - -## Targeted Re-Review Verdict: city extraction + clarification suppression (2026-03-10) - -- **Verdict:** APPROVED -- **Review Score:** 0 -- **Lens:** Correctness (targeted re-review of Fix 1 + Fix 2 only) -- **Scope:** `views/__init__.py` lines 444-462 (city extraction from fallback address) and lines 611-671 + 694-700 (retry failure → execution failure mutation + clarification guard). Plus two new tests in `tests.py`. - -### Fix 1: City extraction — edge cases verified -- Comma address: extracts last non-empty non-numeric segment (correct) -- No comma: falls back to raw `fallback_name` (correct) -- All-numeric segments: falls back to full `fallback_name` (safe, will fail geocoding gracefully) -- Empty parts after split: filtered by truthiness check (correct) -- Dedup via `stop_key`: uses extracted city, correctly collapses same-city addresses - -### Fix 2: Retry failure mutation — flow verified -- `attempted_location_retry` is per-tool-call scoped (no cross-call bleed) -- Mutated result `"Could not search places at the provided itinerary locations"` correctly fails `_is_required_param_tool_error` (no match on exact strings or regex) -- Mutated result correctly passes `_is_execution_failure_tool_error` (dict with non-empty error, not required-param) -- Routes to execution-failure handler, not clarification branch -- Clarification guard at lines 694-700 is defense-in-depth (unreachable in mutation scenario, but protects against future refactoring) - -### Test coverage verified -- `test_collection_context_retry_extracts_city_from_fallback_address`: correctly tests city extraction from `"Little Turnstile 6, London"` → `"London"` retry -- `test_context_retry_failure_does_not_emit_location_clarification`: correctly tests mutation routing to `tool_execution_error` SSE event, no clarification -- Both tests exercise the intended code paths with correct mock setups - -### No findings — APPROVED with zero issues -- Related: [[assistant-add-flow-fixes]]