50 KiB
MeshCore Open — Bugs Found (Web build, manual QA)
Session: 2026-06-12 · Build served at http://localhost:42751/ (Flutter web, debug/DDC) · Browser: Chrome
Each entry: Severity · where · what · repro · expected.
Status
Fixes for BUG-1..4 applied on 2026-06-12 (analyze clean). BUG-2/3 VERIFIED FIXED live: USB device VID:239A PID:8029 now connects on web — console shows Open: vendorId=0x239a uartBridge=false (DTR left default) → Got SELF_INFO → connectUsb: complete, no "device has been lost". Continuing to test the now-reachable connected screens. See "Fixes applied" at the bottom.
Open bugs
BUG-1 · Medium · USB connect — raw browser exception leaked to UI on picker cancel
- Where: "Connect over USB" screen (
usb_screen.dart), tap Select a USB device → Web Serial port picker → dismiss/cancel without choosing a port. - What: A red error snackbar shows the raw browser API string:
NotFoundError: Failed to execute 'requestPort' on 'Serial': No port selected by the user. - Why it's a bug: (1) Cancelling the port picker is a normal user action, not an error — it should be silent (or a neutral "No device selected" message), not a red error toast. (2) Even for real failures, leaking a raw JS
DOMExceptionstring is poor UX. The catch handler should map known cases (user cancellation) and present friendly copy. - Expected: Cancelling shows nothing or an info-level message; the UI never prints raw exception text.
- Root cause (confirmed in source):
usb_screen.dart:396 _friendlyErrorMessage()mapsPlatformException/StateError/etc., but the web cancel path throws a JSDOMException(NotFoundError) that matches none of the branches, so it falls through toreturn error.toString();(line 441). A friendlyl10n.usbErrorNoDeviceSelectedstring already exists (line 428) but is only matched for a nativeStateErrorcontaining'No USB serial device selected', not the web DOMException. Fix: detect the web no-port-selected case and route it to the existing friendly string (or suppress entirely).
BUG-2 · HIGH · Web USB connect fails with misleading "Timed out waiting for SELF_INFO" — read-pump error is dropped by a subscription race
- Where:
usb_serial_service_web.dart+meshcore_connector.dart connectUsb(). Repro: on web (Chrome), connect to a USB serial device (observed withWeb Serial Device VID:239A PID:8029, an Adafruit/nRF52840 board). - Observed (console, happens every attempt):
- Port opens OK:
USB serial opened port=Web Serial Device (VID:239A PID:8029) - Immediately:
_pumpReads error: NetworkError: The device has been lost.→_pumpReads: ended— the read stream dies the instant it opens. - Connect logic ignores that and proceeds:
requesting device info…, writes TX frames,ChannelSync Starting sync for 40 channels. - Nothing can ever be read back → SELF_INFO + ChannelSync retry/timeout for ~7s.
- Ends with
USB connection error: Bad state: Timed out waiting for SELF_INFO during connect→ disconnect. User sees a generic timeout error, not the real cause.
- Port opens OK:
- Root cause (confirmed in source): a subscription-timing race on a broadcast stream:
_frameControllerisStreamController<Uint8List>.broadcast()(usb_serial_service_web.dart:24-25). Broadcast streams do not buffer — events emitted with no listener attached are silently discarded._usbManager.connect()starts_pumpReads()fire-and-forget (usb_serial_service_web.dart:114). On this device_pumpReadserrors instantly and calls_addFrameError()(line 387/393).- But
connectUsb()attaches its error listener only afterawait Future<void>.delayed(200ms)(meshcore_connector.dart:1609), thenframeStream.listen(onError: → disconnect(), onDone: → disconnect())(lines 1610-1620). The read-pump error fired during that 200ms gap, so theonError → disconnectfail-fast path never runs. - With the safety net disarmed, connect falls through to
_waitForSelfInfo(3s) + retry (3s) and throws the misleadingTimed out waiting for SELF_INFO during connect(line 1647).
- Expected: A read-stream failure during connect should abort immediately with the real cause ("USB device disconnected / lost"), not a 7-second generic SELF_INFO timeout.
- Fix directions: attach the
frameStreamlistener (or otherwise observe transport health) before the read pump can emit — i.e. before/at port-open, not 200ms later; OR latch the last transport error in the service and haveconnectUsbcheck it; OR make connect race_waitForSelfInfoagainst a transport-error future. Remove/justify the unconditional 200ms delay that opens the race window.
BUG-3 · Likely device-level root cause for BUG-2 · DTR assertion may reset/lose nRF52840 (Adafruit, VID 0x239A) boards
- Where:
usb_serial_service_web.dart:285-295 _openPort(). - What: Immediately after
open(), the code assertssetSignals({dataTerminalReady:true, requestToSend:false})with the comment "Prevent ESP32 USB-CDC reset". That logic is tuned for ESP32. On Adafruit nRF52840 boards (VID0x239A, as seen here, PID0x8029), toggling DTR is associated with the bootloader/reset line and can cause the device to re-enumerate/reset — which plausibly produces the immediateNetworkError: The device has been lost.seen in BUG-2. - Status: Strong hypothesis, not yet isolated (would need to test connecting with the
setSignalscall removed/varied for this board). Flagging because the device class that fails (nRF52840/Adafruit) is exactly the one where DTR semantics differ from ESP32. - Expected: USB serial open should work for nRF52840-class MeshCore boards on web, or DTR handling should be conditional per device/VID.
BUG-4 · Medium · BLE "Scan" on web (no Bluetooth adapter / unsupported) gives zero feedback
- Where: Scanner screen (
scanner_screen.dart). Repro: on web with no BLE module (or any web build, sinceflutter_blue_plusdoesn't support web), tap Scan. - What: Nothing happens — no spinner, no "scanning" state, no error toast, no "Bluetooth unavailable on web" message. The button just sits there and status stays "Not connected". (Confirmed by user: "its not connecting on chrome and my computer doesn't have a ble module".)
- Expected: Either disable/hide BLE scan on web (the app already gates non-Chrome via
ChromeRequiredScreen; Chrome+web still can't useflutter_blue_plus), or show a clear "Bluetooth isn't available in the browser — use USB or TCP" message. Silent no-op leaves the user stuck.
BUG-5 · Low/Medium · Notifications never work until manually enabled; every incoming message logs an error
- Where:
notification_service.dart—show()calls at lines 201/248/306/394/593 are made without ensuring notification permission was granted. - What (observed in console while connected): repeated
Failed to show channel notification: Bad state: FlutterLocalNotifications.show(): You must request notifications permissions firstand the same for advert/message notifications. Each incoming channel message / advert triggers one. - Why: permission is only ever requested from
app_settings_screen.dart:239(when the user interacts with that setting). If the user never visits it,requestPermissions()is never called, soshow()throws on web (and would on Android 13+ with denied permission). The errors are swallowed bytry/catch(no crash), but notifications silently never fire and the log fills with errors. - Expected: request notification permission during init / on first connect (or check
areNotificationsEnabled()and skipshow()when not granted) instead of callingshow()unconditionally and relying on a caught exception per message.
OBS-1 · RESOLVED (not a bug) · USB-on-web connection dropped after ~4 min (device has been lost)
- What: At 9:25:13 the read pump errored
NetworkError: The device has been lost.→USB transport error→ clean auto-disconnect back to the scanner. - Cause: confirmed by user — they physically dropped/disconnected the radio. So this was a real physical disconnect, NOT a Web Serial stability issue. Positive finding: the app (with the BUG-2 fix) handled an abrupt physical disconnect cleanly and returned to the scanner.
BUG-6 · Medium · Battery Chemistry setting permanently disabled on USB (and TCP) connections
- Where: App Settings → BATTERY → Battery Chemistry.
app_settings_screen.dart:610-611gates the control onconnector.deviceId != null;meshcore_connector.dartonly ever assigns_deviceIdin the BLE connect path (_deviceId = device.remoteId.toString(), line 1839, right after_activeTransport = bluetooth). - What: When connected over USB (verified) — and by the same logic TCP —
connector.deviceIdisnull, so the Battery Chemistry dropdown shows the subtitle "Connect to a device to choose" and is disabled, even though the device is fully connected (header shows088EDAA0 · Connected, radio stats live). - Why it matters: USB/TCP users can never set per-device battery chemistry, so the battery percentage indicator uses the wrong voltage curve for their pack. Also user-confusing ("connect first" while connected).
- Note: The
088EDAA0shown throughout the UI is a node/public-key-derived identifier, distinct from_deviceId(the BLE remoteId). The battery setting keys off the BLE-only_deviceId. - Fix direction: populate a stable per-device identifier in the USB/TCP connect paths (e.g. the node public-key prefix already used for storage scoping), or key battery chemistry off that same node identity rather than the BLE remoteId.
BUG-7 · Low/Medium · Node Name can be saved empty (no validation)
- Where: Settings → Node Name dialog,
settings_screen.dart:708-744 _editNodeName(). - What: Clearing the field leaves the Save button enabled; the handler (line 728-740) calls
connector.setNodeName(controller.text)directly with notrim()/empty check. An empty or whitespace-only name can be written to the device, leaving the node nameless on the mesh (others see a blank contact). Reproduced: cleared field → "0/31", Save still active. (Did not actually save.) - Compare: the Radio Settings dialog correctly disables Save on invalid input — Node Name should do the same.
- Expected: disable Save (or show an error) when the trimmed name is empty.
OBS-2 · Watch · Auto-reconnect to cached Web Serial port fails after a physical drop (stale handle)
- What: After physical disconnects, the app auto-retries the cached port (
web:port:1) and fails withNetworkError: Failed to execute 'open' on 'SerialPort': Failed to open serial port(seen at 9:26:44, 9:34:23, 9:34:53, ~30s apart). It does eventually recover (header shows Connected again), but the cachedSerialPorthandle is often stale right after an OS-level drop. - Note: the native USB service explicitly documents this class of problem and avoids caching the serial handle (
_freshSerial()comment,usb_serial_service_native.dart:47-50). The web service caches the port object (_authorizedPortsByKey), so reopen-after-drop is more fragile. Partly environmental here (user was physically re-plugging), so logged as an observation, not a confirmed bug. Worth confirming the web reconnect path discards/re-requests the port onopen()failure rather than retry-looping on a dead handle.
Notes / non-bugs
- Radio Settings validation works: out-of-range frequency (
9999) shows "Invalid frequency (300-2500 MHz)" and disables Save. Good. - Graceful disconnect verified: when the USB transport drops, the app auto-navigates back to the scanner and shows "Not connected" — matches the intended "handle disconnection gracefully" behavior.
- Channel sync starts before handshake completes:
ChannelSync Starting sync for 40 channelsfires during connect beforeSELF_INFOis confirmed (console 9:07:54). Likely harmless given BUG-2 masks it, but worth confirming the initial-sync pipeline shouldn't wait for SELF_INFO first. - First load on
localhost:39107rendered a blank white page and the URL later resolved tolocalhost:42751. This appears to be a dev-server startup/port artifact (the Flutter view never mounted on the first port), not an app bug. Flagging only in case the port hop is intentional behavior worth confirming.
Fixes applied (2026-06-12)
- BUG-3 (device-lost root cause):
usb_serial_service_web.dart _openPort()now only assertssetSignals(DTR=true)for known USB-UART bridge VIDs (_uartBridgeVendorIds: CP210x0x10C4, CH3400x1A86, FTDI0x0403, PL23030x067B). Native-USB-CDC boards (nRF52840/Adafruit0x239A, etc.) are left untouched, since toggling DTR re-enumerates them. Added a debug log line reporting the detected vendorId and whether DTR was asserted. (The native service already documented this exact NRF52/DTR behavior — confirms the hypothesis.) - BUG-2 (dropped read-pump error / misleading timeout):
usb_serial_service_web.dart _pumpReads()now flips_status = disconnectedand latches_lastErrorwhen the read loop dies unexpectedly, instead of leaving the service reporting "connected".- Added
lastErrorgetter to both web + nativeUsbSerialServiceand toMeshCoreUsbManager. meshcore_connector.dart connectUsb()now checks_usbManager.isConnectedright after the 200ms settle delay (before waiting on SELF_INFO) and throws a clearUSB device disconnected during connect: <cause>using the latched error — failing in ~200ms with the real reason instead of ~7s with a generic SELF_INFO timeout.
- BUG-1 (raw exception on picker cancel):
usb_screen.dartadded_isUserCancelledPortPicker();_showError()now returns silently for picker cancellation (matches the webrequestPort/"No port selected" DOMException and the native StateError) instead of showing a red toast with raw text. - BUG-4 (silent BLE scan on web):
scanner_screen.dart _toggleScan()now short-circuits on web and showsscanner_bluetoothWebUnsupported("Bluetooth isn't available in the browser. Connect over USB instead.") instead of silently no-opping. New l10n key added toapp_en.arband regenerated for all locales (English fallback; pending auto-translation).
Fixes applied — round 2 (2026-06-12)
- BUG-5 (notifications never fire / error spam):
notification_service.dartadded_ensureCanNotify()— caches whether the platform can actually post notifications and is now the gate on all fourshow()entry points (message/advert/channel/batch-summary). Returns false on web (the plugin has no web backend, soshow()always threw) and honorsareNotificationsEnabled()on Android 13+.requestPermissions()now refreshes the cache so enabling from settings takes effect immediately. Thecancel()paths still use_ensureInitialized()(unchanged). Net: no more per-message error spam; notifications post when actually permitted. - BUG-6 (battery chemistry disabled on USB/TCP): added
MeshCoreConnector.batteryDeviceKey— returns the BLE remoteId when present (preserves existing BLE-keyed settings) and falls back to the node public key (selfPublicKeyHex) on USB/TCP. Both the internal_batteryChemistryForDevice()and the App Settings UI (app_settings_screen.dart) now key offbatteryDeviceKeyinstead of the BLE-onlyconnector.deviceId. Battery chemistry is now selectable on USB/TCP, and the battery-% curve is correct for those connections. - BUG-7 (empty node name savable):
settings_screen.dart _editNodeName()Save button is now wrapped in aListenableBuilderon the text controller and is disabled (onPressed: null) when the trimmed name is empty; the handler saves the trimmed value. Mirrors the Radio Settings dialog's validation behavior.
All four files analyze clean. Pending live re-verification after hot restart (BUG-5: no notification errors in console; BUG-6: Battery Chemistry enabled while on USB).
Suggestions / further work (2026-06-12)
Follow-ons directly implied by the fixes
- Audit other per-device features for the same BLE-only gap (BUG-6 was probably not alone).
_deviceIdis set only in the BLE connect path; anything keyed offconnector.deviceIdor_device?.remoteIdis silently BLE-only on USB/TCP. Grep those usages and verify each works on USB/TCP, or migrate them tobatteryDeviceKey/selfPublicKeyHex. - Battery-chemistry key inconsistency (tradeoff in my BUG-6 fix). I kept BLE = remoteId and USB/TCP = public key to avoid wiping saved BLE settings. Consequence: the same physical radio gets two different chemistry settings depending on transport. Cleaner long-term: key everything off
selfPublicKeyHex(the canonical per-radio identity used for all other scoped storage) with a one-time migration of existing BLE-keyed values. - Web reconnect should discard a stale port handle (OBS-2). After a physical drop, auto-reconnect retry-loops on the cached
SerialPortwithFailed to execute 'open'before recovering. The native service deliberately avoids caching the handle (_freshSerial()comment). Onopen()failure the web service should drop the cached port from_authorizedPortsByKeyand re-request (or prompt) instead of retrying a dead handle. - DTR allowlist may need expansion (caveat on my BUG-3 fix). I now assert DTR only for known UART-bridge VIDs (CP210x/CH340/FTDI/PL2303). A future board with an unlisted bridge chip won't get DTR and could reset on open. Consider a per-board config or a user-visible "hold DTR" toggle as a fallback as more hardware is tested.
- On web, the notification toggles are ON but inert (same spirit as BUG-4). Now that
_ensureCanNotify()correctly skips web, the four NOTIFICATIONS switches in App Settings still read as enabled while doing nothing in-browser. Grey them out / annotate "not available in browser" on web. AlsorequestPermissions()still returnstrueon web (fallbackreturn true), which is misleading to its callers.
Robustness / architecture
- The connect handshake rides on a
broadcast()stream + an unconditional 200ms delay — the exact combo that caused BUG-2. My fix (status flip + liveness check) closes the connect race, but any consumer that subscribes late can still miss early frames/errors. Consider readiness signaling instead of the magic delay, and/or buffering the first frames during connect.
UX polish observed
- Two "Scan" buttons on the scanner empty state (center button + bottom-right FAB) — redundant.
- Verify destructive actions confirm before acting. I intentionally did NOT trigger "Clear Chat", "Delete All Paths", "Reboot Device", or "Manage Repeater" (not my device). Worth confirming each has a confirmation dialog.
- Telemetry screen had no obvious "request once" affordance — only autorefresh (interval 20 / qty 10 / Enable). A manual one-shot refresh would help.
Coverage gaps (NOT tested this session — unverified)
- Repeater management (hub / CLI / settings / status), Line-of-Sight, Path-Trace map, Community QR scanner.
- Discovery & Neighbors screens, Map cache screen, Debug log screens.
- TCP transport (no TCP device available), USB on native (Android/desktop), on-device translation (LLM).
- Contacts search/sort/filter, channel add/reorder, GPX export.
Styling observations (2026-06-12)
Caveat: only viewed dark mode in a wide desktop browser (~1298px). Light mode and narrow/mobile widths not assessed.
The color system is good — credit where due. mesh_theme.dart is a coherent, semantic palette ("high-contrast slate surfaces with sky-blue accents"): slate surface ramp (0B1220→334155), sky-blue accent (0EA5E9), signal-green reserved for SNR only (22C55E), amber warn, red alert, magenta node type, a full ink ramp, and a separate light ramp. Not a generic default-Material look. (Note: the CLAUDE.md claim that MeshPalette/MeshTheme is "not currently wired" is stale — main.dart:204-205 wires MeshTheme.light()/.dark().)
Issues are layout/responsiveness, not color:
- No max-width constraint on web/desktop (biggest one). No
ConstrainedBox/maxWidthinmain.dart; the app is mobile-first and stretches edge-to-edge on a wide window — chat input spans the full ~1300px, list rows are very wide, chat bubbles hug the far edges, settings rows stretch across. Recommend a centered max-width content column (phone-like, ~480–600px) or responsive breakpoints for the web build. - App-bar title alignment is inconsistent. Main tab screens (Channels/Contacts/Map) use left-aligned titles with a device-id subtitle; detail screens use
centerTitle: true(23 files) orAdaptiveAppBarTitle. Reads slightly inconsistent screen-to-screen — worth one deliberate rule. - Tall-viewport chat spacing. Bottom-anchored chat leaves a large empty area at the top on a tall window (related to #1 — no height/width framing for big viewports).
- Minor: map node-detail sheet stacked over the older bottom info bar (z-order overlap); the two redundant Scan buttons are also a visual duplication; verify disabled-control contrast is legible (the pre-fix battery dropdown was quite dim).
- Unverified: light theme polish (light ramp exists in the palette but wasn't viewed).
Retry & Path-Selection Analysis (2026-06-12, code review vs firmware)
Static review of the ACK/retry/path-selection pipeline (message_retry_service.dart, path_history_service.dart, timeout_prediction_service.dart, connector wiring) cross-checked against the MeshCore C++ firmware at /mnt/Gaming/meshcore/MeshCore. Verified against actual firmware source, not just docs.
What's correct (verified):
- ACK-hash algorithm matches the firmware exactly. Client
computeExpectedAckHash(message_retry_service.dart:104-136) =SHA256(timestamp[4 LE] ‖ (attempt&3) ‖ text ‖ sender_pubkey[32]), first 4 bytes as LE uint32. FirmwareBaseChatMesh.cpp:413-419buildstemp[0..3]=timestamp,temp[4]=attempt&3,temp[5..]=text(null not hashed — length passed is5+text_len), hashed withself_id.pub_key. Byte order and endianness round-trip correctly (RESP_CODE_SENT/PUSH_CODE_SEND_CONFIRMEDboth LE uint32,meshcore_connector.dart:5265-5409). - Retry is entirely the client's job — firmware
sendMessagesends once, no retry loop. Correct division of responsibility. - Per-contact in-flight serialization (
_sendQueue,message_retry_service.dart:174-208) is the right idea — see RETRY-1 for why it's not sufficient.
Open findings
RETRY-1 · Medium · Per-contact in-flight cap doesn't protect the firmware's global 8-entry ACK table
- Where:
message_retry_service.dart:174-183(one in-flight message per contact) vs firmwareexamples/companion_radio/MyMesh.h:248-250+MyMesh.cpp:1100-1103. - What: The firmware's
expected_ack_tableis a single global circular buffer of 8 entries with a globalnext_ack_idx, shared across all contacts. The client only guarantees ≤1 in-flight per contact, so messaging 9+ contacts concurrently (e.g. several active conversations whose retries overlap, or a broadcast-style burst) puts >8 entries in flight. The firmware then overwrites the oldest slot (next_ack_idx = (next_ack_idx+1) % 8) with no rejection — the evicted message's ACK expectation is silently dropped. - Consequence: the evicted send never produces a
PUSH_CODE_SEND_CONFIRMED, so the client times out and retries a message the radio already sent (wasted airtime), and can mark as failed a message that actually delivered. - The code comment is misleading:
message_retry_service.dart:173-174says serialization exists "to avoid overflowing the firmware's 8-entry expected_ack_table" — but a per-contact cap does not bound the global table. - Fix: enforce a global concurrent-in-flight cap (≤8, ideally ~6 for headroom) across all contacts, not just per-contact; or track
next_ack_idxpressure and back-pressure new sends.
RETRY-2 · Medium · Firmware's reported est_timeout is silently discarded
- Where:
meshcore_connector.dart calculateTimeout()(4379-4408);message_retry_service.dart:405-415. - What:
RESP_CODE_SENTcarries the device's ownest_timeout(firmwareMyMesh.cpp:1106-1110, computed from realgetEstAirtimeFor(...)). It's parsed intotimeoutMsand passed toupdateMessageFromSentas the defaultactualTimeout. Butconfig.calculateTimeoutis always set, so the device value is immediately overwritten on every message — and when the ML model has no data,calculateTimeoutreturnsphysicsMax(line 4407), never the device-provided value. - Why it's a bug: the docstring at
message_retry_service.dart:405("prefer ML prediction, then device-provided, then physics fallback") describes a tier that doesn't exist — device-provided is never used. The physics fallback re-derives the firmware's flood/direct formulas, but_estimateAirtimeMsfalls back to a hard-coded 50 ms when radio params (freq/bw/sf/cr) aren't yet known (meshcore_connector.dart:4341-...), which can diverge sharply from the device's actual airtime estimate (e.g. SF12). The device already did this math correctly with the real airtime. - Fix: use the device
timeoutMsas the fallback when ML is unavailable, and as the clamp ceiling when ML is present — it's strictly better information than a 50 ms guess.
RETRY-3 · Medium · Premature ML timeouts can delete good routes (timeout floor too low + aggressive weight decay)
- Where:
meshcore_connector.dart _physicsMinTimeout(≈4360-4375);message_retry_service.dart:491-538;path_history_service.dart:131-141. - What: For direct paths the timeout floor is
airtime*(hops+1)— it omits the firmware's base (SEND_TIMEOUT_BASE_MILLIS=500) and per-hop processing terms (6*airtime+250per hop). So an ML prediction clamps as low as that floor, well under the firmware's own500 + (6*airtime+250)*hops. When the timer fires before an ACK could realistically return,_handleTimeoutrecords a false failure →recordPathResult(success:false)decrementsrouteWeightby 0.5. A fresh path starts at weight 1.0, so two false timeouts drive it to ≤0 andremovePathRecorddeletes it (path_history_service.dart:136-140). Net: an overly tight timeout estimator actively erodes the learned route table — the opposite of what the ML system is for. - Fix: raise the direct-path min floor to include the firmware base + per-hop terms (don't let ML clamp below a physically plausible RTT); and require more evidence (e.g. a higher failure count, or a confirmed
MSG_SENDwith no ACK over multiple attempts) before deleting a route rather than after two timer fires.
RETRY-4 · Low · handleAckReceived fallback computes a bogus attemptIndex
- Where:
message_retry_service.dart:613-626(fallback branch), usesexpectedHashes.indexOf(expectedHash)as the attempt index. - What:
_expectedAckHashes[messageId]is de-duplicated on insert (line 401), and because both firmware and client maskattempt & 0x03, attempts 0/4/8 (and 1/5/9, …) hash to the same value. So the list index is neither the real attempt number nor stable. The derivedmatchedAttemptIndexis only used to pick which attempt'sPathSelectiongets credited for the delivery — so the impact is limited to path-attribution accuracy on the fallback path (the primary_ackHashToMessageIdmapping carries the correct snapshot). Low severity, but the value is simply wrong. - Fix: store the attempt index alongside each expected hash (or just reuse the snapshot in
_ackHashToMessageId) instead of inferring it from a list position.
RETRY-5 · Low · ML model feeds flood as pathLength = -1, polluting the linear coefficient
- Where:
timeout_prediction_service.dart:59-98(observation),161-200(training). - What: Flood deliveries are recorded with
pathLength: -1andisFlood: true, then both are fed as features into a single global OLS regressor. A linear term that sees hop counts of-1, 0, 1, 2, 3, …with a discontinuity at the flood case distorts thepathLengthcoefficient; theisFlooddummy only partially compensates (it shifts the intercept, not the slope). Direct-path timeout predictions are therefore biased by flood observations and vice-versa. - Fix: train separate flood vs direct models, or set
pathLength = 0(or a dedicated flood feature only) whenisFlood, so the hop-count slope is learned from direct paths alone.
RETRY-6 · Low / verify · ACK hash is computed over SMAZ-transformed text — the sent frame must use byte-identical text
- Where:
message_retry_service.dart:328-351— hash usesprepareContactOutboundText(contact, text)(SMAZ-encoded), butconfig.sendMessage(contact, message.text, …)is handed the raw text; correctness depends on_sendMessageDirect/buildSendTextMsgFrameapplying the exact same SMAZ transform downstream. - Why it matters: if those two paths ever diverge, or if a message exceeds the firmware's
MAX_TEXT_LEN = 160and gets truncated device-side, the receiver hashes different bytes than the client predicted → the ACK never matches → the message retries tofailedeven though it was delivered. This is a silent, hard-to-diagnose failure mode. - Fix: compute the hash from the same byte buffer that is actually framed and sent (single source of truth), and add a unit test that asserts hash-over-frame-text == expected ACK hash, including a >160-byte/SMAZ case.
Observations (heuristics, not correctness bugs)
- Flood path attribution credits a path the send didn't use. On a successful flood delivery,
_recordPathResult(meshcore_connector.dart:1309-1354) boosts the weight ofcontact.path(the device's current path) even though the message was flooded, not routed over that path. It's a reasonable "the ACK probably came back this way" heuristic, but (a) it inflates weight on an unexercised route, and (b) it readscontact.pathwhich the path-return packet triggered by this very ACK may have just mutated (hence theunawaited(getContactByKey(...))re-fetch — a race the code papers over). Worth documenting as a heuristic and considering attributing only when the device path is confirmed fresh. calculateDefaultTimeout(message_retry_service.dart:713-719) appears to be legacy/unused by the active retry path (which always goes throughconfig.calculateTimeout). If dead, remove it to avoid confusion about which timeout logic is authoritative.
How the system could be improved
- Single global in-flight budget tied to the firmware's table size. Replace the per-contact gate with a global semaphore of N (≤8) outstanding ACKs, with a small reserve. This directly models the firmware constraint and eliminates RETRY-1.
- Make the device timeout authoritative, ML advisory. Use the firmware's
est_timeoutas the baseline (it has the real airtime), and let the ML model only widen it when history shows this contact/path is consistently slower. Never let ML clamp below a physically plausible RTT (fixes RETRY-2 + RETRY-3). - Decouple "timed out" from "route failed." A timeout is weak evidence (could be transient congestion). Track per-path consecutive failures and only decay/delete a route after repeated, independent failures — or after the firmware reports a genuine send failure. Add hysteresis so one slow ACK doesn't erase a known-good route.
- Separate latency models per route class (flood vs direct, and ideally per hop-count bucket), or add the airtime estimate itself as a feature so the model learns a multiplier rather than re-deriving physics. Fixes RETRY-5 and makes predictions interpretable.
- One source of truth for outbound bytes. Frame the message once, compute the ACK hash from those exact bytes, and reject/queue anything that would exceed
MAX_TEXT_LENbefore sending (rather than discovering it via never-arriving ACKs). Fixes RETRY-6. - Confidence-based path selection.
_scorePathRecordweights reliability 0.45 / latency 0.25 / freshness 0.1 / routeWeight 0.2 with hard-coded constants. Consider surfacing these as tunables (some already are viaappSettings) and using a Wilson lower-bound on the success ratio instead of the current(s+1)/(n+2)Laplace estimate, so a 1/1 path isn't ranked above a 20/22 path purely on the smoothed mean. - Bound the late-ACK grace + 15-min mapping cleanup explicitly.
handleAckReceivedcleans_ackHashToMessageIdolder than 15 min on every ACK (message_retry_service.dart:589-599); combined with the 30 s post-failure grace timer andattempt&3hash reuse, it's worth a test that a delivered-after-failed message resolves exactly once and never double-advances the send queue.
Possible ML issues
-
High: training/prediction mismatch for secondsSinceLastRx. _lastRxTime is reset when the ACK frame arrives, then the observation is recorded, making this feature effectively zero. Prediction measures it before sending. lib/connector/meshcore_connector.dart:3867
-
Medium: incorrect message-size input. Retry prediction and training use message.text.length, not UTF-8/transformed payload bytes. Unicode and compressed messages therefore receive incorrect airtime inputs. lib/services/message_retry_service.dart:405 lib/services/message_retry_service.dart:669
-
Medium: stale per-contact statistics. Old observations are removed from the 100-item model window, but never removed from _contactStats. The global model adapts while contact averages remain lifetime averages. lib/services/timeout_prediction_service.dart:76
-
Medium: model is underconstrained. Ordinary linear regression starts with only 10 samples and up to four features, without outlier rejection or validation. Physics clamping limits damage, but unstable predictions are still likely.
-
Low: pending observations may not persist. dispose() cancels the delayed save without flushing it. lib/services/timeout_prediction_service.dart:216
ACK & Message-Delivery Analysis (2026-06-12, code review vs firmware)
Focused pass on ACK matching and message delivery (sending + receiving), cross-checked against the MeshCore firmware at /mnt/Gaming/meshcore/MeshCore. Companion of the earlier "Retry & Path-Selection" section — this one is about whether messages reliably reach delivered/failed and aren't lost or double-shown.
Verified correct (recording these to close earlier open questions):
- ACK-hash text consistency holds.
prepareContactOutboundText(SMAZ / Cyr2Lat) transforms the text, and the same transformed bytes are used both for the wire frame (buildSendTextMsgFrame,meshcore_connector.dart:~1113) and for the expected-ack hash (message_retry_service.dart:328-338). The firmware hashes the raw received text bytes (SMAZ is opaque to it), so both sides agree. This resolves the RETRY-6 "verify" concern (the >160-byte truncation guard is still worth adding). - Connect-time queue drain exists.
_startPostChannelInitialQueuedMessageSync(meshcore_connector.dart:3835) plus therespCodeEndOfContactshandler (:3936-3941) proactively callsyncQueuedMessages(force:true)after the SELF_INFO→channels→contacts pipeline, so messages that arrived while disconnected are pulled on reconnect — not solely dependent on a livePUSH_CODE_MSG_WAITINGtickle. - Firmware de-dups inbound packets.
SimpleMeshTables::hasSeenkeeps a 160-entry packet-hash seen-table (src/helpers/SimpleMeshTables.h:34-53), so the same on-air packet arriving via multiple flood paths is dropped, not delivered twice. - Channel sends do reach
sent._handleOkpromotes the pending channel message on the firmware's generic OK (meshcore_connector.dart:5370), and the echo-heard path promotes it again (:5990). (So the "stuck pending on success" worry is unfounded — but see DELIVERY-1 for the error path.)
Open findings
DELIVERY-1 · Medium · A rejected (or lost-response) channel send is stuck "pending" forever
- Where:
_handleErrorFrame(meshcore_connector.dart:4007-4024);ChannelMessageStatushas only{pending, sent, failed}and nothing ever setsfailedfor a channel message. - What: When the firmware answers a
cmdSendChannelTxtMsgwith an error frame (e.g.ERR_CODE_NOT_FOUNDfor an unknown/unconfigured channel — firmwareMyMesh.cpp:1131-1135writes an err frame on failure), the handler removes the message from_pendingChannelSentQueue(line 4023) but never marks the messagefailed. The same happens if the OK/ERR response is simply lost. Result: the bubble shows "pending/sending" indefinitely with no error, and there is no retry or timeout to resolve it. - Contrast: direct messages have a full timeout→retry→
failedpath; channel messages have no failure path at all. - Fix: in
_handleErrorFrame, set the matching channel message tofailed(mirror_markPendingChannelMessageSentById). Optionally add a short watchdog so a channel send with no OK/ERR within N seconds flipspending→failed.
DELIVERY-2 · Medium · Firmware offline queue (16) silently drops contact messages on overflow
- Where: firmware
examples/companion_radio/MyMesh.cpp:219-255(addToOfflineQueue),MyMesh.h:240(OFFLINE_QUEUE_SIZE 16). - What: Inbound messages are buffered in a 16-slot queue and tickled to the app via
PUSH_CODE_MSG_WAITING. On overflow the firmware evicts the oldest channel message to make room; if all 16 queued entries are contact/direct messages, the new message is silently dropped — no error, no notification to the app. - Consequence: a burst of direct messages while the app is backgrounded/slow/disconnected can lose messages with zero indication. The client drains in a loop on tickle and on connect (good), but it cannot recover a message the firmware already discarded.
- Fix (client side): drain as fast as possible — the current one-at-a-time request/response loop (
_requestNextQueuedMessage) round-trips per message; consider keeping the pump saturated. There's no app-side way to detect the drop, so also worth raising upstream (a queue-overflow counter in firmware stats would let the app warn the user).
DELIVERY-3 · Low/Medium · Inbound contact-message de-dup can drop a legitimate message
- Where:
_handleIncomingMessagede-dup (meshcore_connector.dart:~4759-4772): skips an incoming message if a message with the same(timestampSeconds, text)exists in the last 10 messages from that sender. - What: Timestamps are second-resolution, so two genuinely distinct messages with identical text within the same second (e.g. sending "ok" twice) collide and the second is silently dropped. The firmware already de-dups true on-air duplicates at the packet-hash layer (see "verified correct"), so this app-layer heuristic is largely redundant and introduces a real (if uncommon) data-loss path; it also only looks back 10 messages, so a re-delivered queued message after a longer gap can slip through as a "new" duplicate.
- Fix: lean on the firmware's packet identity instead of
(timestamp,text), or narrow the heuristic (e.g. only treat as duplicate within a very short window AND when flagged as a flood repeat), so identical-text messages aren't lost.
DELIVERY-4 · Low · ACK match is 4-byte-hash-only, no sender identity, against a global 8-slot table
- Where: firmware
processAck(MyMesh.cpp:411-427) —memcmp(data, &expected_ack_table[i].ack, 4)over 8 slots, clears the slot to 0 on first match. - What: matching uses only the first 4 hash bytes with no check that the ACK came from the intended recipient, against the same global 8-entry circular table called out in RETRY-1. So (a) >8 concurrent in-flight sends evict a slot → that message never gets
PUSH_CODE_SEND_CONFIRMED→ client retries/false-fails; (b) a ~1/2^32-per-concurrent-send chance of a cross-message false confirm. Duplicate ACKs are handled safely (slot cleared after first match; the random 6th ACK byte doesn't affect the 4-byte compare). Low severity on its own; the real lever is RETRY-1's recommended global in-flight cap (≤8).
DELIVERY-5 · Info · Round-trip time is measured from queueing, not transmission
- Where: firmware
trip_time = getMillis() - expected_ack_table[i].msg_sent(MyMesh.cpp:417), wheremsg_sentis stamped when the message is enqueued (:1100). - What: the
tripTimeMsthe app receives (and feeds the ML timeout model) includes the firmware's TX-queue wait when the radio is busy, so observations are inflated under load. Compounds the ML-input issues already logged (RETRY-5 / the ML notes above). Not a delivery bug, but it skews timeout prediction.
DELIVERY-6 · Low / verify · Command/CLI sends use expected_ack = 0 — must not ride the ACK retry path
- Where: firmware sets
expected_ack = 0forTXT_TYPE_CLI_DATAand the recipient sends no ACK (MyMesh.cpp:1088-1110,BaseChatMesh.cpp:244-252);RESP_CODE_SENTthen carries a zero hash. - What: if any client path routes a CLI/repeater command through
MessageRetryService(which computes a non-zero expected hash and waits forPUSH_CODE_SEND_CONFIRMED), it would never match → spurious retries and a false "failed". Need to confirm the repeater-CLI/command path is separate from the text-message ACK machinery (the connector does call_recordPathResultfor "repeater command results" — worth tracing that it isn't also arming an ACK wait). - Fix/verify: ensure command sends bypass the expected-ack/retry tracking entirely, or special-case
expected_ack == 0inupdateMessageFromSentas "no ACK expected → mark sent, don't wait".
Live Log Analysis — "JC Room Server" session (2026-06-12)
Real device log (BLE/USB, auto route rotation ON) sending to a multi-hop room server. This is a near-perfect live reproduction of RETRY-2, RETRY-3, and RETRY-5. The ACK matching is correct (every RESP_CODE_SENT matches its message); what's broken is timeout estimation and path rotation — the app declares failures and retries messages that are actually still in flight.
Evidence by symptom
-
Flood timeout = 151 seconds (RETRY-5, live).
raw prediction=100638ms for pathLength=-1 → ML timeout 150959ms. FeedingpathLength=-1into the linear model makes the flood case extrapolate off a cliff; the.clamp(physicsMin, physicsMax)didn't help because the floodphysicsMax(500 + 16·airtime) was also huge. A message that falls back to flood effectively hangs for ~2.5 minutes.- Also seen:
raw prediction=-3467ms for pathLength=1, messageBytes=172— a negative predicted time. Discarded by the<=0guard, but proves the OLS model is unstable/underconstrained (matches the ML notes).
- Also seen:
-
Premature timeouts on a working route (RETRY-3, live).
"Its going well"was delivered in 16037ms, but attempt 0/1 used timeouts of 14420ms / 10240ms — shorter than the real round-trip — so they firedTimeout → retryingbefore the ACK could return. Theg:messagedelivered ... in 9418msarrived after a retry had already been scheduled. Messages that are genuinely in flight are being declared failed. -
Device
est_timeoutignored (RETRY-2, live). Every send gets a cleanRESP_CODE_SENT(which carries the firmware's ownest_timeout), but the app overrides it with the runaway ML value. The device's estimate would have been far saner than both the 10s (too short) and 151s (too long) ML outputs. -
Path rotation never converges. Single message, per-retry path:
pathLength 3 → 2 → 1 → 2 → -1(flood); timeout target swings25s → 19s → 13s → 19s → 151s. Routes that just delivered (g:over 2 hops) are abandoned on the next message. On a 2-5 hop room server on a congested channel, the thrash never settles. -
Retries congest the channel.
Radio quiet for 3037ms,Post-inbound backoff: waiting 14808ms— channel is busy. Backoffs (1/2/4/8s) are far shorter than the 19-34s timeouts, so retries stack onto a multi-hop path and reduce delivery odds. -
attempt & 3hash reuse visible."I have some really c..."attempt 0 and attempt 4 both expectb8bfa902;"JC you around?"reusesb2074391. Attempts 0/4 are indistinguishable on the wire (RETRY-4 / masking). -
Room login runs a separate, untracked ACK path (relates to DELIVERY-6).
No pending message found for ACK hash: b0023dab×5 during[RoomLogin]— login sends with an expected ACK thatMessageRetryServicenever registered, so every login confirm logs as unmatched. Login has its own 26s timeout/retry and eventually succeeds.
Highest-impact fix (from this log)
Fix the timeout side: stop feeding flood as pathLength=-1, hard-cap the ML output (≈30-45s), and fall back to the device's est_timeout instead of the runaway ML value. That alone kills the 151s hang and the premature retries. Contained to timeout_prediction_service.dart + calculateTimeout in meshcore_connector.dart.
Design Proposal — Replace the timeout regressor; split timeout from loss (2026-06-12)
Outcome of analyzing the JC Room Server log: the OLS LinearRegressor in timeout_prediction_service.dart is the wrong tool, and it's conflating two separate questions. This sketches the replacement before any code changes.
The core insight (why this isn't TCP, and what that implies)
This is slow, lossy multi-hop radio: a message can be lost on a forward hop (never reaches the recipient) or delivered-but-the-ACK-return-is-lost. Two consequences drive the whole design:
- On a lossy link, a longer timeout recovers nothing — only a retry does. The timeout is your time-to-retry. Loss is the common outcome to a distant (5-hop) node, so inflating the timeout (e.g. the observed 151 s flood value) just maximizes idle time before the one action that helps. The right timeout is the tightest value that still covers a realistically successful round-trip (~P90–95 of success RTT), then give up and retry.
- The latency model is already blind to loss.
recordObservationis only called fromhandleAckReceived— i.e. only on success. Lost messages produce no sample. So the regressor estimates "time of a successful round-trip" (same data an EWMA would use), just with extrapolation/negative/explosion failure modes on top. Modeling loss for real would require censored-data/survival analysis, which is far heavier than needed.
Conclusion: stop using one fragile latency regressor to answer both "how long to wait" and "is this route any good." Split them.
Part A — Timeout: physics-anchored, capped, adaptive RTT (replaces the OLS model)
Drop ml_algo / ml_dataframe. Keep the existing predictTimeout / recordObservation interface so calculateTimeout and the connector wiring are untouched; swap the internals.
Per route class = (hop count, flood-vs-direct); shared across contacts since data is sparse.
// baseline from physics (deterministic, generalizes to unseen hop/byte combos
// from sample #1). Use the LoRa airtime formula already in the connector, or the
// device's est_timeout from RESP_CODE_SENT (RETRY-2).
base = physicsEstimate(hops, bytes, sf, bw, cr)
// learn only a BOUNDED multiplicative overhead from real successes (congestion,
// per-hop processing). EWMA, clamped so it can never explode or go negative.
on success(sample, hops, bytes):
ratio = sample / physicsEstimate(hops, bytes, ...)
f[class] = (1-α)·f[class] + α·clamp(ratio, 0.5, 4.0) // α≈0.125
v[class] = (1-β)·v[class] + β·|f[class] - ratio| // β≈0.25 (deviation)
predictTimeout(hops, bytes):
est = base · f[class]
return clamp(est + 4·v[class]·base, physicsMin, HARD_MAX) // HARD_MAX ≈ 30–45 s
Properties this fixes, by construction:
- No 151 s / no negatives — output is
physics × bounded factor + bounded variance, hard-capped. - No flood discontinuity — flood is its own class with its own factor, never
pathLength = -1on a shared axis (kills RETRY-5). - No cold start — seed
f=1,v=0; the physics/device baseline is sensible from message #1; observations only refine the factor (removes the "need 10 samples" gate). - Variance is the timeout's whole point —
est + k·deviationis a calibrated high-percentile wait, so genuinely-in-flight messages aren't cut off early (kills RETRY-3), while staying tight enough that lost messages retry promptly. - Drops two dependencies, ~15 lines, debuggable.
Also remove the dead/broken secSinceRx feature (recorded ≈0; see ML notes) and feed transformed/UTF-8 payload bytes, not text.length.
Part B — Loss: route reliability drives retry / reroute / flood (data already exists)
The "is this path even working" question belongs to PathHistoryService, which already tracks successCount / failureCount / routeWeight per path. Treat that as a Bernoulli delivery-probability estimate (Wilson or Beta lower bound on success rate) and use it for the decision, not the timeout:
- High-confidence working route → keep using it; on a single timeout, retry the same path once (the ACK return may just have been lost) before changing anything.
- Degrading route (success rate dropping) → switch to the next-best ranked path rather than hammering a dying one.
- No confident route / repeated failures → flood to rediscover, then pin the rediscovered path (don't immediately rotate off it — the log shows good routes being abandoned the next message).
This replaces the current "rotate the path on every retry" churn (which made the timeout target swing 25 s→19 s→13 s→19 s→151 s and never converged) with: pin a working route; only re-route/flood when reliability says the route is actually bad.
Sequencing
- Part A first (self-contained, biggest visible win — kills the 151 s hang and premature retries; removes ml_algo). Same public interface, so low blast radius.
- Then Part B (path-decision policy), which also subsumes much of the separate "location-change brittleness" work — a route that stops delivering after you move is just a route whose reliability dropped.