ART Innovation Team · Skills Marketplace · Kiro Build Quality Bar v1 — Response

Audit Response: KiroEasyButton v1.4.0

Response to: Kiro Build Audit Report — KiroEasyButton v1.3.0 (17.75/20) · From: John Manchisi (jmanch) · To: Shawn Mangold (shawnman) · May 1, 2026

Hey Shawn,

Really appreciate the thorough audit — especially the explicit scoring rubric and the line between "scoring artifact" and "actionable gap." That distinction is what let me triage this cleanly in one afternoon. Rather than just fixing things and telling you it's done, I want to give you a structured response you can feed back into the Skill Vetting Tool's training loop. Three goals here:

  1. Evaluate each of your three recommendations — accepted, declined, or modified + why.
  2. Document what shipped in v1.4.0 and what it addresses.
  3. Surface a set of net-new improvements from today's CSM Dive Deep Friday session that went beyond your original scope but might inform the Quality Bar v2.
18.25+
projected score with v1.4.0
(up from 17.75 — crosses Marketplace Featured)
Release shipped: KiroEasyButton v1.4.0 — Skills Files Library updated, GitHub Release published with asset zip attached.

Response to your three recommendations

Recommendation Status What shipped + rationale
1. Add requirements.txt + package.json
(+0.5 toward Dependencies & Packaging)
Accepted Both files shipped in v1.4.0.
  • requirements.txt documents the Python 3.10+ / stdlib-only runtime stance with a rationale block (no runtime packages). A commented-out optional pytest line tells contributors how to run the suite without changing the end-user install surface.
  • package.json formalizes the Node 18+ story for mcp_probe.js, declares zero runtime deps, pins "os": ["win32"] so npm won't accidentally try to install on macOS/Linux, and adds a test script that runs node --check. Marked "private": true because it references Amazon-internal MCPs.
Evidence: Both files visible in the v1.4.0 repo root.
2. Expand test coverage
(+0.5 toward Testing)
Accepted v1.4.0 adds test_unit.py — 57 unit tests, all passing, covering the areas you flagged:
  • Cookie-expiry parser (8 tests) — happy path, malformed inputs, earliest-expiration-wins across domains, #HttpOnly_ prefix handling.
  • Midway freshness three-state logic (5 tests) — healthy, refresh-needed, mwinit-needed, proactive-mwinit safety margin, no-file path.
  • Grasp failure classifier (10 tests) — all four classes (midway_reauth_required, oauth_token_missing, midway_stale, generic_spawn) plus precedence rules and unknown fallback.
  • Probe/retry gating (5 tests) — no retry on success, no retry when handshake fails, retry-once-when-smoke-fails, stop-at-one-retry, no-retry-on-skip.
  • Self-heal decision tree (4 tests) — generic MCP cookie-copy cascade, grasp-mcp oauth fast path, midway-stale escalation, non-interactive-mode fail-fast. Also covers write-through of outcomes to the failure playbook.
  • Log-scrape filtering (6 tests) — real MCP errors match, conversation JSON does not match, model-ID mentions do not match (the fix from v1.2.2's false-positive bug stays locked in).
  • Recommendations dedup (2 tests) — not-reproduced issues get resolved_at stamps, recurring issues increment occurrence_count.
  • User classification (3 tests) — new_user / early_user / returning_user boundaries.
  • v1.4.0 preflight checks + wiring (14 tests) — covered in the new-work section below.
Evidence: python -m unittest test_unitRan 57 tests in 0.116s · OK. Also runnable via python -m pytest test_unit.py -v.
3. Platform note is fine as-is
(-0.5 Config, -0.5 Scalability treated as scoring realities, not gaps)
Declined — confirmed Agreed. The Windows-only constraint is the whole point of the tool — the Midway-to-WSL cookie-copy dance and the dual-shell MCP fragility simply don't exist on macOS or Linux, so a cross-platform version would be solving a problem that isn't there. Documented this explicitly in the README's trailing note.
Feedback for the Quality Bar: Consider a "by-design platform constraint" modifier for the Config and Scalability dimensions. A build that intentionally targets a single platform (and documents why) probably shouldn't be scored the same as one that got stuck on one platform by accident. It would help differentiate "did the right thing for the audience" from "didn't finish the job."

Net-new improvements in v1.4.0 (beyond your scope)

Alongside your three recommendations, today's CSM Dive Deep Friday session in Slack (channel #art-csm-kiro-quick-deep-dive-support) surfaced first-install pain points that weren't addressable by any tool that only ran after install. The v1.4.0 release adds three upstream environment checks to catch these before users hit the failing command — sharing here because they might inform signals worth adding to the Quality Bar.

Check Real user error it addresses Fix the app now surfaces
admin_powershell
Net-new
Parag Pavan Shetty (paragps) hit dism.exe: DSM needs higher permissions while installing WSL, even with ACME Admin enabled. Required closing PowerShell and re-opening as Administrator. Preflight detects non-elevated shell on every run and surfaces a low-severity rec (new_user audience) explaining that daily runs don't need elevation, but first-install commands do. Includes the exact "right-click PowerShell → Run as administrator" fix.
aim_environment
Net-new
Rahul Pandya (rpandyaz) ran toolbox install aim in Windows PowerShell and got "aim doesn't support windows. Available operating systems are [alinux, alinux_aarch64, osx, osx_arm64, ubuntu]". Needed three troubleshooting exchanges to get him to the correct (WSL Ubuntu) shell. Preflight detects whether aim is available on Windows PATH vs. inside WSL vs. not at all, and emits an info rec pointing to the right shell. Fixes the symptom where the error message itself doesn't tell users "run this in WSL, not PowerShell."
battery_power
Net-new
Rahul also reported his laptop dropped from 70% to 8% battery in 8 minutes during parallel first-run of Quick Desktop + Kiro + MCP servers. Not a bug per se, but the WSL VM + 7 MCP servers + headless Chrome for SAML is genuinely heavy at startup. Preflight detects AC vs. battery via GetSystemPowerStatus and warns new/early users (< 3 runs) when on battery. Returning users are past the heavy-startup phase and don't see the warning — avoids the "keep nagging me about this" anti-pattern.

How they're wired in

All three run non-blocking in PHASE 1 after the existing preflight list. Any exception inside a check is caught and converted to a skip StepResult — hard guarantee that an upstream check can never break the main run. The corresponding warn / info outcomes feed the existing recommendations engine, which adds audience-targeted suggestions to recommendations.json just like the existing preflight-driven recs.

Quality Bar signals this might suggest for v2

Declined / deferred

Item Decision Rationale
Cross-platform support Declined Windows-only is a design choice. The problem this tool solves (Windows → WSL cookie copy, MCP shell confusion, Midway step-up on Windows) doesn't exist on macOS or Linux. Porting would be solving a non-problem and adding a second code path that could drift.
Make pytest a required runtime dep Declined Kept pytest as a commented-out optional in requirements.txt. End users don't need it; only contributors running tests do. Adding a runtime dep just for tests would undercut the "zero pip install step" property that makes the tool frictionless for non-developer CSMs.
Add property-based tests (Hypothesis) for the cookie parser Deferred Considered it. The existing 8 targeted unit tests cover the malformed-input cases (binary garbage, non-integer expirations, short rows, #HttpOnly_ prefix). Property-based tests would add real value for the Netscape-cookie format invariants, but the incremental safety is small relative to adding a first-ever runtime dep. Revisiting if the parser grows.

Verification

v1.4.0 was verified on Windows 11 + AmazonWSL + Kiro IDE before the release went up:

Where to find it

If any of this feels off or you'd like me to go deeper on any of the net-new checks, happy to jump on a quick call. And if it's useful for Quality Bar v2 calibration, I can share the run-history schema the app uses for the per-machine learning loop — that one's been quietly interesting to watch as the playbook fills in.

Thanks again for the detailed read on v1.3.0. Genuinely useful to have someone scoring against a rubric I can actually respond to.

Best,
John