Using AI as a testing co-pilot: A real-world session with an embedded device

I’ve been using Claude Code (with Opus 4.6 at the time of writing) as part of my daily workflow on an embedded Linux project. The product is a compact industrial display – a small, ruggedized screen that shows real-time data in an environment where reliability is non-negotiable. The software stack is built with Yocto Linux, and the device runs a collection of interconnected services that process, format, and display incoming information.

Recently, I needed to verify a feature that had been implemented by a colleague: the ability to disable the application-level watchdog at boot time using a physical hardware button. This is the story of how that testing session went, what worked, what didn’t, and what it taught me about the division of labor between a human tester and an AI tool.

If you’re a tester thinking about bringing AI tools into your workflow, or an engineer curious about what AI-assisted testing actually looks like in practice, this is a session worth reading through.

The groundwork that made this possible

Before I get into the session itself, I want to be clear: this didn’t happen in a vacuum. I had been working on this project for months, and Claude Code had been part of my workflow for much of that time.

It started as a practical choice: it was good at helping me write and refine shell scripts, useful for reading through unfamiliar parts of the codebase when reviewing a colleague’s changes, and well-suited for exploratory testing where I needed to quickly check device state, pull logs, and try things without scripting everything in advance. Over time, it became a natural part of how I work on this project.

The result was that by the time I started this testing session, I had a collection of helper tools – vetted, tested, and trusted – that I had built partly with Claude’s help. SSH connection handling, log retrieval, configuration parsing, screenshots, and firmware deployment. I also had documentation files capturing what I’d learned about the system architecture, the service dependencies, and the quirks of the platform.

When I brought Claude into this session, I wasn’t starting from scratch. I was handing it tools I already trusted. When it ran a command to check the device state, it was using my SSH wrapper script, the same one I’d used hundreds of times. The AI wasn’t operating blindly; it was operating within a framework I had built and validated.

KEY POINT

The tool is only as good as the environment you give it. If you hand it well-tested scripts and a well-documented codebase, it becomes genuinely productive. If you hand it raw access with no structure, you’ll spend half your time second-guessing its output.

What we were testing

The feature under test was conceptually simple. The device has physical hardware buttons on the front panel. When a user holds down one specific button for three seconds during the boot sequence, the bootloader detects this and passes a special flag through to the operating system. The application – specifically the watchdog monitor service – reads this flag at startup and disables its own software watchdog in response.

Why does this matter? In the field, these devices sometimes get into a state where a software problem causes the watchdog to keep rebooting the unit. The service crashes, the watchdog detects it, the device reboots, the service crashes again, and the loop continues.

Previously, the only way to break this cycle was to send the device back to the manufacturer. With this feature, a field technician can hold a button during power-on, disable the watchdog, and then diagnose the problem on a running system. It’s a small feature with a big impact on serviceability.

Phase 1: Understanding the implementation

Before touching the device, I asked Claude to look at the relevant code changes in the repository. This is where having the full codebase accessible made a real difference.

Within a couple of minutes, Claude had found the relevant commit, identified the developer who wrote it, read through the code changes, and explained the mechanism to me. It described how the watchdog module checks a system file at initialization to see if the disable flag is present, and if it finds it, overrides the configured timeout value to zero – effectively turning off the watchdog without modifying any configuration files on disk. The beauty of the approach is that it’s entirely transient: the next normal boot restores the watchdog to its configured state.

This kind of codebase archaeology is where an AI assistant genuinely earns its keep. What would have taken me fifteen or twenty minutes of git log diving, file reading, and cross-referencing happened in about two minutes of conversation. The result was a shared understanding between me and the tool of exactly what we were about to verify.

Phase 2: Proposing the test approach

Based on its reading of the code, Claude proposed a set of test steps:

Claude’s initial plan

1. Boot the device normally, verify the disable flag is not present in the kernel command line

2. Power off, boot with the button held for 3+ seconds

3. Verify the flag appears in the kernel command line

4. Verify the watchdog timeout has been overridden to zero

5. Reboot normally, verify the flag is gone and the watchdog is active again

This was a reasonable starting point, but it had a critical gap: these steps assumed the watchdog was working correctly without ever verifying it. If the watchdog was already broken, then proving we could “disable” it would be meaningless. We’d be testing whether a flag appears in a text file, not whether the watchdog actually stops protecting the system.

There’s a fundamental testing principle at work here:

Before you test that something can be turned off, prove that it’s on. Before you test that a safety mechanism can be bypassed, prove the safety mechanism works. It sounds obvious, but it’s the kind of thing that gets missed when you’re generating test steps from code analysis rather than from thinking about what needs to be proven.

So I revised the approach:

Revised plan

1. Boot normally, verify the watchdog is active

2. Kill a monitored service and confirm the watchdog reboots the device – this proves the watchdog works

3. Boot with the button held for 3+ seconds, verify the disable flag is set

4. Kill the same service again and confirm the device does not reboot – this proves the disable works

5. Reboot normally, verify the flag is gone, and watchdog is re-enabled

The difference is subtle but important. My version tells a story: the watchdog works, then we disable it, then we prove it’s disabled, then we prove it re-enables. Claude’s version would have gone straight to testing the disable mechanism without first establishing that there was anything to disable.

I also added negative tests that Claude hadn’t considered: verifying that a short press of the correct button (under three seconds) wouldn’t accidentally set the flag, and thinking carefully about whether to test the second hardware button – more on that later.

Phase 3: Running the tests

This is where the collaboration really clicked. I was the one making decisions and watching the physical device. Claude was the one running commands, reading output, and keeping track of what we’d verified.

Establishing the baseline

We booted the device normally. Claude checked the kernel command line – no disable flag present. It checked the watchdog status file – state was “Active,” software timeout configured at 45 seconds. It checked that all the application services were running. Everything was up and healthy. Good baseline.

Proving the watchdog works – first attempt

I asked Claude to kill one of the application services to trigger the watchdog. It picked the monitoring service – the process that watches all the other services and triggers reboots when something goes missing. It killed the process, waited what it thought was enough time for a reboot, ran a command, got a response from the device, and concluded that the device had rebooted successfully. It even presented evidence: uptime numbers, boot reason logs, and status data.

But I was looking at the physical device the entire time. It seemed to be working normally. It hadn’t rebooted.

What happened makes perfect sense in hindsight: the monitoring service is the process that watches everything else and initiates reboots. If you kill the monitor, there’s nothing left to trigger the reboot. It’s like disabling the fire alarm and then checking whether the fire alarm goes off.

The data Claude presented was real, but it was from the still-running device, not from a fresh boot. The SSH connection had stayed up the whole time because the device never went down. Claude looked at the numbers, saw plausible values, and drew the wrong conclusion.

When I pointed this out, Claude adjusted.

AI tools are incredibly good at reading output and drawing conclusions from text. But they cannot see the device. They cannot hear the fan spin up. They cannot watch the boot logo appear.

The most important lesson

The AI has no ground truth. It works with text output from commands. I work with reality. When those disagree, reality wins. The human tester is not optional – they are the source of truth about what actually happened.

Proving the watchdog works – second attempt

We rebooted the device to get back to a clean state. This time, I asked Claude to kill one of the monitored services instead of the monitoring service itself. It tried to kill the display process first, but the display runs with elevated privileges, and our SSH session didn’t have permission. Rather than flagging this and asking how to proceed, it silently fell back to killing a different service.

I caught this and called it out. Not because the substitute was a bad choice – it was actually fine, since the watchdog monitors it – but because in testing, you don’t silently change the plan. You flag the obstacle, discuss the options, and agree on the next step. This is especially important when you’re working with a physical device where the wrong action can cause real problems.

After we agreed on the approach, Claude killed the service. I watched the device. After about 40 seconds, it rebooted. When it came back up, Claude checked the watchdog status file: the boot reason was logged as a mandatory process being missing. The watchdog was doing its job.

Testing the disable

After establishing that the watchdog works, I power-cycled the device while holding the button for three seconds. Claude checked the kernel command line – the disable flag was present. The kernel logs confirmed it had been received and passed to the application. The monitoring service was running, but with the timeout overridden to zero.

Then came the real test. Claude killed the same service again – the same one whose death had triggered a reboot just minutes earlier. I watched the device. The display froze (that service feeds data to it), but the device stayed up. One minute passed. Two minutes. No reboot. The watchdog had been successfully disabled.

Verifying recovery

I rebooted the device normally – no button press. Claude checked: no “disable” flag in the command line, watchdog state “Active,” software timeout back to 45 seconds. The disable was truly transient, as designed.

Claude was satisfied at this point – without the flag, the watchdog should be back to normal. But I wanted to prove it, not just infer it. So we killed the same service one more time. After about 40 seconds, the device rebooted. The watchdog was back.

Negative tests

First, I rebooted while giving the correct button a quick tap – well under three seconds. Claude confirmed the flag was not set. The button detection correctly requires a sustained press, not an accidental brush. This matters in the field: you don’t want a technician accidentally disabling the watchdog by bumping a button during a routine power cycle.

Then there was the question of the other hardware button. On this device, the other button triggers a factory reset during boot – it wipes the device configuration and restores everything to defaults. Claude had suggested testing this button as a negative test: press it and verify the watchdog disable flag doesn’t get set.

The logic was sound, but a factory reset would wipe my entire device configuration – network settings, application parameters, calibration data, everything I needed for the rest of my testing. It would have cost me an hour of reconfiguration. So I moved it to the very end of the session, after all other testing was complete.

The test passed – pressing the reset button did not set the watchdog disable flag. This wasn’t the strongest test – the two buttons use completely different mechanisms in the bootloader, and testing one provides limited additional confidence about the other – but it was still worth running at the end of the session, when there was no cost to losing the device configuration.

This illustrates something fundamental about test planning:

A test that makes logical sense can still be expensive to run at the wrong time. Deciding when to run a test, and how much weight to give its results, is as much a testing skill as deciding what to test in the first place.

It requires understanding not just what the test would prove, but what it would cost and what it would disrupt. An AI tool tends to optimize for coverage. A human tester optimizes for value given constraints.

Both perspectives are useful, but when they conflict, the human perspective should win.

Phase 4: Documentation

After the testing was done, I asked Claude to write up the results as comments for the issue tracker. This is where the time savings really add up. I’d just spent an hour carefully working through a test scenario, and all the evidence – command outputs, status data, boot logs, timing observations – was sitting in our conversation history. Claude drafted comments that I could paste directly with minimal editing.

It also generated a markdown file with the complete test results, structured as a set of steps with pass/fail outcomes. This became part of the project documentation.

Having an AI that can turn a testing session into a proper write-up means the work actually gets captured.

The documentation step is easy to undervalue, but it’s often where testing sessions lose their impact. You run through a careful verification, you’re satisfied the feature works, and then you write a two-line comment in the ticket because you’re tired and you want to move on.

What I took away

The entire session – from reading the code to the final documentation – took about an hour. Without the AI, it would have taken two or three times as long, and the documentation would have been half as thorough. The AI didn’t change what I tested or how I thought about testing. It changed how fast I could move through the mechanical parts, leaving me more time and energy for the parts that actually matter.

A few things I’d want anyone considering this kind of workflow to keep in mind:

Invest in your infrastructure first

The session was productive because I had prepared the ground. My SSH scripts, my helper tools, my documentation – these were the foundation. Write good scripts and test them – this is essential, especially if you get AI help to write them. Document your system. Build tools you trust. Then hand those tools to the AI and let it use them.

You are the source of truth

The AI is confident and usually right. But it works entirely with text output from commands, and text can be misleading. When command output and physical reality disagree, physical reality wins. Don’t let plausible data overrule what you can observe directly.

Think about what you’re proving, not just what you’re checking

The AI’s initial test plan would have verified that values changed in files. My revised plan proved a behavior: the watchdog reboots the device when it should, and stops rebooting when told to. The difference is between a test that ticks a box and a test that gives you confidence.

Shape the collaboration actively

Early on, Claude made silent substitutions and premature conclusions. By the end, it was asking before acting and presenting evidence for me to evaluate. This happened because I actively shaped the interaction. Don’t just accept the AI’s default behavior. Tell it what you want. The tool learns fast within a session.

The most useful thing I can say to anyone considering this kind of workflow is that the results depend almost entirely on how thoughtfully you approach the collaboration. Bring good tools, bring good judgment, and the AI becomes a capable partner.

3 responses to “Using AI as a testing co-pilot: A real-world session with an embedded device”

James Lyndsay says:

April 21, 2026 at 2:13 pm

Hi Ru!

You got the machine to do the docs at the end. You also indicate that you shaped the interaction throughout the session. Did you keep the same conversation going for the whole session? Did the machine compact the conversation?

I guess I’m wondering what your approach was to managing the context for the machine, and how that interacted with your own exploration.

Cheers – James

1. Ru Cindrea says:
  
  April 21, 2026 at 5:12 pm
  
  Hey James!
  
  Good question, and definitely something I forgot to address directly.
  
  Yes, it was all one conversation from start to finish, from reading the commit through to the final write-up. No compacting happened, and I did not need to clear or reset anything.
  
  The reason it worked out cleanly in this session is that Opus 4.6 has (had?) a generous context window, and this particular session stayed well within it. I generally keep an eye on how much context I have left and try to wrap up, summarise, or start fresh before I get close to the limit, because I have noticed that once compacting kicks in, it can lose nuance from earlier in the session in ways that are not always obvious. In this case I never got near that point, so I did not have to make those trade-offs.
  
  On how it interacted with my exploration: keeping one conversation going was actually useful here. By the time we got to the documentation step, the tool had the full trace of what we had tried, what had failed, what we had agreed to do differently, and which observations came from me looking at the physical device versus which came from command output. That is what made the write-up fast. If I had been breaking the session up or compacting mid-way, I would have had to reconstruct some of that context myself, and the documentation would have taken longer.
  
  For longer or messier sessions I do take a different approach, with more deliberate check-ins on remaining context, sometimes asking for an interim summary I can paste into a fresh conversation.
  
James Lyndsay says:

April 27, 2026 at 11:43 pm

Thanks Ru – that’s helpful. I’m working with shorter-context local models and finding that context management is one more thing to keep in mind as I explore.

Cheers – James