The Human Touch in the Era of AI: Why Pentesting Still Needs a Human in the Loop

The current narrative is that AI is going to flatten penetration testing. Run a model against an attack surface, let it iterate, collect findings. AI tooling is genuinely accelerating real parts of the workflow; the layered model where automation handles scale and a human steers the engagement is where some of the most consequential findings get built. Two recent engagements I was on made that case in concrete terms: each one increased in impact based on a human providing context (a broadcast schedule, a network's safety posture, a timestamp on a VM disk) and turning that context into the next decision the tooling alone was not set up to make.

This post walks through both of those moments. One was a few seconds of thought that turned an indexed-but-quiet host into a live finding against a broadcast-grade caption encoder. The other was a long internal engagement on a sensitive administrative network where the rules of engagement required deterministic, human-driven tooling for safety reasons, and the path to a full business compromise was a person writing custom scripts and reading the output line by line.

Key Takeaways

An indexed-but-quiet host only made sense to retest during the broadcast station's published on-air hours. Anchoring the retest timing to the device's operational schedule is what turned a previously cataloged banner into a confirmed, unauthenticated, real-time interface into the live caption signal path.
Sensitive administrative and operational technology networks frequently require deterministic, auditable tooling for safety reasons. On those engagements the methodology that fits the customer's safety bar is a human reviewed deterministic code, and that constraint does not reduce impact; it scopes the methodology.
A custom looting toolkit against an open NFS datastore turned read access on virtualization storage into 404 Linux shadow hashes, 1,304 NTLM hashes including five domain controllers' ntds.dit, 31 SSH private keys, and ultimately the company source code monorepo. The full impact chain depended on a tester recognizing a fresh modification timestamp on a single VM disk image.
AI-augmented testing is a real efficiency gain on enumeration scale, fingerprinting, documentation parsing, and noise reduction. The layered model (automation handling discovery and analysis at scale; a human steering the engagement against the target's business context) is where the most consequential findings get built, and the two layers compound when paired.
Continuous penetration testing is the model that gives testers the time and mandate to act on this kind of context. Time-boxed annual engagements race past the hunches that turn an indexed banner into a critical finding.

Anchoring a Quiet Host to Its Operational Schedule

The host in question was a broadcast-grade closed-caption encoder, sitting in the live caption signal path of a TV station's on-air feed. It had been indexed earlier in a scan database advertising a Telnet listener on port 23, with a banner disclosing the exact model and firmware. When initial testing began, the host did not respond to probes and was not returning a connection on the advertised port. At that moment, a first-pass recon sweep had nothing to act on. What changed the analysis was the device class itself: a caption encoder almost certainly only powers on when the station is on the air, and the previously indexed banner was strong evidence that the host was real, owned by the in-scope organization, and addressable, just not at that minute.

I looked up the broadcast station's published programming hours, waited until they were live, and reconnected. The Telnet listener answered immediately and dropped the session into the device's diagnostic command interface with no authentication step. From there, the documented control-sequence commands in the manufacturer's public manual were all reachable. Read-only commands returned the firmware version, the caption articles currently loaded for insertion into the live stream, the output queue staged for broadcast, the SDI relay/bypass state on the active signal path, and a full system status dump.

The same interface documents destructive commands that an attacker could trigger from the same unauthenticated session: a reset sequence that knocks captions off the air for the duration of the reboot, and an article-injection sequence that lets arbitrary text appear in the captions shown to viewers in real time. These were one keystroke away on the same TCP session that produced the read-only evidence.

This is exactly the kind of finding where automation and human judgment work best as a layered pipeline. Automated discovery and AI-assisted analysis can dramatically accelerate the surface-mapping, fingerprinting, and documentation-reading steps; a human steering the engagement is the layer that ties those steps to operational context: what the device does for a living, when it is online, and which documented commands carry real-world impact for the business that owns it. The reasoning chain (device class to operational schedule to retest timing to documented commands to on-air impact) is the kind of context-anchored sequence that gets sharper, not less necessary, as the underlying tooling gets better.

Why Sensitive Internal Networks Still Require Deterministic, Human-Driven Testing

The second engagement was inside a sensitive administrative network. The rules of engagement prohibited non-deterministic and autonomous tooling against the in-scope hosts; in plain terms, no AI agents were permitted to take action on their own judgment, and every action taken by the testing tooling had to be predictable in advance and auditable after the fact. That is not a stylistic preference. It is a safety boundary the customer set. On networks that carry production state, virtualization storage, or operational technology, the customer's infrastructure team needs to be able to predict, in advance, exactly what the testing tooling will do and to stop it on a moment's notice if the production state is at risk. The methodology had to match that bar: a human creating the code, reading the code, and triggering the code with full understanding of what each line would touch. The final script ended up being hundreds of lines of bash and was written collaboratively with the assistance of AI. As it was being developed, I ran into new cases and bugs and different mounting tweaks that needed to be made until the final script iteration that successfully ran against all the mountable hosts in the share. Realistically, I would not have had time to create these custom scripts purely with my organic brain in a reasonable time period without the use of AI. Together in a symbiotic relationship of human-driven custom tooling creation and testing and auditing, this attack chain was possible, safe, and deterministic.

Enumeration of the internal network surfaced two hosts exposing NFS on 2049/tcp with no host restrictions and no authentication. The exports turned out to be the full vSphere and Harvester virtualization datastores, containing 1,264 virtual machine directories. Anyone network-adjacent could mount the share and read every .vmdk on disk, including the live, running production guests.

showmount -e <storage-host>
mount -t nfs <storage-host>:/data0/vsphere /mnt/vsphere

A single mount returned the full directory listing of every VM the platform had ever deployed.

Mounting a Live Virtual Disk for Offline Filesystem Analysis

Read access to a flat.vmdk is read access to the guest's filesystem. The standard workflow is to attach the raw disk image to a network block device on the testing host, re-read the partition table, and mount each filesystem read-only. For a Linux guest using LVM2:

qemu-nbd --read-only --format=raw --connect=/dev/nbd1 /mnt/vsphere/<vm-dir>/<vm>-flat.vmdk
partprobe /dev/nbd1
udevadm settle
lsblk /dev/nbd1
vgchange -ay ubuntu-vg
mount -o ro,noload /dev/ubuntu-vg/ubuntu-lv /mnt/<vm>

That sequence is unremarkable on its own. The point is that on a network where the rules of engagement require deterministic, reviewable execution, this is how the engagement makes forward progress: a person writing the exact mount sequence, validating it on a single host, and only then scaling it.

Building a Looting Toolkit a Human Can Audit Line by Line

With 1,264 mountable VM directories, manual enumeration of each guest was not realistic. I wrote a toolkit consisting of three pieces: a host-mountability validator, a wrapper that mounted each target and invoked the per-host looter, and two looting scripts, one for Linux guests and one for Windows guests. Every script was idempotent on read-only mounts, logged every action, and stopped on any unexpected condition. A human reviewer could read each script top to bottom and predict exactly what it would touch.

The Linux looter targeted the high-yield credential surface that comes up on engagement after engagement:

/etc/passwd, /etc/shadow, /etc/sudoers, and sudoers.d/ for hash material and privilege configuration
Root and per-user .ssh/ directories, capturing private keys, authorized_keys, and known_hosts
Shell history (.bash_history, .zsh_history) for command-line credentials and contextual clues
Cloud credentials under user homes (.aws/credentials, .config/gcloud, .kube/config, .docker/config.json, .netrc, .git-credentials)
Ansible vault material across /home, /root, /opt, and /etc/ansible
Identity configuration: /etc/sssd/sssd.conf, /etc/krb5.conf, /etc/environment
Cron jobs (/etc/crontab, /etc/cron.*), which routinely contain hardcoded credentials in scheduled scripts
The first N docker-compose.yml and .env files within a bounded depth, capturing container secrets without runaway recursion

The Windows looter focused on the offline credential surface:

Registry hives from Windows\\\\System32\\\\config\\\\ (SAM, SYSTEM, SECURITY, SOFTWARE), processed offline with secretsdump.py to recover local NTLM hashes, cached domain credentials, and LSA secrets
The Windows\\\\repair\\\\ copies of SAM and SYSTEM as a fallback when the live hives were locked
Windows\\\\NTDS\\\\ntds.dit from domain controllers, exempted from the file-size cap because the entire engagement frequently hinges on it
Unattend.xml and Unattended.xml from Panther and sysprep directories; these still routinely retain plaintext local administrator passwords
Per-user PowerShell history, DPAPI credential blobs, and SSH keys

I ran a deep variant against 57 high-value targets identified by hostname, role, and recent activity. The output:

Artifact class	Count
Hosts attempted / successfully looted	57 / 33 (11 Windows, 22 Linux)
Domain controllers dumped	5 (across 3 forests)
ntds.dit user hashes (excluding history and machine accounts)	90
krbtgt hashes captured (distinct)	4
Local SAM hashes	35
Linux shadow hashes	29
LSA secrets recovered	316
DPAPI credential blobs	11
Unprotected SSH private keys	9
Plaintext credential pairs	5

After validating coverage I pared the toolkit down to a lightweight variant that pulled only /etc/shadow-style hash material, root and user .ssh/ contents, and the Windows SAM/SYSTEM/SECURITY hives plus ntds.dit. Run against the full share, the lightweight pass produced 404 Linux hashes, 1,304 NTLM hashes, and 31 private SSH keys across 530 successfully looted hosts. Offline cracking against that hash set recovered 20 unique plaintext passwords and 27 complete credential pairs, which were then sprayed against reachable internal services along the rest of the attack chain.

Why This Workflow Had to Be Deterministic and Human-Driven

Every step above had to be observable, interruptible, and predictable to the customer's infrastructure team before it ran. Mounting a live production VM read-only is safe; mounting it read-write is not. Iterating across 1,264 disk images is safe with a deterministic, audited loop where the operator has reviewed exactly which paths the tooling will read from. The scope rules existed because the customer's infrastructure team needed to be able to read the testing tooling top to bottom, predict in advance exactly what it would do, and stop it on a moment's notice if production state was at risk.

A human writing, reviewing, and triggering these scripts could meet that bar on day one. The constraint here was the engagement, not the technology: this customer's safety posture required deterministic, human-driven execution against the in-scope hosts, and the only methodology that fit inside that boundary was a person collaboratively working with AI rather than AI alone writing the toolkit, auditing it, and running it under their own hand.

How a Single Modification Timestamp Turned Loot Into a Source Code Compromise

The most consequential pivot of the engagement came from when I noticed a single timestamp. While reviewing the looter's output across the high-value hosts, one VM stood out because its disk image had been modified the same day testing was underway. That meant the guest was a live, in-use developer machine, and a live developer machine in 2026 almost always carries fresh, valid keys to source control.

I remounted that single host manually and started reading. Inside the developer's home directory the bash history showed a sequence that is depressingly familiar: an attempt to clone a private repository, an authentication failure, a freshly generated SSH key, the same clone re-attempted, and then a successful navigation into the cloned tree. The key was still on disk in the user's .ssh/ directory. It had been generated several months earlier and there was no indication it had been rotated.

Authenticating to the source control provider with that key (in the context of the developer's account, not the company's) succeeded. The full company source monorepo cloned cleanly. A secrets scanner over the cloned tree returned 87 validated credentials including API keys, SSH keys, and cloud provider keys, alongside more than 18,000 additional credential candidates across embedded configuration, build pipelines, and historical commits.

What This Looks Like Without a Human in the Loop

Both engagements turned on the human layer doing something the automated layer alone was not set up to do. On the broadcast encoder, the indexed-but-quiet banner only became a confirmed finding once the tester anchored the retest to the station's airtime window: a piece of context that lives in a programming schedule, not in any banner or vendor manual. On the internal engagement, the customer's safety posture required deterministic, human-driven execution by rule, so the testing tooling and the human running it had to be a single auditable unit. And the most consequential pivot in the entire engagement (a fresh modification timestamp on a single VM disk image) is the kind of signal that is easy to surface in a long log and easy to walk past unless someone is reading with the intuition that fresh equals valuable.

This is not a "humans good, AI bad" argument. AI tooling is genuinely useful, and getting better fast, on enumeration scale, surface mapping, fingerprinting, documentation parsing, response triage, and the parts of pentesting that look like pattern matching across high volumes of data. The argument is narrower: the layered model where automation accelerates the discovery and analysis steps and a human steers the engagement against the target's business context is where the most consequential findings get built. The two layers compound; neither replaces the other.

Why Continuous Testing Is the Model That Lets This Happen

Time-boxed annual engagements force testers to race through scope. There is no time to look up a broadcast schedule and come back four hours later. There is no time to write, audit, and validate a custom looting toolkit, run it in two passes, and read the output carefully enough to spot a single suspicious timestamp. Continuous penetration testing is the model that gives the human-in-the-loop the time and the mandate to act on context, and the engagements above are what that looks like in practice.

Both findings happened because someone had room to think. The encoder finding happened because there was time to come back during airtime. The NFS-to-source-code chain happened because there was time to write deterministic tooling, audit it, run it twice, and read the output carefully. None of that fits inside a one-week scoped pentest, and none of it happens without a human steering the engagement against the target's operational reality.

If your current testing model is producing predictable, repeatable findings year over year, the constraint is the model, not your environment. Talk to us about continuous penetration testing, the version of testing that gives a human the time to notice the timestamp.

The Human Touch in the Era of AI: Why Pentesting Still Needs a Human in the Loop

Key Takeaways

Anchoring a Quiet Host to Its Operational Schedule

Why Sensitive Internal Networks Still Require Deterministic, Human-Driven Testing

Mounting a Live Virtual Disk for Offline Filesystem Analysis

Building a Looting Toolkit a Human Can Audit Line by Line

Why This Workflow Had to Be Deterministic and Human-Driven

How a Single Modification Timestamp Turned Loot Into a Source Code Compromise

What This Looks Like Without a Human in the Loop

Why Continuous Testing Is the Model That Lets This Happen

Stephen Naragon

The Expert-Driven Offensive
Security Platform

Expert-Driven Offensive Security Platform

Our Cookie Policy

Manage Cookies

Use cases

Top Blog Posts

The Human Touch in the Era of AI: Why Pentesting Still Needs a Human in the Loop

Key Takeaways

Anchoring a Quiet Host to Its Operational Schedule

Why Sensitive Internal Networks Still Require Deterministic, Human-Driven Testing

Mounting a Live Virtual Disk for Offline Filesystem Analysis

Building a Looting Toolkit a Human Can Audit Line by Line

Why This Workflow Had to Be Deterministic and Human-Driven

How a Single Modification Timestamp Turned Loot Into a Source Code Compromise

What This Looks Like Without a Human in the Loop

Why Continuous Testing Is the Model That Lets This Happen

Stephen Naragon

Explore Latest Content.

Obtaining AS-REP Hashes Through ARP Poisoning

Can We Beat the Adversary, or Am I Willing to Accept the Risk?

Security Budget Downturn

The Expert-Driven Offensive Security Platform

Expert-Driven Offensive Security Platform

The Expert-Driven Offensive
Security Platform