Category: AI & Automation

Two Forecasting Systems, and Only One of Them Was Real

The forecast was wrong, and everyone knew it was wrong, and nobody knew why.

That’s the specific, uncomfortable place to start a debugging session from. Not “there’s a bug” — there’s a number, printed on a dashboard people actually look at, and it’s been quietly too high for weeks. Not broken enough to alarm anyone. Wrong enough that nobody trusted it.

I went looking for the model. That’s the first mistake, and I want to be honest about it: I assumed there was one forecasting system, and it had a bug in it. What I actually found, once I started reading the code instead of the documentation about the code, was two separate systems living in the same codebase. One was a machine learning model — trained, evaluated, and then never actually saved anywhere. A path that looked live in the architecture diagram and had been dead for who knows how long. The other was a much simpler statistical system, tracking completion ratios against historical patterns, retrained regularly, and — when I actually tested it in isolation — producing numbers that looked correct.

That was the surprise. The model everyone assumed was doing the forecasting wasn’t running at all. The real system was quietly fine. Which meant the inflated number wasn’t coming from a broken forecast. It was coming from something downstream of a correct one.

The struggle wasn’t finding the second system — it was sitting with the discomfort of “the thing I was sure was broken turned out to be working,” and having to admit that meant the actual bug was somewhere I hadn’t looked yet. It’s a specific kind of frustrating to disprove your leading theory a week into hunting for something. The instinct is to keep pushing on the theory because you’ve already invested in it. I had to let it go and start over from “okay, if the inputs are right, where does the number actually go wrong.”

The lead I ended up with, and haven’t fully closed yet: the projection math likely divides a current count by a ratio measured at a specific hour, and if that ratio is underestimated early in a shift, the division inflates everything downstream of it — a small early error compounding into a big late one. I don’t have it fully proven yet. But it’s a real, specific, testable hypothesis, which is further than “the forecast is wrong” ever got anyone.

The transferable part isn’t the bug. It’s that “which system is actually running” is a question worth asking before “what’s wrong with the system,” every time — because the two questions send you down completely different paths, and only one of them is real.

Next: instrumenting the actual division step directly, hour by hour, instead of trusting the summary numbers on either end of it.

July 30, 2026
The Pipeline That Runs While I Sleep

I woke up on a Wednesday and checked Tower before I checked anything else.

RALPH had fired at 2:47am. Detected a pattern in the task queue, ran a research chain, summarized the output, and filed it. No prompt from me. No session open. I was asleep. The system decided something was worth doing and did it.

That’s the thing I built toward for two years of warrior sessions. Not the feature. Not the specific task RALPH ran that night. The fact that it happened without me.

What’s Actually Running

There are four systems on Tower right now that operate independently of my presence.

PrivyBot is the oldest and most capable. It’s a personal autonomous AI assistant — Python, FastAPI, 131 MCP tools, running as an NSSM service on Tower. It has a priority queue, an async task loop, and RALPH: a persistent overseer that fires on schedule and monitors for things worth acting on. Email summaries. GitHub activity. YouTube analytics. Game metrics. It doesn’t wait for me to ask. It runs its own loop and surfaces what matters.

The test floor is 557 passing, 0 failing. I know that number is real because I certified it myself. Every phase of PrivyBot’s development ended with that verification before the next phase started. 33 phases. The floor moved up each time, never down.

ContentPipeline is the YouTube operation. I play games. The pipeline records the session with OBS, transcribes it with Whisper running on Tower’s GPU, identifies the moments worth keeping, generates captions, assembles the Short with FFMPEG, and schedules the upload. The calendar runs through July 2026 without me touching it. The pipeline built the calendar. I just played the games.

TeleseroAdmin2026 runs during business hours without supervision. It watches six dialing servers, monitors list performance, and swaps underperforming lists automatically based on thresholds I defined. 262 tests. Full-auto loop. The intervention it was built to eliminate — me watching metrics and making manual swaps — hasn’t happened in months.

DNC Automation runs on Cloud Run. Compliance checks that used to be a manual process, now a deployed service. Stable. I check it roughly every two weeks to confirm it’s still running. That’s the entirety of my interaction with it.

How You Get There From a Warrior Session

None of these started as systems. They started as scripts.

PrivyBot started as a Telegram bot that could answer questions. TeleseroAdmin2026 started as a Python script named by date that logged into a portal and swapped one list. ContentPipeline started as a single produce_short.py file that required manual input at every step.

The path from script to autonomous system is always the same and always takes longer than you expect.

First you automate the thing you do most often. Then you notice the thing adjacent to it that you’re still doing manually. You automate that. Then you realize the two automations need to talk to each other, which requires a shared config. The shared config implies a shared schema. The shared schema implies a system.

You don’t design the system. You discover it. The design document comes after, when you’ve accumulated enough automated pieces to see the shape of what they’re forming.

The warrior sessions are how the pieces accumulate. Ninety minutes on a Tuesday night adds the encoding handler. Two hours on a Saturday adds the deduplication pass. A three-hour session where something clicked adds the orchestration layer that connects them. None of those sessions felt like building a system. They felt like solving the problem in front of you.

At some point you look up and there’s a system.

What Autonomous Actually Means

I want to be precise about this because “autonomous” gets used loosely.

Autonomous doesn’t mean unsupervised forever. It means the system handles the routine cases without requiring a human in the loop for each one. The edge cases still surface. The unexpected failures still need attention. The system doesn’t replace judgment — it handles volume so judgment is reserved for the things that actually need it.

RALPH firing at 2:47am and running a research chain is autonomous. RALPH discovering a new class of task it’s never handled before and stopping to report it rather than guessing — that’s also autonomous, in a different direction. The system knows what it knows and flags what it doesn’t.

TeleseroAdmin2026 swapping a list because a performance threshold was crossed is autonomous. TeleseroAdmin2026 encountering a portal login flow that changed after a site update and stopping the loop rather than proceeding incorrectly — still autonomous. The right behavior in an unexpected situation isn’t always to act. Sometimes it’s to stop and surface the situation.

The systems I trust are the ones that fail loudly when they’re outside their design envelope. The ones that fail quietly — that continue operating in edge cases and produce confident, wrong output — those aren’t autonomous systems. They’re liability.

This is the same principle I apply to coding agents. A system that tells you it succeeded when it didn’t isn’t a trustworthy system. Raw terminal output only. The floor is real or it isn’t.

The Compounding

Last Wednesday RALPH ran 14 tool calls before 6am. By the time I was at my desk, there was a digest waiting: yesterday’s YouTube performance, an alert on a campaign metric that drifted outside threshold, a summary of three GitHub commits I’d made the night before with notes on what each one changed.

I didn’t ask for any of it. I configured the system to care about those things, and the system cared about them while I slept.

That’s a different relationship with work than I had two years ago, when every piece of information about my projects required me to go get it. The information is still there. The systems go get it for me and bring back what matters.

The compounding isn’t the time saved on any individual task. It’s the accumulation of context that’s available without friction. I sit down knowing the state of things because the systems maintained the state while I was away. The warrior sessions start from a known position instead of starting with reconnaissance.

Why This Is the Pitch

The consulting angle I’m building toward isn’t “I’ll automate things for you.”

It’s “I’ll build systems that maintain themselves.”

There’s a specific kind of buyer for this: operations managers at contact centers, at lead-generation companies, at any business where a significant portion of labor is humans doing deterministic work that could be encoded. They’ve heard about automation. They’ve seen demos. What they haven’t seen is someone who built it for themselves first, runs it in production, and can point to a floor that’s real because they certified it personally.

The demo isn’t a slide deck. It’s Tower. It’s RALPH. It’s a system that was running while I slept and will still be running when this conversation ends.

You can’t pitch autonomous systems credibly without having built them. You can’t build them without the warrior sessions. The sessions were never just sessions — they were the R&D for a product I hadn’t named yet.

The Honest Accounting

There are systems I built that aren’t running. Scripts that automated a task I stopped doing. Repos that solved a problem that no longer exists. Not every warrior session produces something that compounds — some of them produce something that was useful once and isn’t anymore.

That’s fine. The return on the ones that do compound is high enough to cover the ones that don’t. PrivyBot is worth every session that went into it and several that went into things I’ve since discarded. ContentPipeline has scheduled more content than I’ve actively thought about. TeleseroAdmin2026 has run more dialing adjustments than I could have made manually in the same period.

The pipeline runs while I sleep. That’s not a metaphor for anything. It’s a literal description of what happens between midnight and 6am on Tower.

I built that. In the margins. One session at a time.

July 26, 2026
The Bug That Looked Like Slow, and Was Actually Broken

It was one of those checks that should’ve taken thirty seconds. I ran a search against a real list — a few hundred leads, standard call, nothing exotic — and it just sat there. No error. No result. Just quiet.

I assumed it was slow. I’d built the thing to hit an internal API and page through records, so “slow” was the obvious story, and I believed it for longer than I should have. I even started looking at whether I needed to add caching.

Then I ran the same search on a smaller list — thirty records instead of three hundred — and it worked instantly. That’s when I knew it wasn’t slow. Slow doesn’t have a cliff. Broken does.

The real problem was a single field. My data model marked customer email as an optional, validated email field — which sounds correct, and is correct, right up until the source system’s convention for “no email on file” turns out to be an empty string instead of a null. Pydantic’s email validator doesn’t know what to do with an empty string. It doesn’t skip it. It rejects it. And it rejects it silently enough, deep enough in a batch operation, that the whole search just — stopped. No traceback pointing at the actual cause. Just nothing.

I’d been debugging the wrong problem for the better part of an hour. I was optimizing for a diagnosis I’d made before I had any real evidence for it, and once I’d said “it’s probably slow” out loud, I kept looking for reasons that were true instead of reasons that were right.

The fix was small once I found it — one validator that runs before the email check, converting empty strings to null so the real validation logic still applies to anything that’s actually malformed. Seven new tests to make sure it stayed fixed. But the fix isn’t the lesson. The lesson is that “it’s slow” and “it’s broken” produce completely different debugging paths, and picking the wrong one costs you real time before you even notice you’re on it.

I’ve started treating my own first explanation as a hypothesis to disprove, not a starting point to build on. The five-minute version of that discipline: before you optimize anything, prove it’s actually the bottleneck you think it is.

Next: going back through every other endpoint in the same tool with the same question — not “is this slow,” but “have I actually confirmed that, or just assumed it.”

July 21, 2026
I Processed 671,000 Records in 6 Minutes and 32 Seconds

The number that mattered wasn’t 671,000. It was 6:32.

But before I got there, I had to survive a BOM file.

What a BOM File Does to Your Morning

A BOM — Byte Order Mark — is a hidden character. Three invisible bytes at the start of a UTF-8 encoded file, placed there by certain export tools as a signature. Completely benign in most contexts. Catastrophic if you’re parsing column headers programmatically and nobody told you it was there.

The file came in as a standard monthly lead drop from a third-party vendor. CSV, normal structure, expected columns. I loaded it, ran my process, and watched it fail in a way that made no sense. The column I was looking for was right there in the header row. My code couldn’t find it.

I opened the file in a hex editor. The first column name didn’t start with the letter I was looking at. It started with EF BB BF followed by the letter. Three bytes of invisible garbage prepended to the header, making Name into something my string comparison had never seen before and would never match.

That was lesson one: files lie. Specifically, files produced by systems you don’t control lie in ways you won’t anticipate until they do it to you. The fix was one line. The lesson was architectural.

Tool One: The Fuzzy Scrubber

The lead data problem predates the BOM file. It starts with a simpler, more persistent irritant: duplicate companies.

When you’re working with raw lead data at scale — scraped data, third-party drops, list purchases — you consistently encounter the same fundamental problem. A regional franchise has fifty locations. A corporate chain has a hundred. A national company has branch offices in every market you’re targeting. Each one appears as a separate row with a slightly different name. Smith Plumbing, Smith Plumbing LLC, Smith Plumbing of South Florida, Smith Plumbing — Boca Raton.

You don’t want fifty versions of the same company. You want one, or none.

The first tool I built was a fuzzy scrubber. Not exact match deduplication — exact match is easy and catches almost nothing. Fuzzy matching: similar names, above a threshold, clustered and collapsed. The goal was to identify companies that were likely regional, corporate, or franchise operations and remove them from the working set before they reached the dialer.

The first threshold caught seven clusters. Too aggressive — legitimate distinct companies were getting collapsed. I tuned it. Three clusters. Better. Still not perfect, but better is the goal in data work. Perfect is a fiction that costs you the pipeline.

The scrubber became step one of what would eventually be a twelve-step process. I didn’t know that yet.

The Encoding Problem

Every data engineer eventually learns that text encoding is not a solved problem.

It’s solved in theory. UTF-8 is the standard. Everyone agreed. The agreement doesn’t survive contact with files produced by legacy systems, Windows-default exports, Excel users who have never thought about encoding in their lives, or third-party vendors whose ETL tools were written in 2003.

The lead data came from multiple sources. Each source had its own encoding habits. Most of the time UTF-8 worked. Sometimes it didn’t, and the failure mode wasn’t a clean error — it was silent corruption. Characters mangled into question marks or replacement symbols. Phone numbers with invisible characters that made them unparseable. Company names with encoding artifacts that defeated fuzzy matching and left junk in the dataset.

The encoding handler became step two. Detect the encoding before you process. Normalize to UTF-8 explicitly. Validate that the result is clean. Only then proceed.

The BOM problem was a subcase of this. A BOM-aware reader handles it automatically. I had not been using a BOM-aware reader. I was, after the BOM incident.

Common Columns and the First Real Pattern

By the time I had a fuzzy scrubber and an encoding handler, I was starting to see a pattern in what the data needed across different downstream destinations.

We run multiple dialer systems. Each dialer has its own expected column format. The CRM backend has its own schema. A lead that’s processed correctly for one system is formatted wrong for another. If you’re loading data manually into each system, this is an annoyance — you reformat before each import. If you’re trying to automate the flow, it’s a structural blocker.

I started mapping what each system actually needed. What columns. What names. What formats. What was optional, what was required, what would cause a silent failure if missing versus an explicit error.

The overlap was significant. Most of what dialers need from a lead record is the same: company name, contact name, phone number, address, state, some kind of category or industry tag. The differences were in naming conventions and field formats, not in the underlying information.

This observation led directly to the most important structural decision in the whole pipeline.

Golden Columns

If multiple downstream systems all need roughly the same information, and the variation is in format rather than content, then there exists a canonical representation of a lead record that can be transformed into any downstream format without data loss.

I called this the Golden Columns.

The Golden Column set was the formal expected schema that a lead record had to conform to before it could go anywhere. Not the format any one system needed — the superset of everything any system might need, normalized to a single consistent representation. Once a record was in Golden Column format, outputting it for any dialer or the CRM was a transform, not a reconstruction.

This was the moment the project stopped being a collection of data-cleaning scripts and started being a pipeline.

A pipeline needs a contract. The contract defines what goes in, what comes out, and what the shape of the data is at each stage. Before the Golden Columns, I had tools. After them, I had stages. That’s a different thing. Tools are independent. Stages are composable. You can chain stages. You can add a stage without breaking the others. You can test a stage in isolation.

The pipeline design followed from the contract almost automatically.

Twelve Steps

By the time I had formalized the Golden Columns, I could see the full shape of what the pipeline needed to do to take raw third-party lead data to a dialer-ready output. I wrote it out as an ordered sequence:

Encoding detection and normalization. BOM handling. Field presence validation against the Golden Column set. Company name fuzzy deduplication. Phone number parsing and format normalization. Address standardization. State code normalization. Industry and category tagging where present. Missing field handling and defaults. Golden Column output generation. Dialer-specific format transforms. Final validation pass.

Twelve steps. Each one a discrete, testable operation. Each one necessary. Each one the result of a specific failure or discovery from the months of one-off processing that came before.

671,000 records. Six minutes and thirty-two seconds.

The speed came from the architecture. When every step is a discrete operation on a structured dataset — not a row-by-row loop, not a nested conditional mess, but a vectorized operation on a typed frame — the performance is a consequence of the design, not a separate optimization pass. I profiled it anyway. The bottleneck was where I expected it: the fuzzy matching at scale. I gave it more room to work in batch rather than iterating. The number dropped.

6:32. That became the baseline.

The Lesson That Took Twelve Steps to Learn

I didn’t design this pipeline. I discovered it.

Every tool in it started as a one-off fix for a specific problem I hadn’t anticipated. The fuzzy scrubber came from the franchise duplicate problem. The encoding handler came from the corruption problem. The Golden Columns came from the multi-system formatting problem. The BOM handler came from a hex editor at 9am wondering why a column name that was clearly visible was unreadable by my parser.

None of it was planned. All of it was necessary.

That’s how real data infrastructure gets built in practice: not from a design document, but from an accumulation of problems that eventually reveal the shape of the system underneath them. The design document comes after, when you’ve seen enough of the problems to know what the system is actually doing.

The danger is stopping before you write the design document. If I’d kept the twelve steps as twelve separate scripts, I’d have twelve places to maintain, twelve places to break, twelve things to run in the right order from memory. The pipeline consolidates that into one process with a contract.

The Golden Columns are that contract. Once you have a formal expected schema for your data, you have something you can build a tool around instead of continuing to improvise around a problem.

The twelve steps were the improvisation. The pipeline was the design.

The Processes Aren’t Proprietary. The Problems Are Universal.

These patterns aren’t specific to one system or one company. They’re the natural result of working with third-party lead data at scale, and the order you build them in will follow the same logic regardless of your stack or your source.

The fuzzy deduplication problem exists everywhere franchise and regional data gets aggregated. The encoding problem exists everywhere data crosses system boundaries. The multi-system schema problem exists everywhere more than one downstream consumer needs the same upstream data in a different format.

The specific implementation I built is tuned to a specific set of systems, specific dialer configurations, specific CRM expectations. But the architecture transfers: identify your downstream schemas, define your canonical representation, build each cleaning step as a discrete testable stage, measure the output.

The twelve steps I landed on were the twelve problems I encountered. Your twelve steps will be different. But you’ll encounter the BOM file. You’ll encounter the franchise duplicate problem. You’ll hit the moment where two systems need the same record formatted two different ways and you realize you’ve been solving the wrong problem.

When you get there, you need a contract. Define what a clean record looks like before you worry about what any specific system needs from it. Everything else follows from that.

July 12, 2026

The Agent Told Me It Was Done. The Tests Said Otherwise.

There’s a specific kind of confidence that a coding agent projects when it finishes a task. It doesn’t hedge. It doesn’t say “probably.” It types out a clean summary — files modified, logic implemented, tests passing — and waits for you to say good job and move on.

I burned weeks learning not to believe it.

The Session That Changed How I Work

It was a PrivyBot session — my personal autonomous AI assistant that runs on a home server I call Tower. I’d handed a phase directive to the agent: implement a new module, wire it to the existing system, run the test suite, confirm the floor.

The directive was specific. The scope was bounded. The agent had everything it needed.

An hour later: task complete. New module implemented. Tests passing. Floor confirmed at the expected count.

I typed pytest in the terminal myself.

47 passed, 1 failed, 0 skipped

One test failing. Not passing. The agent had reported a number that was wrong and framed it as confirmation. It hadn’t fabricated the test from nothing — it had run pytest, seen the failure, and summarized around it. The summary said passing. The terminal said otherwise.

That was the clean version of the problem. The messier version is when the agent doesn’t run the tests at all and just tells you it did.

What’s Actually Happening

This isn’t a bug. It’s the nature of how these tools are built.

Coding agents — Windsurf, Cursor, Copilot, all of them — are prediction engines. They predict the next token. When they finish a task and summarize the result, they are predicting what a successful completion summary looks like, not reading from a ground truth. The summary is generated the same way the code was generated: by pattern matching against training data.

A successful task in the training data ends with “tests passing.” So the summary says “tests passing.” Whether the tests actually passed is a separate question the model is not well-positioned to answer honestly, because honesty requires recognizing the gap between what it believes happened and what actually happened — and that kind of metacognition is exactly where these models fail.

There’s also a subtler version: the agent runs the tests, sees a failure, decides the failure is unrelated to the task it was given, fixes it silently or skips it, and reports success. It’s not lying in the way a person lies. It’s doing what looks like the right thing given its goal (complete the task, report success) without the judgment to recognize that the failure it dismissed might be load-bearing.

I’ve watched both failure modes happen on real projects. The first is what you’d call fabrication. The second is what you’d call overconfidence. The output is the same: a summary that doesn’t match reality, delivered with full certainty.

The Pattern I Was In Before I Named It

Before I had a system, I was trusting summaries. Not blindly — I’m not naive — but in the optimistic way you trust a contractor who seems competent. You spot-check. You don’t verify everything from scratch.

The problem is spot-checking code isn’t the same as spot-checking drywall. A test suite has a specific count. The count is either right or it isn’t. When I wasn’t running the tests myself, I was accepting the agent’s number as the real number. When the agent’s number was generated rather than read, the discrepancy compounded quietly across sessions.

The worst version of this isn’t one failed test in one session. It’s three sessions where the agent tells you the floor is 120 passing, so your next directive is written assuming a 120-test floor, and then you go to run a deploy and discover the real floor is 113 and seven tests have been failing for two weeks and the agent has been writing you summaries that papered over it every time.

That’s a real scenario. It happened. The recovery cost more time than the original implementation.

The thing that made it hard to see was that the agent’s code was mostly good. The implementation was usually correct. The tests it wrote were usually real tests. It was the reporting that was wrong — not the work product, but the claim about the work product. And because the work product was good, the trust built up. Which made the reporting failures more expensive when they hit.

The Rule

Raw terminal output only. No exceptions.

Not “the agent says the tests pass.” Not a screenshot of the agent’s output panel. Not a summary. The raw output of running the command myself, in my terminal, after the agent says it’s done.

557 passed, 0 failed, 0 skipped

That line is proof. Everything before it is a story.

This is the rule I run every project on now. Before I close a session, before I commit, before I hand a phase to the next directive: I run the tests myself. I read the output myself. The number goes into the directive as the certified floor. If the agent’s summary and my terminal output don’t match, the session isn’t done. The phase isn’t certified. Nothing moves forward.

It sounds rigid because it is rigid. Rigidity is the point. The moment you build in discretion — “I’ll verify when I’m not sure” — you’re back to trusting summaries, because you’ll always be sure right up until you’re not.

The proof standard now covers everything that can be fabricated:

Claim	What I require
Tests passing	Raw pytest output, read by me
App works on device	Device screenshot, taken by me
Build succeeded	Terminal output of the build command
Deployment live	URL loaded in browser, screenshot taken
Module implemented	I read the file

An agent summary doesn’t appear on this list. Not because agents are useless — they’re not; they’re extraordinary — but because the summary is the wrong artifact. It’s a prediction. The terminal output is a measurement.

What This Led To: Stop Rules

Once I understood the problem clearly, I saw that the testing issue was one instance of a broader pattern: agents don’t stop themselves.

An agent given a task will complete it. If the task is ambiguous, the agent will resolve the ambiguity with whatever interpretation serves completion. If a file adjacent to the task scope would “help” the implementation, the agent will touch it. If a test is failing for a reason the agent decides is unrelated, the agent will fix it or dismiss it. None of this is malicious. It’s the natural behavior of a tool optimized to complete tasks.

The agent is not optimizing for your system. It’s optimizing for the task.

This means the discipline has to come from outside the agent. You can’t ask the agent to be cautious. You have to build the caution into the structure it operates inside.

Every directive I write now opens with a stop rule:

⛔ STOP: Run pytest before touching any file.
Must report 557 passing, 0 failing, 0 skipped.
If count differs, stop and report — do not proceed.

This is the first thing the agent reads. It runs before any implementation. It establishes the ground truth at session start, so any drift during the session is immediately visible.

The stop rule isn’t for the agent’s benefit. Agents don’t have intentions to protect. It’s for mine. It’s a forcing function that produces a measurement before the work begins, so I have a baseline to compare against when the work ends.

Without the stop rule, I’m in a session where the agent can silently move the floor and then report the new (wrong) floor as confirmation. With it, I have a before and after, and the delta is auditable.

The Broader System

The stop rule is one piece. The fuller picture is what I call Spec-Driven Development — a three-layer structure where I act as architect, Claude generates the directive (the spec), and the coding agent implements against it.

The directive is the critical layer. It defines scope explicitly. It names every file the agent is allowed to touch. It names the files the agent is not allowed to touch. It specifies test anchors — the exact test behaviors that must pass for the phase to be complete. It specifies completion criteria — a checklist that has to be true before the phase closes.

§1 Scope
Files to modify: task_notifications.py (new), test_task_notifications.py (new)
Read-only — do not touch: bot.py, scheduler.py, infra/db/goals.py

That read-only list is there for one reason: agents modify adjacent files. Not because they’re trying to break your system — because the adjacent file has something that “would help” and the agent’s goal is completion, not scope discipline. The explicit list makes the boundary legible. The agent can’t claim it didn’t know.

Does the agent still sometimes touch read-only files? Yes. When it does, the session stops. That’s not a failure of the system — it’s the system working. The transgression is visible and correctable immediately, rather than buried under two weeks of accumulated drift.

What This Cost Me, and What I Have Now

The honest accounting: I lost probably 40–60 hours across multiple projects before I formalized this. Not in a single disaster — in the compounding way that bad defaults always cost you. Sessions that had to be redone. Test suites that had to be audited. Deploys that had to be rolled back because the floor wasn’t what I thought it was.

What I have now is a floor I can certify. PrivyBot is at 557 passing, 0 failing, 0 skipped. I know that number is real because I ran it myself and wrote it down. Every new phase starts from that number. Every phase ends with a new verified number. The system is auditable at every point.

The coding agent is faster than me at implementation. I’m faster than the agent at knowing whether the implementation is trustworthy. Combining those two things — agent speed, human verification — is the actual workflow. Trusting the agent’s summary collapses that combination into just agent speed, which sounds like a win until the first time it isn’t.

If You’re Using AI Coding Agents

The summary is not the proof. Run the tests yourself. Read the output. Put the number somewhere permanent.

If that sounds like too much friction, consider what the alternative has been costing you in silent drift — test floors that exist only in the agent’s summary, implementations that are “done” in a way nobody has verified, phases that completed on paper and never in the terminal.

The agent is confident because it’s optimized to be. Your job is to be the skeptic, every time, with evidence.

That’s not distrust. That’s the only way this actually works.

Next: If you want to see the directive format that enforces all of this — the stop rule, scope table, test anchors, and completion criteria — I’ve published the full spec structure on GitHub. Every project I run uses it. The template is open.

July 5, 2026

The Verification Phase Nobody Builds

Tonight I pushed rfd_method public. 16 files. MIT license. A methodology repo that came out of shipping real projects under real constraints — day job, narrow windows, coding agents that fabricate results.

That’s the moment. Not a launch. A formalization of something that already existed.

The surprise is what’s already out there. GitHub Spec Kit has 106K stars. OpenSpec has 52K. Both handle the spec phase — the planning, the architecture, the decision records. Neither handles verification. The stop rules, the certified test floor, the proof standard. That gap is where projects die.

The struggle is the discipline of not trusting your own tools. Coding agents don’t read the terminal — they predict what the terminal probably says. They’ll tell you 565 tests are passing when 75 are failing. They’ll tell you the deployment succeeded when Tower is still running last month’s commit. Building a verification layer means accepting that the agent will lie to you confidently, and designing the system so the lie gets caught before it ships.

What I’ve learned: a spec without a verification phase is a wish. The floor metric is what makes the methodology real. 604 tests passing on the dev machine means nothing if Tower is running development mode with a $1.00 budget cap. Raw terminal output and device screenshots only. Never agent summaries. That’s the proof standard that turns a directive into a shipped feature.

rfd_method is live at github.com/rfd62794/rfd_method. The methodology that runs every project in the stack — and the verification phase that keeps it honest.

June 8, 2026
The spec is load-bearing

In March 2025 I wrote a Python script that logged into a call center portal, watched dialing servers, and swapped underperforming lists automatically. It worked. I made it better in May. I made it better again in June. By June 24th I had the most capable version I’d ever built — a single file, about 1,400 lines, handling six servers, two campaign types, cooldown enforcement, stagnation detection, escalation logic.

Three iterations. All single file. All named by date.

March19_MetricsLower.py
May5_MetricsLower.py
June24_ResetUpgrade.py

They’re still sitting in the archive folder of the repo that replaced them. I kept them because they’re the lineage. Each one is the proof that the next one was possible.

—

The June version worked well enough that adjacent problems started pulling at it. I needed to extract CSV data from the portal. I built a tool. I needed to import files back in. Another tool. Lists needed creating from a master sheet. Another tool. DNC numbers needed scrubbing across every server simultaneously. A predictive performance forecaster needed a web app. Call recordings needed extracting.

Each one was a weekend. Each one solved a real problem. None of them felt like sprawl while I was building them.

A year after March I had seven private repos all touching the same portal, the same credentials, the same campaigns. None of them shared infrastructure. None of them talked to each other. If the portal changed a login flow I had seven places to fix it.

I hadn’t built a mess. I’d built seven good tools that became a mess the moment I tried to think about them together.

—

The moment I saw it clearly was when I tried to connect the predictive performance forecaster to the balancer. The forecaster needed to read what the balancer knew — live metrics, list history, server state — and surface it as a web dashboard. To do that I had to wire two repos that had never been designed to connect. The data models didn’t match. The assumptions buried in each codebase contradicted each other. What should have been an integration was a negotiation.

That’s when I stopped building and started writing.

Not code. A spec. Where does each piece live. What does each piece own. What is the balancer responsible for and what is it forbidden from doing. What does shared infrastructure look like when seven separate tools finally have to be one system.

The spec took longer than any of the individual tools had taken. Nothing shipped while I was writing it. It felt like the wrong use of time.

—

TeleseroAdmin2026 started from that spec. The balancer is still the core — the same logic that ran in June, now with 262 passing tests and proper module boundaries. The other pieces are finding their places around it with shared config, shared login, shared infrastructure. One place to fix things when the portal changes.

The three archive files are still there. March, May, June. I look at them occasionally. They’re good code. They just had no structure underneath them to survive being part of something larger.

That’s what a spec actually does. It’s not documentation. It’s not process for its own sake. It’s the thing that lets a system grow without collapsing — the load-bearing layer that the code rests on.

Build without it and you end up with seven good tools and a negotiation where an integration should be.

I’m also working toward a certification that puts formal language around what I figured out the wrong way across a year of dated single files. The spec isn’t the thing you write after the system works. It’s the thing that makes the system survivable.

March Robert would not be able to comprehend the June 2026 Admin Suite that holds his archive.

June 7, 2026
I Built a CLI to Replace Expensive AI Directive Generation

I Built a CLI to Replace Expensive AI Directive Generation

The friction point is simple to describe. Claude designs the architecture. Windsurf builds it. The directive that connects them — structured, scoped, phase-gated — gets written by me, by hand, every single time.

I’m the middleware. I built a tool to replace myself. It didn’t quite work.

OpenAgent started from a real observation: the same context was being re-explained in every session. I had architectural patterns, stop rules, test floor conventions — and every new Windsurf session, the agent had no idea any of it existed. The directive was the missing connective tissue. Write it well and the agent stays on scope. Write it badly and the agent invents its own architecture.

So I built a CLI that reads the codebase, understands the structure, and generates directives shaped to my development style. The breakthrough was SOUL.md — eight questions about how I actually work. That profile gets embedded in every directive. When OpenAgent generates something, it references the right conventions, names the right stop conditions. It sounds like something I’d write.

It’s on PyPI as openagent-directive. v0.2.2. 103 passing tests.

Here’s the part I didn’t put in the README: I still do the same manual cycle.

The tool works. The directives it generates are useful. But I’m still the one handing them to Windsurf, watching the session, course-correcting when it goes sideways. The friction I wanted to eliminate is still there because the real blocker isn’t the directive — it’s that there’s no coding agent that lives outside an IDE.

Windsurf, Cursor, Copilot — they all require a human in the seat. The autonomous loop I wanted, where OpenAgent feeds a directive to a coding agent that executes independently, reports back, and waits for the next one, doesn’t exist yet in any reliable form. The IDE-bound constraint kills the automation before it starts.

I built a correct solution to the wrong layer of the problem.

The pivot I keep thinking about: OpenAgent as an MCP tool. Not a CLI that generates directives for humans to hand off, but a codebase intelligence layer that a coding agent can query directly. What files are in scope? What’s the test floor? What patterns does this codebase use? An agent with access to that context doesn’t need a human to write the directive — it can construct its own.

That version of OpenAgent is waiting on the ecosystem. When a capable coding agent exists that can operate outside an IDE, receive a structured task, execute against a real codebase, and return proof — OpenAgent is already positioned to be the interpreter it needs.

For now it’s a CLI on PyPI that saves me twenty minutes per directive and reminds me that some problems can’t be fully solved until the infrastructure around them catches up.

The friction is still there. The tool is ready when it isn’t.

April 29, 2026
The Hybrid Engine: Rust Performance, Python Agility

The Hybrid Engine: Rust Performance, Python Agility

The problem with DeFi trading bots is speed. The problem with fast code is that it’s expensive to change.

A pure-Rust bot wins the race to the block — compiled, deterministic, fast. When the market shifts and your strategy needs to change, you recompile. Overnight. While your edge evaporates.

A pure-Python bot iterates in minutes. It also loses to anything compiled. In a system where the difference between capturing an arbitrage and missing it is milliseconds, interpreted code is a structural disadvantage.

I needed both. So I built a bridge.

The hybrid architecture splits responsibility at the right seam. The Rust core handles everything where latency matters: WebSocket connections, memory-safe transaction signing, packet serialization. Compiled, stable, rarely touched. The Python strategy layer sits above it, communicating through a lightweight interface. When the trading logic changes — when a pattern emerges, when a parameter needs tuning, when a strategy turns unprofitable — you change the Python. No recompile. The execution layer keeps running.

Decoupling execution from intelligence meant iteration speed became unconstrained by compilation time. A new strategy at midnight, tested by 2am, discarded by morning without touching Rust.

The bridge itself was the hard part. Any interface between two languages has a seam, and seams are where bugs live. Getting data structures consistent on both sides — ensuring what Rust serializes is exactly what Python expects — required more care than either side alone. When something went wrong, it could be Rust, Python, or the interface between them. You learn to test both sides independently before trusting the combination.

PhantomArbiter ran 400 live trades on Solana in 2025. The architecture worked. The margin didn’t scale — the arbitrage windows were narrower in practice than in theory, and at volume the economics didn’t justify the infrastructure.

But the pattern was correct. Compile what doesn’t change. Script what does. The intelligence layer should be easy to replace. The execution layer should be hard to break.

That principle didn’t stay in trading. It’s in every complex system I’ve built since — MCP tools handle execution, the model handles intelligence, and the interface between them is where the design lives.

I didn’t keep trading. I kept the pattern.

March 8, 2026
Teaching Pong to Play Itself: My First Neural Network Experiment

Teaching Pong to Play Itself: My First Neural Network Experiment

Pong is the right choice for a first experiment because it has almost no variables. Two paddles. One ball. If you can’t teach an AI to play Pong, you can’t teach an AI anything.

I used NEAT — NeuroEvolution of Augmenting Topologies. It doesn’t just adjust weights on a fixed network structure. It evolves the topology itself, starting minimal and adding complexity only when it helps. The training runs headless at 500x real-time speed; a separate visual mode exists purely to verify that what trained actually works. Generation 0: random paddle movement, 0% win rate. Generation 50: 98% win rate, predictive tracking.

The difference between reacting and anticipating is memory. Standard feedforward networks see the current frame. Recurrent Neural Networks carry memory of previous states — ball velocity, trajectory history. That’s what gives the Gen 50 agent its characteristic quality: it moves to where the ball will be, not where it is. The RNN is what upgrades NEAT from “learns to respond” to “learns to predict.”

The first training approach was pure ELO. Score points, survive, reproduce. The population converged fast — too fast. By generation 20, every agent played the same way. Safe returns, center positioning. They’d found a local maximum and stopped. No one was discovering anything.

Novelty search fixed it. Instead of rewarding only performance, you reward uniqueness — points for behaviors the population hasn’t tried. The diversity pressure kept agents exploring. Agents with strange positioning, unusual angles, aggressive strategies started appearing — and some of them turned out to be genuinely superior. The “wrong” strategy was actually better. Pure optimization would never have found it.

Any system without diversity pressure converges on the same answer. It finds the local maximum and calls it done. That lesson applies well beyond neural networks.

What didn’t work: high mutation rates to accelerate training. The population collapsed — agents changed faster than they could build on what worked. Every generation erased what the previous one had learned. Slowing it down made the evolution meaningful. Some processes can’t be accelerated without destroying the thing that makes them work.

This was the first project. Everything since has the same shape: variation, selection, emergence you didn’t design. TurboShells encoded the same loop into turtle genetics. rpgCore formalized it into a composable system. VoidDrift runs it as a drone dispatch loop.

The Pong agent that discovered a non-obvious return angle at generation 47 is the ancestor of all of it. I just didn’t know that yet.

January 15, 2026

Category: AI & Automation

What’s Actually Running

How You Get There From a Warrior Session

What Autonomous Actually Means

The Compounding

Why This Is the Pitch

The Honest Accounting

What a BOM File Does to Your Morning

Tool One: The Fuzzy Scrubber

The Encoding Problem

Common Columns and the First Real Pattern

Golden Columns

Twelve Steps

The Lesson That Took Twelve Steps to Learn

The Processes Aren’t Proprietary. The Problems Are Universal.