How Agent-Driven Development Transformed Our Research Workflow at Copilot Applied Science

Introduction

Every software engineer knows the cycle: a moment of inspiration—or frustration—leads to building a tool that eliminates tedious work, freeing up time for creative problem-solving. Then comes the ownership, the maintenance, and the quiet satisfaction of enabling peers to work smarter too. As an AI researcher on the Copilot Applied Science team, I recently took this principle to an entirely new level by automating not just manual toil but intellectual drudgery. The result? A system that my entire team now uses to accelerate their own work, and a shift in how I think about collaboration with GitHub Copilot.

How Agent-Driven Development Transformed Our Research Workflow at Copilot Applied Science — Source: github.blog

In this article, I’ll walk through the journey that led to creating eval-agents—a set of autonomous coding agents that analyze performance benchmarks. Along the way, I’ll share the lessons I learned about building and sharing agents effectively, and how applying those lessons unlocked a dramatically faster development loop for both myself and my teammates.

The Challenge: Analyzing Thousands of Agent Trajectories

My day-to-day work involves evaluating coding agents against standardized benchmarks such as TerminalBench2 and SWEBench-Pro. Each benchmark consists of dozens of tasks, and for every task the agent produces a trajectory—a detailed log of its thoughts and actions. These trajectories are stored as JSON files, often hundreds of lines long. Multiply that by the number of tasks in a suite and again by the number of benchmark runs I need to analyze daily, and you’re looking at hundreds of thousands of lines of code to sift through.

Doing that manually would be impossible. My go‑to solution was to leverage GitHub Copilot to surface patterns in the data. I’d ask it to summarize common failure modes, find anomalies, or identify promising strategies. Copilot would reduce the ocean of raw trajectories to a few hundred lines of insight, which I’d then investigate further. It worked well, but the process was repetitive: every new benchmark run meant repeating the same analysis loop. The engineer inside me whispered, “Automate that.”

That’s when eval-agents was born—a project designed to turn that intellectual labor into an automated, shareable pipeline.

The Solution: Building Eval-Agents

The core idea was simple: create a set of autonomous agents that could perform the same kind of pattern‑finding and analysis that I had been doing manually with Copilot. But instead of being a solo tool, it needed to be a platform that the whole team could use and extend.

I approached the design with three guiding goals:

Make agents easy to share and use – No steep learning curve. If a teammate wants to run an analysis, they should be able to spin up an agent with minimal configuration.
Make it easy to author new agents – The barrier to creating a new agent should be low. I wanted teammates to build custom agents tailored to their specific analysis needs, whether that’s comparing two benchmark runs, detecting regressions, or generating summary reports.
Make coding agents the primary vehicle for contributions – Instead of asking people to write documentation or fill out forms, I wanted them to contribute by writing agents. Agents themselves become the documentation, the analysis, and the reusable building blocks.

These goals align with GitHub’s core values of collaboration and openness. They also reflect skills I honed as an open‑source maintainer of the GitHub CLI. The first two goals—sharing and authoring—are about lowering friction. The third is about shifting the culture: contributions should be executable, not just readable.

Implementing the Agent Framework

Under the hood, eval-agents is built on top of GitHub Copilot’s extended capabilities. Each agent is a Python script that uses Copilot to generate analysis code, run it, and return structured insights. The framework handles task scheduling, caching of results, and versioning of agents. Crucially, agents are treated as first‑class artifacts: they can be stored in a shared repository, forked, and improved by anyone on the team.

To make authoring easy, I created a simple template and a set of helper functions. For example, an agent that identifies the most common error patterns across trajectories might be written in about 20 lines of code. The template includes hooks for defining input (which benchmark run), processing (the analysis logic), and output (a summary report). Copilot assists in writing the analysis logic itself—the agent essentially asks Copilot to generate code that analyzes the trajectories, and the results are then post‑processed.

Early tests were promising. I used the new agents to automatically produce daily reports on benchmark performance, highlighting trends that previously took me hours to discover. I then shared the agents with a couple of teammates, who immediately started adapting them for their own questions. Within a week, the repository had five new agents contributed by three different people.

Impact on the Team’s Workflow

The most immediate benefit was speed. What used to require an hour of manual Copilot interaction can now be done in seconds by running an agent. But the deeper impact is cultural. The team now thinks in terms of agents: instead of asking “Can you look at this run?” they ask “Which agent should I run?” New hires can quickly onboard by reading existing agents to understand how analysis is done—and then contribute by writing their own.

Moreover, the agents themselves become a form of living documentation. Because they are executable, they never go out of date. When a benchmark changes, the agent that analyzes it can be updated in one place, and everyone automatically benefits. This has dramatically reduced the overhead of maintaining analysis scripts.

I also learned important lessons about working with Copilot in an agentic context. For example, specifying the exact format for the output and including examples in the prompt significantly improves the reliability of generated code. The more structured the request, the better Copilot performs—treating it as a junior collaborator who benefits from clear instructions.

Conclusion: A New Role for an AI Researcher

Looking back, I realize that by automating away my intellectual toil, I didn’t make myself redundant—I changed my job. Instead of spending hours analyzing trajectories, I now spend time designing better agents, sharing them with the team, and thinking about higher‑level problems. The work has shifted from being a consumer of analysis to being an architect of automated reasoning.

If you’re a software engineer or researcher facing a repetitive intellectual task, consider whether you can build an agent to do it for you—not just a one‑off script, but a reusable, shareable agent that your whole team can benefit from. With tools like GitHub Copilot and a mindset of collaboration, the boundaries of what can be automated keep expanding. And who knows? You might just automate yourself into a completely different, and far more interesting, job.

Tags: