Armin Bagrat Stepanyan
Projects
Recurse.ml Online Evals
Context
- Recurse.ml is a GitHub app that identifies bugs in PRs and leaves comments.
- I managed two founding engineers who collaborated on this project and used it in product work.
- We lacked reliable signal for whether agent changes made comments more useful to developers.
What I built
- I built a feedback loop connecting GitHub user reactions (emoji feedback, replies, and direct agent mentions) to qualitative analysis and evaluation metrics.
- Tracked “actioned comments”: cases where a user changed a Recurse-commented line between PR and merge.
- Created internal tools to register customer feedback as evaluation data points.
- Connected our eval data with CI and W&B for regression monitoring and evaluation of new foundation model releases.
- Added false-positive cases that tripped up the agent because false positives were the failure mode users hated most.
Result
- We used these signals to identify which bug reports our customers found useful (and which they ignored) and evaluate architecture changes against both benchmark performance and real developer feedback.
- The evals and production analysis allowed us to identify several failure modes and improvements:
- The new system reads/searches project context, fetches full diffs when needed, filters comments outside changed lines, and only reports high-confidence production-impacting bugs.
- Outdated library information -> web search tool.
- Missing programmer's intention -> diff tools.
- Need to dynamically allocate analysis resources across tools -> restructured our architecture from an agentic analysis pipeline to a tool-calling agent.
- Direct use of feedback -> used negative feedback for personalized filtering.
Code-LLM Bug Localizers
Context
- We wanted to move from flagging bugs at the time of PR to directly within a coding agent's development flow.
- However, while 2+ minute agentic code analysis was acceptable during PR review, it was too slow and expensive to use within a coding agent's loop.
- Trading off accuracy for low latency would allow us to fit a niche in the market no other competitor could.
What I built
- I identified latency as the key capability bottleneck: agentic bug localization was acceptable for PR review but unusable inside a coding agent's live loop.
- I supervised a PhD student (Nikolai Rozanov) to develop a fast line-level bug classifier.
- I scoped the high-level research problem around sub-10-second inference while preserving at least 80% recall relative to our existing agentic model.
- Then, I provided production data for model research and gave feedback when research trade-offs affected downstream capabilities, such as the granularity of classifier labels.
Result
- Nikolai Rozanov led the model research and implementation, training Qwen3-1.7B with LoRA/PEFT and a multi-channel classification head for token/segment-level bug classification.
- This model replaced an agentic inference loop with a single forward pass through a model.
WikiText Energy Benchmark
Context
- Sutro's thesis is that the ML training stack contains avoidable inefficiencies, especially around backpropagation and memory movement.
- They are using agentic research (i.e. autoresearch) to find better ML techniques that fix these inefficiencies.
- Yaroslav’s stated goal was to eventually produce a NanoGPT-like artifact: a small, clear demonstration of a novel, more energy-efficient architecture.
- I noticed that solutions to existing benchmarks (e.g. Sparse XOR) would exploit properties specific to binary logical functions, which do not generalize to language modeling.
- I suggested that a model-agnostic language modeling benchmark would move the group closer to its final goal.
- Additionally, if research agents exploited language-modeling-task-specific properties, as they did with Sparse XOR, those discoveries would be useful.
What I built
- I built a WikiText benchmark for evaluating different architectures on sub-5-minute training runs.
- I designed the benchmark around A100 energy-to-threshold: Joules required to reach 0.7 character accuracy on WikiText.
- We selected character accuracy as a model-agnostic benchmark to encourage autoresearch agents to benchmark language models that don't use transformers (or even backpropagation) against current state-of-the-art architectures.
Result
- We've used it to run small versions of diffusion language models and study experimental architectures such as forward-forward.
- This benchmark has allowed Gabriel Nakayama, Miyu Horiuchi, and other members of the Sutro research group to use autoresearch for researching alternative LM architectures.
- Miyu Horiuchi used this benchmark to evaluate numerical-analysis-based optimization methods.
spx Backend Deployments
Context
- I noticed that when programming with coding agents, the bottleneck to quick product iteration has shifted from writing code to deploying it and delivering it to users.
- spx's core hypothesis is that sufficiently fast deployment removes the need for any local execution.
What I built
- Built a sub 5s (p99) deployment engine for Python backends.
- Per-deployment Firecracker microVMs.
- Caddy for routing traffic through a
<slug>.runspx.comURL to the user's VMs. - I removed the requirement for a Python interpreter on the client by building
spx uvto manage dependencies directly on the deployment. - I implemented user secret management and persistent volumes.
Result
- Users prefer this over Cloud Run and Vercel deployments.
- Pull from users to deploy their full-stack production apps on SPX.
Bloomberg Search Migration
Context
- The Notes team was bottlenecked by calls to Bloomberg’s Solr deployment, limiting their ability to ship customer-requested features.
- Bloomberg is the largest Solr user in the world.
What I built
- Replicated the production instance in staging and ran load testing.
- Coordinated with the search infrastructure team in NYC and the Notes team in London.
- Migrated production code from BAS to HTTP for communication with the Solr instance.
Result
- This increased the maximum payload size by 100x and reduced latency by 10x between the Notes instance and the search index.
Achievements
- (2024) Raised a $3M pre-seed round. Investors include Seedcamp, Playfair, DeepMind researchers, and a Meta board member.
- Identifying and Explaining Discriminative Attributes (Stepanjans & Freitas, EMNLP-IJCNLP 2019)
- (2018) IARPA best disease prediction model. Lead a team that built the best real-time MERS disease prediction model.