Armin Bagrat Stepanyan

Projects

Recurse.ml Online Evals

Context

  1. Recurse.ml is a GitHub app that identifies bugs in PRs and leaves comments.
  2. I managed two founding engineers who collaborated on this project and used it in product work.
  3. We lacked reliable signal for whether agent changes made comments more useful to developers.

What I built

  1. I built a feedback loop connecting GitHub user reactions (emoji feedback, replies, and direct agent mentions) to qualitative analysis and evaluation metrics.
  2. Tracked “actioned comments”: cases where a user changed a Recurse-commented line between PR and merge.
  3. Created internal tools to register customer feedback as evaluation data points.
  4. Connected our eval data with CI and W&B for regression monitoring and evaluation of new foundation model releases.
  5. Added false-positive cases that tripped up the agent because false positives were the failure mode users hated most.

Result

  1. We used these signals to identify which bug reports our customers found useful (and which they ignored) and evaluate architecture changes against both benchmark performance and real developer feedback.
  2. The evals and production analysis allowed us to identify several failure modes and improvements:
    1. The new system reads/searches project context, fetches full diffs when needed, filters comments outside changed lines, and only reports high-confidence production-impacting bugs.
    2. Outdated library information -> web search tool.
    3. Missing programmer's intention -> diff tools.
    4. Need to dynamically allocate analysis resources across tools -> restructured our architecture from an agentic analysis pipeline to a tool-calling agent.
    5. Direct use of feedback -> used negative feedback for personalized filtering.

Code-LLM Bug Localizers

Context

  1. We wanted to move from flagging bugs at the time of PR to directly within a coding agent's development flow.
  2. However, while 2+ minute agentic code analysis was acceptable during PR review, it was too slow and expensive to use within a coding agent's loop.
  3. Trading off accuracy for low latency would allow us to fit a niche in the market no other competitor could.

What I built

  1. I identified latency as the key capability bottleneck: agentic bug localization was acceptable for PR review but unusable inside a coding agent's live loop.
  2. I supervised a PhD student (Nikolai Rozanov) to develop a fast line-level bug classifier.
  3. I scoped the high-level research problem around sub-10-second inference while preserving at least 80% recall relative to our existing agentic model.
  4. Then, I provided production data for model research and gave feedback when research trade-offs affected downstream capabilities, such as the granularity of classifier labels.

Result

  1. Nikolai Rozanov led the model research and implementation, training Qwen3-1.7B with LoRA/PEFT and a multi-channel classification head for token/segment-level bug classification.
  2. This model replaced an agentic inference loop with a single forward pass through a model.

WikiText Energy Benchmark

Context

  1. Sutro's thesis is that the ML training stack contains avoidable inefficiencies, especially around backpropagation and memory movement.
  2. They are using agentic research (i.e. autoresearch) to find better ML techniques that fix these inefficiencies.
  3. Yaroslav’s stated goal was to eventually produce a NanoGPT-like artifact: a small, clear demonstration of a novel, more energy-efficient architecture.
  4. I noticed that solutions to existing benchmarks (e.g. Sparse XOR) would exploit properties specific to binary logical functions, which do not generalize to language modeling.
  5. I suggested that a model-agnostic language modeling benchmark would move the group closer to its final goal.
  6. Additionally, if research agents exploited language-modeling-task-specific properties, as they did with Sparse XOR, those discoveries would be useful.

What I built

  1. I built a WikiText benchmark for evaluating different architectures on sub-5-minute training runs.
  2. I designed the benchmark around A100 energy-to-threshold: Joules required to reach 0.7 character accuracy on WikiText.
  3. We selected character accuracy as a model-agnostic benchmark to encourage autoresearch agents to benchmark language models that don't use transformers (or even backpropagation) against current state-of-the-art architectures.

Result

  1. We've used it to run small versions of diffusion language models and study experimental architectures such as forward-forward.
  2. This benchmark has allowed Gabriel Nakayama, Miyu Horiuchi, and other members of the Sutro research group to use autoresearch for researching alternative LM architectures.
  3. Miyu Horiuchi used this benchmark to evaluate numerical-analysis-based optimization methods.

spx Backend Deployments

Context

  1. I noticed that when programming with coding agents, the bottleneck to quick product iteration has shifted from writing code to deploying it and delivering it to users.
  2. spx's core hypothesis is that sufficiently fast deployment removes the need for any local execution.

What I built

  1. Built a sub 5s (p99) deployment engine for Python backends.
  2. Per-deployment Firecracker microVMs.
  3. Caddy for routing traffic through a <slug>.runspx.com URL to the user's VMs.
  4. I removed the requirement for a Python interpreter on the client by building spx uv to manage dependencies directly on the deployment.
  5. I implemented user secret management and persistent volumes.

Result

  1. Users prefer this over Cloud Run and Vercel deployments.
  2. Pull from users to deploy their full-stack production apps on SPX.

Bloomberg Search Migration

Context

  1. The Notes team was bottlenecked by calls to Bloomberg’s Solr deployment, limiting their ability to ship customer-requested features.
  2. Bloomberg is the largest Solr user in the world.

What I built

  1. Replicated the production instance in staging and ran load testing.
  2. Coordinated with the search infrastructure team in NYC and the Notes team in London.
  3. Migrated production code from BAS to HTTP for communication with the Solr instance.

Result

  1. This increased the maximum payload size by 100x and reduced latency by 10x between the Notes instance and the search index.

Achievements

  1. (2024) Raised a $3M pre-seed round. Investors include Seedcamp, Playfair, DeepMind researchers, and a Meta board member.
  2. Identifying and Explaining Discriminative Attributes (Stepanjans & Freitas, EMNLP-IJCNLP 2019)
  3. (2018) IARPA best disease prediction model. Lead a team that built the best real-time MERS disease prediction model.