Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

LLM-as-Judge on a Budget

Saha, A., Wagde, A., Kveton, B. AISTATS 2026 (with Adobe Research)

LLM judges are stochastic — querying the same prompt-response pair multiple times yields different scores, so naive uniform sampling wastes budget on low-variance pairs. We frame evaluation as a best-arm identification problem: model each prompt-response pair as a bandit arm, estimate per-pair score variance online, and adaptively concentrate queries where uncertainty is highest. Our algorithms ROBIN and ROBIN-HOOD provably minimize worst-case estimation error under a fixed query budget, with error bounds scaling with the sum of variances. Experiments on Summarize-From-Feedback and HelpSteer2 show substantial error reduction over uniform baselines at equal cost.

Efficient Algorithms for Combinatorial-Bandits with Monotonicity

Wagde, A., Saha, A. NeurIPS 2025 OPT Workshop

[OpenReview] · [PDF]

In combinatorial bandits, a learner picks a subset of arms each round and observes a reward aggregated by an unknown function — prior work required strong structural assumptions (e.g., linearity) on that aggregator, leading to intractable sample complexities in the general case. We show that under monotonicity alone, two arms can be compared by randomizing over the remaining arms in the chosen set, effectively isolating a pairwise signal from combinatorial feedback. This reduction lets us plug in any pairwise-preference bandit algorithm, matching state-of-the-art sample complexity while dropping the restrictive aggregation assumptions that made previous approaches brittle.

Journal Articles

A Deep Reinforcement Learning Technique for Bug Detection in Video Games

Rani, G., Pandey, U., Wagde, A.A., Dhaka, V.S. International Journal of Information Technology, 15, 355–367, 2023.

[DOI] · [PDF]

Proposes RLBGameTester: trains a DQN agent on Atari games, then monitors last-layer gradients during gameplay — bugs injected post-training cause sharp gradient spikes, flagging the exact iteration where the bug occurred without any human supervision.

Aniket Wagde