Publications
Conference Papers
Saha, A., Wagde, A., Kveton, B. AISTATS 2026 (with Adobe Research)
[arXiv] · [PDF] · [Code]
LLM judges are stochastic — querying the same prompt-response pair multiple times yields different scores, so naive uniform sampling wastes budget on low-variance pairs. We frame evaluation as a best-arm identification problem: model each prompt-response pair as a bandit arm, estimate per-pair score variance online, and adaptively concentrate queries where uncertainty is highest. Our algorithms ROBIN and ROBIN-HOOD provably minimize worst-case estimation error under a fixed query budget, with error bounds scaling with the sum of variances. Experiments on Summarize-From-Feedback and HelpSteer2 show substantial error reduction over uniform baselines at equal cost.
Wagde, A., Saha, A. NeurIPS 2025 OPT Workshop
[OpenReview] · [PDF]
In combinatorial bandits, a learner picks a subset of arms each round and observes a reward aggregated by an unknown function — prior work required strong structural assumptions (e.g., linearity) on that aggregator, leading to intractable sample complexities in the general case. We show that under monotonicity alone, two arms can be compared by randomizing over the remaining arms in the chosen set, effectively isolating a pairwise signal from combinatorial feedback. This reduction lets us plug in any pairwise-preference bandit algorithm, matching state-of-the-art sample complexity while dropping the restrictive aggregation assumptions that made previous approaches brittle.
Journal Articles
Rani, G., Pandey, U., Wagde, A.A., Dhaka, V.S. International Journal of Information Technology, 15, 355–367, 2023.
[DOI] · [PDF]
Proposes RLBGameTester: trains a DQN agent on Atari games, then monitors last-layer gradients during gameplay — bugs injected post-training cause sharp gradient spikes, flagging the exact iteration where the bug occurred without any human supervision.