Hey all! Thought this group might find this interesting - new approach to evaluating RAG pipelines using 'agents as a judge'. We got excited by the findings in this paper (https://arxiv.org/abs/2410.10934), about agents producing evaluations closer to human-evaluators, especially for multi-step workflows.
Our first use case was RAG pipelines, specifically evaluating if your agent MISSED pulling any important chunks from the source document. While many RAG evaluators determine if your model USED its chunk in the output, there's no visibility on if your model grabbed all the right chunks in the first place. We thought we'd test the 'agent as judge', with a new metric called 'potential sources missed', to help evaluate if your agents are missing any important chunks from the source of truth.
Curious what you all think!
sidpremkumar117 hours ago
One of the founders at Lytix here
It was pretty interesting, we started with LLM-as-a-judge, but noticed a big jump in human aligned accuracy when switching to a agentic evaluation approach. Was a lot of fun to work on!