LLM evaluation is critical for generative AI in the enterprise, but measuring how well an LLM answers questions or performs tasks is difficult. Thus, LLM evaluations must go beyond standard measures of “correctness” to include a more nuanced and granular view of quality.
In practice, enterprise LLM evaluations (e.g., OSS benchmarks) often come up short because they’re slow, expensive, subjective, and incomplete. That leaves AI initiatives blocked because there is no clear path to production quality.
In this video, Vincent Sunn Chen, Founding Engineer at Snorkel AI, and Rebekah Westerlind, Software Engineer at Snorkel AI, discuss the importance of LLM evaluation, highlight common challenges and approaches, and explain the core concepts behind Snorkel AI's approach to data-centric LLM evaluation.
In this video, you’ll learn more about:
Understanding the nuances of LLM evaluation.
Evaluating LLM response accuracy at scale.
Identifying where additional LLM fine-tuning is needed.
See more videos from Snorkel AI here: / @snorkelai
Learn more about LLM evaluation here: https://snorkel.ai/llm-evaluation-pri...
Timestamps:
01:07 Agenda
01:40: Why do we need LLM evaluation?
02:55 Common evaluation axes
04:05 Why eval is more critical in Gen AI use cases
05:55 Why enterprises are often blocked on effective LLM evaluation
07:30 Common approaches to LLM evaluation
08:30 OSS benchmarks + metrics
09:40 LLM-as-a-judge
11:20 Annotation strategies
12:50 How can we do better than manual annotation strategies?
16:00 How data slices enable better LLM evaluation
18:00 How does LLM eval work with Snorkel?
20:45 Building a quality model
24:10 Using fine-grained benchmarks for next steps
25:50 Workflow overview (review)
26:45 Workflow—starting with the model
28:08 Workflow—Using an LLM as a judge
28:40 Workflow—the quality model
30:00 Chatbot demo
31:46 Annotating data in Snorkel Flow (demo)
34:49 Building labeling functions in Snorkel Flow (demo)
40:15 LLM evaluation in Snorkel Flow (demo)
41:58 Snorkel Flow jupyter notebook demo
44:28 Data slices in Snorkel Flow (demo)
46:51 Recap
49:25 Snorkel eval offer!
50:31 Q&A
#enterpriseai #largelanguagemodels #evaluation