Data Machina #222
Evaluating LLMs & RAG Apps. DeepEval. HoneyHive. Jailbreak GPT-4. XGBoost v2. Voice cloning. LlaVA SOTA L&V. TSMixer SOTA Time-series. Google Graph Mining.
Evaluating LLMs & RAG apps is tricky. We are in this virtual meeting in which an engineer is enthusiastically demoing a RAG app for long-form, legal docs summarisation. All good, until the LLM spits out a summary which is shambles. Then some smart exec from the client team innocently asks:
How do we prevent this from happening? How do we know that the AI will always produce correct, accurate summarisations? Further: How do we evaluate the quality of those summaries with certainty from a business perspective?
The short answer: LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarisation. Yes, How humans can evaluate the AI summariser?
Evaluating LLMs is a minefield. Princeton researchers analysed different LLM evaluation methods, and concluded that current ways of evaluating LLMs don't work well, especially for questions relating to their societal impact. Checkout this annotated slide-deck on why evaluating LLMs is a minefield.
Understanding seven challenges in evaluating AI systems. Researchers at Anthropic, outlined 7 challenges that they have encountered while evaluating their own models. The post describes in detail what developing, implementing, and interpreting model evaluations looks like in practice. Blogpost: Challenges in evaluating AI systems.
Linking LLM evaluation metrics and pipelines is key. There is no point in trying to evaluate LLM outputs if you don’t put in place a solid evaluation framework with metrics defined by the business domain experts. And this also means that you need to manage the evaluation workflow and pipeline with proper tools. DeepEval is a tool that connects metric-based evaluations with your CI/CD pipelines. See DeepEval: Evaluation and Unit Testing for LLMs.
How can we measure LLM quality in a holistic manner? Researchers at Mosaic, recently introduced The Mosaic Eval Gauntlet: a technique for evaluating the quality of pretrained foundation models. It includes 34 different benchmarks organised into 6 broad categories. Read more here: LLM Evaluation Scores.
Need for new collaboration platforms for LLM eval, testing & debugging. All that stuff requires the collaboration of several teams. And the specific nature, interactive dynamics of LLMs output evaluations adds more complexity. HoneyHive is a collaboration platform to test and evaluate, monitor and debug your LLM apps, from prototype to production. It enables you to continuously improve LLM apps in production with human feedback, quantitative rigour and safety best-practices.
LLM-RAG apps evaluation: Automation and best practices? Evaluating RAG outputs for domain specific apps is very hard and time consuming. There are no “standards” or “best practices,” so many companies end up doing, non standard, random, manual evaluations. Databricks has proposed some best practices for human+AI evaluation of LLM-RAG Apps.
RAG apps evaluation requires optimising the interactions of all RAG parts. RAG apps involve many moving parts like: chunking/ splitting, embeddings, vector stores, calls to external APIs, model grounding, search & retrieval… The ways you combine all those parts have massive impacts on the quality of the outputs, and effectively on the consistent evaluation of RAG apps. Jerry @LlamaIndex recently published a great presentation on all this: Checkout: Evaluating and Optimising your RAG Apps (45 slides.)
RAG apps evaluation requires understanding and minimising hallucinations. This is a fast-moving field, but here is a comprehensive guide to the most effective approaches for mitigating hallucinations in user-facing products (Sep 2023.) Also worth reading this interesting Survey of Hallucination in Large Foundation Models (Sep 2023.)
LLM apps evaluation requires user analytics. LLM apps evaluation also means many user interactions and more cost! LLMs are all about dynamic, continuous user interactions in natural language. This requires a new way for AI & product teams to think about LLM-app user analytics, to understand the flow of the conversation between the user and the LLM, and costs. Nebuly is a new tool for LLM user analytics. Checkout: What is User Analytics for LLMs.
The latest on LLM evaluations, Oct 2023. Checkout this repo from MS Research with the latest LLM evaluation techniques, including 4 novel methods, a comprehensive survey, and lots of papers.
Have a nice week.
10 Link-o-Troned
the ML Pythonista
Deep & Other Learning Bits
AI/ DL ResearchDocs
data v-i-s-i-o-n-s
MLOps Untangled
AI startups -> radar
ML Datasets & Stuff
Postscript, etc
Tips? Suggestions? Feedback? email Carlos
Curated by @ds_ldn in the middle of the night.
Ensuring the reliability and safety of LLM-based apps is essential. At DATUMO, we provide an LLM evaluation SaaS tool that automatically generates large-scale question datasets and assesses model reliability. It’s designed to enhance your model’s performance and stability before launch. Hope this helps anyone working on LLM development!
You can learn more about us at https://datumo.com