Why Is It Hard To Evaluate GenAI Applications?

TL;DR If you don’t have time to read the whole article, the following four takeaways are a concise version. You can navigate to the corresponding section in the article for details. Lack of framework: A GenAI application is not a GenAI foundation model; different frameworks are required to evaluate them. There may be a lack of clarity on the difference between the two tasks. Unstructured data: The unstructured output of a GenAI application makes evaluation more difficult than a traditional ML system. Foundation model unpredictability: GenAI foundation model usually introduces extra unpredictability into the evaluation process. Longer and more costly iteration: GenAI application evaluation is expensive and time consuming, because building evaluation dataset and running tests on GenAI application require more resources. Introduction I have spent the last two and a half years listening to what businesses want from GenAI, building GenAI applications, and delivering value from the applications. It has been an interesting journey, as I realized the advent of ChatGPT constitutes a paradigm shift for ML/AI practitioners like me. I started to believe that GenAI would change our lives, similar to personal computers in the 90s or the modern search engine in the 2000s. ...