Managing Generative AI Risk and Meeting M-24-10 Mandates on Monitoring & Evaluation
On the face of it, the OMB memo refers to specific steps that the implementing organization must undertake. And taking one AI system or model at a time, an implementation specialist (i.e. AI engineer, product manager, program manager, or scientist) could incorporate these steps into their process for building an AI service or product. For example, one could incorporate a library such as Trusty AI or Arize-phoenix into an MLOps pipeline, for evaluation and monitoring.
But what about at an organizational level, and from the perspective of an accountable executive like the Chief AI Officer (CAIO) – how might one set up processes and functions to ensure that any team working on AI products meets or exceeds minimum standards of monitoring and evaluation?
To implement a durable and defensible test evaluation and monitoring solution, it is important to begin with standards and build practices and procedures around those prescriptive standards such as NIST AI RMF.
Even if your organization isn’t under the purview of the OMB memo (it applies to federal agencies as defined in 44 U.S.C. § 3502(1)), it is reasonable to expect that upcoming regulations on AI at both the federal and state levels would be informed by (or even replicate) the approach taken by NIST and the OMB.
Using NIST AI RMF to Baseline Testing & Evaluation
Artificial intelligence (AI) advancements have ignited discussions about associated risks, potential biases within training data, and the characteristics of reliable, trustworthy AI. The NIST AI Risk Management Framework (AI RMF) addresses these concerns by providing both a conceptual roadmap for pinpointing AI-related risks and a set of processes tailored to assess and manage these risks. The framework is based on seven key traits of trustworthy AI (safety, security, resilience, explainability, privacy protection, fairness, and accountability) while linking the socio-technical aspects of AI to its lifecycle and relevant actors. A critical component of these processes is “test, evaluation, verification, and validation (TEVV).” The NIST AI RMF outlines core functions (govern, map, measure, manage) with subcategories detailing ways to implement them.
Incorporating TEVV into the AI lifecycle is essential and described in Appendix A of the RMF, namely:
- TEVV tasks for design, planning, and data may center on internal and external validation of assumptions for system design, data collection, and measurements relative to the intended context of deployment or application.
- TEVV tasks for development (i.e., model building) include model validation and assessment.
- TEVV tasks for deployment include system validation and integration in production, with testing, and recalibration for systems and process integration, user experience, and compliance with existing legal, regulatory, and ethical specifications.
- TEVV tasks for operations involve ongoing monitoring for periodic updates, testing, and subject matter expert (SME) recalibration of models, the tracking of incidents or errors reported and their management, the detection of emergent properties and related impacts, and processes for redress and response.
We’ll go through all of these in this series of blogs. In our previous article, we covered one approach how LLM’s might be tested with a mix of approaches, to provide robust testing within a reasonable budget (both time and money).
In today’s article, we’re going to cover monitoring. We find that the evaluation of LLM performance happens here too, via the metrics that we gather in our monitoring. And this evaluation (which we’re going to refer to as online evaluation), when done as part of an organizational process, can provide dashboards and be distilled into management reporting that supports effective decision making by accountable executives.
Understanding LLM based Systems and Applications
At this point, organizations are racing ahead to implement applications that use LLMs at their core. To develop an effective testing and evaluation strategy it is important to understand current deployment and usage models. A common pattern is a RAG application, which consists of one or more LLMs that are connected to external data sources and inputs (e.g. user input at a website, and the users record at the agency). These are fed a corpus of knowledge that is organization and task specific (e.g. information about processes and procedures at the agency) and is fine tuned to respond to perform the desired inference on its inputs based on the corpus of data evaluated on those tasks (e.g. responding to the users query accurately). Typically, this system is deployed to a production environment accessible to end users.
Use this typical deployment pattern, the organizational group responsible should develop a systematic approach that includes:
- Identifying Key Metrics
- Defining Data Collection and Integration Needs
- Establishing Performance Monitoring and Analysis Reports
- Develop Baseline using Benchmarking and Comparison
- Monitor Resource Utilization and Cost Optimization
- Ensure Continuous Improvement and Adaptation
What are the key metrics we would look at during the evaluation and monitoring process?
These would be:
- Groundedness: The extent to which generated responses are factually verifiable based on authoritative sources (e.g., company databases, product manuals, regulatory guidelines). This metric is crucial in scenarios where accuracy and compliance are non-negotiable.
- Relevance: The model’s ability to tailor responses specifically to the user’s query or input. High relevance scores signify strong comprehension of business context and promote efficiency in problem-solving.
- Coherence: The clarity, logical structure, and overall readability of generated text. Coherent responses are indispensable for seamless internal and external communications, upholding a professional brand image.
- Fluency: The model’s linguistic proficiency, focusing on grammatical correctness and vocabulary usage. Responses demonstrating strong fluency enhance the organization’s credibility and optimize information exchange.
- Similarity: The alignment between model-generated responses and the desired content or messaging (as established in ground truth sources). High similarity ensures consistency with the company’s knowledge base and brand voice.
Evaluation & Monitoring
Now that we’ve established the high level concepts and processes that go into monitoring and evaluating AI applications, we’re going to look at specifics and implementation.
Microsoft Azure OpenAI
For the examples in this blog post, we’re going to use Microsoft Azure OpenAI as an example. Why? Microsoft recently made OpenAI available on its Azure Government cloud and is FedRAMP certified for the high baseline. Our analysis shows a number of agencies are currently deploying and building solutions using this platform.
However, the same general principles and approach could be used for other LLMs and platforms, as they become available in a FedRAMP envelope in the future.
Evaluation of the LLM based application can then be automated in the Azure environment, with dashboarding and metrics collection and calculation. Azure AI Studio (currently in preview) provides flexible options for evaluating generative AI applications. There are several options that an AI application developer can take. The figure below provides an overview of the AI development lifecycle beginning with sample data.