Meet Maxim, an end-to-end evaluation platform to solve AI quality issues

June 19, 2024

62 4 minutes read

Meet Maxim, an end-to-end evaluation platform to solve AI quality issues — adobe firefly robot testing machine.jpg

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More

Enterprises are bullish on the prospects of generative AI. They are investing billions of dollars in the space and building various applications (from chatbots to search tools) targeting different use cases. Almost every major enterprise has some gen AI play in the works. But, here’s the thing, committing to AI and actually deploying it to production are two very different things.

Today, Maxim, a California-based startup founded by former Google and Postman executives Vaibhavi Gangwar and Akshay Deo, launched an end-to-end evaluation and observation platform to bridge this gap. It also announced $3 million in funding from Elevation Capital and other angel investors.

At the core, Maxim is solving the biggest pain point developers face when building large language model (LLM)-powered AI applications: how to keep tabs on different moving parts in the development lifecycle. A small error here or there and the whole thing can break, creating trust or reliability problems and ultimately delaying the delivery of the project.

Maxim’s offering focused on testing for and improving AI quality and safety, both pre-release and post-production, creates an evaluation standard of sorts, helping organizations streamline the entire lifecycle of their AI applications and quickly deliver high-quality products in production.

VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

Why is developing generative AI applications challenging?

Traditionally, software products were built with a deterministic approach that revolved around standardized practices for testing and iteration. Teams had a clear-cut path to improving the quality and security aspects of whatever application they developed. However, when gen AI came to the scene, the number of variables in the development lifecycle exploded, leading to a non-deterministic paradigm. Developers looking to focus on quality, safety and performance of their AI apps have to keep tabs on various moving parts, right from the model being used to data and the framing of the question by the user.

Most organizations target this evaluation problem with two mainstream approaches: hiring talent to manage every variable in question or trying to build internal tooling independently. They both lead to massive cost overheads and take the focus away from the core functions of the business.

Realizing this gap, Gangwar and Deo came together to launch Maxim, which sits between the model and application layer of the gen AI stack, and provides end-to-end evaluation across the AI development lifecycle, right from pre-release prompt engineering and testing for quality and functionality to post-release monitoring and optimization.

As Gangwar explained, the platform has four core pieces: an experimentation suite, an evaluation toolkit, observability and a data engine.

The experimentation suite, which comes with a prompt CMS, IDE, visual workflow builder and connectors to external data sources/functions, serves as a playground to help teams iterate on prompts, models, parameters and other components of their compound AI systems to see what works best for their targeted use case. Imagine experimenting with one prompt on different models for a customer service chatbot.

Meanwhile, the evaluation toolkit offers a unified framework for AI and human-driven evaluation, enabling teams to quantitatively determine improvements or regressions for their application on large test suites. It visualizes the evaluation results on dashboards, covering aspects such as tone, faithfulness, toxicity and relevance.

The third component, observability, works in the post-release phase, allowing users to monitor real-time production logs and run them through automated online evaluation to track and debug live issues and ensure the application delivers the expected level of quality.

“Using our online evaluations, users can set up automated control across a range of quality, safety, and security-focused signals — like toxicity, bias, hallucinations and jailbreak — on production logs. They can also set real-time alerts to notify them about any regressions on metrics they care about, be it performance-related (e.g., latency), cost-related or quality-related (e.g., bias),” Gangwar told VentureBeat.

Using the insights from the observability suite, the user can quickly address the issue at hand. If the problem is tied to data, they can use the last component, the data engine, to seamlessly curate and enrich datasets for fine-tuning.

App deployments accelerated

While Maxim is still at an early stage, the company claims it has already helped a “few dozen” early partners test, iterate and ship their AI products about five times faster than before. She did not name these companies.

“Most of our customers are from the B2B tech, gen AI services, BFSI and Edtech domains – the industries where the problem for evaluation is more pressing. We are mostly focused on mid-market and enterprise clients. With our general availability, we want to double down on this market and commercialize it more broadly,” Gangwar added.

She also noted the platform includes several enterprise-centric features such as role-based access controls, compliance, collaboration with teammates and the option to go for deployment in a virtual private cloud.

Maxim’s approach to standardizing testing and evaluation is interesting, but it will be quite a challenge for the company to take on other players in this emerging market, especially heavily funded ones like Dynatrace and Datadog which are constantly evolving their stack.

On her part, Vaibhavi says most players are either targeting performance monitoring, quality or observability, but Maxim is doing everything in one place with its end-to-end approach.

“There are products that offer evaluation/experimentation tooling for different phases of the AI development lifecycle: a few are building for experimentation, a few are building for observability. We strongly believe that a single, integrated platform to help businesses manage all testing-related needs across the AI development lifecycle will drive real productivity and quality gains for building enduring applications,” she said.

As the next step, the company plans to expand its team and scale operations to partner with more enterprises building AI products. It also plans to expand platform capabilities, including proprietary domain-specific evaluations for quality and security as well as a multi-modal data engine.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source

June 19, 2024

62 4 minutes read