From gen AI 1.5 to 2.0: Moving from RAG to Agent Systems

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Time is almost up! There is only one week left to request an invite to The AI ​​Impact Tour on June 5th. Don't miss this incredible opportunity to explore different ways to audit AI models. Find out how you can participate here.

We are now spending more than a year developing solutions based on generative AI foundation models. While most applications use large language models (LLMs), recently multimodal models that can understand and generate images and video have made it so that foundation models (FM) are a more accurate term. Is.

The world has begun to develop models that can be leveraged to bring these solutions into production and create real impact by sifting through information and adapting it to people's diverse needs. Additionally, there are opportunities for change on the horizon that will unlock significantly more complex uses (and significantly greater value) of LLMs. However, both of these opportunities come with increased costs that must be managed.

Gen AI 1.0: LLMs and emerging behavior from next generation tokens

It is important to gain a better understanding of how FMs work. Under the hood, these models convert our words, images, numbers and sounds into tokens, then simply predict the 'best-next-token' that is likely to respond to the person interacting with the model. makes Learning from over a year of feedback, the underlying models (from Anthropic, OpenAI, Maxtral, Meta, and elsewhere) have become much more in tune with what people want from them.

By understanding how to convert language into tokens, we've learned that formatting matters (ie, YAML performs better than JSON). By better understanding the models themselves, the creative AI community has developed “prompt-engineering” techniques to effectively respond to models.

June 5: AI Audit in NYC

Join us next week in NYC to engage with top executive leaders as they consider strategies for auditing AI models to ensure optimal performance and accuracy in your organization. Make sure to attend this exclusive invitational event.

For example, by providing a few examples (a few shot prompts), we can train a model toward our desired response style. Or, by asking the model to break down the problem (chosen-off thought prompt), we can get it to generate more tokens, increasing the likelihood that it will arrive at the correct answer to complex questions. If you have been an active user of Consumer General AI chat services over the past year, you must have noticed these improvements.

Gen AI 1.5: Retrieval Enhanced Generation, Embedding Models and Vector Databases

Another basis for growth is increasing the amount of information that an LLM can process. The latest models can now process up to 1M tokens (a full-length college textbook), enabling users interacting with these systems to control the context with which they ask questions. Answers in ways that were not possible before.

With performance at 85% accuracy on relevant entrance exams for the field, it is now quite easy to take an entire complex legal, medical or scientific text and quiz the LLM on it. I was recently working with a physician on answering questions on a complex 700-page guidance document, and I was able to organize it without any infrastructure using Anthropic's Claude.

Adding to this, the continued development of technology that leverages LLMs to store and retrieve text based on concepts rather than keywords further expands the available information.

New embedding models (with obscure names such as titan-v2, gte, or cohere-embed) enable the retrieval of similar text by transforming it into “vectors” learned from correlations in very large datasets from diverse sources. , adding vector queries to database systems. Vector functionality in the suite of AWS database solutions) and special-purpose vector databases such as Turbopuffer, LanceDB, and Qdirt that help extend them. These systems have successfully scaled to 100 million multipage documents with limited performance degradation.

Scaling these solutions into production is still a complex endeavor, bringing together teams from multiple backgrounds to optimize a complex system. Security, scaling, latency, cost optimization and data/response quality are all emerging topics that lack standard solutions in the LLM-based applications space.

From gen AI 1.5 to 2.0: Moving from RAG to Agent Systems 2

Gen 2.0 and the Agent System

Although improvements in model and system performance are gradually improving the accuracy of solutions to the point where they are viable for almost every organization, both are still evolutions (general AI may be 1.5). The next evolution is in creatively tying together multiple forms of gen AI functionality.

The first steps in this direction would be to manually develop process chains (a system such as ARIA, a gen-AI-powered virtual building manager that understands the picture of equipment malfunctions, a Searches for knowledge-related context.Base, generates an API query to pull relevant structured information from the IoT data feed and ultimately recommends an action method). The limitations of these systems are in defining the logic to solve a given problem, which must either be hard-coded by the development team, or only 1-2 steps deep.

The next phase of gen AI (2.0) will create agent-based systems that use multimodal models in multiple ways, driven by a 'reasoning engine' (typically just an LLM today) that breaks down problems into stages. I can help, then choose from a set of AI-enabled tools to execute each step, taking the results of each step as context to incorporate into the next step as well as plan the overall solution. Rethinking

By separating the data collection, reasoning and processing components, these agent-based systems enable a much more flexible set of solutions and enable much more complex tasks. For programming, tools like from Cognition Labs can go beyond simple code generation, performing end-to-end tasks like programming language changes or design pattern refactoring in 90 minutes with almost no human intervention. Similarly, Amazon's Q for Developers service enables end-to-end Java version upgrades without any human intervention.

In another example, imagine a medical agent regimen for a patient with end-stage chronic obstructive pulmonary disease. It can access patient EHR records (from AWS HealthLake), imaging data (from AWS HealthImaging), genetic data (from AWS HealthOmics), and other relevant information to generate detailed responses. The agent can also search for clinical trials, drugs and biomedical literature using an index built on Amazon Kander to provide the clinician with the most accurate and relevant information to make informed decisions.

Additionally, multi-purpose agents can work in sync to perform more complex workflows, such as building a detailed patient profile. These agents can autonomously implement multi-step knowledge creation processes, which would otherwise require human intervention.

However, without extensive tuning, these systems would be extremely expensive to run, with thousands of LLM calls sending large numbers of tokens to the API. Therefore, parallel developments in LLM optimization techniques including hardware (NVidia Blackwell, AWS Inferentia), frameworks (Mojo), cloud (AWS Spot Instances), models (parameter size, quantization) and hosting (NVidia Triton) have been implemented in their Must continue to integrate. Cost optimization solutions.


As organizations mature in their use of LLMs over the next year, the game will be about getting the highest quality output (tokens) at the lowest cost, as quickly as possible. This is a fast-moving goal, so it's best to find a partner that is constantly learning from real-world experience running and improving genAI-backed solutions in production.

Ryan Gross is Senior Director of Data and Applications at Caylent.

Data decision makers

Welcome to the VentureBeat community!

DataDecisionMakers is a place where experts, including technical people working with data, can share data insights and innovation.

If you want to read about cutting-edge ideas and the latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider submitting an article of your own!

Read more from DataDecisionMakers

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment