The Librarian's Recommendation: Overcoming the Traps of Data in LLM-based Systems Claudionor Coelho Jr.

The evolution of artificial intelligence has led to significant advancements in natural language processing, particularly with the development of large language models (LLMs). These models have revolutionized the way we approach text analysis, generation, summarization, and translation. However, despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges they face is the phenomenon of hallucination, where the model generates false or misleading outputs [1]. This issue raises concerns about the practical deployment of LLMs in real-world applications, where accuracy and trustworthiness are paramount.

As LLMs have matured, the introduction of Retrieval Augmented Generation (RAG) pipelines has further enhanced their capabilities by allowing them to access external data sources. RAG pipelines bridge the gap between the vast knowledge stored within LLMs and the dynamic, ever-changing information available in external databases. By incorporating real-time data from these sources, RAG pipelines aim to mitigate some of the limitations of LLMs, such as outdated knowledge and non-transparent reasoning processes. However, the integration of RAG pipelines introduces new challenges, particularly in terms of data management and retrieval accuracy.

The transition from LLMs to agentic systems presents unique challenges that must be addressed to optimize performance and reliability [2]. Agents are designed to overcome the limitations of LLMs by extending their capabilities through the integration of tools and real-time data analysis. These systems can perform tasks that go beyond simple text generation, such as invoking UI widgets, conducting internet searches, and analyzing real-time data. However, the complexity of agentic systems also introduces new pitfalls, particularly related to the management and organization of data.

This paper aims to explore the various pitfalls associated with LLM-based systems and RAG pipelines, particularly focusing on data management issues that are often overlooked [3, 4, 5]. By examining some common data problems, we hope to provide a comprehensive guide on how to get agentic systems to work for real-life problems, by increasing their performance and reliability. The effective management of data is crucial for the success of LLM-based systems, as it directly impacts the accuracy and relevance of the generated outputs.

The sections of this paper will cover the evolution of LLMs, an overview of RAG, the transition to agent systems, the pitfalls encountered in these agentic systems, and proposed solutions to mitigate these challenges. [6].

Large Language Models

Large language models have revolutionized the way we approach text analysis, generation, summarization, and translation. These models, trained on vast amounts of text data, can generate coherent and contextually relevant text, making them invaluable tools in various applications. For instance, they can summarize lengthy documents, translate text between languages, and generate creative content. The ability of LLMs to understand and generate human-like text has opened up new possibilities in fields such as customer service, content creation, and language translation.

Despite their impressive capabilities, LLMs are not without limitations, particularly in their tendency to produce hallucinations—false or misleading outputs that can undermine their reliability. Hallucinations occur when the model generates information that is not grounded in the training data or external knowledge sources. This issue is particularly problematic in applications where accuracy and trustworthiness are critical, such as in cybersecurity, financial recommendations, medical diagnosis or legal advice. The tendency of LLMs to hallucinate raises concerns about their practical deployment in real-world scenarios.

In these types of real-life applications with financial or life loss, accuracy and trustworthiness are paramount [7]. Users need to be able to trust the outputs generated by these models, especially in high-stakes situations. Various techniques have been proposed to mitigate hallucinations, such as incorporating external knowledge sources and improving the training data. However, completely eliminating hallucinations remains a significant challenge, and ongoing research is needed to address this issue.

Ultimately, while LLMs offer significant advantages, their effectiveness is often overshadowed by the challenges posed by hallucinations. The ability to generate human-like text is a double-edged sword, as it can lead to the generation of plausible but incorrect information. As we continue to develop and deploy LLMs, it is crucial to address the limitations and ensure that these models can be trusted to provide accurate and reliable information. The integration of RAG pipelines and agentic systems offers a promising approach to enhancing the capabilities of LLMs and mitigating some of their limitations.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a framework that enhances LLMs by allowing them to reference external knowledge bases during response generation [8,9,10,11,12]. This approach bridges the gap between the static knowledge stored within LLMs and the dynamic, ever-changing information available in external databases. By incorporating real-time data from these sources, RAG pipelines aim to improve the accuracy and relevance of the generated outputs [13]. The RAG framework consists of two main components: the retrieval model, which fetches relevant information, and the generative model, which synthesizes coherent text from the retrieved data, such as presented in Figure 1.

Figure 1: RAG Framework from [8]

In a RAG pipeline, data is stored in ‘key’-‘value’ pairs, where documents are indexed in vector databases using a ‘key’, and a corresponding ‘value’ is returned for a matched ‘key’, often with additional metadata. The ‘key’ is typically a representation of the content, such as a sentence embedding, which maps a piece of text to a vector of numbers. The ‘value’ is the actual content or information associated with the ‘key’. This structure allows the system to efficiently retrieve relevant information based on a user’s query. The process of matching a query to a ‘key’ involves computing the distance between the query and the target ‘key’ in numeric space, over all ‘key’-‘value’ pairs stored in a vector database.

Within documents, ‘key’ and ‘value’ can represent paragraph units or subunits called chunks, and in question-answer databases, the ‘key’ is the question while the ‘value’ is the answer. In the context of images, the ‘key’ could be the captioning for the picture, and the ‘value’ is the picture itself. Alternatively, if a multi-modal LLM is used, the description of the picture in textual form can serve as the ‘key’ to the picture. This flexible structure allows RAG pipelines to handle various types of data, including text, images, and other modalities. The ability to index and retrieve information from different sources enhances the versatility and applicability of RAG systems. Figure 2 summarizes how vector databases use the information to return a valid answer.

Figure 2: Key-Value Structure

Despite the advantages of RAG, we still rely on the limitations of LLMs to reason about the text, indicating a need for additional information beyond just the documents. While RAG pipelines can provide relevant information from external sources, we need more powerful reasoning capabilities, and the ability to connect to real systems, for example, returning an internet search or returning UI widgets. To address these limitations, it is essential to continuously update the external knowledge bases and improve the retrieval and generation processes.

From RAG to Agents

Agents are designed to overcome the limitations of LLMs by extending their capabilities through the integration of tools and real-time data analysis. While RAG extended LLMs by enabling them to index data outside the training cut-off date, agents add tools that enhance the LLM’s capabilities. These tools can include generic function calling, invoking UI widgets, and real-time data analysis, among others. By incorporating these tools, agents can perform tasks that go beyond simple text generation, such as executing functions, interacting with user interfaces, and analyzing real-time data.

In an agentic system, the indexing capabilities of vector databases enable the detection of which tool to execute based on ‘key’-‘value’ pairs. This is presented as the last ‘key’-‘value’ pair example in Figure 2. For example, the ‘key’ could be the docstring for the description of a tool or function, such as an internet search, and the ‘value’ could be the execution of the function itself. This structure allows the agent to dynamically select and execute the appropriate tool based on the user’s query. The ability to index and retrieve tools in this manner enhances the flexibility and functionality of agentic systems, enabling them to perform a wide range of tasks.

In order to fully leverage the power of AI Agents, we can further structure AI Agents in the following components: planner, tools, memory, and an agent core, implemented by an LLM, facilitating interaction between the user and the AI Agent [14]. The planner is responsible for determining the sequence of actions to achieve a specific goal. Tools extend the behaviors of LLMs by enabling them to analyze data in ways that LLMs would do in a limited way. The memory stores information about past interactions and experiences, allowing the agent to learn and adapt over time. For example, when an AI Agent does not understand a question and asks the user to clarify, memory allows the user to refer to some previous interaction instead of having to repeat it completely. The agent core drives the interaction between the user and the AI Agent, coordinating the various components to achieve the desired outcome. An example of an AI Agent with all these components is given in Figure 3.

Figure 3: AI Agent Structure

Real-life examples of AI Agents include virtual assistants, customer service chatbots, and autonomous systems in various industries [15,16,17,18,19]. Virtual assistants like Siri and Alexa use agentic systems to perform tasks such as setting reminders, answering questions, and controlling smart home devices [20]. Customer service chatbots leverage AI agents to provide instant responses and round-the-clock support, enhancing customer experience and operational efficiency. Autonomous systems in industries such as healthcare, manufacturing, and logistics use AI agents to perform complex tasks, such as diagnosing medical conditions, optimizing production processes, and managing supply chains.

Pitfalls in Agent Systems and How to Fix Them

While the process of matching user intent to the target ‘key’ seems straightforward, it often involves approximations that can lead to errors. Sentence embedding, which is used to capture the topic sentence for a textual paragraph unit, is an approximation and is bound to have mismatches. No matter how sophisticated the ‘key’ matching mechanism is (there has been several works suggesting improvements to RAG pipelines), the reader should understand that the selection of the top-k ranking documents or tools is still an approximation to an user’s intent (which is specified using an ambiguous natural language), and as such, it is prone to errors. For example, in a Q&A database of medical questions (ori_pqaa.json), a query about gastric cancer was matched to a prostate cancer question 9 out of 10 times, by simply slightly modifying the original question, due to limitations in the embeddings.

Figure 4: Number of mismatches in ori_pqaa.json medical Q&A database

The data ingested in RAG pipelines is multi-modal and dynamic, requiring careful management of data lineage to ensure accuracy and relevance [21]. Data lineage refers to the full traceability of data, from its source to its final use. This is particularly important when dealing with data that changes over time, such as corporate documents and other sources of information [22]. For example, the cabin of the F-16 fighter jet has evolved considerably over the years, as seen in Figure 5, and a user asking about the F-16 might expect information about the latest version unless specified otherwise. Ensuring accurate and up-to-date information requires meticulous tracking and management of data lineage.

Figure 5: Evolution of the cabin of the F-16, from F-16A to F16-V Viper

Our ability to detect useful embeddings that match actual topic sentences in paragraphs is dependent on how well documents are written. Over time, documents have deteriorated, and content creators started relying too much on multi-modal specifications, such as pictures. AI Agents relying on previously stored documents require significant effort to maintain data quality, specially at the language level. This is especially true for technical documentation, which may become outdated or incomplete. For instance, a fictional instructional manual based on an actual Ikea manual shown in Figure 6 might be too complex for multi-modal vector databases to extract useful information. The picture might be too intricate, and the accompanying text might not provide any valuable context. Maintaining data quality involves regularly reviewing and updating documents to ensure they are clear, accurate, and useful for LLM-based systems.

Figure 6: Complex Instructional Manual, from

The promise of cloud storage has led to the accumulation of outdated and irrelevant data, highlighting the need for effective cataloging and debugging tools. Cloud storage allows organizations to keep all versions of documents, including old specifications, draft documents, and other irrelevant data. This accumulation of data can make it challenging to retrieve relevant information in RAG pipelines. Effective data lineage (as discussed previously), with supporting cataloging and debugging tools are essential to organize and manage this data, ensuring that only relevant and up-to-date information is used in the retrieval process [23,24,25].

To address data ingestion challenges, a new generation of tools is needed to maintain control and traceability over all ingested data. These tools should provide full traceability of data, allowing users to track changes over time and ensure that the information is accurate and relevant. For example, if the interface of a system changes, the ingested data should be updated to reflect the new interface. Maintaining control and traceability over data ingestion is crucial for the accuracy and reliability of LLM-based systems. We have observed that a lot of the RAG and AI Agent systems only discuss the initial ingestion of the documents and tools, but not the continuous evolution of such systems. This poses extreme limitations on the applicability of these systems over time.

For deteriorated documents, rewriting is essential to ensure that technical documentation is clear and useful for LLM-based systems. Many technical documents contain only minimal information, such as a single line of text claiming that the next picture or code sample will show how the system works. Such documents are not useful for LLM-based systems, as they do not provide enough context for accurate information retrieval. Rewriting these documents to include detailed and clear explanations is necessary to improve data quality and usability.

The coexistence of old and new data emphasizes the need for proper organization and tagging before ingestion into RAG pipelines. Old and new data often coexist in organizational databases, making it challenging to retrieve relevant information. Proper organization and tagging of data are essential to ensure that only relevant and up-to-date information is ingested into RAG pipelines. This involves categorizing and labeling data accurately, so that the retrieval process can efficiently identify and use the most relevant information.

Finally, to mitigate mismatches between user queries and target ‘keys’, it is crucial to identify failure points in the user’s databases, as shown in Figure 7, and implement strategies to avoid the wrong matches, such as inserting alternate ‘keys’, eliminating duplicate entries, and retraining ranking or embedding systems to improve accuracy. Continuous monitoring and updating of the retrieval process are necessary and crucial to ensure that the system remains accurate and reliable.

Figure 7: By slightly modifying a user query k2 from k, we make the system select the answer that is related to ka.

Conclusions

The transition from LLMs to agentic systems presents both opportunities and challenges that must be navigated carefully. While agentic systems offer enhanced capabilities and flexibility, they also introduce new complexities, particularly related to data management. Addressing these challenges is crucial for optimizing the performance and reliability of LLM-based systems.

By addressing the pitfalls associated with data management in RAG pipelines, we can enhance the reliability and effectiveness of LLM-based systems. Effective data management practices, such as maintaining data lineage, ensuring data quality, and organizing and tagging data, are essential for the success of these systems. Implementing these practices, and adopting supporting tools will help mitigate the limitations of LLMs and improve the accuracy and relevance of the generated outputs, making these LLM-based systems useful in real life.

Ultimately, the integration of effective data management practices will be essential for the successful deployment of AI Agents in real-world applications. As we continue to develop and deploy these systems, it is crucial to address the challenges and ensure that they can be trusted to provide accurate and reliable information. By doing so, we can unlock the full potential of LLM-based systems and enhance their impact across various industries.

References1. Xu, Z., Jain, S., Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv preprint arXiv:2401.11817 (2024). https://arxiv.org/abs/2401.11817

2. Multi-Agent Systems: Technical & Ethical Challenges of Functioning in a …, https://direct.mit.edu/daed/article/151/2/114/110611/Multi-Agent-Systems-Technical-amp-Ethical

3. Coelho Jr., C., Koratala, S. The Mythical LLM-Month. Zscaler, January 16, 2024. https://www.zscaler.com.br/blogs/product-insights/mythical-llm-month

4. Mastering RAG Systems: From Fundamentals to Advanced, with Strategic …, https://towardsdatascience.com/mastering-rag-systems-from-fundamentals-to-advanced-with-strategic-component-evaluation-3551be31858f

5. Glantz, W. 12 RAG Pain Points and Proposed Solutions. Towards Data Science. https://towardsdatascience.com/12-rag-pain-points-and-proposed-solutions-43709939a28c

6. Common pitfalls in deploying AI Agents to production, https://behavio.ghost.io/common-pitfalls-in-deploying-ai-agents-to-production/

7. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv preprint arXiv:2311.05232 (2023). https://arxiv.org/abs/2311.05232

8. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401 (2021). https://arxiv.org/abs/2005.11401

9. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997 (2024). https://arxiv.org/abs/2312.10997

10. What is RAG? – Retrieval-Augmented Generation AI Explained – AWS, https://aws.amazon.com/what-is/retrieval-augmented-generation/

11. Retrieval-augmented Generation (RAG): A Comprehensive Guide, https://www.datastax.com/guides/what-is-retrieval-augmented-generation

12. What is retrieval-augmented generation (RAG)? – IBM Research, https://research.ibm.com/blog/retrieval-augmented-generation-RAG

13. A Beginner’s Guide to Evaluating RAG Pipelines Using RAGAS, https://medium.com/@erkajalkumari/a-beginners-guide-to-evaluating-rag-pipelines-using-ragas-24bb3808f81e

14. Components of AI Agents: An Overview – toloka.ai, https://toloka.ai/blog/ai-agents/

15. What are AI Agents?- Agents in Artificial Intelligence Explained – AWS, https://aws.amazon.com/what-is/ai-agents/

16. Understanding AI Agents: How They Work, Types, and Practical … – Medium, https://medium.com/@williamwarley/understanding-ai-agents-how-they-work-types-and-practical-applications-bd261845f7c3

17. Guide of AI Agent Types with examples | by Thomas Latterner – Medium, https://medium.com/@thomas.latterner/guide-of-ai-agent-types-with-examples-79f94a741d44

18. Exploring AI Agents: Real-World Examples and Applications, https://digitalon.ai/ai-agents-examples

19. AI Agents – Types, Benefits and Examples – Yellow.ai, https://yellow.ai/blog/ai-agents/

20. Practices for Governing Agentic AI Systems | OpenAI, https://openai.com/research/practices-for-governing-agentic-ai-systems/

21. Improve RAG data pipeline quality | Databricks on AWS, https://docs.databricks.com/en/ai-cookbook/quality-data-pipeline-rag.html

22. Libraries Are Even More Important to Contemporary Community Than We …, https://lithub.com/libraries-are-even-more-important-to-contemporary-community-than-we-thought/

23. Why Are Libraries Important? (19 Reasons) – Enlightio, https://enlightio.com/why-are-libraries-important

24. Why are libraries important? Here are 8 good reasons, https://blog.pressreader.com/libraries-institutions/why-are-libraries-important-here-are-8-good-reasons

25. – https://www.linkedin.com/posts/claudionor-coelho-jr-b156b01_the-importance-of-libraries-and-librarians-activity-7102363525943099392-NcqW

[#item_full_content] [[{“value”:”The evolution of artificial intelligence has led to significant advancements in natural language processing, particularly with the development of large language models (LLMs). These models have revolutionized the way we approach text analysis, generation, summarization, and translation. However, despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges they face is the phenomenon of hallucination, where the model generates false or misleading outputs [1]. This issue raises concerns about the practical deployment of LLMs in real-world applications, where accuracy and trustworthiness are paramount.