easy-accordion-free
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/mother99/jacksonholdingcompany.com/wp-includes/functions.php on line 6114zoho-flow
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/mother99/jacksonholdingcompany.com/wp-includes/functions.php on line 6114wordpress-seo
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/mother99/jacksonholdingcompany.com/wp-includes/functions.php on line 6114In July, I started an exciting journey with Cisco via the acquisition of Armorblox<\/a>. Armorblox was acquired by Cisco to further their\u00a0 AI-first Security Cloud by bringing generative AI experiences to \u2026 Read more on Cisco Blogs<\/a><\/p>\n \u200b<\/p>\n In July, I started an exciting journey with Cisco via the acquisition of Armorblox<\/a>. Armorblox was acquired by Cisco to further their\u00a0 AI-first Security Cloud by bringing generative AI experiences to Cisco\u2019s security solutions. The transition was filled with excitement and a touch of nostalgia, as building Armorblox had been my focus for the past three years.<\/p>\n Quickly, however, a new mission came my way: Build generative AI Assistants that will allow cybersecurity administrators to find the answers they seek quickly, and therefore make their lives easier. \u00a0This was an exciting mission, given the \u201cmagic\u201d that Large Language Models (LLMs) are capable of, and the rapid adoption of generative AI.<\/p>\n We started with the Cisco Firewall, building an AI Assistant that Firewall administrators can chat with in natural language. The AI Assistant can help with troubleshooting, such as locating policies, giving summarization of existing configurations, providing documentation, and more.<\/p>\n Throughout this product development journey, I\u2019ve encountered several challenges, and here, I aim to shed light on them.<\/p>\n The first and most obvious challenge has been evaluation of the model.<\/p>\n How do we know if these models are performing well?<\/p>\n There are several ways a model\u2019s responses can be evaluated.<\/p>\n Automated Validation \u2013 using metrics computed automatically on AI responses without the need for any human review \u00a0Automated Validation<\/strong><\/p>\n An innovative method that was proposed early on by the community was using LLMs to evaluate LLMs. This works wonders for generalized use cases, but can fall short when assessing models tailored for niche tasks. In order for niche use cases to perform well, they require access to unique or proprietary data that is inaccessible to standard models like GPT-4.<\/p>\n Alternatively, using a precise Q&A set can pave the way for the formulation of automated metrics, with or without an LLM.\u00a0 However, curating and bootstrapping such sets, especially ones demanding deep domain knowledge, can be a challenging task. And even with a perfect question and answer set, questions arise: Are these representative of user queries? How aligned are the golden answers with user expectations?<\/p>\n While automated metrics serve as a foundation, their reliability for specific use cases, especially in the initial phases, is arguable. However, as we expand the size of real user data that can be used for validation, the importance of automated metrics will grow. With real user questions, we can more appropriately benchmark against real use cases and automated metrics become a stronger signal for good models.<\/p>\n Manual Validation<\/strong><\/p>\n Metrics based on manual validation have been particularly valuable early on. \u00a0The first set of use cases for our AI assistant are aimed at allowing a user to become more efficient by either compiling and presenting data coherently or making information more accessible. For example, a Firewall Administrator quickly wants to know which rules are configured to block for a particular firewall policy, so they can consider making changes. Once the AI assistant gives summarizes their rule configuration, they want to know how to alter it. The AI assistant will give them guided steps to configured the policy as desired.<\/p>\n The information and data that it presents can be manually validated by our team. This has already given me insight into some hallucinations and poor assumptions that the AI assistant is making.<\/p>\n Although manual metrics come with their own set of expenses, they can be more cost-effective than the creation of golden Q&A sets, which necessitate the involvement and expertise of domain specialists. It\u2019s essential to strike a balance to ensure that the evaluation process remains both accurate and budget-friendly.<\/p>\n User Feedback Validation<\/strong><\/p>\n Engaging domain experts as a proxy for real customers at pre-launch to test the AI assistant has proven invaluable. Their insights help develop tight feedback loops to improve the quality of responses.<\/p>\n Designing a seamless feedback mechanism for these busy experts is critical, so that they can provide as much information on why responses are missing the mark. Instituting a regular team ritual to review and act on this feedback ensures continued alignment with expectations for the model responses.<\/p>\n Upon reviewing evaluation gaps, the immediate challenge lies in effectively addressing and monitoring them towards resolution. User feedback and eval metrics often highlight many areas or errors. This naturally leads to the question: How do we prioritize and address these concerns?<\/p>\n Prioritizing the feedback we get is extremely important, focusing on the impact of the user experience and the loss of trust in the AI assistant are the core criteria for prioritization along with the frequency of the issue.<\/p>\n The pathways for addressing evaluation gaps are varied \u2013 be it through prompt engineering, different models, or trying various augmented model strategies like knowledge graphs. Given the plethora of options, it becomes imperative to lean on the expertise and insights of the ML experts on your team. Given the rapidly evolving landscape of generative AI, it\u2019s also helpful to stay up to date with new research and best practices shared by the community. There are a number of newsletters, podcasts that I use to stay up-to date with new developments. However, I\u2019ve found that the most useful tool has been Twitter where the Generative AI community is partciuarly strong.<\/p>\n In the early phases of LLM application development, the emphasis is primarily on ensuring high quality. Yet, as the solution evolves into a tangible, demoable product, latency, the amount of time it takes for a response to be returned to a user, becomes increasingly important. And when it\u2019s time to introduce the product as generally available, striking a balance between delivering exceptional performance and managing costs is key.<\/p>\n In practice, balancing these is tricky. Take, for instance, when building chat experiences for IT administrators. If the responses fall short of expectations, do we modify the system prompt to be more elaborate? Alternatively, do we shift our focus to fine-tuning, experimenting with different LLMs, embedding models, or expanding our data sources? Each adjustment cascades, impacting quality, latency, and cost, requiring a careful and data informed approach.<\/p>\n Depending on the use case, you may find that users will be accepting of additional latency in exchange for higher quality. Knowing the relative value that your users have for each of these will help your team strike the right balance. For the sustained success of the project, it\u2019s crucial for your team to monitor and optimize these three areas according to the tradeoffs that your user deems acceptable.<\/p>\n It\u2019s been an exciting start to the journey of building products with LLMs and I can\u2019t wait to learn more as we continue building and shipping awesome AI products.<\/p>\n It\u2019s worth noting that my main experience has been with chat experiences using vector database retrieval augmented generation (RAG) and SQL agents. But with advancements on the horizon, I\u2019m optimistic about the emergence of autonomous agents with access to multiple tools that can take actions for users.<\/p>\n Recently, Open AI released their Assistants API<\/a>, which will enable developers to more easily access the potential of LLMs to operate as agents with multiple tools and larger contexts. For a deeper dive into AI agents, check out talk<\/a> by Harrison Chase, the founder of Langchain, and this intriguing episode<\/a> of the Latent Space podcast that explores the evolution and complexities of agents.<\/p>\n Thanks for reading! If you have any comments or questions feel free to reach out.<\/p>\n You can follow my thoughts on X<\/a> or connect with me on Linkedin.<\/a><\/p>\n We\u2019d love to hear what you think. Ask a Question, Comment Below, and Stay Connected with Cisco Security on social!<\/em><\/p>\n Cisco Security Social Channels<\/strong><\/p>\n Instagram<\/a><\/strong>Facebook<\/a><\/strong>Twitter<\/a><\/strong>LinkedIn<\/a><\/strong><\/p>\n \u00a0\u00a0Explore how Cisco crafts AI security experiences, addressing LLM evaluation, prioritization of initiatives, and balancing cost, quality, and latency.\u00a0\u00a0Read More<\/a>\u00a0Cisco Blogs\u00a0<\/p>","protected":false},"excerpt":{"rendered":" <\/p>\n In July, I started an exciting journey with Cisco via the acquisition of Armorblox<\/a>. Armorblox was acquired by Cisco to further their\u00a0 AI-first Security Cloud by bringing generative AI experiences to \u2026 Read more on Cisco Blogs<\/a><\/p>\n \u200b<\/p>\n In July, I started an exciting journey with Cisco via the acquisition of Armorblox<\/a>. Armorblox was acquired by Cisco to further their\u00a0 AI-first Security Cloud by bringing generative AI experiences to Cisco\u2019s security solutions. The transition was filled with excitement and a touch of nostalgia, as building Armorblox had been my focus for the past three years.<\/p>\n Quickly, however, a new mission came my way: Build generative AI Assistants that will allow cybersecurity administrators to find the answers they seek quickly, and therefore make their lives easier. \u00a0This was an exciting mission, given the \u201cmagic\u201d that Large Language Models (LLMs) are capable of, and the rapid adoption of generative AI.<\/p>\n We started with the Cisco Firewall, building an AI Assistant that Firewall administrators can chat with in natural language. The AI Assistant can help with troubleshooting, such as locating policies, giving summarization of existing configurations, providing documentation, and more.<\/p>\n Throughout this product development journey, I\u2019ve encountered several challenges, and here, I aim to shed light on them.<\/p>\n The first and most obvious challenge has been evaluation of the model.<\/p>\n How do we know if these models are performing well?<\/p>\n There are several ways a model\u2019s responses can be evaluated.<\/p>\n Automated Validation \u2013 using metrics computed automatically on AI responses without the need for any human review \u00a0Automated Validation<\/strong><\/p>\n An innovative method that was proposed early on by the community was using LLMs to evaluate LLMs. This works wonders for generalized use cases, but can fall short when assessing models tailored for niche tasks. In order for niche use cases to perform well, they require access to unique or proprietary data that is inaccessible to standard models like GPT-4.<\/p>\n Alternatively, using a precise Q&A set can pave the way for the formulation of automated metrics, with or without an LLM.\u00a0 However, curating and bootstrapping such sets, especially ones demanding deep domain knowledge, can be a challenging task. And even with a perfect question and answer set, questions arise: Are these representative of user queries? How aligned are the golden answers with user expectations?<\/p>\n While automated metrics serve as a foundation, their reliability for specific use cases, especially in the initial phases, is arguable. However, as we expand the size of real user data that can be used for validation, the importance of automated metrics will grow. With real user questions, we can more appropriately benchmark against real use cases and automated metrics become a stronger signal for good models.<\/p>\n Manual Validation<\/strong><\/p>\n Metrics based on manual validation have been particularly valuable early on. \u00a0The first set of use cases for our AI assistant are aimed at allowing a user to become more efficient by either compiling and presenting data coherently or making information more accessible. For example, a Firewall Administrator quickly wants to know which rules are configured to block for a particular firewall policy, so they can consider making changes. Once the AI assistant gives summarizes their rule configuration, they want to know how to alter it. The AI assistant will give them guided steps to configured the policy as desired.<\/p>\n The information and data that it presents can be manually validated by our team. This has already given me insight into some hallucinations and poor assumptions that the AI assistant is making.<\/p>\n Although manual metrics come with their own set of expenses, they can be more cost-effective than the creation of golden Q&A sets, which necessitate the involvement and expertise of domain specialists. It\u2019s essential to strike a balance to ensure that the evaluation process remains both accurate and budget-friendly.<\/p>\n User Feedback Validation<\/strong><\/p>\n Engaging domain experts as a proxy for real customers at pre-launch to test the AI assistant has proven invaluable. Their insights help develop tight feedback loops to improve the quality of responses.<\/p>\n Designing a seamless feedback mechanism for these busy experts is critical, so that they can provide as much information on why responses are missing the mark. Instituting a regular team ritual to review and act on this feedback ensures continued alignment with expectations for the model responses.<\/p>\n Upon reviewing evaluation gaps, the immediate challenge lies in effectively addressing and monitoring them towards resolution. User feedback and eval metrics often highlight many areas or errors. This naturally leads to the question: How do we prioritize and address these concerns?<\/p>\n Prioritizing the feedback we get is extremely important, focusing on the impact of the user experience and the loss of trust in the AI assistant are the core criteria for prioritization along with the frequency of the issue.<\/p>\n The pathways for addressing evaluation gaps are varied \u2013 be it through prompt engineering, different models, or trying various augmented model strategies like knowledge graphs. Given the plethora of options, it becomes imperative to lean on the expertise and insights of the ML experts on your team. Given the rapidly evolving landscape of generative AI, it\u2019s also helpful to stay up to date with new research and best practices shared by the community. There are a number of newsletters, podcasts that I use to stay up-to date with new developments. However, I\u2019ve found that the most useful tool has been Twitter where the Generative AI community is partciuarly strong.<\/p>\n In the early phases of LLM application development, the emphasis is primarily on ensuring high quality. Yet, as the solution evolves into a tangible, demoable product, latency, the amount of time it takes for a response to be returned to a user, becomes increasingly important. And when it\u2019s time to introduce the product as generally available, striking a balance between delivering exceptional performance and managing costs is key.<\/p>\n In practice, balancing these is tricky. Take, for instance, when building chat experiences for IT administrators. If the responses fall short of expectations, do we modify the system prompt to be more elaborate? Alternatively, do we shift our focus to fine-tuning, experimenting with different LLMs, embedding models, or expanding our data sources? Each adjustment cascades, impacting quality, latency, and cost, requiring a careful and data informed approach.<\/p>\n Depending on the use case, you may find that users will be accepting of additional latency in exchange for higher quality. Knowing the relative value that your users have for each of these will help your team strike the right balance. For the sustained success of the project, it\u2019s crucial for your team to monitor and optimize these three areas according to the tradeoffs that your user deems acceptable.<\/p>\n It\u2019s been an exciting start to the journey of building products with LLMs and I can\u2019t wait to learn more as we continue building and shipping awesome AI products.<\/p>\n It\u2019s worth noting that my main experience has been with chat experiences using vector database retrieval augmented generation (RAG) and SQL agents. But with advancements on the horizon, I\u2019m optimistic about the emergence of autonomous agents with access to multiple tools that can take actions for users.<\/p>\n Recently, Open AI released their Assistants API<\/a>, which will enable developers to more easily access the potential of LLMs to operate as agents with multiple tools and larger contexts. For a deeper dive into AI agents, check out talk<\/a> by Harrison Chase, the founder of Langchain, and this intriguing episode<\/a> of the Latent Space podcast that explores the evolution and complexities of agents.<\/p>\n Thanks for reading! If you have any comments or questions feel free to reach out.<\/p>\n You can follow my thoughts on X<\/a> or connect with me on Linkedin.<\/a><\/p>\n We\u2019d love to hear what you think. Ask a Question, Comment Below, and Stay Connected with Cisco Security on social!<\/em><\/p>\n Cisco Security Social Channels<\/strong><\/p>\n Instagram<\/a><\/strong>Facebook<\/a><\/strong>Twitter<\/a><\/strong>LinkedIn<\/a><\/strong><\/p>\n \u00a0\u00a0Explore how Cisco crafts AI security experiences, addressing LLM evaluation, prioritization of initiatives, and balancing cost, quality, and latency.\u00a0\u00a0Read More<\/a>\u00a0Cisco Blogs\u00a0<\/p>\n <\/p>\n","protected":false},"author":0,"featured_media":1946,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-1945","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cisco-learning"],"yoast_head":"\n1.The Evaluation Conundrum<\/strong><\/h2>\n
\nManual Validation \u2013 validating AI responses manually with human review
\nUser Feedback Validation \u2013 signal directly from users or user proxies on model responses<\/p>\n2. Prioritizing Initiatives based on Evaluation Gaps<\/strong><\/h2>\n
3.Striking a Balance: Latency, Cost, and Quality<\/strong><\/h2>\n
The Future of LLM Applications<\/strong><\/h2>\n
1.The Evaluation Conundrum<\/strong><\/h2>\n
\nManual Validation \u2013 validating AI responses manually with human review
\nUser Feedback Validation \u2013 signal directly from users or user proxies on model responses<\/p>\n2. Prioritizing Initiatives based on Evaluation Gaps<\/strong><\/h2>\n
3.Striking a Balance: Latency, Cost, and Quality<\/strong><\/h2>\n
The Future of LLM Applications<\/strong><\/h2>\n