Abstract

Have you ever been working on an unsupervised task and wondered, “How you I validate my algorithm at scale?”

In unsupervised learning, in contrast to supervised learning, our validation set has to be manually created and checked by us, i.e. we will have to go through the classifications ourselves and measure the classification accuracy or some other scores. The problem with manual classification is the time, effort, and work that is required for classifications, but this is the easy part of the problem.

Let’s assume that we developed an algorithm and tested it very well while manually passing on all the classifications, what about future changes to that algorithm? After every change we should check the classifications manually ourselves again. While the data classified might change with time, it might also grow to huge scales with the evolution of our product, and the growth of our customers, then our manual classification problem would of course be much more difficult.

Have you started to worry about your production algorithms already? Well, you shouldn’t!

After reading this, you will be familiar with our proposed method to validate your algorithm score easily, adaptively, and effectively against any change in the data or the model.

So let’s start detailing it from the beginning.

Why is it needed?

Algorithm continuous modifications always happen. For example, we are having:

Runtime optimizations

Model improvements

Bug fixes

Version upgrades

How are we dealing with those modifications? We usually use QA tests to make sure the system keeps working. At the same time, the best among us might even develop some regression tests to make sure, for several constant scenarios, that the classifications would not be changed

What about data integrity?

But what about the real classifications on prod? Who verifies their change? We need to make sure that we won’t have any disasters on prod when deploying our new changes in the algorithm.

For that, we have two optional solutions:

Naive solution – pass through all the classifications on prod (which is of course not possible)

Practical solution – use samples of each customer data on prod – using the margin of error equation.

Margin of error

To demonstrate, we are going to take a constant sample from each customer’s data, which would represent the real distribution of the data with minimal deviation, which we will do using the Margin of Error equation, sometimes known from election surveys, where the surveys are sometimes based on some equation derived from the Margin of Error equation.

So, how does it work?

We can use the first equation used for calculating the margin of error, to extract the needed sample size desired.

We would like to have a maximum margin of error of 5%, while we should use a constant value of Z = 1.96 if we want the confidence of 95% (might be changed if we would like to have another confidence level)

The extraction of the required sample size is demonstrated in the following equation:

While this equation is an expansion of the equation above, it might be used when we have the full data size, to be more precise. Otherwise, we’ll be left only with the numerator of that equation – which is also fine if we don’t have the full data size.

This is a code block demonstrating the implementation of this equation in Python:

We can now freeze those samples, which we call a “golden dataset,” and use them as a supervised dataset that will be used by us in the future when making modifications, and serves us as a data integrity validator on real data from prod.

We should mention that because optional changes on prod data might happen with time, we encourage you to update this golden dataset from time to time.

The flow of work for end-to-end data integrity:

Manual classification to create a golden dataset

Maintaining a constant baseline of prod classifications

Developing a suite of score comparison tests

Integrating quality check into CI-process of the algorithm

So, how will it all work together? You can see that in the following GIF:

We may now push any change to our algorithm code, and remain protected, thanks to our data integrity shield!

For further questions about data integrity checks, or data science in general, don’t hesitate to reach out to me at emeyuhas@zscaler.com.

**Margin of error **

To demonstrate, we are going to take a constant sample from each customer’s data, which would represent the real distribution of the data with minimal deviation, which we will do using the Margin of Error equation, sometimes known from election surveys, where the surveys are sometimes based on some equation derived from the Margin of Error equation.

So, how does it work?

We can use the first equation used for calculating the margin of error, to extract the needed sample size desired.

We would like to have a maximum margin of error of 5%, while we should use a constant value of Z = 1.96 if we want the confidence of 95% (might be changed if we would like to have another confidence level)

The extraction of the required sample size is demonstrated in the following equation:

While this equation is an expansion of the equation above, it might be used when we have the full data size, to be more precise. Otherwise, we’ll be left only with the numerator of that equation – which is also fine if we don’t have the full data size.

This is a code block demonstrating the implementation of this equation in Python:

We can now freeze those samples, which we call a “golden dataset,” and use them as a supervised dataset that will be used by us in the future when making modifications, and serves us as a data integrity validator on real data from prod.

We should mention that because optional changes on prod data might happen with time, we encourage you to update this golden dataset from time to time.

Abstract

Have you ever been working on an unsupervised task and wondered, “How you I validate my algorithm at scale?”

In unsupervised learning, in contrast to supervised learning, our validation set has to be manually created and checked by us, i.e. we will have to go through the classifications ourselves and measure the classification accuracy or some other scores. The problem with manual classification is the time, effort, and work that is required for classifications, but this is the easy part of the problem.

Let’s assume that we developed an algorithm and tested it very well while manually passing on all the classifications, what about future changes to that algorithm? After every change we should check the classifications manually ourselves again. While the data classified might change with time, it might also grow to huge scales with the evolution of our product, and the growth of our customers, then our manual classification problem would of course be much more difficult.

Have you started to worry about your production algorithms already? Well, you shouldn’t!

After reading this, you will be familiar with our proposed method to validate your algorithm score easily, adaptively, and effectively against any change in the data or the model.

So let’s start detailing it from the beginning.

Why is it needed?

Algorithm continuous modifications always happen. For example, we are having:

Runtime optimizations

Model improvements

Bug fixes

Version upgrades

How are we dealing with those modifications? We usually use QA tests to make sure the system keeps working. At the same time, the best among us might even develop some regression tests to make sure, for several constant scenarios, that the classifications would not be changed

What about data integrity?

But what about the real classifications on prod? Who verifies their change? We need to make sure that we won’t have any disasters on prod when deploying our new changes in the algorithm.

For that, we have two optional solutions:

Naive solution – pass through all the classifications on prod (which is of course not possible)

Practical solution – use samples of each customer data on prod – using the margin of error equation.

Margin of error

To demonstrate, we are going to take a constant sample from each customer’s data, which would represent the real distribution of the data with minimal deviation, which we will do using the Margin of Error equation, sometimes known from election surveys, where the surveys are sometimes based on some equation derived from the Margin of Error equation.

So, how does it work?

We can use the first equation used for calculating the margin of error, to extract the needed sample size desired.

We would like to have a maximum margin of error of 5%, while we should use a constant value of Z = 1.96 if we want the confidence of 95% (might be changed if we would like to have another confidence level)

The extraction of the required sample size is demonstrated in the following equation:

While this equation is an expansion of the equation above, it might be used when we have the full data size, to be more precise. Otherwise, we’ll be left only with the numerator of that equation – which is also fine if we don’t have the full data size.

This is a code block demonstrating the implementation of this equation in Python:

We can now freeze those samples, which we call a “golden dataset,” and use them as a supervised dataset that will be used by us in the future when making modifications, and serves us as a data integrity validator on real data from prod.

We should mention that because optional changes on prod data might happen with time, we encourage you to update this golden dataset from time to time.

The flow of work for end-to-end data integrity:

Manual classification to create a golden dataset

Maintaining a constant baseline of prod classifications

Developing a suite of score comparison tests

Integrating quality check into CI-process of the algorithm

So, how will it all work together? You can see that in the following GIF:

We may now push any change to our algorithm code, and remain protected, thanks to our data integrity shield!

For further questions about data integrity checks, or data science in general, don’t hesitate to reach out to me at emeyuhas@zscaler.com.