Bayesian Analysis for IIoT: Bayesian estimation for equipment intervention
It’s not often that a data scientist can influence a multi-million dollar decision. In such situations Data Scientists should have the “confidence” in their analysis and the recommendations that follow from it. Establishing trust is one of the pillars for successful data science collaborations.
Oftentimes, I find myself facing a situation where the evaluation could lead to an expensive equipment intervention. Balancing equipment reliability with capital decisions is a delicate business. This is where using Bayesian Analysis can come in handy. The remainder of this post will walk through a “real” business case and can act as an introduction to using Bayesian Analysis in an IIoT context. There will be some code, however I will abstract the math behind Bayesian Estimation as much as possible.
What is the goal of this analysis
We will dip our toes in Bayesian Analysis to infer what can one say about the state of a Turbine by analyzing its flow function (image below). Bayesian Estimation will allow us to assess how big and how real of a change the flow function is experiencing over time.
Steam Turbine Flow Function
Why does this matter?
Steam Turbine Flow Function is a parameter that can be used to assess changes to a turbine steam path, i.e changes/damage to the blades. In this use case, the flow function is plotted over time. Sustained changes to the flow function generally indicate a need for intervention and blades replacement. While the subject of this analysis is a Turbine, this method applies to any type of equipment and measurement: deviation from a pump curve, or Velocity Ratio/Dimensionless parameter for Aortic valves. To summarize, we are interested in understanding and quantifying the changes to an engineering parameter and hence equipment health over time.
In the plot, shown again below, there are 4 groups. Those groups coincide with actual operational events. I will elaborate more on the significance of those groups in later sections. It’s clear that there is a significant change between group 1 and 2. The changes going from group 2 to group 3 and from group 3 to 4 are up for more interpretation and could use Bayesian estimation.
Flow function trend split into groups.
Wait but why Bayesian
There is an implied view point I am making in this blog post: Data analysis for IIoT, takes on a different approach than predicting sales, or forecasting time series data. For one, most Subject Matter Experts are generally interested in causal inference. We have a lot of prior engineering knowledge about the systems, but we don’t know enough until we can observe some data. This means it’s generally less about understanding which features to use and more about building on top of the existing knowledge to improve model predictions. Those constraints render Bayesian data analysis very effective. That’s not to say Bayesian is overall better than a classical ML based analysis for such problems, but rather it’s another useful tool in one’s toolbox.
Here in particular, the goal is to assess differences between groups. To do this we need a statistical model that evaluates true differences given the measurement noise/uncertainty observed in the flow function. What separates the groups are the different operational/events treatments that occurred.
But why not just use run of the mill Null hypothesis t-test?
In short, Bayesian estimation allows a data scientist to not only reject but potentially accept the hypothesis, which is a big deal when deciding if a change really occurred. Moreover, one needs to run only a single model to look at the difference in means between all 4 groups as well as difference in standard deviations. In classical Null hypothesis test, each property requires a separate test. Those points and more are made very eloquently in this paper and in this pymc noteboook.
We have a lot of prior engineering knowledge about the systems, but we don’t know enough until we can observe some data.
Getting down to business
The Bayesian analysis here is done using pymc3, there will be future updates using TensorFlow Probability (TFP) and Pyro.
As a reminder We’re trying to determine if the turbine continues to experience degradation. The choice of group breakdown is driven by specific events and operational changes. The histogram plots below show us the distribution for each group.
plt.style.use(‘fivethirtyeight’)
flowdf.hist(‘Flow_function’, by=’group’, figsize=(12, 8))
Histogram plots showing data for each group
We can see from the histograms as well as from the flow function trend that group 2 has a big shift in mean, which coincides with an operational event. The changes in means from group 2 to groups 3 and 4 are less obvious.
A bit of practical background around Bayesian estimation
For many data scientists, starting with Bayesian feels like making mountains out of molehills. There is a reason for this apparent madness. I will briefly describe how Machine Learning and Bayesian Analysis work from a conceptual level.
In classical machine learning, we have y=f(X), where f(X) could be a neural network, Linear regression, SVM, etc… We start by defining the f() otherwise known as the model, we then use an optimization algorithm along with an objective function to find the parameters of the model, an optimization algorithm could be a Gradient Descent. basically, what I just described is what goes behind the scenes when one types the following two lines in scikit-learn:
model= LinearRegression()
model.fit(X,y)
In Bayesian Analysis we start with defining a model for our parameters, we call that the Prior. This is usually a distribution of some sort, e.g., a normal distribution or Poisson distribution. We then define a model for our data over the parameters, this could be a linear regression, or a Neural Network. We then “fit” the data by running thousands of samples for the parameters from the Prior distribution. Each Sample will come from the prior distribution and will also satisfy the data we’re modeling. This sampling is what’s generally called MCMC (Markov Chain Monte Carlo). This sampling of parameters over the entire data set, gives us the final result, The Posterior, which is a model with updated parameters. The Prior and Posterior are represented as distributions, like a normal distribution.
Before jumping into code, we need to explain the steps a bit more. Starting with the famous Bayes Theorem:
Bayes Theorem
In a nutshell, the theorem is saying if someone has a Linear model: y=ax+b, we start with an initial assumption of the parameters a & b(the Prior), those parameters can be updated when presented with the data, the final result is the Posterior.
The complexity generally comes from the seemingly simple term p(data) (the evidence). To compute p(data) one has to calculate the integral of **_p(data | q).p(q)_**, over all the model parameters. So, in the simple y=ax+b the integration will look something like this. |
Anyone who’s done two or three parameter integration can attest to how hard this can become for some functions and in many cases can’t be calculated. In our use case here, we have a fairly complicated function/model with 9 parameters (explained later). It ain’t easy to integrate over 9 parameters. Hence, we use MCMC as a numerical method to approximate p(data). Beside MCMC There are other ways for generating the posterior, but I won’t cover them in this blog post.
So in layman terms, 1) we start our analysis with a function/model for the analysis, just like we do with classical machine learning, say LinearRegression() , 2) we setup an initial assumption of model parameters (the Prior), 3) we observe the data and then change the range of possible values for those parameters using MCMC algorithm in order to get the Posterior
So, there are 3 main steps:
1. Define the model for the data that will generate the Posterior.
2. Define a prior distribution from which the model parameters, the a & b, will be sampled.
3. Generate new credible range of parameters using MCMC and hence get the Posterior.
Step 1: Defining the data model, the f(x)
The goal of the analysis is to estimate how did the flow function change between group 1, 2, 3, and 4. To make this evaluation we can compare the statistics of a distribution, like mean, standard deviation between each group. To perform this comparison, I chose Student-t distribution, as a descriptive model for the distributions in each group. I could have chosen a Gaussian distribution, however a Student-t distribution is more robust to the outliers present in the data. It’s important to note the Student -t distribution here is used to describe the data distribution, this is not the t-distribution for a t-test. Said differently, the Student-t distribution is the f(x): the model used to find the descriptive statistics of each group.
The Student-t distribution has three parameters mean μ, a precision (inverse-variance) λ and a degrees-of-freedom parameter ν:
Student-t distribution
The variable ν is assumed here to be the same for all 4 groups. The mean, and variance parameter will be different for each group. This leads to a model with a total of 9 parameters (mean and standard deviation parameters for each of the 4 groups and one ν for all groups). Large values of ν make the distribution resemble a normal distribution, while small values lead to heavy tails, i.e., outliers.
Step 2: Defining the model Initial parameter values, AKA Prior
The analysis starts with a selection of some prior values for the mean, standard deviation and the ν parameters.
a. The Model mean parameters
Each group mean parameter Prior values will be sampled from a normal distribution. This normal distribution has a mean set to that of the entire data set and a standard deviation that is twice the standard deviation of the data.
Mean parameter µ for group 1 sampled from a normal distribution.
import pandas as pd
import numpy as np
import pymc3 as pmμ_m = flowdf[‘Flow_function’].values.mean()
μ_s = flowdf[‘Flow_function’].values.std() * 2with pm.Model() as model:
group1_mean = pm.Normal(“group1_mean”, mu=μ_m, sd=μ_s)
group2_mean = pm.Normal(“group2_mean”, mu=μ_m, sd=μ_s)
group3_mean = pm.Normal(“group3_mean”, mu=μ_m, sd=μ_s)
group4_mean = pm.Normal(“group4_mean”, mu=μ_m, sd=μ_s)
b. The Model standard deviation parameters
Each group standard deviation parameter Prior values will be sampled from a Uniform distribution. The Uniform distribution will have a lower limit of (1/100) *(dataset standard deviation) and an upper limit of (100*dataset standard deviation):
σ_low = (μ_s/2)/100
σ_high = (μ_s/2)*100with model:
group1_std = pm.Uniform(“group1_std”, lower=σ_low, upper=σ_high)
group2_std = pm.Uniform(“group2_std”, lower=σ_low, upper=σ_high)
group3_std = pm.Uniform(“group3_std”, lower=σ_low, upper=σ_high)
group4_std = pm.Uniform(“group4_std”, lower=σ_low, upper=σ_high)
c. The Model degrees-of-freedom parameter (ν)
The original paper by Kruschke selects the Prior values for ν from an exponential distribution. This analysis also uses an exponential distribution with a mean of 30; a small value for ν means the data is heavy tailed, a larger ν means the data is normally distributed. Using an exponentially distributed sampling of ν means the student-t distribution model, f(x), will adapt to a range from normal data (no outliers) to heavy tailed data (with outliers). For ν>30 the model will take the shape of normal distribution, for ν<30 the model will take a shape of normal distribution with long tails.
with model:
ν = pm.Exponential(“ν_minus_one”, 1 / 29.0) + 1# The way we arrive to the exponential function above is:
# p(ν|λ)=(1/λ)*exp[-(ν-1)/λ], ν≥1 and λ=29 pm.kdeplot(np.random.exponential(30, size=10000), fill_kwargs={“alpha”: 0.5})
Exponential Distribution for ν
Step 3: Get the Posterior, model.fit()
Up to his point we were setting up the probabilistic model. This step will fit the model using MCMC and get the posterior. The estimates of interests are the comparisons between groups, because we want to be certain about the impact of the operational events on the Steam Turbine Flow Function. The comparison between groups will come in the form of difference in means, difference in standard deviation, and effect size. Those measures will be set up as deterministic nodes. Assigning Deterministic tells PyMC that we want the sampled values to be returned as part of the output.
Another measure of interest is the effect size.
Effect size will help us understand the significance of the change.
with model:diff_of_means_pre = pm.Deterministic(“difference of means pre”, group2_mean - group1_mean)diff_of_stds_pre = pm.Deterministic(“difference of stds pre”, group2_std - group1_std)effect_size_pre = pm.Deterministic(“effect size pre”, diff_of_means_pre / np.sqrt((group2_std ** 2 + group1_std ** 2) / 2))diff_of_means_post = pm.Deterministic(“difference of means post”, group3_mean - group2_mean)diff_of_stds_post = pm.Deterministic(“difference of stds post”, group3_std - group2_std)effect_size_post = pm.Deterministic(“effect size post”, diff_of_means_post / np.sqrt((group3_std ** 2 + group2_std ** 2) / 2))diff_of_means_new = pm.Deterministic(“difference of means new”, group4_mean - group3_mean)diff_of_stds_new = pm.Deterministic(“difference of stds new”, group4_std - group3_std)effect_size_new = pm.Deterministic(“effect size new”, diff_of_means_new / np.sqrt((group4_std ** 2 + group3_std ** 2) / 2))
The line of code below performs the MCMC sampling, hence fitting the model:
with model:trace = pm.sample(draws=2000,tune=1000)
To better understand how MCMC sampling works and the NUTS (No-U-Turn Sampler) step algorithm used under the hood by PyMC, I recommend reading this tutorial.
We plot the posteriors results for each group. The plots summarize the posterior distributions for each parameter of the model. Those plots present us with the 95% interval and the means for each parameter.
Our main interest is in the group differences. the below code plots the differences between groups.
pm.plot_posterior(trace,var_names=[“difference of means pre”, “difference of stds pre”, “effect size pre”,”difference of means post”, “difference of stds post”, “effect size post”,”difference of means new”, “difference of stds new”, “effect size new”],ref_val=0, color=”#87ceeb”)
The big shift going from group1 to group 2 that we discussed before can also be seen here. The vertical orange line is for reference value of 0. The definite change in mean with no or little changes to Standard Deviation is a good confirmation of the change in the underlying Steam Turbine Flow Function. The narrow distribution of the change in mean as well as the effect size, tells us that a point estimate would have been just as good. This is not surprising because the change from group 1 to group 2 is hard to miss.
In the comparisons of Group 2 to Group 3 (Post operational change) and Group 3 to Group 4 (New Operations event) below, there is an evident change in the mean, however the variance has also increased. The effect size while still positive, has stayed steady and meaningfully shrunk as compared to the initial change between group 1 and 2. This points to a slower degradation in the steam turbine, but a continued one as well.
Absent other information, this continued albeit less obvious degradation along with the increase in standard deviation calls for intervention on this Steam Turbine at it’s next scheduled repair, rather than allowing the machine to continue running past the scheduled repair.
To close
We were interested in understanding whether or not a critical measurement of Steam Turbine Health has changed following known operational events. Bayesian Estimation for group differences was used to confidently answer this question. This real engineering use case was used to demonstrate how Bayesian Estimation can serve as a very effective decision-making tool.
The analysis presented here is relatively simple, as Bayesian Estimation goes, but it packs a big punch. In this blog post I didn’t compare Bayesian Estimation with hypothesis t tests. I encourage the readers to review the excellent work by Kruschke, where the author covers the differences between Bayesian Estimation and Null Hypothesis test in great detail. The methods used here are based on the work presented in the Kruschke paper.
References:
1. Bayesian Estimation Supersedes the t Test, John K. Kruschke
2. Bayesian estimation supersedes the t-test, PyMC Notebook
3. Bayesian Data Analysis, 3rd edition, A. Gelman, J. B. Carlin , H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin