Bayesian multivariable linear regression in PyMC3
A motivating example for Sklearn users interested in Bayesian analysis
Bayesian linear regression (BLR) is a powerful tool for statistical analysis. BLR models can provide a probability density of parameter values as opposed to a single best-fit value as in the standard (Frequentist) linear regression. In addition, BLR can be used to fit to parameters within a specified interval or create hierarchical models. However, despite decades of development and open source libraries, BLR has yet to reach its full user base potential. One key obstacle to this is overcoming the barrier to entry for new users. In this blog post I hope to remove some of these obstacles by demonstrating how to:
- Create a model
- Train a model
- Create a traceplot and summary statistics
- Run the model on test data
Create the data
In this first step, the necessary libraries are imported and the data set is created. PyMC3 is the Bayesian analysis library and Theano is the back-end of PyMC3 handling vector/matrix multiplication. Theano is necessary to import in order to create shared variables that can be used to switch out the test and train data.
In this example, the data set has only two features and 1000 data points. However, these values can be changed through the num_features and data_points attributes. The variables beta_set and alpha_set are the slopes and intercept, respectively, that we will try to guess later.
The variables X and y are created using the slope and intercept values and normally distributed random noise is added to Y. Finally, X and y are split into training and testing set via Sci-kit learn’s train_test_split function.
In [1]:
Plot the data set
Next, let’s visualize the data we are fitting. The feature X1 shows a postive correlation, whereas X2 shows a negative correlation.
In [4]:
Create the PyMC3 model
Next, we create the PyMC3 model. In PyMC3 all model parameters are attached to a PyMC3 Model object, instantiated with pm.Model. Next, we declare the Theano shared variables, one for the input and one for the output. Model parameters are contained inside the with section. Alpha is the prior distribution for the intercept. Beta is the prior distribution for the slopes, with dimension described by the number of features. S is the prior distribution describing the noise around the data. Data_est is the expected form or model of the data. Finally, y is likelihood.
In [4]:
It is often difficult to visualize the model from the several lines of code above. When first building a model, it is helpful to draw out your design as shown below.
Figure 2, shows the hierarchical diagram of the model. At the base of the model is the datum, yi, which is normally distributed random value with a mean value, μi and width σ. The width is described by a half-normal distribution. The mean value is described by the equation shown above, where the intercept (α) and slope (β) are both given broad normal priors that are vague compared to the scale of the data. Each feature has it’s own slope coefficients described by the normal prior.
Train the model
Next, we train the model. In this example we use the NUTS (No U-Turn sampler) method as the stepper. The pm.sample command draws samples from the posterior.
In [5]:
Sequential sampling (2 chains in 1 job)
NUTS: [s, betas, alpha]
100%|█████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:05<00:00, 432.98it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:02<00:00, 836.41it/s]
traceplot and summary statistics
Once the model has completed training, we can plot the traceplot and look at summary statistics. The left column shows the posterior distributions for Alpha, Beta, and the standard deviation of the noise in the model. From these graphs, we get not only the mean value of the estimate, but the entire probability distribution. This is in effect what makes probablistic programming and Bayesian analysis unique and in many cases superior to frequentist statistics. The right column shows the sampling values for each parameter at each step.
In [6]:
In [7]:
The summary statistics can be viewed via the pm.summary command. The default values are the mean, standard deviation, MC error = sd/sqrt(iterations), high probability density (HPD) values, n_eff = effective sample size, and Rhat = scale reduction factor. Rhat and n_eff are both used to determine if the chains mixed well and if the solution has converged. Rhat measures the ratio of the average variance in each chain compared to the variance of the pooled draws across chains. As the model converges Rhat approaches 1.
mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|
alpha__0 | 0.480106 | 0.037595 | 0.000774 | 0.405709 | 0.550487 | 2975.210769 | 1.001104 |
betas__0_0 | 1.557257 | 0.036890 | 0.000658 | 1.487401 | 1.628156 | 3147.314093 | 0.999515 |
betas__0_1 | -1.562830 | 0.038143 | 0.000744 | -1.634971 | -1.484338 | 2849.492200 | 0.999755 |
s | 0.986880 | 0.027428 | 0.000468 | 0.930528 | 1.038020 | 3049.781251 | 0.999667 |
Test the trained model on the test data
Next, we want to test the model on the training data. This is where the shared variables we created are useful. We can now switch out the training data with the model data via the set_value command from Theano. Once the Theano values are set, we pull sample from the posterior for each test data point. This is accomplished through a posterior predictive check (PPC), which draws 1000 samples from the trace for each data point.
In [8]:
100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1683.52it/s]
Plot the y_test vs y_predict and calculate r**2
The ppc can be visualized by plotting the predicted y values from the model as a function of the actual y values with error bars generated from the standard deviation of the 1000 samples drawn for each point. As a second check, we can also use sci-kit learn’s r2_score function to determine the error in the estimates.
In [9]:
0.835496545799272
Resources and Additional Reading
Below is a list of resources