GPflowOpt Documentation

Introduction

GPflowOpt is a library for Bayesian Optimization with GPflow. It makes use of TensorFlow for computation of acquisition functions, to offer scalability, and avoid implementation of gradients. The package was created, and is currently maintained by Joachim van der Herten and Ivo Couckuyt

The project is open source: if you feel you have some relevant skills and are interested in contributing then please contact us on GitHub by opening an issue or pull request.

Install

  1. Install package

A straightforward way to install GPflowOpt is to clone its repository and run

pip install . --process-dependency-links

in the root folder. This also installs required dependencies including TensorFlow. For alternative TensorFlow installations (e.g., gpu), please see the instructions on the main TensorFlow webpage.

  1. Development

GPflowOpt is a pure python library so you could just add it to your python path. We use

pip  install -e . --process-dependency-links

  1. Testing and documentation

The tests require some additional dependencies that need to be installed first with pip install -e .[test]. Afterwards the tests can be run with python setup.py test.

Similarly, to build the documentation, first install the extra dependencies with pip install -e .[docs]. Then proceed with python setup.py build_sphinx.

Getting started

A simple example of Bayesian optimization to get up and running is provided by the first steps into Bayesian optimization notebook

For more advanced use cases have a look at the other tutorial notebooks and the API and architecture.

Acknowledgements

Joachim van der Herten and Ivo Couckuyt are Ghent University - imec postdoctoral fellows. Ivo Couckuyt is supported by FWO Vlaanderen.

Tutorials and examples

First steps into Bayesian optimization

Ivo Couckuyt, Joachim van der Herten

Introduction

Bayesian optimization is particularly useful for expensive optimization problems. This includes optimization problems where the objective (and constraints) are time-consuming to evaluate: measurements, engineering simulations, hyperparameter optimization of deep learning models, etc. Another area where Bayesian optimization may provide a benefit is in the presence of (a lot of) noise. If your problem does not satisfy these requirements other optimization algorithms might be better suited.

To setup a Bayesian optimization scheme with GPflowOpt you have to:

  • define your objective and specify the optimization domain
  • setup a GPflow model and choose an acquisition function
  • create a BayesianOptimizer

Objective function

In [1]:
import numpy as np
from gpflowopt.domain import ContinuousParameter


def fx(X):
    X = np.atleast_2d(X)
    return np.sum(np.square(X), axis=1)[:, None]

domain = ContinuousParameter('x1', -2, 2) + ContinuousParameter('x2', -1, 2)
domain
Out[1]:
NameTypeValues
x1Continuous[-2. 2.]
x2Continuous[-1. 2.]

Bayesian optimizer

In [2]:
import gpflow
from gpflowopt.bo import BayesianOptimizer
from gpflowopt.design import LatinHyperCube
from gpflowopt.acquisition import ExpectedImprovement
from gpflowopt.optim import SciPyOptimizer

# Use standard Gaussian process Regression
lhd = LatinHyperCube(21, domain)
X = lhd.generate()
Y = fx(X)
model = gpflow.gpr.GPR(X, Y, gpflow.kernels.Matern52(2, ARD=True))
model.kern.lengthscales.transform = gpflow.transforms.Log1pe(1e-3)

# Now create the Bayesian Optimizer
alpha = ExpectedImprovement(model)
optimizer = BayesianOptimizer(domain, alpha)

# Run the Bayesian optimization
with optimizer.silent():
    r = optimizer.optimize(fx, n_iter=15)
print(r)
Warning: optimization restart 1/5 failed
Warning: optimization restart 2/5 failed
Warning: optimization restart 3/5 failed
Warning: optimization restart 2/5 failed
     fun: array([ 0.01])
 message: 'OK'
    nfev: 15
 success: True
       x: array([[ 0. , -0.1]])

That’s all! Your objective function has now been optimized for 15 iterations.

Bayesian Optimization with black-box constraints

Joachim van der Herten

Introduction

This notebook demonstrates the optimization of an analytical function using the well known Expected Improvement (EI) function. The problem is constrained by a black-box constraint function. The feasible regions are learnt jointly with the optimal regions by considering a second acquisition function known as the Probability of Feasibility (PoF), following the approach of Gardner et al. (2014)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import gpflow
import gpflowopt
import numpy as np

Constrained problem

First we set up an objective function (the townsend function) and a constraint function. We further assume both functions are black-box. We also define the optimization domain (2 continuous parameters).

In [2]:
# Objective & constraint
def townsend(X):
    return -(np.cos((X[:,0]-0.1)*X[:,1])**2 + X[:,0] * np.sin(3*X[:,0]+X[:,1]))[:,None]

def constraint(X):
    return -(-np.cos(1.5*X[:,0]+np.pi)*np.cos(1.5*X[:,1])+np.sin(1.5*X[:,0]+np.pi)*np.sin(1.5*X[:,1]))[:,None]

# Setup input domain
domain = gpflowopt.domain.ContinuousParameter('x1', -2.25, 2.5) + \
         gpflowopt.domain.ContinuousParameter('x2', -2.5, 1.75)

# Plot
def plotfx():
    X = gpflowopt.design.FactorialDesign(101, domain).generate()
    Zo = townsend(X)
    Zc = constraint(X)
    mask = Zc>=0
    Zc[mask] = np.nan
    Zc[np.logical_not(mask)] = 1
    Z = Zo * Zc
    shape = (101, 101)

    f, axes = plt.subplots(1, 1, figsize=(7, 5))
    axes.contourf(X[:,0].reshape(shape), X[:,1].reshape(shape), Z.reshape(shape))
    axes.set_xlabel('x1')
    axes.set_ylabel('x2')
    axes.set_xlim([domain.lower[0], domain.upper[0]])
    axes.set_ylim([domain.lower[1], domain.upper[1]])
    return axes

plotfx();
_images/notebooks_constrained_bo_4_0.png

Modeling and joint acquisition function

We proceed by assigning the objective and constraint function a GP prior. Both functions are evaluated on a space-filling set of points (here, a Latin Hypercube design). Two GPR models are created. The EI is based on the model of the objective function (townsend), whereas PoF is based on the model of the constraint function. We then define the joint criterioin as the product of the EI and PoF.

In [3]:
# Initial evaluations
design = gpflowopt.design.LatinHyperCube(11, domain)
X = design.generate()
Yo = townsend(X)
Yc = constraint(X)

# Models
objective_model = gpflow.gpr.GPR(X, Yo, gpflow.kernels.Matern52(2, ARD=True))
objective_model.likelihood.variance = 0.01
constraint_model = gpflow.gpr.GPR(np.copy(X), Yc, gpflow.kernels.Matern52(2, ARD=True))
constraint_model.kern.lengthscales.transform = gpflow.transforms.Log1pe(1e-3)
constraint_model.likelihood.variance = 0.01
constraint_model.likelihood.variance.prior =  gpflow.priors.Gamma(1./4.,1.0)

# Setup
ei = gpflowopt.acquisition.ExpectedImprovement(objective_model)
pof = gpflowopt.acquisition.ProbabilityOfFeasibility(constraint_model)
joint = ei * pof

Initial belief

We can now inspect our belief about the optimization problem by plotting the models, the EI, PoF and joint mappings. Both models clearly are not very accurate yet. More specifically, the constraint model does not correctly capture the feasibility yet.

In [4]:
def plot():
    Xeval = gpflowopt.design.FactorialDesign(101, domain).generate()
    Yevala,_ = joint.operands[0].models[0].predict_f(Xeval)
    Yevalb,_ = joint.operands[1].models[0].predict_f(Xeval)
    Yevalc = np.maximum(ei.evaluate(Xeval), 0)
    Yevald = pof.evaluate(Xeval)
    Yevale = np.maximum(joint.evaluate(Xeval), 0)
    shape = (101, 101)
    plots = [('Objective model', Yevala), ('Constraint model', Yevalb),
             ('EI', Yevalc), ('PoF', Yevald),
             ('EI * PoF', Yevale)]

    plt.figure(figsize=(10,10))
    for i, plot in enumerate(plots):
        if i == 4:
            ax = plt.subplot2grid((3, 4), (2, 1), colspan=2)
        else:
            ax = plt.subplot2grid((3, 2), (int(i/2), i % 2))

        ax.contourf(Xeval[:,0].reshape(shape), Xeval[:,1].reshape(shape), plot[1].reshape(shape))
        ax.scatter(joint.data[0][:,0], joint.data[0][:,1], c='w')
        ax.set_title(plot[0])
        ax.set_xlabel('x1')
        ax.set_ylabel('x2')
        ax.set_xlim([domain.lower[0], domain.upper[0]])
        ax.set_ylim([domain.lower[1], domain.upper[1]])
    plt.tight_layout()

# Plot representing the model belief, and the belief mapped to EI and PoF
plot()
print(constraint_model)
name.kern.lengthscales transform:+ve prior:None
[ 0.15481272  0.14080552]
name.kern.variance transform:+ve prior:None
[ 0.63087138]
name.likelihood.variance transform:+ve prior:Ga([ 0.25],[ 1.])
[ 0.16824172]

_images/notebooks_constrained_bo_8_1.png

Running Bayesian Optimizer

Running the Bayesian optimization is the next step. For this, we must set up an appropriate strategy to optimize the joint acquisition function. Sometimes this can be a bit challenging as often large non-varying areas may occur. A typical strategy is to apply a Monte Carlo optimization step first, then optimize the point with the best value (several variations exist). This approach is followed here. We then run the Bayesian Optimization and allow it to select up to 50 additional decisions.

The joint acquisition function assures the feasibility (w.r.t the constraint) is taken into account while selecting decisions for optimality.

In [5]:
# First setup the optimization strategy for the acquisition function
# Combining MC step followed by L-BFGS-B
acquisition_opt = gpflowopt.optim.StagedOptimizer([gpflowopt.optim.MCOptimizer(domain, 200),
                                                   gpflowopt.optim.SciPyOptimizer(domain)])

# Then run the BayesianOptimizer for 50 iterations
optimizer = gpflowopt.BayesianOptimizer(domain, joint, optimizer=acquisition_opt)
with optimizer.silent():
    result = optimizer.optimize([townsend, constraint], n_iter=50)

print(result)
     fun: array([[-3.19946868]])
 message: 'OK'
    nfev: 50
 success: True
       x: array([[-2.25      , -1.29648192]])

Results

If we now plot the belief, we clearly see the constraint model has improved significantly. More specifically, its PoF mapping is an accurate representation of the true constraint function. By multiplying the EI by the PoF, the search is restricted to the feasible regions.

In [6]:
# Plotting belief again
print(constraint_model)
plot()
name.kern.lengthscales transform:+ve prior:None
[ 0.46705724  0.49762857]
name.kern.variance transform:+ve prior:None
[ 9.14493985]
name.likelihood.variance transform:+ve prior:Ga([ 0.25],[ 1.])
[  3.10492691e-06]

_images/notebooks_constrained_bo_12_1.png

If we inspect the sampling distribution, we can see that the amount of samples in the infeasible regions is limited. The optimization has focussed on the feasible areas. In addition, it has been active mostly in two optimal regions.

In [7]:
# Plot function, overlayed by the constraint. Also plot the samples
axes = plotfx()
valid = joint.feasible_data_index()
axes.scatter(joint.data[0][valid,0], joint.data[0][valid,1], label='feasible data', c='w')
axes.scatter(joint.data[0][np.logical_not(valid),0], joint.data[0][np.logical_not(valid),1], label='data', c='r');
axes.legend()
Out[7]:
<matplotlib.legend.Legend at 0x7fd40a3a9e48>
_images/notebooks_constrained_bo_14_1.png

Finally, the evolution of the best value over the number of iterations clearly shows a very good solution is already found after only a few evaluations.

In [8]:
f, axes = plt.subplots(1, 1, figsize=(7, 5))
f = joint.data[1][:,0]
f[joint.data[1][:,1] > 0] = np.inf
axes.plot(np.arange(0, joint.data[0].shape[0]), np.minimum.accumulate(f))
axes.set_ylabel('fmin')
axes.set_xlabel('Number of evaluated points');
_images/notebooks_constrained_bo_16_0.png

Defining new acquisition functions

Joachim van der Herten

Introduction

GPflowOpt implements supports some acquisition functions for common scenarios, such as EI and PoF. However, it is straightforward to implement your own strategy. For most strategies, it is sufficient to implement the Acquisition interface. In case a more sophisticated model is needed, this can easily be achieved with GPflow.

In [1]:
%matplotlib inline
import numpy as np

import matplotlib.pyplot as plt
import copy
import gpflow
import gpflowopt
import tensorflow as tf
In [2]:
rng = np.random.RandomState(5)
def camelback(X):
    f = (4. - 2.1*X[:,0]**2 + 0.3* X[:,0]**4) * X[:,0]**2 + np.prod(X,axis=1) + 4 * (X[:,1]**2-1) * X[:,1]**2
    return f[:,None] + rng.rand(X.shape[0], 1) * 0.25

# Setup input domain
domain = gpflowopt.domain.ContinuousParameter('x1', -3, 3) + \
         gpflowopt.domain.ContinuousParameter('x2', -2, 2)

Augmented expected improvement

As an example on how to implement a custom acquisition function, we illustrate the Augmented EI (Huang et al. 2006), a modification for Expected Improvement for optimization of noisy functions. It is defined as

\[\alpha_{\text{aEI}}(\mathbf x_{\star}) = \alpha_{\text{EI}}(\mathbf x_{\star}) \left( 1 - \frac{\sigma_n}{\sqrt{\text{Var}\left[ f_{\star}\,|\, \mathbf x, \mathbf y, \mathbf x_{\star} \right] + \sigma_n^2}}\right)\]

This definition can be interpreted as rescaling of the EI score, with respect to the noise variance. For \(\sigma^2_n=0\), the rescale term equals 1 and normal EI is recovered. For \(\sigma^2_n > 0\), small prediction variances are punished, which decreases concentration of sampling and enhances exploration.

To implement this acquisition function, we could override the build_acquisition method of Acquisition, however in this case its easier to override ExpectedImprovement and obtain the EI score through a call to build_acquisition of the parent class.

In [3]:
class AugmentedEI(gpflowopt.acquisition.ExpectedImprovement):
    def __init__(self, model):
        super(AugmentedEI, self).__init__(model)

    def build_acquisition(self, Xcand):
        ei = super(AugmentedEI, self).build_acquisition(Xcand)
        X = self.models[0].input_transform.build_forward(Xcand)
        _, pvar = self.models[0].wrapped.build_predict(X)
        return tf.multiply(ei, 1 - tf.sqrt(self.models[0].likelihood.variance)
                           / (tf.sqrt(pvar + self.models[0].likelihood.variance)))

Results

This small experiment on the six hump camelback illustrates impact of the penalty term.

In [4]:
design = gpflowopt.design.LatinHyperCube(9, domain)
X = design.generate()
Y = camelback(X)
m = gpflow.gpr.GPR(X, Y, gpflow.kernels.Matern52(2, ARD=True, lengthscales=[10,10], variance=10000))
m.likelihood.variance = 1
m.likelihood.variance.fixed = True
ei = gpflowopt.acquisition.ExpectedImprovement(m)
m = gpflow.gpr.GPR(X, Y, gpflow.kernels.Matern52(2, ARD=True, lengthscales=[10,10], variance=10000))
m.likelihood.variance = 1
m.likelihood.variance.fixed = False
aei = AugmentedEI(m)

opt = gpflowopt.optim.StagedOptimizer([gpflowopt.optim.MCOptimizer(domain, 200),
                                       gpflowopt.optim.SciPyOptimizer(domain)])

bopt1 = gpflowopt.BayesianOptimizer(domain, ei, optimizer=opt)
with bopt1.silent():
    bopt1.optimize(camelback, n_iter=50)

bopt2 = gpflowopt.BayesianOptimizer(domain, aei, optimizer=opt)
with bopt2.silent():
    bopt2.optimize(camelback, n_iter=50)

f, axes = plt.subplots(1,2, figsize=(14,7))

Xeval = gpflowopt.design.FactorialDesign(101, domain).generate()
Yeval = camelback(Xeval)
titles = ['EI', 'AEI']
shape = (101, 101)

for ax, t, acq in zip(axes, titles, [ei, aei]):
    pred = acq.models[0].predict_f(Xeval)[0]
    ax.contourf(Xeval[:,0].reshape(shape), Xeval[:,1].reshape(shape),
                pred.reshape(shape))
    ax.set_xlabel('x1')
    ax.set_ylabel('x2')
    ax.set_title(t)
    ax.scatter(acq.data[0][:,0], acq.data[0][:,1], c='w')
_images/notebooks_new_acquisition_7_0.png

Bayesian Optimization of Hyperparameters

Vincent Dutordoir, Joachim van der Herten

Introduction

The paper Practical Bayesian Optimization of Machine Learning algorithms by Snoek et al. 2012 describes the use of Bayesian optimization for hyperparameter optimization. In this paper, the (at the time) state-of-the-art test error for convolutional neural networks on the CIFAR-10 dataset was improved significantly by optimizing the parameters of the training process with Bayesian optimization.

In this notebook we demonstrate the principle by optimizing the starting point of the maximum likelihood estimation of a GP. Note that we use a GP to optimize the initial hyperparameter values of another GP.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

# Loading airline data
import numpy as np
data = np.load('airline.npz')
X_train, Y_train = data['X_train'], data['Y_train']
D = Y_train.shape[1];

Data set

The data to illustrate hyperparameter optimization is the well-known airline passenger volume data. It is a one-dimensional time series of the passenger volumes of airlines over time. We wish to use it to make forecasts. Plotting the data below, it is clear that the data contains a pattern.

In [2]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('Time (years)')
ax.set_ylabel('Airline passengers ($10^3$)')
ax.plot(X_train.flatten(),Y_train.flatten(), c='b')
ax.set_xticklabels([1949, 1952, 1955, 1958, 1961, 1964])
plt.tight_layout()
_images/notebooks_hyperopt_4_0.png

Modeling

To forecast this timeseries, we will pick up its pattern using a Spectral Mixture kernel (Wilson et al, 2013). Essentially, this kernel is a sum of \(Q\) products of Cosine and RBF kernels. For this one-dimensional problem each term of the sum has 4 hyperparameters. To account for the upward trend, we also add a Linear and a Bias kernel.

In [3]:
from gpflow.kernels import RBF, Cosine, Linear, Bias, Matern52
from gpflow import transforms
from gpflow.gpr import GPR

Q = 10 # nr of terms in the sum
max_iters = 1000

# Trains a model with a spectral mixture kernel, given an ndarray of 2Q frequencies and lengthscales
def create_model(hypers):
    f = np.clip(hypers[:Q], 0, 5)
    weights = np.ones(Q) / Q
    lengths = hypers[Q:]

    kterms = []
    for i in range(Q):
        rbf = RBF(D, lengthscales=lengths[i], variance=1./Q)
        rbf.lengthscales.transform = transforms.Exp()
        cos = Cosine(D, lengthscales=f[i])
        kterms.append(rbf * cos)

    k = np.sum(kterms) + Linear(D) + Bias(D)
    m = GPR(X_train, Y_train, kern=k)
    return m

X_test, X_complete = data['X_test'], data['X_complete']
def plotprediction(m):
    # Perform prediction
    mu, var = m.predict_f(X_complete)

    # Plot
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.set_xlabel('Time (years)')
    ax.set_ylabel('Airline passengers ($10^3$)')
    ax.set_xticklabels([1949, 1952, 1955, 1958, 1961, 1964, 1967, 1970, 1973])
    ax.plot(X_train.flatten(),Y_train.flatten(), c='b')
    ax.plot(X_complete.flatten(), mu.flatten(), c='g')
    lower = mu - 2*np.sqrt(var)
    upper = mu + 2*np.sqrt(var)
    ax.plot(X_complete, upper, 'g--', X_complete, lower, 'g--', lw=1.2)
    ax.fill_between(X_complete.flatten(), lower.flatten(), upper.flatten(),
                    color='g', alpha=.1)
    plt.tight_layout()

m = create_model(np.ones((2*Q,)))
m.optimize(maxiter=max_iters)
plotprediction(m)
_images/notebooks_hyperopt_6_0.png

In total, a lot of hyperparameters must be optimized. Furthermore, the optimization surface of the spectral mixture is highly multimodal. Starting from the default hyperparameter values the optimized GP is able to pick up the linear trend, and the RBF kernels perform local interpolation. However, the kernel is not able to extrapolate away from the data. In sum, with this starting point, the likelihood optimization ends in a local minimum.

Hyperparameter optimization

This issue is a known problem of the spectram mixture kernel, and several recommendations exist on how to improve the starting point. Here, we will use GPflowOpt to optimize the initial values for the lengthscales of the RBF and the Cosine kernel (i.e., the frequencies of the latter kernel). The other hyperparameters (rbf and cosine variances, likelihood variances and the linear and bias terms) are kept at their defaults and will be optimized by the standard likelihood optimization.

First, we setup the objective function accepting proposed starting points. The objective function returns the negative log likelihood, obtained by optimizing the hyperparameters from the given starting point. Then, we setup the 20D optimization domain for the frequencies and lengthscales.

In [4]:
from gpflowopt.domain import ContinuousParameter
from gpflowopt.objective import batch_apply

# Objective function for our optimization
# Input: N x 2Q ndarray, output: N x 1.
# returns the negative log likelihood obtained by training with given frequencies and rbf lengthscales
# Applies some tricks for stability similar to GPy's jitchol
@batch_apply
def objectivefx(freq):
    m = create_model(freq)
    for i in [0] + [10**exponent for exponent in range(6,1,-1)]:
        try:
            mean_diag = np.mean(np.diag(m.kern.compute_K_symm(X_train)))
            m.likelihood.variance = 1 + mean_diag * i
            m.optimize(maxiter=max_iters)
            return -m.compute_log_likelihood()
        except:
            pass
    raise RuntimeError("Frequency combination failed indefinately.")

# Setting up optimization domain.
lower = [0.]*Q
upper = [5.]*int(Q)
df = np.sum([ContinuousParameter('freq{0}'.format(i), l, u) for i, l, u in zip(range(Q), lower, upper)])

lower = [1e-5]*Q
upper = [300]*int(Q)
dl = np.sum([ContinuousParameter('l{0}'.format(i), l, u) for i, l, u in zip(range(Q), lower, upper)])
domain = df + dl
domain
Out[4]:
NameTypeValues
freq0Continuous[ 0. 5.]
freq1Continuous[ 0. 5.]
freq2Continuous[ 0. 5.]
freq3Continuous[ 0. 5.]
freq4Continuous[ 0. 5.]
freq5Continuous[ 0. 5.]
freq6Continuous[ 0. 5.]
freq7Continuous[ 0. 5.]
freq8Continuous[ 0. 5.]
freq9Continuous[ 0. 5.]
l0Continuous[ 1.00000000e-05 3.00000000e+02]
l1Continuous[ 1.00000000e-05 3.00000000e+02]
l2Continuous[ 1.00000000e-05 3.00000000e+02]
l3Continuous[ 1.00000000e-05 3.00000000e+02]
l4Continuous[ 1.00000000e-05 3.00000000e+02]
l5Continuous[ 1.00000000e-05 3.00000000e+02]
l6Continuous[ 1.00000000e-05 3.00000000e+02]
l7Continuous[ 1.00000000e-05 3.00000000e+02]
l8Continuous[ 1.00000000e-05 3.00000000e+02]
l9Continuous[ 1.00000000e-05 3.00000000e+02]

High-dimensional Bayesian optimization is tricky, although the complexity of the problem is significantly reduced due to symmetry in the optimization domain (interchanging frequencies does not make a difference) and because we still optimize the likelihood given the starting point. Therefore, getting near a mode is sufficient. Furthermore we disable ARD of the kernel of the model approximating the objective function to avoid optimizing a lot of lengthscales with little data. We then use EI to pick new candidate starting points and evaluate our objective.

In [5]:
from gpflowopt.design import LatinHyperCube
from gpflowopt.acquisition import ExpectedImprovement
from gpflowopt import optim, BayesianOptimizer
design = LatinHyperCube(6, domain)
X = design.generate()

Y = objectivefx(X)
m = GPR(X, Y, kern=Matern52(domain.size, ARD=False))
ei = ExpectedImprovement(m)
opt = optim.StagedOptimizer([optim.MCOptimizer(domain, 5000), optim.SciPyOptimizer(domain)])
optimizer = BayesianOptimizer(domain, ei, optimizer=opt)
with optimizer.silent():
    result = optimizer.optimize(objectivefx, n_iter=24)
Warning: inf or nan in gradient: replacing with zeros
Warning: inf or nan in gradient: replacing with zeros
Warning: inf or nan in gradient: replacing with zeros
Warning: inf or nan in gradient: replacing with zeros
In [6]:
m = create_model(result.x[0,:])
m.optimize()
plotprediction(m)
_images/notebooks_hyperopt_11_0.png

Clearly, the optimization point identified with BO is a lot better than the default values. We now obtain a proper forecasting. By inspecting the evolution of the best likelihood value obtained so far, we see the solution is identified quickly.

In [7]:
f, axes = plt.subplots(1, 1, figsize=(7, 5))
f = ei.data[1][:,0]
axes.plot(np.arange(0, ei.data[0].shape[0]), np.minimum.accumulate(f))
axes.set_ylabel('fmin')
axes.set_xlabel('Number of evaluated points');
_images/notebooks_hyperopt_13_0.png

Bayesian Multiobjective Optimization

Joachim van der Herten, Ivo Couckuyt

Introduction

This notebook demonstrates the multiobjective optimization of an analytical function using the hypervolume-based probability of improvement function.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import gpflow
import gpflowopt
import numpy as np

We setup the Veldhuizen and Lamont multiobjective optimization problem 2 (vlmop2). The objectives of vlmop2 are very easy to model. Ideal for illustrating Bayesian multiobjective optimization.

In [2]:
# Objective
def vlmop2(x):
    transl = 1 / np.sqrt(2)
    part1 = (x[:, [0]] - transl) ** 2 + (x[:, [1]] - transl) ** 2
    part2 = (x[:, [0]] + transl) ** 2 + (x[:, [1]] + transl) ** 2
    y1 = 1 - np.exp(-1 * part1)
    y2 = 1 - np.exp(-1 * part2)
    return np.hstack((y1, y2))

# Setup input domain
domain = gpflowopt.domain.ContinuousParameter('x1', -2, 2) + \
         gpflowopt.domain.ContinuousParameter('x2', -2, 2)

# Plot
def plotfx():
    X = gpflowopt.design.FactorialDesign(101, domain).generate()
    Z = vlmop2(X)
    shape = (101, 101)

    axes = []
    plt.figure(figsize=(15, 5))
    for i in range(Z.shape[1]):
        axes = axes + [plt.subplot2grid((1, 2), (0, i))]

        axes[-1].contourf(X[:,0].reshape(shape), X[:,1].reshape(shape), Z[:,i].reshape(shape))
        axes[-1].set_title('Objective {}'.format(i+1))
        axes[-1].set_xlabel('x1')
        axes[-1].set_ylabel('x2')
        axes[-1].set_xlim([domain.lower[0], domain.upper[0]])
        axes[-1].set_ylim([domain.lower[1], domain.upper[1]])

    return axes

plotfx();
_images/notebooks_multiobjective_4_0.png

Multiobjective acquisition function

We can model the belief of each objective by one GP prior or model each objective separately using a GP prior. We illustrate the latter approach here. A set of data points arranged in a Latin Hypercube is evaluated on the vlmop2 function.

In multiobjective optimization the definition of improvement is ambigious. Here we define improvement using the contributing hypervolume which will determine the focus on density and uniformity of the Pareto set during sampling. For instance, due to the nature of the contributing hypervolume steep slopes in the Pareto front will be sampled less densely. The hypervolume-based probability of improvement is based on the model(s) of the objective functions (vlmop2) and aggregates all the information in one cost function which is a balance between:

  • improving our belief of the objectives (high uncertainty)
  • favoring points improving the Pareto set (large contributing hypervolume with a likely higher uncertainty)
  • focussing on augmenting the Pareto set (small contributing hypervolume but with low uncertainty).
In [3]:
# Initial evaluations
design = gpflowopt.design.LatinHyperCube(11, domain)
X = design.generate()
Y = vlmop2(X)

# One model for each objective
objective_models = [gpflow.gpr.GPR(X.copy(), Y[:,[i]].copy(), gpflow.kernels.Matern52(2, ARD=True)) for i in range(Y.shape[1])]
for model in objective_models:
    model.likelihood.variance = 0.01

hvpoi = gpflowopt.acquisition.HVProbabilityOfImprovement(objective_models)

Running the Bayesian optimizer

The optimization surface of multiobjective acquisition functions can be even more challenging than, e.g., standard expected improvement. Hence, a hybrid optimization scheme is preferred: a Monte Carlo optimization step first, then optimize the point with the best value.

We then run the Bayesian Optimization and allow it to select up to 20 additional decisions.

In [4]:
# First setup the optimization strategy for the acquisition function
# Combining MC step followed by L-BFGS-B
acquisition_opt = gpflowopt.optim.StagedOptimizer([gpflowopt.optim.MCOptimizer(domain, 1000),
                                                   gpflowopt.optim.SciPyOptimizer(domain)])

# Then run the BayesianOptimizer for 20 iterations
optimizer = gpflowopt.BayesianOptimizer(domain, hvpoi, optimizer=acquisition_opt)
with optimizer.silent():
    result = optimizer.optimize([vlmop2], n_iter=20)

print(result)
print(optimizer.acquisition.pareto.front.value)
     fun: array([[ 0.88762951,  0.29191139],
       [ 0.68279472,  0.59802318],
       [ 0.42659568,  0.79430421],
       [ 0.07354128,  0.94891563],
       [ 0.82764037,  0.37153952],
       [ 0.21803674,  0.89654802],
       [ 0.5652733 ,  0.69464852],
       [ 0.00923563,  0.9738539 ],
       [ 0.9334239 ,  0.12286183],
       [ 0.71007284,  0.54565196],
       [ 0.76405389,  0.47382314],
       [ 0.34539271,  0.83984545],
       [ 0.48687039,  0.75355875],
       [ 0.63004337,  0.63471413],
       [ 0.98267516,  0.00178449],
       [ 0.91892508,  0.16232397],
       [ 0.85634937,  0.30878024],
       [ 0.29492334,  0.86269861],
       [ 0.89223755,  0.22705924],
       [ 0.79104171,  0.43108073]])
 message: 'OK'
    nfev: 20
 success: True
       x: array([[-0.16140283, -0.48940691],
       [ 0.0727557 , -0.15649316],
       [ 0.22042083,  0.14203954],
       [ 0.52761769,  0.49694149],
       [-0.17070714, -0.28668133],
       [ 0.38576529,  0.32936723],
       [ 0.09298785,  0.03190632],
       [ 0.66424394,  0.62084368],
       [-0.40696469, -0.50460869],
       [-0.10057392, -0.05825421],
       [-0.10349761, -0.18006422],
       [ 0.2923697 ,  0.20539677],
       [ 0.14213026,  0.11716751],
       [-0.01662408,  0.02111616],
       [-0.74495067, -0.68829395],
       [-0.36858442, -0.45705387],
       [-0.26088788, -0.29457053],
       [ 0.27909045,  0.29936819],
       [-0.3425574 , -0.35403933],
       [-0.20960284, -0.14451762]])
[[ 0.00924157  0.97385548]
 [ 0.07353906  0.9489149 ]
 [ 0.21803833  0.89654648]
 [ 0.29492503  0.86270261]
 [ 0.34538813  0.83984479]
 [ 0.42659985  0.79430552]
 [ 0.48686827  0.75354473]
 [ 0.56527447  0.69466632]
 [ 0.63003846  0.63471877]
 [ 0.68279543  0.59801444]
 [ 0.71007725  0.54563413]
 [ 0.76405382  0.47382832]
 [ 0.82763436  0.37155876]
 [ 0.8563462   0.30880294]
 [ 0.88763433  0.29189238]
 [ 0.89225033  0.2270049 ]
 [ 0.91890866  0.16234889]
 [ 0.93342994  0.12287979]
 [ 0.98267631  0.00178968]]

For multiple objectives the returned OptimizeResult object contains the identified Pareto set instead of just a single optimum. Note that this is computed on the raw data Y.

The hypervolume-based probability of improvement operates on the Pareto set derived from the model predictions of the training data (to handle noise). This latter Pareto set can be found as optimizer.acquisition.pareto.front.value.

Plotting the results

Lets plot the belief of the final models and acquisition function.

In [5]:
def plot():
    grid_size = 51  # 101
    shape = (grid_size, grid_size)

    Xeval = gpflowopt.design.FactorialDesign(grid_size, domain).generate()

    Yeval_1, _ = hvpoi.models[0].predict_f(Xeval)
    Yeval_2, _ = hvpoi.models[1].predict_f(Xeval)

    Yevalc = hvpoi.evaluate(Xeval)

    plots = [((0,0), 1, 1, 'Objective 1 model', Yeval_1[:,0]),
             ((0,1), 1, 1, 'Objective 2 model', Yeval_2[:,0]),
             ((1,0), 2, 2, 'hypervolume-based PoI', Yevalc)]

    plt.figure(figsize=(7,7))
    for i, (plot_pos, plot_rowspan, plot_colspan, plot_title, plot_data) in enumerate(plots):
        data = hvpoi.data[0]

        ax = plt.subplot2grid((3, 2), plot_pos, rowspan=plot_rowspan, colspan=plot_colspan)
        ax.contourf(Xeval[:,0].reshape(shape), Xeval[:,1].reshape(shape), plot_data.reshape(shape))
        ax.scatter(data[:,0], data[:,1], c='w')
        ax.set_title(plot_title)
        ax.set_xlabel('x1')
        ax.set_ylabel('x2')
        ax.set_xlim([domain.lower[0], domain.upper[0]])
        ax.set_ylim([domain.lower[1], domain.upper[1]])
    plt.tight_layout()

# Plot representing the model belief, and the belief mapped to EI and PoF
plot()

for model in objective_models:
    print(model)
name.kern.lengthscales transform:+ve prior:None
[ 0.58170479  0.57493089]
name.kern.variance transform:+ve prior:None
[ 4.39140011]
name.likelihood.variance transform:+ve prior:None
[  1.00000000e-06]

name.kern.lengthscales transform:+ve prior:None
[ 0.60083304  0.58044734]
name.kern.variance transform:+ve prior:None
[ 3.35582088]
name.likelihood.variance transform:+ve prior:None
[  1.00000007e-06]

_images/notebooks_multiobjective_10_1.png

Finally, we can extract and plot the Pareto front ourselves using the pareto.non_dominated_sort function on the final data matrix Y.

The non-dominated sort returns the Pareto set (non-dominated solutions) as well as a dominance vector holding the number of dominated points for each point in Y. For example, we could only select the points with dom == 2, or dom == 0 (the latter retrieves the non-dominated solutions). Here we choose to use the dominance vector to color the points.

In [6]:
# plot pareto front
plt.figure(figsize=(9, 4))

R = np.array([1.5, 1.5])
print('R:', R)
hv = hvpoi.pareto.hypervolume(R)
print('Hypervolume indicator:', hv)

plt.figure(figsize=(7, 7))

pf, dom = gpflowopt.pareto.non_dominated_sort(hvpoi.data[1])

plt.scatter(hvpoi.data[1][:,0], hvpoi.data[1][:,1], c=dom)
plt.title('Pareto set')
plt.xlabel('Objective 1')
plt.ylabel('Objective 2')
R: [ 1.5  1.5]
Hypervolume indicator: 1.55890890103
Out[6]:
<matplotlib.text.Text at 0x7f42a9bd1908>
<matplotlib.figure.Figure at 0x7f4324517860>
_images/notebooks_multiobjective_12_3.png

API and architecture

The structure of GPflowOpt

Joachim van der Herten

In this document, the structure of the GPflowOpt library is explained, including some small examples. First the Domain and Optimizer interfaces are shortly illustrated, followed by a description of the BayesianOptimizer. At the end, a step-by-step walkthrough of the BayesianOptimizer is given.

Optimization

The underlying design principles of GPflowOpt were chosen to address the following task: optimizing problems of the form

\[\underset{\boldsymbol{x} \in \mathcal{X}}{\operatorname{argmin}} f(\boldsymbol{x}).\]

The objective function \(f: \mathcal{X} \rightarrow \mathbb{R}^p\) maps a candidate optimum to a score (or multiple). Here \(\mathcal{X}\) represents the input domain. This domain encloses all candidate solutions to the optimization problem and can be entirely continuous (i.e., a \(d\)-dimensional hypercube) but may also consist of discrete and categorical parameters.

In GPflowOpt, the Domain and Optimizer interfaces and corresponding subclasses are used to explicitly represent the optimization problem.

Objective function

The objective function itself is provided and must be implemented as any python callable (function or object with implemented call operator), accepting a two dimensional numpy array with shape \((n, d)\) as an input, with \(n\) the batch size. It returns a tuple of numpy arrays: the first element of the tuple has shape \((n, p)\) and returns the objective scores for each point to evaluate. The second element is the gradient in every point, shaped either \((n, d)\) if we have a single-objective optimization problem, or \((n, d, p)\) in case of a multi-objective function. If the objective function is passed on to a gradient-free optimization method, the gradient is automatically discarded. GPflowOpt provides decorators which handle batch application of a function along the n points of the input matrix, or dealing with functions which accept each feature as function argument.

Here, we define a simple quadratic objective function:

In [5]:
import numpy as np

def fx(X):
    X = np.atleast_2d(X)
    # Return objective & gradient
    return np.sum(np.square(X), axis=1)[:,None], 2*X
Domain

Then, we represent \(\mathcal{X}\) by composing parameters. This is how a simple continuous square domain is defined:

In [6]:
from gpflowopt.domain import ContinuousParameter
domain = ContinuousParameter('x1', -2, 2) + ContinuousParameter('x2', -1, 2)
domain
Out[6]:
NameTypeValues
x1Continuous[-2. 2.]
x2Continuous[-1. 2.]
Optimize

Based on the domain and a valid objective function, we can now easily apply one of the included optimizers to optimize objective functions. GPflowOpt defines an intuitive Optimizer interface which can be used to specify the domain, the initial point(s), constraints (to be implemented) etc. Some popular optimization approaches are provided. Here is how our function is optimized using one of the available methods of SciPy’s minimize:

In [7]:
from gpflowopt.optim import SciPyOptimizer

optimizer = SciPyOptimizer(domain, method='SLSQP')
optimizer.set_initial([-1,-1])
optimizer.optimize(fx)
Out[7]:
fun: 0.0
     jac: array([ 0.,  0.])
 message: 'Optimization terminated successfully.'
    nfev: 3
     nit: 2
    njev: 2
  status: 0
 success: True
       x: array([[ 0.,  0.]])

And here is how we optimize it Monte-Carlo. We can pass the same function as the gradients are automatically discarded.

In [8]:
from gpflowopt.optim import MCOptimizer
optimizer = MCOptimizer(domain, 200)
optimizer.optimize(fx)
Out[8]:
fun: array([[ 0.02115951]])
 message: 'OK'
    nfev: 200
 success: True
       x: array([[ 0.05731395, -0.13369599]])

Bayesian Optimization

In Bayesian Optimization (BO), the typical assumption is that \(f\) is expensive to evaluate and no gradients are available. The typical approach is to sequentially select a limited set of decisions \(\boldsymbol{x}_0, \boldsymbol{x}_1, ... \boldsymbol{x}_{n-1}\) using a sampling policy. Hence each decision \(\boldsymbol{x}_i \in \mathcal{X}\) itself is the result of an optimization problem

\[\boldsymbol{x}_i = \underset{\boldsymbol{x}}{\operatorname{argmax}} \alpha_i(\boldsymbol{x})\]

Each iteration, a function \(\alpha_i\) which is cheap-to-evaluate acts as a surrogate for the expensive function. It is typically a mapping of the predictive distribution of a (Bayesian) model built on all decisions and their corresponding (noisy) evaluations. The mapping introduces an order in \(\mathcal{X}\) to obtain a certain goal. The typical goal within the context of BO is the search for optimality or feasibility while keeping the amount of required evaluations (\(n\)) a small number. As we can have several functions \(f\) representing objectives and constraints, BO may invoke several models and mappings \(\alpha\). These mappings are typically referred to as acquisition functions (or infill criteria). GPflowOpt defines an Acquisition interface to implement these mappings and provides implementations of some default choices. In combination with a special Optimizer implementation for BO, following steps are required for a typical workflow:

  1. Define the problem domain. Its dimensionality matches the input to the objective and constraint functions. (like normal optimization)
  2. Specify the (GP) models for the constraints and objectives. This involves choice of kernels, priors, fixes, transforms... this step follows the standard way of setting up GPflow models. GPflowOpt does not further wrap models hence it is possible to implement custom models in GPflow and use them directly in GPflowOpt
  3. Set up the acquisition function(s) using the available built-in implementations in GPflowOpt, or design your own by implementing the Acquisition interface.
  4. Set up an optimizer for the acquisition function.
  5. Run the high-level BayesianOptimizer which implements a typical BO flow. BayesianOptimizer in itself is compliant with the Optimizer interface. Exceptionally, the BayesianOptimizer requires that the objective function returns no gradients.

Alternatively, advanced users requiring finer control can easily implement their own flow based on the low-level interfaces of GPflowOpt, as the coupling between these objects was intentionally kept loose.

As illustration of the described flow, the previous example is optimized using Bayesian optimization instead, with the well-known Expected Improvement acquisition function:

In [9]:
from gpflowopt.bo import BayesianOptimizer
from gpflowopt.design import FactorialDesign
from gpflowopt.acquisition import ExpectedImprovement
import gpflow

# The Bayesian Optimizer does not expect gradients to be returned
def fx(X):
    X = np.atleast_2d(X)
    # Return objective & gradient
    return np.sum(np.square(X), axis=1)[:,None]


X = FactorialDesign(2, domain).generate()
Y = fx(X)

# initializing a standard BO model, Gaussian Process Regression with
# Matern52 ARD Kernel
model = gpflow.gpr.GPR(X, Y, gpflow.kernels.Matern52(2, ARD=True))
alpha = ExpectedImprovement(model)

# Now we must specify an optimization algorithm to optimize the acquisition
# function, each iteration.
acqopt = SciPyOptimizer(domain)

# Now create the Bayesian Optimizer
optimizer = BayesianOptimizer(domain, alpha, optimizer=acqopt)
with optimizer.silent():
    r = optimizer.optimize(fx, n_iter=15)
print(r)
     fun: array([ 0.17684308])
 message: 'OK'
    nfev: 15
 success: True
       x: array([[ 0.        ,  0.42052714]])

This brief snippet code starts from a 2-level grid (corner points of the domain) and uses a GP model to model the response surface of the objective function. The BayesianOptimizer follows the same interface as other optimizers and is initialized with a domain, the acquisition function and an additional optimization method to optimize the acquisition function each iteration. Finally, the optimizer performs 10 iterations to optimize fx.

The code to evaluate the acquisition function on the model is written in TensorFlow, allowing gradient-based optimization without additional effort due to the automated differentation.

Step-by-step description of the BayesianOptimizer

Prior to running BayesianOptimizer.optimize(), the acquisition function is initialized with an underlying model. Any data previously included in the model (through the GPModel.__init__ constructor in GPflow) is used as initial data. When optimize(function, n_iter) is called:

  1. Any data points returned by get_initial() are evaluated. Afterwards the evaluated points are added to the models by calling _update_model_data().
  2. n_iter iterations are performed. Each iteration the acquisition function is optimized, and the models are updated by calling _update_model_data()

The updating of a model through _update_model_data() calls set_data(X, Y) on the acquisition function. This covers following aspects:

  • GPModel.X and GPModel.Y are updated
  • Each of the contained models are returned to the state when the acquisition function was initialized and optimized. If the optimize_restarts parameter of the Acquisition.__init__() was set to \(n>1\), the state of the model is randomized and optimized \(n-1\) times. Finally, the state resulting in the best log_likelihood() is the new model state
  • Call Acquisition.setup() to perform any pre-calculation of quantities independent of candidate points, which can be used in build_acquisition().

The GPflow tree

The Acquisition interface, mapping the belief of the model(s) to a score indicating areas of optimality/feasibility, is implemented as part of the GPflow tree structure. More specifically it implements the Parameterized interface permitting the use of the useful AutoFlow decorator. The build_acquisition() method to be implemented by subclasses is a TensorFlow method, allowing automated differentiation of the acquisition function which enables gradient-based optimization thereof (not of the objective!). It may directly access the graph for computing the predictive distribution of a model by calling build_predict().

Bayesian Optimizer

Acquisition functions

The GPflowOpt package currently supports a limited number of popular acquisition functions. These are summarized in the table below. Detailed description for each can be found below.

Method Objective Constraint # Outputs
GPflowOpt.acquisition.ExpectedImprovement   1
GPflowOpt.acquisition.ProbabilityOfFeasibility   1
GPflowOpt.acquisition.ProbabilityOfImprovement   1
GPflowOpt.acquisition.LowerConfidenceBound   1
GPflowOpt.acquisition.HVProbabilityOfImprovement   > 1

Single-objective

Expected Improvement
Probability of Feasibility
Probability of Improvement
Lower Confidence Bound

Multi-objective

Hypervolume-based Probability of Improvement
Pareto module

Initial Designs

Latin Hypercube design

Factorial design

Random design

Data Transformations

Transforms

Normalizer

Interfaces

Domain

Optimizer

Acquisition

Design

Transform

ModelWrapper