Theano

Description

  • Mathematical symbolic expression compiler

  • Dynamic C/CUDA code generation

  • Efficient symbolic differentiation

    • Theano computes derivatives of functions with one or many inputs.

  • Speed and stability optimizations

    • Gives the right answer for log(1+x) even if x is really tiny.

  • Works on Linux, Mac and Windows

  • Transparent use of a GPU

    • float32 only for now (working on other data types)

    • Still in experimental state on Windows

    • On GPU data-intensive calculations are typically between 6.5x and 44x faster. We’ve seen speedups up to 140x

  • Extensive unit-testing and self-verification

    • Detects and diagnoses many types of errors

  • On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives

    • including specialized implementations in C/C++, NumPy, SciPy, and Matlab

  • Expressions mimic NumPy’s syntax & semantics

  • Statically typed and purely functional

  • Some sparse operations (CPU only)

Simple example

>>> import theano
>>> a = theano.tensor.vector("a")      # declare symbolic variable
>>> b = a + a ** 10                    # build symbolic expression
>>> f = theano.function([a], b)        # compile function
>>> f([0, 1, 2])
array([    0.,     2.,  1026.])

Unoptimized graph

Optimized graph

../_images/f_unoptimized.png ../_images/f_optimized.png

Symbolic programming = Paradigm shift: people need to use it to understand it.

Exercise 1

import theano
a = theano.tensor.vector()      # declare variable
out = a + a ** 10               # build symbolic expression
f = theano.function([a], out)   # compile function
print f([0, 1, 2])
# prints `array([0, 2, 1026])`

theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)

Modify and execute the example to do this expression: a ** 2 + b ** 2 + 2 * a * b

Real example

Logistic Regression

  • GPU-ready

  • Symbolic differentiation

  • Speed optimizations

  • Stability optimizations

from __future__ import absolute_import, print_function, division
import numpy as np
import theano
import theano.tensor as tt
rng = np.random

N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print("Initial model:")
print(w.get_value(), b.get_value())

# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))   # Probability that target = 1
prediction = p_1 > 0.5                      # The prediction thresholded
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy loss
cost = xent.mean() + 0.01 * (w ** 2).sum()  # The cost to minimize
gw, gb = tt.grad(cost, [w, b])

# Compile
train = theano.function(
    inputs=[x, y],
    outputs=[prediction, xent],
    updates=[(w, w - 0.1 * gw),
             (b, b - 0.1 * gb)],
    name='train')

predict = theano.function(inputs=[x], outputs=prediction,
                          name='predict')

# Train
for i in range(training_steps):
    pred, err = train(D[0], D[1])

print("Final model:")
print(w.get_value(), b.get_value())
print("target values for D:", D[1])
print("prediction on D:", predict(D[0]))

Optimizations:

Where are those optimization applied?

  • log(1+exp(x))

  • 1 / (1 + tt.exp(var)) (sigmoid)

  • log(1-sigmoid(var)) (softplus, stabilisation)

  • GEMV (matrix-vector multiply from BLAS)

  • Loop fusion

p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
# 1 / (1 + tt.exp(var)) -> sigmoid(var)
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
# Log(1-sigmoid(var)) -> -sigmoid(var)
prediction = p_1 > 0.5
cost = xent.mean() + 0.01 * (w ** 2).sum()
gw, gb = tt.grad(cost, [w, b])

train = theano.function(
          inputs=[x, y],
          outputs=[prediction, xent],
          # w - 0.1 * gw: GEMV with the dot in the grad
          updates=[(w, w - 0.1 * gw),
                   (b, b - 0.1 * gb)])

Theano flags

Theano can be configured with flags. They can be defined in two ways

  • With an environment variable: THEANO_FLAGS="profile=True,profile_memory=True"

  • With a configuration file that defaults to ~/.theanorc

Exercise 2

import numpy
import theano
import theano.tensor as tt
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000

# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()


# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to optimize
gw,gb = tt.grad(cost, [w, b])

# Compile expressions to functions
train = theano.function(
            inputs=[x, y],
            outputs=[prediction, xent],
            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
            name="train")
predict = theano.function(inputs=[x], outputs=prediction,
                          name="predict")

if any([x.op.__class__.__name__=='Gemv' for x in
        train.maker.fgraph.toposort()]):
    print 'Used the cpu'
elif any([x.op.__class__.__name__=='GpuGemm' for x in
          train.maker.fgraph.toposort()]):
    print 'Used the gpu'
else:
    print 'ERROR, not able to tell if theano used the cpu or the gpu'
    print train.maker.fgraph.toposort()



for i in range(training_steps):
    pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()

print "target values for D"
print D[1]

print "prediction on D"
print predict(D[0])

# Print the graph used in the slides
theano.printing.pydotprint(predict,
                           outfile="pics/logreg_pydotprint_predic.png",
                           var_with_name_simple=True)
theano.printing.pydotprint_variables(prediction,
                           outfile="pics/logreg_pydotprint_prediction.png",
                           var_with_name_simple=True)
theano.printing.pydotprint(train,
                           outfile="pics/logreg_pydotprint_train.png",
                           var_with_name_simple=True)

Modify and execute the example to run on CPU with floatX=float32

  • You will need to use: theano.config.floatX and ndarray.astype("str")

GPU

  • Only 32 bit floats are supported (being worked on)

  • Only 1 GPU per process. Wiki page on using multiple process for multiple GPU

  • Use the Theano flag device=gpu to tell to use the GPU device

  • Use device=gpu{0, 1, ...} to specify which GPU if you have more than one

  • Shared variables with float32 dtype are by default moved to the GPU memory space

  • Use the Theano flag floatX=float32

  • Be sure to use floatX (theano.config.floatX) in your code

  • Cast inputs before putting them into a shared variable

  • Cast “problem”: int32 with float32 to float64

  • Insert manual cast in your code or use [u]int{8,16}

  • The mean operator is worked on to make the output stay in float32.

  • Use the Theano flag force_device=True, to exit if Theano isn’t able to use a GPU.

    • Theano 0.6rc4 will have the combination of force_device=True and device=cpu disable the GPU.

Exercise 3

  • Modify and execute the example of Exercise 2 to run with floatX=float32 on GPU

  • Time with: time python file.py

Symbolic variables

  • # Dimensions

  • tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4

  • Dtype

  • tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)

  • tt.vector to floatX dtype

  • floatX: configurable dtype that can be float32 or float64.

  • Custom variable

  • All are shortcuts to: tt.tensor(dtype, broadcastable=[False]*nd)

  • Other dtype: uint[8,16,32,64], floatX

Creating symbolic variables: Broadcastability

  • Remember what I said about broadcasting?

  • How to add a row to all rows of a matrix?

  • How to add a column to all columns of a matrix?

Details regarding symbolic broadcasting…

  • Broadcastability must be specified when creating the variable

  • The only shorcut with broadcastable dimensions are: tt.row and tt.col

  • For all others: tt.tensor(dtype, broadcastable=([False or True])*nd)

Differentiation details

>>> gw, gb = tt.grad(cost, [w,b])  
  • tt.grad works symbolically: takes and returns a Theano variable

  • tt.grad can be compared to a macro: it can be applied multiple times

  • tt.grad takes scalar costs only

  • Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian

  • We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector

Old Benchmarks

Example:

  • Multi-layer perceptron

  • Convolutional Neural Networks

  • Misc Elemwise operations

Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr

  • EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks

  • numexpr: similar to Theano, ‘virtual machine’ for elemwise expressions

New Benchmarks

Example (Page 7 and 9):

  • Logistic regression, MLP with 1 and 3 layers

  • Recurrent neural networks

Competitors: Torch7, RNNLM

  • Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks