Recipes

ComputationGraphs includes a few recipes to add to a computation graph the nodes needed to perform a "common" computation or to implement a "common" algorithm. These recipes are functions of the general form

output_nodes = recipe!(graph, input_nodes...; keyword_parameters)

where

graph is a computation graph that is updated "in-place" by adding to it all the nodes needed for the recipe,
input_nodes is typically a (named) tuple of nodes from the existing computational graph to which the recipe will be applied,
keyword_parameters are specify options for the recipe, and
output_nodes is typically a (named) tuple of nodes generated by the recipe. The recipe may add many more "intermediate" nodes to graph, but only those in output_nodes contain the recipe's "final outputs".

The functions D and hessian discussed in [Differentiation] are, in fact, recipes for which the input nodes specify the expression and variable to differentiate and a single output node that corresponds to the derivative.

Building upon these recipes, ComputationGraphs provides more complex recipes, e.g., to implement common optimization algorithms and to perform inference and training with neural networks.

Optimization recipes

gradDescent! is a recipe for performing the computations needed by the classical gradient descent algorithm to minimize a (scalar-valued) loss function $J(\theta)$ by adjusting a set of optimization parameters $\theta$.

\[ \theta^+ = \theta - \eta\, \nabla_\theta J(\theta)\]

The recipe used to implement gradient descent is called as follows:

(; next_theta, eta, gradients) = gradDescent!(graph; loss, theta)

where the input parameters include

loss is a (scalar-valued) computation node that corresponds to the loss function $J(\theta)$,
theta is a (named) tuple of variable nodes that correspond to the optimization parameters $\theta$,

and the returned values include

eta is a variable node that can be used to set the learning rate $\eta$,
next_theta is a (named) tuple of computation nodes that holds the value $\theta^+$ of the optimization parameters after one gradient descent iteration,
gradients is a (named) tuple of the computation nodes that hold the value of the gradients of the loss function with respect to the different variables in theta.

To illustrate the use of this recipe, suppose that we want to minimize

\[ J(x) = \| A\, x -b \|^2\]

with respect to $x$. The construct of the computation graph that accomplishes this using gradient descent, we would use:

using ComputationGraphs
graph = ComputationGraph(Float64)
A = variable(graph, 4, 3)
x = variable(graph, 3)
b = variable(graph, 4)
loss = @add graph norm2(A*x-b)
theta = (;x,)
(; next_theta, eta, gradients) = gradDescent!(graph; loss, theta)

With this graph in place, the actual optimization can be carried out as follows:

using BenchmarkTools
# Set fixed parameters
set!(graph, A, [1.0 2.0 3.0; 4.0 5.0 6.0; 7.0 8.0 9.0; 10.0 11.0 12.0])
set!(graph, b, [2.0, 2.0, 2.0, 2.0])
# Set learning rate
set!(graph, eta, 0.001)
# Initialize optimization parameter
set!(graph, x, [1.0, 1.0, 1.0])
println("initial loss: ", get(graph,loss))
# Gradient descent loop
nIterations=100
for i in 1:nIterations
    compute!(graph, next_theta)       # compute next value of theta
    copyto!(graph, theta, next_theta) # execute update
end
println("final loss: ", get(graph,loss))
bmk = @benchmark for i in 1:$nIterations
    compute!($graph, $next_theta)             # compute next value of theta
    copyto!($graph, $theta, $next_theta)      # execute update
end setup=(set!(graph, x, [1.0, 1.0, 1.0]))
println("final loss: ", get(graph,loss))

initial loss: fill(1630.0)
final loss: fill(2.1298844507367067)
final loss: fill(2.1298844507367067)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  57.276 μs … 137.836 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.967 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.594 μs ±   2.718 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇██▆▃▁▁                                  ▁▂▂▂▁             ▂
  █████████▇▅▄▃▁▅▇▆▇▆▅▃▄▄▃▄▃▄▃▁▁▁▁▁▁▃▃▃▃▁▄▃▄▇█████▇▇▅▆▆▆▅▅▅▆▆▄ █
  57.3 μs       Histogram: log(frequency) by time      67.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The recipe adam! for the more complex Adam's algorithm can be found in Examples

Warning

Unlike get, the function copyto! does not implicitly carry out any computations to make sure that the value being copied is valid. Because of this, we need to explicitly call compute! before copyto!.

Neural-network recipes

denseChain! is a recipe for performing inference and training of a dense forward neural network of form:

    x[1]   = input
    x[2]   = activation(W[1] * x[1] + b[1])
    ...
    x[N-1] = activation(W[N-2] * x[N-2] + b[N-2])
    output = W[N-1] * x[N-1] + b[N-1]              # no activation in the last layer

with training based on loss function of

  loss = some_loss_function(output-reference)

The recipe used for this is called as follows:

(; inference, training, theta) = denseChain!(graph; 
        nNodes, inferenceBatchSize, trainingBatchSize, activation,loss)

where the input parameters include

nNodes::Vector{Int}: Vector with the number of nodes in each layer, starting from the input and ending at the output layer.
inferenceBatchSize::Int=1: Number of inputs for each inference batch. When inferenceBatchSize=0 no nodes will be created for inference.
trainingBatchSize::Int=0: Number of inputs for each training batch. When trainingBatchSize=0 no nodes will be created for training.
activation::Function=ComputationGraphs.relu: Activation function. Use the identity function if no activation is desired.
loss::Symbol=:mse: Desired type of loss function, among the options: + :sse = sum of square error + :mse = mean square error (i.e., sse normalized by the error size) + :huber = huber function on the error + :mhuber = huber function on the error, normalized by the error size

and the returned tuple includes

inference::NamedTuple: named tuple with the inference nodes: + input NN input for inference + output NN output for inference When inferenceBatchSize=0 this tuple is returned empty

+ training::NamedTuple: named tuple with the training nodes: + input NN input for training + output NN output for training + reference NN desired output for training + loss NN loss for training When trainingBatchSize=0 this tuple is returned empty

theta::NamedTuple: named tuple with the NN parameters (all the matrices W and b)

To illustrate the use of this recipe, suppose that we want to construct a neural network whose input is an angle in the [0,2*pi] range with two outputs that return the sine and cosine of the angle.

To accomplish this, will use a network with 1 input, 2 output, a few hidden layers, and relu activation functions.

We want the computation graph to support:

inference, i.e., compute the output for a given input;
training, i.e., minimize the loss for a given set of inputs and desired outputs.

For training we will use a large batch size, but for inference we will only provide one input at a time.

We start by using denseChain! to construct a graph that performs all the computations needed to do inference and compute the (training) loss function for the network.

using ComputationGraphs, Random
graph=ComputationGraph(Float32)
hiddenLayers=[20,20,20]
(; inference, training, theta)=denseChain!(graph;
        nNodes=[1,hiddenLayers...,2], inferenceBatchSize=1, trainingBatchSize=3000,
        activation=ComputationGraphs.relu, loss=:mse)

We now generate random (but repeatable) training data:

Random.seed!(0)
input=(2*pi).*rand(Float32,size(training.input))
reference=vcat(sin.(input),cos.(input))
set!(graph,training.input,input)
set!(graph,training.reference,reference)

Not surprisingly, with a random initialization of the network weights, the loss is large:

Random.seed!(0)
for k in eachindex(theta)
    set!(graph,theta[k],randn(Float32,size(theta[k])))
end
loss=get(graph,training.loss)
println("loss = ",loss)

loss = fill(198.47896f0)

However, we can use an optimization recipe –- like gradDescent! or adam! –- to train the network. This can be found in the example Neural network training.