Recipes
ComputationGraphs
includes a few recipes to add to a computation graph the nodes needed to perform a "common" computation or to implement a "common" algorithm. These recipes are functions of the general form
output_nodes = recipe!(graph, input_nodes...; keyword_parameters)
where
graph
is a computation graph that is updated "in-place" by adding to it all the nodes needed for the recipe,input_nodes
is typically a (named) tuple of nodes from the existing computational graph to which the recipe will be applied,keyword_parameters
are specify options for the recipe, andoutput_nodes
is typically a (named) tuple of nodes generated by the recipe. The recipe may add many more "intermediate" nodes tograph
, but only those inoutput_nodes
contain the recipe's "final outputs".
The functions D and hessian discussed in [Differentiation] are, in fact, recipes for which the input nodes specify the expression and variable to differentiate and a single output node that corresponds to the derivative.
Building upon these recipes, ComputationGraphs
provides more complex recipes, e.g., to implement common optimization algorithms and to perform inference and training with neural networks.
Optimization recipes
gradDescent! is a recipe for performing the computations needed by the classical gradient descent algorithm to minimize a (scalar-valued) loss function $J(\theta)$ by adjusting a set of optimization parameters $\theta$.
\[ \theta^+ = \theta - \eta\, \nabla_\theta J(\theta)\]
The recipe used to implement gradient descent is called as follows:
(; next_theta, eta, gradients) = gradDescent!(graph; loss, theta)
where the input parameters include
loss
is a (scalar-valued) computation node that corresponds to the loss function $J(\theta)$,theta
is a (named) tuple of variable nodes that correspond to the optimization parameters $\theta$,
and the returned values include
eta
is a variable node that can be used to set the learning rate $\eta$,next_theta
is a (named) tuple of computation nodes that holds the value $\theta^+$ of the optimization parameters after one gradient descent iteration,gradients
is a (named) tuple of the computation nodes that hold the value of the gradients of the loss function with respect to the different variables intheta
.
To illustrate the use of this recipe, suppose that we want to minimize
\[ J(x) = \| A\, x -b \|^2\]
with respect to $x$. The construct of the computation graph that accomplishes this using gradient descent, we would use:
using ComputationGraphs
graph = ComputationGraph(Float64)
A = variable(graph, 4, 3)
x = variable(graph, 3)
b = variable(graph, 4)
loss = @add graph norm2(A*x-b)
theta = (;x,)
(; next_theta, eta, gradients) = gradDescent!(graph; loss, theta)
With this graph in place, the actual optimization can be carried out as follows:
using BenchmarkTools
# Set fixed parameters
set!(graph, A, [1.0 2.0 3.0; 4.0 5.0 6.0; 7.0 8.0 9.0; 10.0 11.0 12.0])
set!(graph, b, [2.0, 2.0, 2.0, 2.0])
# Set learning rate
set!(graph, eta, 0.001)
# Initialize optimization parameter
set!(graph, x, [1.0, 1.0, 1.0])
println("initial loss: ", get(graph,loss))
# Gradient descent loop
nIterations=100
for i in 1:nIterations
compute!(graph, next_theta) # compute next value of theta
copyto!(graph, theta, next_theta) # execute update
end
println("final loss: ", get(graph,loss))
bmk = @benchmark for i in 1:$nIterations
compute!($graph, $next_theta) # compute next value of theta
copyto!($graph, $theta, $next_theta) # execute update
end setup=(set!(graph, x, [1.0, 1.0, 1.0]))
println("final loss: ", get(graph,loss))
initial loss: fill(1630.0)
final loss: fill(2.1298844507367067)
final loss: fill(2.1298844507367067)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 57.276 μs … 137.836 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 57.967 μs ┊ GC (median): 0.00%
Time (mean ± σ): 58.594 μs ± 2.718 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇██▆▃▁▁ ▁▂▂▂▁ ▂
█████████▇▅▄▃▁▅▇▆▇▆▅▃▄▄▃▄▃▄▃▁▁▁▁▁▁▃▃▃▃▁▄▃▄▇█████▇▇▅▆▆▆▅▅▅▆▆▄ █
57.3 μs Histogram: log(frequency) by time 67.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
The recipe adam! for the more complex Adam's algorithm can be found in Examples
Neural-network recipes
denseChain! is a recipe for performing inference and training of a dense forward neural network of form:
x[1] = input
x[2] = activation(W[1] * x[1] + b[1])
...
x[N-1] = activation(W[N-2] * x[N-2] + b[N-2])
output = W[N-1] * x[N-1] + b[N-1] # no activation in the last layer
with training based on loss function of
loss = some_loss_function(output-reference)
The recipe used for this is called as follows:
(; inference, training, theta) = denseChain!(graph;
nNodes, inferenceBatchSize, trainingBatchSize, activation,loss)
where the input parameters include
nNodes::Vector{Int}
: Vector with the number of nodes in each layer, starting from the input and ending at the output layer.inferenceBatchSize::Int=1
: Number of inputs for each inference batch. WheninferenceBatchSize=0
no nodes will be created for inference.trainingBatchSize::Int=0
: Number of inputs for each training batch. WhentrainingBatchSize=0
no nodes will be created for training.activation::Function=ComputationGraphs.relu
: Activation function. Use theidentity
function if no activation is desired.loss::Symbol=:mse
: Desired type of loss function, among the options: + :sse = sum of square error + :mse = mean square error (i.e., sse normalized by the error size) + :huber = huber function on the error + :mhuber = huber function on the error, normalized by the error size
and the returned tuple includes
inference::NamedTuple
: named tuple with the inference nodes: +input
NN input for inference +output
NN output for inference WheninferenceBatchSize=0
this tuple is returned empty
+ training::NamedTuple
: named tuple with the training nodes: + input
NN input for training + output
NN output for training + reference
NN desired output for training + loss
NN loss for training When trainingBatchSize=0
this tuple is returned empty
theta::NamedTuple
: named tuple with the NN parameters (all the matrices W and b)
To illustrate the use of this recipe, suppose that we want to construct a neural network whose input is an angle in the [0,2*pi] range with two outputs that return the sine and cosine of the angle.
To accomplish this, will use a network with 1 input, 2 output, a few hidden layers, and relu
activation functions.
We want the computation graph to support:
- inference, i.e., compute the output for a given input;
- training, i.e., minimize the loss for a given set of inputs and desired outputs.
For training we will use a large batch size, but for inference we will only provide one input at a time.
We start by using denseChain! to construct a graph that performs all the computations needed to do inference and compute the (training) loss function for the network.
using ComputationGraphs, Random
graph=ComputationGraph(Float32)
hiddenLayers=[20,20,20]
(; inference, training, theta)=denseChain!(graph;
nNodes=[1,hiddenLayers...,2], inferenceBatchSize=1, trainingBatchSize=3000,
activation=ComputationGraphs.relu, loss=:mse)
We now generate random (but repeatable) training data:
Random.seed!(0)
input=(2*pi).*rand(Float32,size(training.input))
reference=vcat(sin.(input),cos.(input))
set!(graph,training.input,input)
set!(graph,training.reference,reference)
Not surprisingly, with a random initialization of the network weights, the loss is large:
Random.seed!(0)
for k in eachindex(theta)
set!(graph,theta[k],randn(Float32,size(theta[k])))
end
loss=get(graph,training.loss)
println("loss = ",loss)
loss = fill(198.47896f0)
However, we can use an optimization recipe –- like gradDescent! or adam! –- to train the network. This can be found in the example Neural network training.