Preliminary
Perceptron
- Threshold unit
- “Fires” if the weighted sum of inputs exceeds a threshold
- Soft perceptron
- Using sigmoid function instead of a threshold at the output
- Activation: The function that acts on the weighted combination of inputs (and threshold)
- Affine combination
- Different from Linear combination: the result of mapping zero is not zero.
Multi-layer perceptron
- Depth
- Is the length of the longest path from a source to a sink
- Deep: Depth greater than 2
- Inputs/Outputs are real or Boolean stimuli
- What can this network compute?
Universal Boolean functions
- A perceptron can model any simple binary Boolean gate
- Using weight 1 or -1 to model function
- The universal AND gate: (⋀i=1LXi)∧(⋀i=L+1NXˉi)(\bigwedge_{i=1}^{L} X_{i}) \wedge(\bigwedge_{i=L+1}^{N} \bar{X}_{i})(⋀i=1LXi)∧(⋀i=L+1NXˉi)
- The universal OR gate: (⋁i=1LXi)∨(⋁i=L+1NXˉi)(\bigvee_{i=1}^{L} X_{i}) \vee(\bigvee_{i=L+1}^{N} \bar{X}_{i})(⋁i=1LXi)∨(⋁i=L+1NXˉi)
- Cannot compute an XOR
- MLPs can compute the XOR
-
MLPs are universal Boolean functions
- Can compute any Boolean function
-
A Boolean function is just a truth table
-
So expressed the result in disjunctive normal form, like
-
Y=Xˉ1Xˉ2X3X4Xˉ5+Xˉ1X2Xˉ3X4X5+Xˉ1X2X3Xˉ4Xˉ5+X1Xˉ2Xˉ3Xˉ4X5+X1Xˉ2X3X4X5+X1X2Xˉ3Xˉ4X5 \begin{aligned} Y=& \bar{X}_{1} \bar{X}_{2} X_{3} X_{4} \bar{X}_{5}+\bar{X}_{1} X_{2} \bar{X}_{3} X_{4} X_{5}+\bar{X}_{1} X_{2} X_{3} \bar{X}_{4} \bar{X}_{5}+\\ & X_{1} \bar{X}_{2} \bar{X}_{3} \bar{X}_{4} X_{5}+X_{1} \bar{X}_{2} X_{3} X_{4} X_{5}+X_{1} X_{2} \bar{X}_{3} \bar{X}_{4} X_{5} \end{aligned} Y=Xˉ1Xˉ2X3X4Xˉ5+Xˉ1X2Xˉ3X4X5+Xˉ1X2X3Xˉ4Xˉ5+X1Xˉ2Xˉ3Xˉ4X5+X1Xˉ2X3X4X5+X1X2Xˉ3Xˉ4X5
-
In this case, need 5 neurons in the hidden layer.
-
Need for depth
-
A one-hidden-layer MLP is a Universal Boolean Function
- But the largest number of perceptrons is expontial: 2N2^N2N
-
How about depth?
- Will require 3(N−1)3(N-1)3(N−1) perceptrons, linear in NNN to express the same function
- Using associatable rules, can be arranged in 2log2N2\log_2 N2log2N layers
- eg. model O=W⊕X⊕Y⊕ZO=W \oplus X \oplus Y \oplus ZO=W⊕X⊕Y⊕Z
-
The challenge of depth
- Using only KKK hidden layers will require O(2CN)O(2^{CN})O(2CN) neurons in the KKKth layer, where C=2−(k−1)/2C = 2^{-(k-1)/2}C=2−(k−1)/2
- A network with fewer than the minimum required number of neurons cannot model the function
Universal classifiers
- Composing complicated “decision” boundaries
- Using OR to create more decision boundaries
- Can compose arbitrarily complex decision boundaries
- Even using one-layer MLP
Need for depth
- A naïve one-hidden-layer neural network will required infinite hidden neurons
- Construct basic unit and add more layers to decrese #neurons
- The number of neurons required in a shallow network is potentially exponential in the dimensionality of the input
Universal approximators
- A one-layer MLP can model an arbitrary function of a single input
- MLPs can actually compose arbitrary functions in any number of dimensions
- Even without “activation”
- Activation
- A universal map from the entire domain of input values to the entire range of the output activation
Optimal depth and width
- Deeper networks will require far fewer neurons for the same approximation error
- Sufficiency of architecture
- Not all architectures can represent any function
- Continuous activation functions result in graded output at the layer
- To capture information “missed” by the lower layer
Width vs. Activations vs. Depth
- Narrow layers can still pass information to subsequent layers if the activation function is sufficiently graded
- But will require greater depth, to permit later layers to capture patterns
- Capacity of the network
- Information or Storage: how many patterns can it remember
- VC dimension: bounded by the square of the number of …weights… in the network
- Straight forward: largest number of disconnected convex regions it can represent
- A network with insufficient capacity cannot exactly model a function that requires a greater minimal number of convex hulls than the capacity of the network