非线性假想函数
对一个非常复杂的数据集进行线性回归是不明智的。假设你要构造一个包含很多非线性项的逻辑回归函数: \[g(\theta_0 + \theta_1x^2_1 + \theta_2x_1x_2 + \theta_3x_1x_3 + \theta_4x^2_2 + \theta_5x_2x_3 + \theta_6x^2_3)\]
总共有6个参数。事实上 ,当多项式项数足够多时,那么可能能够得到一个分开正样本和负样本的分界线,当只有两项时,这种方法确实能得到不错的结果。因为你可以把 \(x_1\) 和 \(x_2\) 的所有组合都包含到多项式中。但是对于许多复杂的机器学习问题,涉及的项往往多于两项。我们之前已经讨论过房价预测的问题,假设现在要处理的是关于住房的分类问题而不是一个回归问题。假设你对一栋房子的多方面特点都有所了解,你想预测房子在未来半年内能被卖出去的概率,这是一个分类问题。我们可以想出很多特征,对于不同的房子有可能有上百个特征,对于这类问题,如果要包含所有的二次项,即使只包含二项式或多项式的计算,最终的多项式也会有很多。
包含所有 \(n\) 个特征 \(r\) 次项的多项式项的个数是:\(\frac{(n+r-1)!}{r!(n-1)!}\)
比如有100个特征,那么所有二次项的个数为 \(\frac{(100 + 2 - 1)!}{(2*(100 - 1)!)} = 5050\)
We can approximate the growth of the number of new features we get with all quadratic terms with \(O(n^2/2)\). And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at \(O(n^3)\). These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.
神经网络提供了一种可行的方式来应用有许多特征的复杂的函数的学习。
神经和大脑
神经网络模型就是模仿我们大脑的学习过程。
我们的大脑只用一个学习算法学习所有不同的函数。科学家尝试切断连接耳朵与控制听觉的的神经并把光感应器官上和控制听觉的神经重新连接起来,结果控制听觉的神经学会了看(see)。
这叫神经可塑性(Neuroplasticity),并有许多例子证明它是正确的。
模型解释
Let's examine how we will represent a hypothesis function using neural networks.
At a very simple level, neurons are basically computational units that take input (dendrites树突) as electrical input (called "spikes") that are channeled to outputs (axons轴突).
In our model, our dendrites are like the input features (x1⋯xn), and the output is the result of our hypothesis function.
In this model our \(x_0\) input node is sometimes called the "bias unit." It is always equal to 1.
In neural networks, we use the same logistic function as in classification: \(\frac{1}{1+e^{−θ^Tx}}\). In neural networks however we sometimes call it a sigmoid (logistic) activation function.
Our "theta" parameters are sometimes instead called "weights" in the neural networks model.
Visually, a simplistic representation looks like: \[\begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix}→[ ]→h_θ(x)\]
Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function.
The first layer is called the "input layer" and the final layer the "output layer," which gives the final value computed on the hypothesis.
We can have intermediate layers of nodes between the input and output layers called the "hidden layer." We label these intermediate or "hidden" layer nodes \(a^2_0⋯a^2_n\) and call them "activation units."
\(a^{(j)}_i\) = "activation" of unit i in layer j \(Θ^{(j)}\) = matrix of weights controlling function mapping from layer j to layer j+1
If we had one hidden layer, it would look visually something like:
\[\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix}→\begin{bmatrix} a_1^{(2)} \\ a_2^{(2)} \\ a_3^{(2)} \end{bmatrix}→h_θ(x)\]
The values for each of the "activation" nodes is obtained as follows:
\[a^{(2)}_1=g(Θ^{(1)}_{10}x_0+Θ^{(1)}_{11}x_1+Θ^{(1)}_{12}x_2+Θ^{(1)}_{13}x_3)\] \[a^{(2)}_2=g(Θ^{(1)}_{20}x_0+Θ^{(1)}_{21}x_1+Θ^{(1)}_{22}x_2+Θ^{(1)}_{23}x_3)\] \[a^{(2)}_3=g(Θ^{(1)}_{30}x_0+Θ^{(1)}_{31}x_1+Θ^{(1)}_{32}x_2+Θ^{(1)}_{33}x_3)\] \[h_\theta(x) = a^{(3)}_1 = g(Θ^{(2)}_{10}a_0^{(2)}+Θ^{(2)}_{11}a^{(2)}_1+Θ^{(2)}_{12}a_2^{(2)}+Θ^{(2)}_{13}a_3^{(2)})\]
This is saying that we compute our activation nodes by using a \(3×4\) matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix \(Θ^{(2)}\) containing the weights for our second layer of nodes.
Each layer gets its own matrix of weights, \(Θ^{(j)}\).
The dimensions of these matrices of weights is determined as follows: If network has \(s_j\) units in layer \(j\) and \(s_{j+1}\) units in layer \(j+1\), then \(Θ^{(j)}\) will be of dimension \(s_{j+1}×(s_j+1)\).
The \(+1\) comes from the addition in \(Θ^{(j)}\) of the "bias nodes," \(x_0\) and \(Θ^{(j)}_0\). In other words the output nodes will not include the bias nodes while the inputs will.
Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of \(Θ^{(1)}\) is going to be \(4×3\) where \(s_j=2\) and \(s_{j+1}=4\), so \(s_{j+1}×(s_j+1)=4×3\).
Vectorized implementation
We're going to define a new variable \(z^{(j)}_k\) that encompasses the parameters inside our \(g\) function. In our previous example if we replaced the variable \(z\) for all the parameters we would get: \[a^{(2)}_1=g(z^{(2)}_1)\] \[a^{(2)}_2=g(z^{(2)}_2)\] \[a^{(2)}_3=g(z^{(2)}_3)\]
In other words, for layer \(j=2\) and node \(k\), the variable z will be:
\(z^{(2)}_k=Θ^{(1)}_{k,0}x_0+Θ^{(1)}_{k,1}x_1+⋯+Θ^{(1)}_{k,n}x_n\)
The vector representation of \(x\) and \(z^{(j)}\) is:
\[x=\begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} z(j)=\begin{bmatrix} z_1^{(j)} \\ z_2^{(j)} \\ \vdots \\ z_n^{(j)} \end{bmatrix}\]
Setting \(x=a^{(1)}\), we can rewrite the equation as: \[z^{(j)} = Θ^{(j−1)}a^{(j−1)}\]
We are multiplying our matrix \(Θ^{(j−1)}\) with dimensions \(s_j×(n+1)\) (where \(s_j\) is the number of our activation nodes) by our vector \(a^{(j−1)}\) with height \((n+1)\). This gives us our vector \(z^{(j)}\) with height \(s_j\). Now we can get a vector of our activation nodes for layer \(j\) as follows: \[a^{(j)}=g(z^{(j)})\]
Where our function \(g\) can be applied element-wise to our vector \(z^{(j)}\).
We can then add a bias unit (equal to 1) to layer \(j\) after we have computed \(a^{(j)}\). This will be element \(a^{(j)}_0\) and will be equal to 1.
To compute our final hypothesis, let's first compute another z vector: \[z^{(j+1)}=Θ^{(j)}a^{(j)}\]
We get this final \(z\) vector by multiplying the next theta matrix after \(Θ^{(j−1)}\) with the values of all the activation nodes we just got.
This last theta matrix (\(Θ^{(j)}\)) will have only one row so that our result is a single number. We then get our final result with: \[h_Θ(x)=a^{(j+1)}=g(z^{(j+1)})\]
Notice that in this last step, between layer \(j\)and layer \(j+1\), we are doing exactly the same thing as we did in logistic regression.
Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.
举例说明
\(x_1\) AND \(x_2\)
A simple example of applying neural networks is by predicting \(x_1\) AND \(x_2\), which is the logical 'and' operator and is only true if both \(x_1\) and \(x_2\) are \(1\).
The graph of our functions will look like: \[\begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix}→[g(z^{(2)})]→h_Θ(x)\]
Remember that \(x_0\) is our bias variable and is always 1.
Let's set our first theta matrix as: \(Θ^{(1)} = [\) \(-30\) \(20\) \(20]\)
This will cause the output of our hypothesis to only be positive if both \(x_1\) and \(x_2\) are \(1\). In other words: \(h_Θ(x)=g(−30+20x_1+20x_2)\)
\(x_1=0\) \(and\) \(x_2=0\) \(then\) \(g(−30) ≈ 0\) \(x_1=0\) \(and\) \(x_2=1\) \(then\) \(g(−10)≈0\) \(x_1=1\) \(and\) \(x_2=0\) \(then\) \(g(−10)≈0\) \(x_1=1\) \(and\) \(x_2=1\) \(then\) \(g(10)≈1\)
So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates.
NOR、OR、XNOR
The \(Θ^{(1)}\) matrices for \(AND\), \(NOR\), and \(OR\) are:
\(AND\): \[Θ^{(1)} = \begin{bmatrix} -30 & 20 & 20 \end{bmatrix}\]
\(NOR\): \[Θ^{(1)} = \begin{bmatrix} 10 & -20 & -20 \end{bmatrix}\]
\(OR\): \[Θ^{(1)}=\begin{bmatrix} -10 & 20 & 20 \end{bmatrix}\]
We can combine these to get the \(XNOR\) logical operator (which gives 1 if \(x_1\) and \(x_2\) are both 0 or both 1). \[\begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix}→\begin{bmatrix} a_1^{(2)} \\ a_2^{(2)} \\ \end{bmatrix}→[a^{(3)}]→h_Θ(x)\]
For the transition between the first and second layer, we'll use a \(Θ^{(1)}\) matrix that combines the values for \(AND\) and \(NOR\): \[Θ^{(1)}=\begin{bmatrix} -30 & 20 & 20 \\ 10 & -20 & -20 \end{bmatrix}\] For the transition between the second and third layer, we'll use a \(Θ^{(2)}\) matrix that uses the value for \(OR\): \[Θ^{(2)}=\begin{bmatrix} -10 & 20 & 20 \end{bmatrix}\]
Let's write out the values for all our nodes: \(a^{(2)}=g(Θ^{(1)}⋅x)\) \(a^{(3)}=g(Θ^{(2)}⋅a^{(2)})\) \(h_Θ(x)=a^{(3)}\)
And there we have the \(XNOR\) operator using one hidden layer!
Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four final resulting classes:
\[\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} → \begin{bmatrix} a_0^{(2)} \\ a_1^{(2)} \\ a_2^{(2)} \\ \vdots \\ \end{bmatrix} → \begin{bmatrix} a_0^{(3)} \\ a_1^{(3)} \\ a_2^{(3)} \\ \vdots \\ \end{bmatrix}→ ... → \begin{bmatrix} h_\theta(x)_1 \\ h_\theta(x)_2 \\ h_\theta(x)_3 \\ h_\theta(x)_4 \\ \end{bmatrix} →\]
Our final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will apply the \(g()\) logistic function to get a vector of hypothesis values.
Our resulting hypothesis for one set of inputs may look like: \[h_Θ(x)=\begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}\]
In which case our resulting class is the third one down, or \(h_Θ(x)_3\).
We can define our set of resulting classes as \(y\): \[y(i)=\begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix},\begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}\]
Our final value of our hypothesis for a set of inputs will be one of the elements in \(y\).