본문 바로가기

ML&DL/Dive into Deep Learning

[3.1.6]Dive into Deep Learning : exercise answers

728x90
반응형
 

3.1. Linear Regression — Dive into Deep Learning 1.0.3 documentation

 

d2l.ai

This post is just my personal solutions to the exercise problems from the linear regression part of the book. Although I tried my best to come up with the most logical solution, there must be flawes in the solutions(I am currently just a masters student(I am just a hot potato)).    

 

[1-1]

If we view the equation as a loss function as we did in the linear regression session,

$J(b) =  \sum_{i}^{n}(x_{i}-b)^2$

Thus we have to find $b$ that satisfies $\partial_{b}J(b)=0$ 

$\partial_{b}J(b)=-2\sum_{i}^{n}(x_{i}-b)=0$

 

$\therefore b=\frac{1}{n}\sum_{i}^{n} x_{i}$

 

[1-2]

By the given equation above, we can view $x_{i}$ as 'actual result data' and $b$ as 'predicted result'. We can rewrite as $x_{i}=b+\epsilon$(where $\epsilon \sim N(0, \sigma ^2)$).

To maximize the liklihood(MLE) of $b$, we should minimize $\frac{1}{2\sigma ^2}\sum_{i}(x_{i}-b)^2$ which can be driven from the log-liklihood of the original linear regression.

 

[1-3]

The squared version is MSE (Mean Squared Error), and it is the most commonly used loss function in linear regression. It is useful because it is differentiable at every point. However, MAE (Mean Absolute Error), as shown in this problem, is not differentiable when $x_{i}=b$. Although it has this downside, it is better at handling outliers than MSE because it doesn't square the error.

Due to the limitation of MAE (it can't be differentiated to find the minimum of the loss), there is no optimal solution for 
$b$.


[2]

The given affine function can be written as $x_{1}w_{1}+ x_{1}w_{1} + \cdots + x_{n}w_{n}+b$. We can make this into a linear function on $(x, 1)$[this means "a vector space represented by $x$, where the output dimension is 1"]. Since an affine function is the composition of a linear function with a translation, we only have to subtract $b$ from the affine function so that the new function can go through the origin(0), making the function linear. 


[3]

For convenience, let's assume there are only 3 fituers($x_{1}, x_{2}, x_{3}$)

If we put this context into python code, 

import numpy as np
def quadratic(w, W, X, n):
    result = 0
    result += np.sum(w * X)
    for i in range(n):
        result += np.sum(X[i] * W[i][:i + 1] * X[:i + 1])
    return result

[4-1]

From the analytic solution : $w=(X^TX)_{}^{-1}X^Ty$

If $X^TX$ is not a full rank, which means $X^TX$ is not invertible(singular), means that it is impossible to get the unique set of w.

 

Since $(X^TX)_{}^{-1}=X_{}^{-1}(X^T)_{}^{-1}$ and $det(X)=det(X^T)$, $X^TX$ not being full rank also means there is a linearly dependent set of data in X.

 

[4-2]

As long as the result of $X^TX$($x_{ij}:=x_{ij}+\epsilon$ for every i, j) doesn't have any dependent rows, we can use the anlytic solution.

 

[4-3]

I am not sure what the question is intending... but the obvious result is

$(X+N)^T(X+N)$

 

[4-4]

SGD doesn't use analytic solution to find the optimal w, the algorithm iterates through only one data example and calculates the gradient of the loss function. So as long as the gradient is computable, $X^TX$ not being full rank is not the case.


[5-1]

The original distribution was the Gaussian, thus we only have to replace the formula to the new exponential distribution.

$P(y|X)=\prod_{i=1}^{n}\frac{1}{2}exp(-|y_{}^{(i)}-w^Tx_{}^{(i)}-b|)$. This equation should be maximized.

Thus $-\log P(y|X)=k\log 2+\sum_{i=1}^{k}(|y_{}^{(i)}-w^Tx_{}^{(i)}-b|)$

Since $k\log 2$ is not related to w, only $\sum_{i=1}^{k}(|y_{}^{(i)}-w^Tx_{}^{(i)}-b|)$ should be minimized in order to maximize $P(y|X)$.

 

[5-2]

We want to find w for $\partial_{w}(-\log P(y|X))=0$. However because of the absolute value, we can't find the closed form solution.

 

[5-3]

The only problem is that gradient is not computable when $y_{}^{(i)}=w^Tx_{}^{(i)}+b$. To solve this problem I sugest two solutions.

  1. Adding small noise to the gradient. This would prevent the gradient from becoming zero. However gradient will become too large when the above equation is satisfied, most likely the gradient might explode. So just by adding noise to the result might not be enough.
  2. $\bigtriangledown _{w}(-\log P(y|X))=\begin{cases}
    -x_{}^{(i)} & \text{} (y_{}^{(i)}-w^Tx_{}^{(i)}-b>0) \\
    0 & \text{} (y_{}^{(i)}-w^Tx_{}^{(i)}-b=0) \\
    x_{}^{(i)} & \text{} (y_{}^{(i)}-w^Tx_{}^{(i)}-b<0)
    \end{cases}$

    This makes every spot differentiable. Also it stops updating the parameters when the gradient is at minimum 0. However this approach can lead to early convergence thus stopping at the local minimum.

[6]

Doesn't matter whether the layer is 2 ro 1. Because the result of the output layer is just the sum of linear terms which are created also by sum of linear terms. Thus no matter how deep the network is, multi layer linear network is same as a single layer network. If we want to solve for problems that are complicated for linear models to handle, we should use ReLU, tan h, sigmoid(non-linear activation functions)


[7-1]

Negative price doesn't exist. But negative price might be possible according to the model if the noise to a very small price is negative.

 

There is an assumption that all the noise are indepenent and identically distributed(IID), however house and stock prices are heavly influenced by both the enviornment and each other's price, it is hard to say the noise is independent.

 

[7-2]

log can ease the divergence of the given data since it is a much more stable(in the perspective of how much y is changed respect to x) function than y=x.

 

[7-3]

A pennystock is a stock which is only worth for few pennys or dollors. Despite being considered as a cheap investment opertunity, it is highly risky due to the huge fluctuation. Here are things you should worry when using linear regression on pennystock.

 

  1. Danger of being delisted
    Being Delisted can make it hard for the model to learn since there are no further data after it had been delisted. And I highly doubt a delisted stock to be assessed as 0. 
  2. Limited information
    There are little information since most of the companys listed as pennystock are startups. Therefore there are not enough information to train the model.
  3. Significant price fluctuation 
    Representing the price as a linear line might pose challenges.

Since linear regression is sensitive to outliers, it is not desirable to set pennystock as a target variable. Although in the paper the magnitude of the volatility is diminished thanks to the natural log, the impact of volatility could seam through the log.


[8-1]

The number of apples sold is heavily influenced by other factors such as climate, seasonal timing. Also the data is presented as discrete values which is not the case for normal distribution(normal distribution can only be applied to continuous random variable).

 

[8-2]

We can use the Poisson Distribution which is designed for discrete random variable.

We have to show that $E[k]=\lambda$

 

$E[k]=\sum_{k=0}^{\infty}kp(k|\lambda)=\sum_{k=0}^{\infty}k\frac{\lambda^ke_{}^{-\lambda}}{k!}=\lambda e_{}^{-\lambda}\sum_{k=1}^{\infty}\frac{\lambda_{}^{k-1}}{(k-1)!}=\lambda e_{}^{-\lambda}\sum_{k=0}^{\infty}\frac{\lambda_{}^{k}}{k!}=\lambda e_{}^{-\lambda}e_{}^{\lambda}=\lambda$

 

If you are wondering why $\sum_{k=0}^{\infty}\frac{\lambda_{}^{k}}{k!} = e_{}^{\lambda} $ see the form of powered natural constant($e^\lambda$) written in tailer series.

 

[8-3]

There is only one rate function($\lambda$) by the poisson distribution. 

 

The liklihood of lambda is

$L(\lambda)=\prod_{i=1}^{n}\frac{\lambda_{}^{k_{}^{(i)}} e_{}^{-\lambda}}{k_{}^{(i)}!}$

 

thus

 

$\log L=l=\sum_{i=1}^{n}\log \lambda_{}^{k_{}^{(i)}}e_{}^{-\lambda}-\sum_{i=1}^{n}\log k_{}^{(i)}! $

 

we only want to look at the equation other than constants.

 

$loss(\lambda)=n\lambda-(\sum_{i=1}^{n}k_{}^{(i)})\log \lambda$

 

[8-4]

By the loss function from 8-3, let $t=\log \lambda$, thus

$loss(t)=ne^t-(\sum_{i=1}^{n}k_{}^{(i)})t$

 

728x90
반응형