본문 바로가기

ML&DL/Dive into Deep Learning

[3.7.6] Dive into Deep Learning : exercise answers

728x90
반응형
 

3.7. Weight Decay — Dive into Deep Learning 1.0.3 documentation

 

d2l.ai

[1]

$\lambda$ can be any positive real number. However, to maintain the purpose of weight decay we should follow $\lambda \eta <1 $. Since the learning rate is 0.01, we can try $\lambda$ for 0 ~ 99

data = Data(num_train=100, num_val=100, num_inputs=200, batch_size=20)
trainer = d2l.Trainer(max_epochs=10)
test_lambds=[*range(100)]
board = d2l.ProgressBoard('lambda')

def accuracy(y_hat, y):
    return (1 - ((y_hat - y).mean() / y.mean()).abs()) * 100

def train_ex1(lambd):    
    model = WeightDecay(wd=lambd, lr=0.01)
    model.board.yscale='log'
    trainer.fit(model, data)
    y_hat = model.forward(data.X)
    acc_train = accuracy(y_hat[:data.num_train], data.y[:data.num_train])
    acc_val = accuracy(y_hat[data.num_train:], data.y[data.num_train:])
    return acc_train, acc_val

for item in test_lambds:
    acc_train, acc_val = train_ex1(item)
    board.draw(item, acc_train.item(), 'acc_train', every_n=1)
    board.draw(item, acc_val.item(), 'acc_val', every_n=1)


[2]

By the result picture above, even if we find the optimal $\lambda$ it really doesn't matters since there are so many top accuracy points over 0~99.


[3]

$$(1-\eta \lambda)w \to w-\eta \lambda$$


[4]

$$\left\| \textbf{X}\right\|_{F}=\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}x_{ij}^{2}}$$


[5]

training error : Represents the model's performance in a real number scale. It is reused for parameter modification. We ideally want this error to be low by learning through the training data.

 

generalization error : Hard to get the exact score for this since we can't access all the unseen data in the universe. For practical perposes, generalization error is more important than training error.

 

Low training error doesn't guaruntee low generalization error, which can lead to overfitting. To prevent this we use validation data, regularization. 


[6]

If we use log and solve the equation for minimizing $P(w|x)$, the $P(w)$ will act as a weigth decay.

728x90
반응형