본문 바로가기

ML&DL/Dive into Deep Learning

(11)
[4.4.7] Dive into Deep Learning : exercise answers 4.4. Softmax Regression Implementation from Scratch — Dive into Deep Learning 1.0.3 documentation d2l.ai [1-1] We can observe this code when we tamper with XX_prob = softmax(X)X_prob, X_prob.sum(1) Instead of consisting X with small values such as torch.normal as the book did, let X beX=torch.arange(90, 100).reshape(2, 5) now, X consists of large values from 90 to 99, let's watch what happens to..
[4.3.4] Dive into Deep Learning : exercise answers 4.3. The Base Classification Model — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] $L_{v}$ denotes the 'averaged total validation loss' and $L_{v}^{d}$ denotes 'averaged validation loss of a minibatch'. By the question, we have to find the relationship between $L_{v}$ and $L_{v}^{d}$. Let sample size $=N$(total examples in the dataset) minibatch size $=M$(number of examples in the minib..
[4.2.5] Dive into Deep Learning : exercise answers 4.2. The Image Classification Dataset — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] Usually batch_size's are formed as $2^n$ from, we can inspect the run time of loading the dataset when batch_size=$2^n$. ls=[(2*i)**2 for i in range(1, 9)] t_info = [] for size in ls: data = FashionMNIST(resize=(32, 32), batch_size=size) tic = time.time() for X, y in data.train_dataloader(): continue t..
[4.1.5] Dive into Deep Learning : exercise answers 4.1. Softmax Regression — Dive into Deep Learning 1.0.3 documentation d2l.ai [1-1] Since the first derivative of the cross-entropy loss is given, we only have to differentiate $softmax(o)_{j}-y_{i}$ with the respect to $o_{j}$. $$\partial _{o_{j}}(softmax(\textbf{o})_{j}-y_{i})=softmax(\textbf{o})_{j}(1-softmax(\textbf{o})_{j})$$ [1-2] Makes sense if the distribution is a Bernoulli. We can view ..
[3.7.6] Dive into Deep Learning : exercise answers 3.7. Weight Decay — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] $\lambda$ can be any positive real number. However, to maintain the purpose of weight decay we should follow $\lambda \eta
[3.6.5] Dive into Deep Learning : exercise answers 3.6. Generalization — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] Although the algorithm alows to plot non linear data, the problem of polynomial regression is that it has the vice of overfitting. Especially when there are n training data and we choose to fit the hypothesis function's degree to n. Since theoretically there is a perfect n degree function that passess all n points. Whic..
[3.5.6] Dive into Deep Learning : exercise answers 3.5. Concise Implementation of Linear Regression — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] Let's say that the original loss function returns the total sum of the batch's loss(the learning rate is $\alpha$ and batch size is $n$). Thus the total impact of the loss will be $\alpha\times loss(y\_hat, y)$. However if we want the loss function to return the mean of the summed loss($loss..
[3.4.6] Dive into Deep Learning : exercise answers 3.4. Linear Regression Implementation from Scratch — Dive into Deep Learning 1.0.3 documentation d2l.ai [1] The algorithm will still work even if the weight is initially zero. Since the algorithm we are using is gradient descent, typically it doesn't matter where you start. The gradient will mostly always guide the weight to the spot where the loss function is minimum. We can confirm this by run..