Loading [MathJax]/jax/output/CommonHTML/jax.js
본문 바로가기

ML&DL/Dive into Deep Learning

[4.1.5] Dive into Deep Learning : exercise answers

728x90
반응형

 

 

4.1. Softmax Regression — Dive into Deep Learning 1.0.3 documentation

 

d2l.ai

 

[1-1]

Since the first derivative of the cross-entropy loss is given, we only have to differentiate softmax(o)jyi with the respect to oj.

oj(softmax(o)jyi)=softmax(o)j(1softmax(o)j)


[1-2]

Makes sense if the distribution is a Bernoulli. We can view the softmax(o)j as the possibility of oj among ok for all k in q.

softmax(o)j=exp(oj)qk=1exp(ok)

If the distribution is not a Bernoulli, the variance will be σ2=iP(xi)(xiμ)2 where P(xi) is the PMF(Probability Mass Function),μ is the mean of the distribution. I believe employing the probability mass function (PMF) in the variance calculation is more appropriate for the logistic distribution, given that the logit represents a set of discrete values. This approach accounts for the discrete nature of the logit and ensures a more accurate representation of the distribution's variability.  

σ2=qi=1softmax(o)i(exp(oi)qk=1exp(ok)q)2


[2-1]

Let's say we label each class as binary bits 00, 01, 10. This can be a problem since the distance between 00 and 10 is different compared to 00, 10 and 01, 10, when there are no difference in the probability (13,13,13). Let's see a example.

 

Assume we have 3 classes A, B, C which has equal probability of 13. Since we use binary, we need atleast 2 bits to represent the class uniquely.However, with equal probability, an ideal code will have an average length of 1 bit per class. Also if we encode A as "0", B as "1", we waste 1 bit for C. This is the inefficiency problem.

 

There is also a uniqueness problem. If we use {"0":A, "1":B, "00":C} as a decoder, we are not sure whether to decode "00" as AA or C.


[2-2]

For a better code,

1. Use a ternary code. If we use {"-1":A, "0":B, "1":C}, we can avoid  the inefficiency and non uniqueness issues.

 

2. Use one-hot encoding.


[3]

There are 9 possible signals with 2 ternary units. Since we only need 8 classes, 9 is an enough combinations for our case. And these are the benefits of ternary sytems compared to binary.

 

Reduced Power Consumption: In some cases, ternary circuits can offer lower power consumption compared to binary circuits.


Increased Speed: Ternary systems can potentially process information faster than binary systems for certain operations.


Improved Error Detection: With three states, ternary systems can offer more sophisticated error detection and correction mechanisms.


[4-1]

Assume, w.l.o.g(let oapple=o1,oorange=o2), o1>o2 apple is the most likely one to be chosen. This is still true with softmax(o)1>softmax(o)2 because

 

o1>o2

 

exp(o1)>exp(o2)               (since exp() is a monotonous function)

 

exp(o1)qk=1exp(ok)>exp(o2)qk=1exp(ok)

 

softmax(o)1>softmax(o)2


[4-2]

Let's say the score of choosing neither is o3. Thus we only have to update q to 3. The softmax function is a generalized form for classification. It can be implemented for more than two classes.


[5-1]

Assume, w.l.o.g, log(exp(a)+exp(b))>a

 

log(exp(a)+exp(b))>log(exp(a))=a    ( since exp(b)>0)


[5-2]

Assume, w.l.o.g, b=0,ab thus we want to minimize the function f.

f(a)def=log(exp(a)+1)a

 

If we draw the graph via desmos.

 

We can achieve a small difference between RealSoftMax and max when there is a huge difference between a and b.


[5-3]

λ1log(exp(λa)+exp(λb))>a

log(exp(λa)+exp(λb))>λa 

 

This is the same equation from 5-1 but all the previous a and b multiplied by λ.


[5-4]

Assume, w.l.o.g, a>b.

 

limλ(RealSoftMax(a,b),max(a,b))

 

=limλlog(exp(λa)+exp(λb))λaλ

 

=limλlog(exp(λa)+exp(λb))logexp(λa)λ

 

limλlog(1+exp(λ(ba))λ

 

Since ba<0, limλexp(λ(ba))=0

 

Thus the above equation is 0 when λ, which proves the question.


[5-5]

RealSoftMin(a,b)=log(exp(a)+exp(b))


[5-6]

 

RealSoftMinExtend(a)=log(qk=1exp(ak))


[6-1]

The derivative of the log-partition function is the softmax function.

 

xjg(x)=exp(xj)iexp(xi)=softmax(x)j

 

By 1-1, the derivative of the softmax(second derivative of g(x)) is equal to softmax(x)j(1softmax(x)j)

 

Since 0softmax(x)j1,

2xjg(x)>0, thus it is convex.


[6-2]

 

g(x+b)=logiexp(xi+b)=logiexp(xi)exp(b)

 

=b+logiexp(xi)=g(x)+b


[6-3]

If some xi are very large, it could be a problem because the coresponding exp(xi) will dominate the sum. Other components that are not as big as xi will not have any specific role in shaping g(x).

 

Here is a 3D graph of a 2 degree log-partiton function z=log(exp(x)+exp(y)).

 

3D image is powered by desmos.

 

Now see what happens when I amplificate y, which will be done by changing the original y to 100y

 

 

We can see that the value of x doesn't really matter when y dominates the function. Which leads to not being able to reflect the influence of x.


[6-4]

Let b=xj=maxixi

 

g(x)=g(x+b))b=logiexp(xixj)+xj

 

Before the stable implementation : log(e+e2+e100)

After the stable implementation : log(e99+e98+1)+100

 

This method allows us ot stabilize the function since the most powerful exponent will be 1. And the maximum difference between the most powerful exponent and the most weak exponent diminishes to 1(max e0, min e).


[7-1]

 

The Boltzmann distribution P(i) eϵi/KT, Q(i) eαϵi/KT

Halving α will have the impact of doubling T, and doubling it will have the impact of halving T.


[7-2]

 

No matter what ϵi is, the probability of a particle that has positive energy is close to zero. However particles with zero energy will always be found. Thus all the particles are static(frozen).


[7-3]

 

Every energy ϵi has the same possibility to be found when T.

728x90
반응형