# 2024 Paper Reading Logs

- This is my personal logs for papers I read.
- Brain-vommiting Some interesting findings from the papers.

### [1] Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

https://arxiv.org/abs/2407.12687

- curiousity: interesting, but less technical details revealed in the report..
- The paper:
- majorly focuses HCI aspects with newly emergent Gen AI technologies in EdTech.
- summarizes current situations of gen AI tutor for educaiton in the form of a chatbot from pedagogical perspectives with enphasis on evaluation, introducing their benchmarks tailored for education context.

- The authors used SFT (supervised fine tuning) so that their models learn pedagogically good behaviors.
- Future work is mentioned, that is trying to use RLHF (reinforcement learning human feedback)
- As in intro, evaluations for EdTech realm has been suffered their deisgns and benchmarks.
- the team has fine-tuned Gemini 1.0 and created LearnerLM-Tutor which has several better aspects than the original model.
- reports some interesting (but with insignificance) findings
- Interesting that there's "sycophancy" in LLMs behavior in general as in Appendix. D
- The researchers tried to overcome this and the resultant model are good at finding a learner's mistakes.

- The researchers saw some statistical insignificance as if often the case in EdTech HCI evaluation.
- In automatic eval using LLMs, they made multiple LLM evaluators for each pedagogical dimensions
- The result of auto-evals seems promising (I mean, not only the fine-tuned model itself, but the efficacy of LLM as an auto-evals)
- Fine-tuning (SFT) yielded better improvent on pedagogical aspects than prompt engineering (meaning everyone needs to learn about SFT or RLHF..?)
- Interested that they did redteaming for safety eval.
- There are not many references from HCI conferences, such as CHI or UbiComp in the paper. I'm sure we can find relevant literature from those venues, too. But this is a tech report, so it does not matter anyways?

#### I haven't understood

- How did they fine-tune the model?—This technical report does not cover such details.
- Welch t-test, I did not know about it. What's different from the usual t-test?
- Welch's t-test is a better version of student t-test per wikipedia.
- ref https://zenn.dev/tmizuho/books/3d511e017bfd23/viewer/62b1c7
- If normality is not garanteed, better to use Wilcoxon rank sum test per the website

- I'm not familiar with effect sizes in welch's t-test
- What is "token-length normalised log-probability"? They just consider the token length, like deviding probability by token-length?

### [2] Scaling Laws for Neural Language Models

https://arxiv.org/pdf/2001.08361

- Curiousity: neutral, too technial to understand, but might have some implication for the future AI landscape.
- Computational costs to learn LLMs seem to follow the scaling-law introduced in this paper.

### [3] Attention Is All You Need

- curiousity: high, want to explore more
- The paper introduced Transformer, an architecture that removed recurrent models, resulting in efficient training with paralellism. It was from 2017, but the Transformer is widely used and largely contributed to today's language models (and other multi-modal models).
- "Background" Conventional ways to track the relationshpt of the two sequences had drawbacks when it comes to track global dependency, due to computational costs, etc

https://arxiv.org/abs/1706.03762

#### I have not understood:

- The term, "Transductive" is hard to understand even doing some websearch.
- My hypothesis: "Transduction", is this mapping sequences to sequenses? I can find this.

- The "Why self-attention" section assumes knowledge on attension. I need to get familiar with those with original paper or website dedicated for attention.
- This section discuesses computation complexity differences among attention, LSTM, convolutional NN, etc.
- Self-attentions per se are not from the paper but already there. Multiple papers are cited in the paper and not sure which one(s) is the orginal.
- Among them, maybe I can try reading a paper from Yoshua Bengio's group? https://arxiv.org/abs/1703.03130 This paper clearly says that it proposes a self-attention mechanism. This one mentions the primitive version of self-attentions with addition, not scaled dot product attention.
- The attention mechanism with addition dates back to 2024 in Yoshua Bengio's group, example:
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

#### I'm trying to understand

- Q, K, V in scaled dot product attention, the operation is unclear. Why do we need this?
- The scaling factor in the scaled dot attention, why did the reearchers come up with this? I can understand that big numerical values will lead to gradient vanish, so they are making values smaller. But why specifically sqrt(d_k)?
- Page 4, footnote 4 mentions about the variance of the dot product. The scaling factor is the std, sqrt of the variance. They normalize the dot product so that they will have zero mean & unit std dot products. An explanation from stackexchange

- Encoder-decoder archtecture. Input-Encoder-Feature-Decoder-Output
- A good explanation: https://kikaben.com/transformers-encoder-decoder/

- Why the output of decoder is re-constructed via attention mechanism?
- 3.2.3 has a good explanation of different multi-head attentions in the transfo

rmer architecture

### [4] Quantitative knowledge retrieval from large language models

- A work by my former colleagues in DFKI Kaiserslautern and my friend Yuichiro.
- Can see effective sample sizes for beta distribution, alpha + beta. Why only beta distribution? Maybe it is sufficiently generic?

https://arxiv.org/abs/2402.07770

### [5] Sharpness-Aware Minimization for Efficiently Improving Generalization

https://arxiv.org/abs/2010.01412

- Figure 1: Error reduction trends for multiple datasets. Consistently SAM reduces errors but the effect is different among dataset. I'm curious dataset charasteristic somehow in some datasets yields more improvements than other datasets?
- Figure 1: Sharp mimima vs flat minima obtained by SGD and SAM. How did they visualize the high dimensional loss space? It seems the authors followed Li 2017 et. al. I need to check the paper, "visualizing the loss landscape of neural nets" (li et. al.) https://arxiv.org/abs/1712.09913
- Aside from the visualization technique per se, the visualization is known to be helpful because it is highly correlated with non-convexity.

- Figure 2: I did not know what adv stands for in W_adv. But I understood the overview of the figure, calculating the term with rho and modify the final gradient a bit such that mimizes the sharpness in the loss space.
- Equation 2: rho is more complex than other papers mentions SAM alrorithm
- Figure 3: Hessian eiganval, Hessian spectrum, what are they? See 4.2.

Do peeks in Hessian spectra correspond to eiganvalues? Looks like so reading the eigan values and peaks in the visualization. The peaks in SAM overall are smaller than SGD.

- Btw, Hessian and curvature, there is a good summary by the sam authors. It was rejected by ICLR2024 but the pdf is available in OpenReview.
- https://openreview.net/pdf?id=Gl4AsqInti
- https://openreview.net/forum?id=Gl4AsqInti
- Hmm.. at a glance, the review for ICLR is highly rigorous! Wow..
- Maybe this is the arxiv version: https://arxiv.org/abs/2401.10809

### [6] visualizing the loss landscape of neural nets

TLDR: The authors introduced a nice visualization of the loss landscape using "filter normalizaton" scheme. This is very helpful for non-convex optimization in neural nets.

- It's common to have non-convex loss function optimazation with neural nets.
- Figure 1: Skip connection yields flatten minima while otherwise the model training ends up with sharp minima. This fig is similar to SAM optimizer Figur 1 because seemingly the SAM paper followed the original paper.
- Figure 1: What is the filter normalization?
- Figure 4: ResNet-110 with no skip connection has a bit sharp visualization. DenseNet has very smooth loss landscape. But not sure what the authors want to illustate with this figure.
- Figure 2-3 and other ones I did not mention: Ignored for now. Too much information. I don't know weight decay, etc either.
- weight decay (sept 16) is the same as L2 regularization per this website.
- L2 Regularization—I always forget every time I learn this, haha.

- Intro: Hessian eigenval is mentioned to quantitatively measure the non-convexity of the loss function.
- Section 4 has concrete descriptions for the filter normalization, but I stop reading the paper for now. I will resume the sam paper for now.

https://arxiv.org/abs/1712.09913

### [7] WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

https://arxiv.org/pdf/1609.03499

- Abstract: The authors say that they can employ wavenet as a discriminative model. Wavenet is a discriminative model, or (b) is a generative-model by default but also can be modified into a discriminative model.
- 3.4 for speech recognition, wavenet can serve as discriminative model. This is interesting.
- but fundametally what are the generative models, and what are the discrimative model? Obtaining distribution is a generative model, and obtaining decision boundaries is a discriminative model? I need to check this.

- Conclusion: Wavenets direcly process waveform. Seems like this was not common before wavenet.
- Conclusion: causal filters, it's related to 'causality'?
- Fig2: dilated causal convolutions. Interesting to know that convolutions (or CNNs) are used in tts or audio processing.
- causal convolution, the attempts to hide future inputs remind me of the masking in transformer decoder.

- Fig3: why changing dilation, inreasing from input to output?
- Fig4: residual block (helps deeper neural nets to converge quickly), skip connection (mitigating the vanishing gradients)

Thoughts: Audio generation seemingly highly involve with probabilistic modelings. Interesting that CNN can be used for audio modeling.

### [8] Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

https://arxiv.org/abs/2402.00809

**Here are my scribbles, to be orgaznized:**- ABS: continual learning, active learning, uncertainty quantification: BDL helps
- Fig1: llm can report confidence (but llms can't handle numbers well, so it might not be a failr for llm to compare bayesian uncertain quantification
- Fig 2: BDL methods, MAP, laplace, variational and MCMC
- hyper-priors reduces hyperparam tunins?
- BDL is good for handling advasarial attacks and halcunation
- bayesian experimental design, optimizatin and model selection, what are they?
- probablistic nature of bayesian paradigm has regularizaton effects
- small sample size, BDL works
- foundation model finetuning with small data, BDL works
- active learning of rlhf, BDL helps
- BDL is computationally intensive
- GPs, gaussian proccess remain popular
- SWAG considers (estimates) curvature
- SG-MCMC (MCMC in BDL) slower to converge with overhead with extra steps
- generally, monte carlo is slow and finding hardware acceleration would be nice
- getting high quality of posterior is also important
- BDL to LLM is unexplored
- LoRa (low rank adaptation) actually comes wihth bayesian low-rank optimization

### Bayesian Optimization in AlphaGo

https://arxiv.org/pdf/1812.06855

- AlphaGo, an RL system, has many hyperparams to be tuned. RL is known to behave differently depending on its hyperparams. AlphaGo team utilized Bayesian optimization for those hyperparam tuning, which is automatic. This contributed to AlphaGo's winning capabilities.
- MCTS:?
- Fig1: Gausian Processes, GPs, illustration. EI, expected improvement aquisition, is something new to me, something behind the scenes of GPs (and Bayesian optimization.)
- seems like GPs calculate maxima in the current iteration's EI, then next query point would be that point (next sample point)
- before the process converges (or stops), the distribution of EI is broad with non-zero values. at the end of the process, that EI distribution is flat with small values, with not much difference in its values—indicating less uncertainty at the end.
- cool sentence, "EI trade-off exploration and exploitationn"
- I take it as a positive position on EI. For exploration, GP can find the next query point with high uncertainty and unexplored points. For the exploitaiton part, the vecinity between already-sampled points (the closer to those points, less uncertainty) and succeeds to consider such information that we already have at each step.

- Fig 2: did not understand what authors want to say with this fig.
- From intro, MCTS stands for Monte Carlo Tree Search, a step after neural nets training.
- other hyper params, distribute sys and mixing ration, they are not quite explained in the paper. maybe something don't matter? just param examples for illustration of the posteriors?

- Fig 3: Comparisons between winning rate observation vs expected winnig rate
- The expected winning rate has variance, but overall seems two values are corralated to each other

- Fig 4: mixing ratio, not sure
- Fig 5: time control, with respect to time budget? not sure
- Intro: UCT exploration formlula
- UCT: UCB applied to trees per this articleA
- UCB: Upper confidence bounds (I've seen this in bayesian optimizaton)
- for now I don't search too much about this.

- UCB: Upper confidence bounds (I've seen this in bayesian optimizaton)
- "Multi-armed bandit problem", related concepts, will check later
- policy and value network, two important components of alphago
- they also tried grid search for each hyper params, and it was too expensive

- UCT: UCB applied to trees per this articleA
- MEthods: posterior is used to calculate the next query points (is this bayesian specific thing, or bayesian optimization thing?)
- EI function formula, I want to understand
- Elo gain: there's Elo rating system (for zero sum games)
- "byoyomi" was in the session— 60 sec constraints for tree search + other calculations

#### change logs

- 2024-08-11: created this website, and added x3 papers
- 2024-08-19: added some to attention paper, with links to papers which introduced self-attentino & attention mecahism.
- 2024-08-30: Separated subsections to clarify what I haven't understood. Added SAM optimizer paper entry.
- 2024-08-31: Organized Transformer paper a bit, mainly read about the attention mechanism in general.
- 2024-09-16: Skimmed thru a bit of WaveNet paper, searched about generative vs discriminative model
- 2024-10-01: Added AlphaGo and Bayesian deep learning survey paper (diff from Sept)