2024 Paper Reading Logs

This is my personal logs for papers I read.
Brain-vommiting Some interesting findings from the papers.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

https://arxiv.org/abs/2407.12687

curiousity: interesting, but less technical details revealed in the report..
The paper:
- majorly focuses HCI aspects with newly emergent Gen AI technologies in EdTech.
- summarizes current situations of gen AI tutor for educaiton in the form of a chatbot from pedagogical perspectives with enphasis on evaluation, introducing their benchmarks tailored for education context.
The authors used SFT (supervised fine tuning) so that their models learn pedagogically good behaviors.
Future work is mentioned, that is trying to use RLHF (reinforcement learning human feedback)
As in intro, evaluations for EdTech realm has been suffered their deisgns and benchmarks.
the team has fine-tuned Gemini 1.0 and created LearnerLM-Tutor which has several better aspects than the original model.
reports some interesting (but with insignificance) findings
Interesting that there's "sycophancy" in LLMs behavior in general as in Appendix. D
- The researchers tried to overcome this and the resultant model are good at finding a learner's mistakes.
The researchers saw some statistical insignificance as if often the case in EdTech HCI evaluation.
In automatic eval using LLMs, they made multiple LLM evaluators for each pedagogical dimensions
The result of auto-evals seems promising (I mean, not only the fine-tuned model itself, but the efficacy of LLM as an auto-evals)
Fine-tuning (SFT) yielded better improvent on pedagogical aspects than prompt engineering (meaning everyone needs to learn about SFT or RLHF..?)
Interested that they did redteaming for safety eval.
There are not many references from HCI conferences, such as CHI or UbiComp in the paper. I'm sure we can find relevant literature from those venues, too. But this is a tech report, so it does not matter anyways?

I haven't understood

How did they fine-tune the model?—This technical report does not cover such details.
Welch t-test, I did not know about it. What's different from the usual t-test?
- Welch's t-test is a better version of student t-test per wikipedia.
- ref https://zenn.dev/tmizuho/books/3d511e017bfd23/viewer/62b1c7
- If normality is not garanteed, better to use Wilcoxon rank sum test per the website
I'm not familiar with effect sizes in welch's t-test
What is "token-length normalised log-probability"? They just consider the token length, like deviding probability by token-length?

Scaling Laws for Neural Language Models

https://arxiv.org/pdf/2001.08361

Curiousity: neutral, too technial to understand, but might have some implication for the future AI landscape.
Computational costs to learn LLMs seem to follow the scaling-law introduced in this paper.

Attention Is All You Need

curiousity: high, want to explore more
The paper introduced Transformer, an architecture that removed recurrent models, resulting in efficient training with paralellism. It was from 2017, but the Transformer is widely used and largely contributed to today's language models (and other multi-modal models).
"Background" Conventional ways to track the relationshpt of the two sequences had drawbacks when it comes to track global dependency, due to computational costs, etc

https://arxiv.org/abs/1706.03762

I have not understood:

The term, "Transductive" is hard to understand even doing some websearch.
- My hypothesis: "Transduction", is this mapping sequences to sequenses? I can find this.
The "Why self-attention" section assumes knowledge on attension. I need to get familiar with those with original paper or website dedicated for attention.
This section discuesses computation complexity differences among attention, LSTM, convolutional NN, etc.
Self-attentions per se are not from the paper but already there. Multiple papers are cited in the paper and not sure which one(s) is the orginal.
- Among them, maybe I can try reading a paper from Yoshua Bengio's group? https://arxiv.org/abs/1703.03130 This paper clearly says that it proposes a self-attention mechanism. This one mentions the primitive version of self-attentions with addition, not scaled dot product attention.
- The attention mechanism with addition dates back to 2024 in Yoshua Bengio's group, example:
  - Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

I'm trying to understand

Q, K, V in scaled dot product attention, the operation is unclear. Why do we need this?
The scaling factor in the scaled dot attention, why did the reearchers come up with this? I can understand that big numerical values will lead to gradient vanish, so they are making values smaller. But why specifically sqrt(d_k)?
- Page 4, footnote 4 mentions about the variance of the dot product. The scaling factor is the std, sqrt of the variance. They normalize the dot product so that they will have zero mean & unit std dot products. An explanation from stackexchange
Encoder-decoder archtecture. Input-Encoder-Feature-Decoder-Output
- A good explanation: https://kikaben.com/transformers-encoder-decoder/
Why the output of decoder is re-constructed via attention mechanism?
3.2.3 has a good explanation of different multi-head attentions in the transfo

rmer architecture

Quantitative knowledge retrieval from large language models

A work by my former colleagues in DFKI Kaiserslautern and my friend Yuichiro.
Can see effective sample sizes for beta distribution, alpha + beta. Why only beta distribution? Maybe it is sufficiently generic?

https://arxiv.org/abs/2402.07770

Sharpness-Aware Minimization for Efficiently Improving Generalization

https://arxiv.org/abs/2010.01412

Figure 1: Error reduction trends for multiple datasets. Consistently SAM reduces errors but the effect is different among dataset. I'm curious dataset charasteristic somehow in some datasets yields more improvements than other datasets?
Figure 1: Sharp mimima vs flat minima obtained by SGD and SAM. How did they visualize the high dimensional loss space? It seems the authors followed Li 2017 et. al. I need to check the paper, "visualizing the loss landscape of neural nets" (li et. al.) https://arxiv.org/abs/1712.09913
- Aside from the visualization technique per se, the visualization is known to be helpful because it is highly correlated with non-convexity.
Figure 2: I did not know what adv stands for in W_adv. But I understood the overview of the figure, calculating the term with rho and modify the final gradient a bit such that mimizes the sharpness in the loss space.
Equation 2: rho is more complex than other papers mentions SAM alrorithm
Figure 3: Hessian eiganval, Hessian spectrum, what are they? See 4.2.

Do peeks in Hessian spectra correspond to eiganvalues? Looks like so reading the eigan values and peaks in the visualization. The peaks in SAM overall are smaller than SGD.

Btw, Hessian and curvature, there is a good summary by the sam authors. It was rejected by ICLR2024 but the pdf is available in OpenReview.
- https://openreview.net/pdf?id=Gl4AsqInti
- https://openreview.net/forum?id=Gl4AsqInti
- Hmm.. at a glance, the review for ICLR is highly rigorous! Wow..
- Maybe this is the arxiv version: https://arxiv.org/abs/2401.10809

visualizing the loss landscape of neural nets

TLDR: The authors introduced a nice visualization of the loss landscape using "filter normalizaton" scheme. This is very helpful for non-convex optimization in neural nets.

It's common to have non-convex loss function optimazation with neural nets.
Figure 1: Skip connection yields flatten minima while otherwise the model training ends up with sharp minima. This fig is similar to SAM optimizer Figur 1 because seemingly the SAM paper followed the original paper.
Figure 1: What is the filter normalization?
Figure 4: ResNet-110 with no skip connection has a bit sharp visualization. DenseNet has very smooth loss landscape. But not sure what the authors want to illustate with this figure.
Figure 2-3 and other ones I did not mention: Ignored for now. Too much information. I don't know weight decay, etc either.
- weight decay (sept 16) is the same as L2 regularization per this website.
- L2 Regularization—I always forget every time I learn this, haha.
Intro: Hessian eigenval is mentioned to quantitatively measure the non-convexity of the loss function.
Section 4 has concrete descriptions for the filter normalization, but I stop reading the paper for now. I will resume the sam paper for now.

https://arxiv.org/abs/1712.09913

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

https://arxiv.org/pdf/1609.03499

Abstract: The authors say that they can employ wavenet as a discriminative model. Wavenet is a discriminative model, or (b) is a generative-model by default but also can be modified into a discriminative model.
- 3.4 for speech recognition, wavenet can serve as discriminative model. This is interesting.
- but fundametally what are the generative models, and what are the discrimative model? Obtaining distribution is a generative model, and obtaining decision boundaries is a discriminative model? I need to check this.
Conclusion: Wavenets direcly process waveform. Seems like this was not common before wavenet.
Conclusion: causal filters, it's related to 'causality'?
Fig2: dilated causal convolutions. Interesting to know that convolutions (or CNNs) are used in tts or audio processing.
- causal convolution, the attempts to hide future inputs remind me of the masking in transformer decoder.
Fig3: why changing dilation, inreasing from input to output?
Fig4: residual block (helps deeper neural nets to converge quickly), skip connection (mitigating the vanishing gradients)

Thoughts: Audio generation seemingly highly involve with probabilistic modelings. Interesting that CNN can be used for audio modeling.

Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

https://arxiv.org/abs/2402.00809

Here are my scribbles, to be orgaznized:
ABS: continual learning, active learning, uncertainty quantification: BDL helps
Fig1: llm can report confidence (but llms can't handle numbers well, so it might not be a failr for llm to compare bayesian uncertain quantification
Fig 2: BDL methods, MAP, laplace, variational and MCMC
hyper-priors reduces hyperparam tunins?
BDL is good for handling advasarial attacks and halcunation
bayesian experimental design, optimizatin and model selection, what are they?
probablistic nature of bayesian paradigm has regularizaton effects
small sample size, BDL works
foundation model finetuning with small data, BDL works
active learning of rlhf, BDL helps
BDL is computationally intensive
GPs, gaussian proccess remain popular
SWAG considers (estimates) curvature
SG-MCMC (MCMC in BDL) slower to converge with overhead with extra steps
generally, monte carlo is slow and finding hardware acceleration would be nice
getting high quality of posterior is also important
BDL to LLM is unexplored
LoRa (low rank adaptation) actually comes wihth bayesian low-rank optimization

Bayesian Optimization in AlphaGo

https://arxiv.org/pdf/1812.06855

AlphaGo, an RL system, has many hyperparams to be tuned. RL is known to behave differently depending on its hyperparams. AlphaGo team utilized Bayesian optimization for those hyperparam tuning, which is automatic. This contributed to AlphaGo's winning capabilities.
MCTS:?
Fig1: Gausian Processes, GPs, illustration. EI, expected improvement aquisition, is something new to me, something behind the scenes of GPs (and Bayesian optimization.)
- seems like GPs calculate maxima in the current iteration's EI, then next query point would be that point (next sample point)
- before the process converges (or stops), the distribution of EI is broad with non-zero values. at the end of the process, that EI distribution is flat with small values, with not much difference in its values—indicating less uncertainty at the end.
- cool sentence, "EI trade-off exploration and exploitationn"
- I take it as a positive position on EI. For exploration, GP can find the next query point with high uncertainty and unexplored points. For the exploitaiton part, the vecinity between already-sampled points (the closer to those points, less uncertainty) and succeeds to consider such information that we already have at each step.
Fig 2: did not understand what authors want to say with this fig.
- From intro, MCTS stands for Monte Carlo Tree Search, a step after neural nets training.
- other hyper params, distribute sys and mixing ration, they are not quite explained in the paper. maybe something don't matter? just param examples for illustration of the posteriors?
Fig 3: Comparisons between winning rate observation vs expected winnig rate
- The expected winning rate has variance, but overall seems two values are corralated to each other
Fig 4: mixing ratio, not sure
Fig 5: time control, with respect to time budget? not sure
Intro: UCT exploration formlula
- UCT: UCB applied to trees per this articleA
  - UCB: Upper confidence bounds (I've seen this in bayesian optimizaton)
    - for now I don't search too much about this.
- "Multi-armed bandit problem", related concepts, will check later
- policy and value network, two important components of alphago
- they also tried grid search for each hyper params, and it was too expensive
MEthods: posterior is used to calculate the next query points (is this bayesian specific thing, or bayesian optimization thing?)
- EI function formula, I want to understand
- Elo gain: there's Elo rating system (for zero sum games)
- "byoyomi" was in the session— 60 sec constraints for tree search + other calculations

Evolutionary Optimization of Model Merging Recipes

https://arxiv.org/pdf/2411.03028

Abstract: param space and data flow space, the latter, I'm not familiar.
Limitations: instruction fine-turning or alignment, opssible future work
fig 1: model mernging, combining models that can partially correctlly ansers, resulting in models that can answer both quesitons.
mgsm ja I have looked at and people names are all western, what if a person's name changes, does that change the perfomrance? for exaple, James to Takashi.

https://huggingface.co/datasets/juletxara/mgsm/viewer/ja/test?row=4

fig 2: mgsm ja perfomrmacne. model merge generally yeilds better peformances. PS outpeforms DFS. but PS vs (PS + DFS) are equally good, it seems. from acccuracy in the table 1, it seems PS + DFS slightly outperforms PS solely. o
fig 3: left, density vs weight, what are they? right: not sure what it is. layer indices charts. first, I need to understand DFS then come back to this fig.
Intro: DFS—infrence paths, still I don't get it.
2.1 shmithuber's falt minima is mentioend
2.2. related papers descriptions are concise and I don't have the domain knowledge so hard to understand. for now let's skip them.
2.3. authors won't change transformer blocks (they were trainied with a lot of compute already) but tweaking other blocks. maybe they are mixing those blocks but don't want to change them.
they mention wight agnostic neural nets (david work) as related work. wanna read this in the future. (having some inductive biases of specific neural nets are enough, and we don't need to train params? curious) I have read the abstract and it is highly interesing!
they also mention david's hypernetworks (a network is that determines the main network's params.) add it to my reading list.
3 method

merging the process to two distict and orthogonal config spaces, analysing their individual impacts — what does this sentence mean?

3.2 merging in the data flow space

the term, "budget", is it evolution strategy term?
changing the orders or usages (use or not) of the transformer blocks but keeping those weights intact
as expected distribution shift happens for each block, resulting performance degradation but mitigation is done by scaling an input.

Graph Agnostic Causal Bayesian Optimisation

My friend, Sumantrak's recent work. https://arxiv.org/abs/2411.03028

abstract:

causal graph, a graph that captures causal relationships of variables.
cbo, causal bayesien optimization is a thing?
- Aglietti et al. https://arxiv.org/abs/2005.11741
conclustion: with causal structures unknown settings, the proposed method yeilds sota. theoretical stuff is a future work.
- what are the hard and soft interventions? maybe causal inference specific thing?
- hard interventions: changing some variables to fixed values
- soft interventions: changing distributions or adding noises
fig1 shows reward. is the method related to rl in some ways?
- maybe causal inference often involves reward calculation?
- maybe not? lets forget it for now
fig2 no idea whats happening here
fig3 no idea, what are the rewards?

intro:

bayesian scoring, i don't know abou it

background and problem settigns

i forgot the concept, "compact," in math...
also "power set"

ignored math details for now.

method:

Markov property, the current state only relys on the prev output
they mention bayesian networks in this section. how bayesien networks relate to their method?
generally, i need to learn more about UCB.
plausible function, what is it?
causal graph, a graph that captures causal relationships of variables.
cbo, causal bayesien optimization is originally from Aglietti et al. https://arxiv.org/abs/2005.11741 Bayesian optimization utilizing causal relationship (but I don't know more than the high-level overview. If I want to know more about it better to read the original paper.)

conclustion: with causal structures unknown settings, the proposed method yeilds sota. theoretical analysis is future work. The authors tested with the hard and soft interventions. I need to check each scenario from the paper. (2.1: hard interventions, 2.2: soft interventions)

Conclusion also mentions function uncertainty, which is also explained in 3.1 and I need to read it later why picking actions only when it might improve rewards leads to reduced function uncertainty.

"Cumulative regret", I'm not familiar yet
Conclusion says MCBO had previously the SOTA. I also read MCBO section.
what are the hard and soft interventions? maybe causal inference specific thing?
- hard interventions: changing some variables to fixed values (the corresponding section is somewhat understandable, while that of soft interventions are more difficult to understand)
- soft interventions: changing distributions or adding noises

2.1 hard interventions - for now I can focus on it and ignore the soft one to save my brain capacity. But even with the part explaining SCM, structural causal models, the notations are so difficult for me with a lot of variables.

Reexamining computer ethics in light of AI systems and AI regulation

https://link.springer.com/article/10.1007/s43681-022-00229-6

Design-oriented vs policy-oriented computer ethics and regulations, a very interesting review on the current computer ethics landscape surrounding AI, written by Prof. Simon, who gave a talk in the trilateral AI conference the other day.

The recent AI regulation trends, such as The AI Act from Europe, falls within policy-oriented computer ethics mindset, which speculates risks in advance and tries to mitigate those risks using policies. This may not address future AI risks, hence, there will be policy vaccuum, when an AI system has an emergent capability.
The other approach, design-oriented computer ethics rather focuses on design decision making processes and the value of those design. The author is rather in favor of this approcah (also when I asked the question to the first author in the trilateral AI conference in Tokyo, Japan)

Bayesian Optimization of Function Networks

This is realted to Graph Agnostic CBO. I'm reading this carefully. later add my notes here.

BATINeT: Background-Aware Text to Image Synthesis and Manipulation Network

Generating foreground based on background images. Later write my thoughts if I have time.

Interactive image manipulation with complex text instructions

This improves image manipulation with texts using segmentation masks and super-resolution network. Later write my thoughts if I have time.

A Tutorial on Bayesian Optimization (Frazier 2018)

https://arxiv.org/pdf/1807.02811

Read to learn more about the overview of BO, GP regression, EI and related concepts.

Stochastic noise, maybe the "z thing" in some BO papers? Related to reparam trick in BO?
BO TLDR: “BayesOpt is designed for black-box derivative free global optimization”
- There are multiple optimization methods for derivative free and they usually use surrogate models.
- In BO, surrogate models are constructed with bayesian statistics.
GP regression is a particular use cases of GP.
The tutorial mentions alternatives other than EI (e.g., KG, entropy search), but for now I focus EI.
my thoughts: gacbo is perhaps BO with EI and noisy measurement (aka stochastic noise). the tutorial says Scot et al is a popular formulation for this.

Budget N (function evaluation), is a common phrase in BO.

GP regression yields estimated points and their credible intervals!
Curious how they calculate credible intervals as well as points estimate.
[not realted to the paper] UCB’s confidence bound, why not “credible” bounds?
misc thoughts: covariance large sigma, I learned this in college but forgot how's different from the variance in the non-multivairaite sattings
TIL MAB (multi-armed bandits) are not part of BO although the idea of exploration-vs-exploitation tradeoff also arise in both EI and MAB settings
WIP: reading the noisy BO settings, where KG works much better than EI. Read this part and figure out if this part is useful for BOFN paper (and eventually, GACBO paper)
Noisy evaluation (part of exotic BO), still difficult with maths.
GP-UCB is built upon GP regression
EI is very intuitive. Practically, seems calculating the expected value is difficult
Term: 1st vs 2nd order optiimzation.

https://stats.stackexchange.com/questions/328492/the-actual-role-of-second-order-optimization-as-oppose-to-first-order-optimizati

change logs

2024-08-11: created this website, and added x3 papers
2024-08-19: added some to attention paper, with links to papers which introduced self-attentino & attention mecahism.
2024-08-30: Separated subsections to clarify what I haven't understood. Added SAM optimizer paper entry.
2024-08-31: Organized Transformer paper a bit, mainly read about the attention mechanism in general.
2024-09-16: Skimmed thru a bit of WaveNet paper, searched about generative vs discriminative model
2024-10-01: Added AlphaGo and Bayesian deep learning survey paper (diff from Sept)
2025-01-11: Added my scrribles for BO tutorial