2024 Paper Reading Logs
- This is my personal logs for papers I read.
- Brain-vommiting Some interesting findings from the papers.
[1] Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach
https://arxiv.org/abs/2407.12687
- curiousity: interesting, but less technical details revealed in the report..
- The paper:
- majorly focuses HCI aspects with newly emergent Gen AI technologies in EdTech.
- summarizes current situations of gen AI tutor for educaiton in the form of a chatbot from pedagogical perspectives with enphasis on evaluation, introducing their benchmarks tailored for education context.
- The authors used SFT (supervised fine tuning) so that their models learn pedagogically good behaviors.
- Future work is mentioned, that is trying to use RLHF (reinforcement learning human feedback)
- As in intro, evaluations for EdTech realm has been suffered their deisgns and benchmarks.
- the team has fine-tuned Gemini 1.0 and created LearnerLM-Tutor which has several better aspects than the original model.
- reports some interesting (but with insignificance) findings
- Interesting that there's "sycophancy" in LLMs behavior in general as in Appendix. D
- The researchers tried to overcome this and the resultant model are good at finding a learner's mistakes.
- The researchers saw some statistical insignificance as if often the case in EdTech HCI evaluation.
- In automatic eval using LLMs, they made multiple LLM evaluators for each pedagogical dimensions
- The result of auto-evals seems promising (I mean, not only the fine-tuned model itself, but the efficacy of LLM as an auto-evals)
- Fine-tuning (SFT) yielded better improvent on pedagogical aspects than prompt engineering (meaning everyone needs to learn about SFT or RLHF..?)
- Interested that they did redteaming for safety eval.
- There are not many references from HCI conferences, such as CHI or UbiComp in the paper. I'm sure we can find relevant literature from those venues, too. But this is a tech report, so it does not matter anyways?
I haven't understood
- How did they fine-tune the model?—This technical report does not cover such details.
- Welch t-test, I did not know about it. What's different from the usual t-test?
- Welch's t-test is a better version of student t-test per wikipedia.
- ref https://zenn.dev/tmizuho/books/3d511e017bfd23/viewer/62b1c7
- If normality is not garanteed, better to use Wilcoxon rank sum test per the website
- I'm not familiar with effect sizes in welch's t-test
- What is "token-length normalised log-probability"? They just consider the token length, like deviding probability by token-length?
[2] Scaling Laws for Neural Language Models
https://arxiv.org/pdf/2001.08361
- Curiousity: neutral, too technial to understand, but might have some implication for the future AI landscape.
- Computational costs to learn LLMs seem to follow the scaling-law introduced in this paper.
[3] Attention Is All You Need
- curiousity: high, want to explore more
- The paper introduced Transformer, an architecture that removed recurrent models, resulting in efficient training with paralellism. It was from 2017, but the Transformer is widely used and largely contributed to today's language models (and other multi-modal models).
- "Background" Conventional ways to track the relationshpt of the two sequences had drawbacks when it comes to track global dependency, due to computational costs, etc
https://arxiv.org/abs/1706.03762
I have not understood:
- The term, "Transductive" is hard to understand even doing some websearch.
- My hypothesis: "Transduction", is this mapping sequences to sequenses? I can find this.
- The "Why self-attention" section assumes knowledge on attension. I need to get familiar with those with original paper or website dedicated for attention.
- This section discuesses computation complexity differences among attention, LSTM, convolutional NN, etc.
- Self-attentions per se are not from the paper but already there. Multiple papers are cited in the paper and not sure which one(s) is the orginal.
- Among them, maybe I can try reading a paper from Yoshua Bengio's group? https://arxiv.org/abs/1703.03130 This paper clearly says that it proposes a self-attention mechanism. This one mentions the primitive version of self-attentions with addition, not scaled dot product attention.
- The attention mechanism with addition dates back to 2024 in Yoshua Bengio's group, example:
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
I'm trying to understand
- Q, K, V in scaled dot product attention, the operation is unclear. Why do we need this?
- The scaling factor in the scaled dot attention, why did the reearchers come up with this? I can understand that big numerical values will lead to gradient vanish, so they are making values smaller. But why specifically sqrt(d_k)?
- Page 4, footnote 4 mentions about the variance of the dot product. The scaling factor is the std, sqrt of the variance. They normalize the dot product so that they will have zero mean & unit std dot products. An explanation from stackexchange
- Encoder-decoder archtecture. Input-Encoder-Feature-Decoder-Output
- A good explanation: https://kikaben.com/transformers-encoder-decoder/
- Why the output of decoder is re-constructed via attention mechanism?
- 3.2.3 has a good explanation of different multi-head attentions in the transfo
rmer architecture
[4] Quantitative knowledge retrieval from large language models
- A work by my former colleagues in DFKI Kaiserslautern and my friend Yuichiro.
- Can see effective sample sizes for beta distribution, alpha + beta. Why only beta distribution? Maybe it is sufficiently generic?
https://arxiv.org/abs/2402.07770
[5] Sharpness-Aware Minimization for Efficiently Improving Generalization
https://arxiv.org/abs/2010.01412
- Figure 1: Error reduction trends for multiple datasets. Consistently SAM reduces errors but the effect is different among dataset. I'm curious dataset charasteristic somehow in some datasets yields more improvements than other datasets?
- Figure 1: Sharp mimima vs flat minima obtained by SGD and SAM. How did they visualize the high dimensional loss space? It seems the authors followed Li 2017 et. al. I need to check the paper, "visualizing the loss landscape of neural nets" (li et. al.) https://arxiv.org/abs/1712.09913
- Aside from the visualization technique per se, the visualization is known to be helpful because it is highly correlated with non-convexity.
- Figure 2: I did not know what adv stands for in W_adv. But I understood the overview of the figure, calculating the term with rho and modify the final gradient a bit such that mimizes the sharpness in the loss space.
- Equation 2: rho is more complex than other papers mentions SAM alrorithm
- Figure 3: Hessian eiganval, Hessian spectrum, what are they? See 4.2.
Do peeks in Hessian spectra correspond to eiganvalues? Looks like so reading the eigan values and peaks in the visualization. The peaks in SAM overall are smaller than SGD.
- Btw, Hessian and curvature, there is a good summary by the sam authors. It was rejected by ICLR2024 but the pdf is available in OpenReview.
- https://openreview.net/pdf?id=Gl4AsqInti
- https://openreview.net/forum?id=Gl4AsqInti
- Hmm.. at a glance, the review for ICLR is highly rigorous! Wow..
- Maybe this is the arxiv version: https://arxiv.org/abs/2401.10809
[6] visualizing the loss landscape of neural nets
TLDR: The authors introduced a nice visualization of the loss landscape using "filter normalizaton" scheme. This is very helpful for non-convex optimization in neural nets.
- It's common to have non-convex loss function optimazation with neural nets.
- Figure 1: Skip connection yields flatten minima while otherwise the model training ends up with sharp minima. This fig is similar to SAM optimizer Figur 1 because seemingly the SAM paper followed the original paper.
- Figure 1: What is the filter normalization?
- Figure 4: ResNet-110 with no skip connection has a bit sharp visualization. DenseNet has very smooth loss landscape. But not sure what the authors want to illustate with this figure.
- Figure 2-3 and other ones I did not mention: Ignored for now. Too much information. I don't know weight decay, etc either.
- weight decay (sept 16) is the same as L2 regularization per this website.
- L2 Regularization—I always forget every time I learn this, haha.
- Intro: Hessian eigenval is mentioned to quantitatively measure the non-convexity of the loss function.
- Section 4 has concrete descriptions for the filter normalization, but I stop reading the paper for now. I will resume the sam paper for now.
https://arxiv.org/abs/1712.09913
[7] WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
https://arxiv.org/pdf/1609.03499
- Abstract: The authors say that they can employ wavenet as a discriminative model. Wavenet is a discriminative model, or (b) is a generative-model by default but also can be modified into a discriminative model.
- 3.4 for speech recognition, wavenet can serve as discriminative model. This is interesting.
- but fundametally what are the generative models, and what are the discrimative model? Obtaining distribution is a generative model, and obtaining decision boundaries is a discriminative model? I need to check this.
- Conclusion: Wavenets direcly process waveform. Seems like this was not common before wavenet.
- Conclusion: causal filters, it's related to 'causality'?
- Fig2: dilated causal convolutions. Interesting to know that convolutions (or CNNs) are used in tts or audio processing.
- causal convolution, the attempts to hide future inputs remind me of the masking in transformer decoder.
- Fig3: why changing dilation, inreasing from input to output?
- Fig4: residual block (helps deeper neural nets to converge quickly), skip connection (mitigating the vanishing gradients)
Thoughts: Audio generation seemingly highly involve with probabilistic modelings. Interesting that CNN can be used for audio modeling.
[8] Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI
https://arxiv.org/abs/2402.00809
- Here are my scribbles, to be orgaznized:
- ABS: continual learning, active learning, uncertainty quantification: BDL helps
- Fig1: llm can report confidence (but llms can't handle numbers well, so it might not be a failr for llm to compare bayesian uncertain quantification
- Fig 2: BDL methods, MAP, laplace, variational and MCMC
- hyper-priors reduces hyperparam tunins?
- BDL is good for handling advasarial attacks and halcunation
- bayesian experimental design, optimizatin and model selection, what are they?
- probablistic nature of bayesian paradigm has regularizaton effects
- small sample size, BDL works
- foundation model finetuning with small data, BDL works
- active learning of rlhf, BDL helps
- BDL is computationally intensive
- GPs, gaussian proccess remain popular
- SWAG considers (estimates) curvature
- SG-MCMC (MCMC in BDL) slower to converge with overhead with extra steps
- generally, monte carlo is slow and finding hardware acceleration would be nice
- getting high quality of posterior is also important
- BDL to LLM is unexplored
- LoRa (low rank adaptation) actually comes wihth bayesian low-rank optimization
Bayesian Optimization in AlphaGo
https://arxiv.org/pdf/1812.06855
- AlphaGo, an RL system, has many hyperparams to be tuned. RL is known to behave differently depending on its hyperparams. AlphaGo team utilized Bayesian optimization for those hyperparam tuning, which is automatic. This contributed to AlphaGo's winning capabilities.
- MCTS:?
- Fig1: Gausian Processes, GPs, illustration. EI, expected improvement aquisition, is something new to me, something behind the scenes of GPs (and Bayesian optimization.)
- seems like GPs calculate maxima in the current iteration's EI, then next query point would be that point (next sample point)
- before the process converges (or stops), the distribution of EI is broad with non-zero values. at the end of the process, that EI distribution is flat with small values, with not much difference in its values—indicating less uncertainty at the end.
- cool sentence, "EI trade-off exploration and exploitationn"
- I take it as a positive position on EI. For exploration, GP can find the next query point with high uncertainty and unexplored points. For the exploitaiton part, the vecinity between already-sampled points (the closer to those points, less uncertainty) and succeeds to consider such information that we already have at each step.
- Fig 2: did not understand what authors want to say with this fig.
- From intro, MCTS stands for Monte Carlo Tree Search, a step after neural nets training.
- other hyper params, distribute sys and mixing ration, they are not quite explained in the paper. maybe something don't matter? just param examples for illustration of the posteriors?
- Fig 3: Comparisons between winning rate observation vs expected winnig rate
- The expected winning rate has variance, but overall seems two values are corralated to each other
- Fig 4: mixing ratio, not sure
- Fig 5: time control, with respect to time budget? not sure
- Intro: UCT exploration formlula
- UCT: UCB applied to trees per this articleA
- UCB: Upper confidence bounds (I've seen this in bayesian optimizaton)
- for now I don't search too much about this.
- UCB: Upper confidence bounds (I've seen this in bayesian optimizaton)
- "Multi-armed bandit problem", related concepts, will check later
- policy and value network, two important components of alphago
- they also tried grid search for each hyper params, and it was too expensive
- UCT: UCB applied to trees per this articleA
- MEthods: posterior is used to calculate the next query points (is this bayesian specific thing, or bayesian optimization thing?)
- EI function formula, I want to understand
- Elo gain: there's Elo rating system (for zero sum games)
- "byoyomi" was in the session— 60 sec constraints for tree search + other calculations
Evolutionary Optimization of Model Merging Recipes
https://arxiv.org/pdf/2411.03028
- Abstract: param space and data flow space, the latter, I'm not familiar.
- Limitations: instruction fine-turning or alignment, opssible future work
- fig 1: model mernging, combining models that can partially correctlly ansers, resulting in models that can answer both quesitons.
- mgsm ja I have looked at and people names are all western, what if a person's name changes, does that change the perfomrance? for exaple, James to Takashi.
https://huggingface.co/datasets/juletxara/mgsm/viewer/ja/test?row=4
- fig 2: mgsm ja perfomrmacne. model merge generally yeilds better peformances. PS outpeforms DFS. but PS vs (PS + DFS) are equally good, it seems. from acccuracy in the table 1, it seems PS + DFS slightly outperforms PS solely. o
- fig 3: left, density vs weight, what are they? right: not sure what it is. layer indices charts. first, I need to understand DFS then come back to this fig.
- Intro: DFS—infrence paths, still I don't get it.
- 2.1 shmithuber's falt minima is mentioend
- 2.2. related papers descriptions are concise and I don't have the domain knowledge so hard to understand. for now let's skip them.
- 2.3. authors won't change transformer blocks (they were trainied with a lot of compute already) but tweaking other blocks. maybe they are mixing those blocks but don't want to change them.
- they mention wight agnostic neural nets (david work) as related work. wanna read this in the future. (having some inductive biases of specific neural nets are enough, and we don't need to train params? curious) I have read the abstract and it is highly interesing!
- they also mention david's hypernetworks (a network is that determines the main network's params.) add it to my reading list.
- 3 method
merging the process to two distict and orthogonal config spaces, analysing their individual impacts — what does this sentence mean?
3.2 merging in the data flow space
- the term, "budget", is it evolution strategy term?
- changing the orders or usages (use or not) of the transformer blocks but keeping those weights intact
- as expected distribution shift happens for each block, resulting performance degradation but mitigation is done by scaling an input.
Graph Agnostic Causal Bayesian Optimisation
My friend, Sumantrak's recent work. https://arxiv.org/abs/2411.03028
abstract:
- causal graph, a graph that captures causal relationships of variables.
- cbo, causal bayesien optimization is a thing?
- Aglietti et al. https://arxiv.org/abs/2005.11741
- conclustion: with causal structures unknown settings, the proposed method yeilds sota. theoretical stuff is a future work.
- what are the hard and soft interventions? maybe causal inference specific thing?
- hard interventions: changing some variables to fixed values
- soft interventions: changing distributions or adding noises
- fig1 shows reward. is the method related to rl in some ways?
- maybe causal inference often involves reward calculation?
- maybe not? lets forget it for now
- fig2 no idea whats happening here
- fig3 no idea, what are the rewards?
intro:
- bayesian scoring, i don't know abou it
background and problem settigns
- i forgot the concept, "compact," in math...
- also "power set"
ignored math details for now.
method:
-
Markov property, the current state only relys on the prev output
-
they mention bayesian networks in this section. how bayesien networks relate to their method?
-
generally, i need to learn more about UCB.
-
plausible function, what is it?
-
causal graph, a graph that captures causal relationships of variables.
-
cbo, causal bayesien optimization is originally from Aglietti et al. https://arxiv.org/abs/2005.11741 Bayesian optimization utilizing causal relationship (but I don't know more than the high-level overview. If I want to know more about it better to read the original paper.)
conclustion: with causal structures unknown settings, the proposed method yeilds sota. theoretical analysis is future work. The authors tested with the hard and soft interventions. I need to check each scenario from the paper. (2.1: hard interventions, 2.2: soft interventions)
Conclusion also mentions function uncertainty, which is also explained in 3.1 and I need to read it later why picking actions only when it might improve rewards leads to reduced function uncertainty.
-
"Cumulative regret", I'm not familiar yet
-
Conclusion says MCBO had previously the SOTA. I also read MCBO section.
-
what are the hard and soft interventions? maybe causal inference specific thing?
- hard interventions: changing some variables to fixed values (the corresponding section is somewhat understandable, while that of soft interventions are more difficult to understand)
- soft interventions: changing distributions or adding noises
2.1 hard interventions - for now I can focus on it and ignore the soft one to save my brain capacity. But even with the part explaining SCM, structural causal models, the notations are so difficult for me with a lot of variables.
Reexamining computer ethics in light of AI systems and AI regulation
https://link.springer.com/article/10.1007/s43681-022-00229-6
Design-oriented vs policy-oriented computer ethics and regulations, a very interesting review on the current computer ethics landscape surrounding AI, written by Prof. Simon, who gave a talk in the trilateral AI conference the other day.
- The recent AI regulation trends, such as The AI Act from Europe, falls within policy-oriented computer ethics mindset, which speculates risks in advance and tries to mitigate those risks using policies. This may not address future AI risks, hence, there will be policy vaccuum, when an AI system has an emergent capability.
- The other approach, design-oriented computer ethics rather focuses on design decision making processes and the value of those design. The author is rather in favor of this approcah (also when I asked the question to the first author in the trilateral AI conference in Tokyo, Japan)
Bayesian Optimization of Function Networks
This is realted to Graph Agnostic CBO. I'm reading this carefully. later add my notes here.
BATINeT: Background-Aware Text to Image Synthesis and Manipulation Network
Generating foreground based on background images. Later write my thoughts if I have time.
Interactive image manipulation with complex text instructions
This improves image manipulation with texts using segmentation masks and super-resolution network. Later write my thoughts if I have time.
change logs
- 2024-08-11: created this website, and added x3 papers
- 2024-08-19: added some to attention paper, with links to papers which introduced self-attentino & attention mecahism.
- 2024-08-30: Separated subsections to clarify what I haven't understood. Added SAM optimizer paper entry.
- 2024-08-31: Organized Transformer paper a bit, mainly read about the attention mechanism in general.
- 2024-09-16: Skimmed thru a bit of WaveNet paper, searched about generative vs discriminative model
- 2024-10-01: Added AlphaGo and Bayesian deep learning survey paper (diff from Sept)