I probably shouldn't have eaten the entire box of spearmint tictacs in one go. But at least I got the entire screen recording of Dune Part 2 today early in the morning which is awesome. Life is finally good.
Entropic LLMs Github
Here is my Entropic_LLMs github repository where I have put my initial note on entropy based LLMs and Entropix as an EBM. More or less based on my previous Notion posts. Will follow up with more implementations soon.
vkalvakotamath/entropic_llms: Repository on entropic large language models
Statement on the 2024 Nobel Prize in Physics
Given the response of the physics community (mostly the hep-th community) to the 2024 Nobel Prize in Physics that was awarded to Hopfield and Hinton, here is my impartial statement on the fiasco. Bottom line: most of you were wrong.
Witten and Boltzmann
From my Notion on Witten and Boltzmann
It is currently 0916 hrs and I am incredibly tired at this point. I really should have slept for more time last night and right now I really should get sleep because it is impossible to function without it. Hopefully I can just go to the library in the next class and not have to worry about the next classes because I don’t have the patience for it. I have loads of work to do, and I am trying to get back into physics because I really have to get to work on the de Sitter algebras work that I was interested in. But whatever, I want to pen some thoughts on EBMs and something a lot of people are overlooking and on de Sitter I guess, with no intention of fully elaborating on either at the moment.
The first thing is that there is a nice relationship between entropy based sampling and min-$p$ directly, which results in a sort of ``sampling redundancy”. Loosely, start by noticing that $p(x)=\frac{\exp (-\beta E)}{Z}$, and the entropy
$$ S(p)=-\int p(x)\log p(x)\;. $$
Expand into
$$ S(p)=\beta ^{-1}\mathbb{E}_{p}[E(x)]+\mathcal{F}_{\beta }\;, $$
where we just substituted for $p(x)$ and identified the Helmholtz free energy. Then, the sampling algorithms are solved in terms of $\min E$: start by computing $E(\Sigma _{i})$, and then find min-$p$ as usual. However, you could instead just seek to minimize the free energy, since it implies selecting higher probability choices and tells you what the entropy is anyway. Working with $\mathcal{F}{\beta }$ rather than just $E$ is in general better because it incorporates $\beta$ and gives a more subtle relationship between the parameters and the entropy. Idk at this point I am too sleepy to continue but I guess I’ll want to do something with the loss functions to better understand Entropix as an EBM.
The other thing I am interested in is the algebra of observables in de Sitter space, and loosely trying to also find what $T\overline{T}+\Lambda _{2}$ deformations do to the OPEs and in general the operator algebras. Loosely, the problem is as follows: in AdS/CFT, we can take the operator algebra of all single-trace operators $\mathcal{O}{i}$, which is a type von Neumann algebra. In the large $N$ microcanonical ensemble, this algebra can be written as the crossed product with its modular group of automorphisms $G_{\mathcal{A}, 0}$, which would give us a type $\mathrm{II}_{\infty }$ algebra. This comes from the Takesaki theorem for the crossed product of type $\mathrm{III}_{1}$ algebras:
$$ \mathrm{III}_{1}\rtimes G_{\mathcal{A},0}=\mathrm{II}_{\infty }\;. $$
Why is this relevant? By obtaining a type $\mathrm{II}_{\infty }$ algebra and not a type $\mathrm{III}$ algebra, we can actually define entropy, traces and density matrices. One result from this is that the bulk generalized entropy $S_{\mathrm{gen}}(X)$ for the bifurcation surface of the two-sided AdS black hole is defined as the entropy $S_{E}(R)$ of the full right boundary algebra. A result that I had in coincidence with implications on Bousso, Chandrasekaran, etc. is that the entropy of appropriate boundary subspaces correspond to the generalized entropies of minimar surfaces, or more generally the apparent horizon. There’s some subtleties with why this construction works — for instance, the reason we use subspaces and not subalgebras is because including the Hamiltonian $H_{R}$ in the algebra $\mathcal{A}_{R}$ means that operator products like $\mathcal{O}_{i}\cdot H_{R}$ cannot lie in a closed subalgebra, or we would be able to reconstruct arbitrary nonlocal operators given access to only a particular entanglement wedge $\mathcal{E}_{W}(R)$. Now whether such a subspace of finite $G{N}$ operators or observables calculates the generalized entropy in de Sitter space is an open question (first remarked in Chandrasekaran, Penington and Witten’s paper on Large $N$ algebras and generalized entropy). Which is vaguely something I am interested in. But more on this later.
November 1, 2024, 1414 Polish time.
Energy Based Models EBM101 -I
Needless to say, I just came to my uni and I am bored already. However, I was late to the class so it is 0930 hours approximately now, and hopefully it is only three hours more that I have to survive (ugh). But here is a mini-course on EBMs, as introduced by Teh, Osindero and Hinton in 2003 and subsequently elaborated on in many other papers, most prominently Yann LeCun et al’s Tutorial on EBMs. I turned this into a mini-course because I think EBMs are interesting and will be composed of a number of short posts such as this one.
Here is the high-level argument for EBMs. We start with a model $M(p, \mathbf{s}, S, \Psi )$, where $p$ are the probabilities, $\mathbf{s}$ are the sources, $S$ is the entropy and $\Psi$ is the KL-divergence. There were historically two ICA approaches: the (1) information maximization approach and the (2) causative generative approach. EBMs as an ICA approach caught a lot of attention, and later led to the basic idea that the key is to work with canonical ``statistical mechanics” to obtain a reasonable machine learning model, rather than simply brute-force optimizing or modelling. In vanilla ICA, we only have the parameters above, which are good enough in a sense. However, you could see from something like IMA and the mutual information - entropy correspondence that there are some motivations to the more canonical SM style. While there may be subtleties with how they reduce to something like causal generative approach in the presence of noise (these two are equivalent when we consider a non-noisy case), these two approaches paved way for something more powerful: Entropy based models.
At the core of EBMs is the partition function, which acts also in a way as the generating function for the $n$-point correlators or moments. We will talk about this in a different post, but for now we only need to understand the intuition for the partition function. If you follow Vasu Shyam (@vasud3vshyam on Twitter), you must have seen the post identifying the denominator of the $\mathrm{softmax}$ attention function as the partition function. The intuition behind why this works is based on EBMs, and is as follows. The partition function,
$$ Z=\sum \exp \left( \mathcal{X_{i}} \right) $$
appearing in the $\mathrm{softmax}$ function is related to the probabilities as
$$ p(x)=\frac{\exp (-\beta E)}{Z_{\beta }}\;. $$
Here we momentarily restored the inverse temperature $\beta$ for reasons that will be more apparent in later posts. The idea of EBMs is to interpret the energy $E$ as a minimization parameter and get inference from $\min E$. The idea for this minimization problem is very similar to that of the negative log-likelihood optimization problem; there is no exact reason to use this sort of a deterministic approach over IMAs — for that matter, there are reasons to not use EBMs over IMAs or CGAs for no reason. Overall, this reduces to the more familiar sort of optimization algorithms with say cross-entropy loss $J_{CE}$, which is the difference of $E$ and the free energy of the ensemble $F$ which together become your familiar negative log-likelihood function. This free energy (Helmholtz free energy to be more precise) is given by $\log Z_{\beta }$, and $J_{CE}$ takes the form
\[ J_{CE}=\beta E+F_{\beta }\;, \]
where we can expand $F_{\beta }$ into
\[ F_{\beta }=\beta ^{-1}\log \left( \int \exp (-\beta E) \right)\;. \]
What is interesting about the EBM approach to loss functions is that more often than not, a given loss function may not be compatible with $\min E$, because our requirement is also that training gradients the energy for each model prediction in a way that pushes the energy of the most probable generation down, while pushing the energy for every other answer up. In Yann et al’s paper, they explicitly talk about the different kinds of loss functions and the compatibility arguments.
In the next post, we will discuss these loss functions and the gradient optimization with $\min E$.
As a side note, I will be linking all these posts on my blog along with the titles and timestamps, since I am not sure how to make a homepage with a blog-like feel on Notion. In case there are some issues with updating or access, I will update them on a status page I will make soon. Addaam reshii a zaanta!
October 29, 2024, 2136 hrs
Adaptive Entropy Sparsity
From my Notion page on Adaptive entropy sparsity
(This is a very crude first draft. There are no exact details or code to go along with this since I wanted some understanding of the abstraction first. )
I will avoid timestamping this for obvious reasons. But here is the vague idea for now just to have something to send to Wowo. You start by taking the entropy and define something like
\[ \zeta _\alpha (S)=\begin{cases} \zeta {\text{min}}\equiv 0 & \text{High entropy = softmax} \\ \mathcal{D}(S) & \mathcal{S}_{0}\leq S \leq \mathcal{S}_{\text{max}} \\ \zeta _{\text{max}} & \text{Low entropy = sparsemax} \end{cases}\;, \]
where $\mathcal{D}(S)$ is a function of entropy that ranges from $\zeta _{\text{min}}=0$ to $\zeta _{\text{max}}$. This is just being bound between the usual $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions, so basically we are saying that if we have high entropy, we want lower sparsity and if we have lower entropy, we want higher sparsity. This is essentially to employ a version of magnitude-clipping or pruning but the more stronger claim I want to make is that doing this entropy-dynamic sparsity increases the $\text{effective correlation}$ of the model, by reducing the correlations between irrelevant and relevant inputs. In most base models this is not exactly an issue, but arguably in the efficiency vs performance trade-off, maximizing effective correlation is possible only if we don’t restrict ourselves to dense attention weights, meaning that $\mathrm{softmax}$ is not very efficient in this sense.
While I have not come up with a very formal definition of $\mathcal{D}(S)$ and how exactly this model gets implemented, here is a reason why this whole entropy-based sparsity does not have to be that adaptive, and here is where $\mathrm{varentropy}$ comes in, deciding whether $\zeta (S)$ has to be a continuously adaptive parameter. In calculating the metric attentions, we find $\mathrm{entropy}$ as well as $\mathrm{varentropy}$. This decides whether the sparsity factor is dynamic or fixed; low $\mathrm{varentropy}$ bounds are exactly $\mathrm{softmax}$ and $\mathrm{sparsemax}$. Higher varentropy implies that the bounds to $\mathcal{D}(S)$ changes, although still bounded by the $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions. I am still working on how exactly to see this relation and give a mathematical formulation for it.
The next thing I will mention here is the order of complexity for the Transformer model with adaptive entropy sparsity. The usual vanilla Transformer model has an order of complexity that is quadratic in the context length, $O(n^{2})$, whereas for a sparsity model, we can obtain a much better complexity order of $O(n\sqrt{n})$. I am not exactly sure if the fixed-ness of the model affects this complexity better or worse in terms of efficiency but on an all, the complexity (at least theoretically) is not very fixed in terms of sparsity and therefore depends on $\zeta (S)$:
$$ O(\mathrm{complexity})\propto \frac{1}{\zeta (S)}\;. $$
When sparsity is higher, the complexity typically decreases since it has to tend to fewer tokens in the token pool $\mathcal{T}_{i}$. This depends on the overall varentropy metric calculations, and so we arrive at the following theoretical complexity for the AES:
$$ O(\mathrm{complexity})\propto \frac{1}{\mathrm{varentropy}}\;. $$
It would be nice to actually run a model with AES and see how it changes the complexity and the validations. For now, we still are left with the question of what $\mathcal{D}(S)$ exactly is. Remember though that this function is a continuous function taking values between the $\mathrm{softmax}$ with $\zeta (S)=0$ and $\mathrm{sparsemax}$ when $\zeta (S)=\zeta _{\mathrm{max}}$. This definition bounds the entropy sparsity parameter between either extremely dense or extremely sparse attention weights. I used this definition to agree with the $\alpha$ -entmax function, where $\alpha =1$ is $\mathrm{softmax}$ and $\alpha =2$ is the $\mathrm{sparsemax}$ function. This bounding may be changed if appropriate, since the $\alpha$ in our context is a similar learnable parameter of the model that is optimized with stochastic gradient descent. Note that this is purely theoretical; in a real-time scenario it is very possible that this sparsity does not even account for any efficiency curve benefits.
One final thing I will mention is that there is the possibility of a $\text{sampling redundancy}$ in using entropy for the sparsity model. This redundancy arises when we comb through $\mathcal{T}_{i}$ enough to make the token pool reduce TO BE COMPLETED
And now, to the main objective: a mathematical description for this effective correlation. My first thoughts were to describe the partition
$$ Z=\sum \exp \left( \mathbf{x}_{i}^{\textsf{T}} \right)\;, $$
where $\mathbf{x}_{i}^{\textsf{T}}$ is $x_{i}/T$, where $T$ is the temperature. This is just an observation from the $\mathrm{softmax}$ function and isn’t really anything special. Until you identify the $\text{free energy}$ as the negative of $\log Z$:
$$ F=-\log Z\;. $$
Then we write the $n$-point correlators or moments or whatever (sorry I’m too QFT brained rn) are the expansions like
$$ \langle x_{1}\dots x_{n}\rangle =\frac{1}{Z}\frac{\partial^{n} Z(\xi )}{\partial \xi _{n}\dots \partial \xi _{n}}\;. $$
The $1$-point correlator $\langle x_{1}\rangle$ gives the free energy, and looks like
\[ F=-\log \sum _{a} \underbrace{\exp \left( qk^{\textsf{T}}{a}+\xi x^{\textsf{T}}_{a} \right)}_{\text{Partition function}}\;. \]
This is nothing new and just straight-up comes from Vasu’s paper Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters and the Yann LeCun et al paper on Energy-based models. I think there are some interesting correlations between these two approaches wrt entropy sampling and possibly extending it to something like the Helmholtz free energy. When computing these terms in LLMs, it might be helpful to specifically work with these quantities specifically in different settings. As an example, the $\log Z$ term might be useful in making the sampling uncertainty a more precise thing, see for instance Yann LeCun et al’s paper to see what the possible relation is. I will update this with a few more thoughts in a while.
Note of caution
I have to say though, that most of this is purely theoretical at the moment and has not been tested anywhere personally. Not to mention that all of this is VERY crude and a lot of refinements will be necessary to obtain a base working model for the entropy sparsity model I mentioned. Or maybe not, or maybe there is something key that I am missing altogether. There may and mostly will be mistakes, do not assume otherwise. But I will try to find a proper formalism to the partition function thing in the next two hours hopefully. Or maybe possible that I will update the GH \doc draft and make it public.
1018 hrs local time
Update 1: The free energy is not free
So writing $Z$ a little more formally (taking $\beta =1/T$):
$$ Z=\sum \exp \left( -\beta E \right)\;, $$
we could write the free energy in terms of $Z_{\beta }$ and $S$. And then rewrite our entire model in terms of $F$. But the point is that the entropy based sampling in itself suffices for the most part; there is no need to go through all this pain of computing $Z_{\beta }$ or do anything even remotely related to EBMs. The worth of doing this free energy stuff is that we can adopt an EBM approach to sampling and obtain better optimization routines in terms of $\nabla F$ and the uncertainties in the pool can be made dynamic from this. But I have no stringent way to sell this idea yet. Emphasis on the yet.