Entropic LLMs Github

 Here is my Entropic_LLMs github repository where I have put my initial note on entropy based LLMs and Entropix as an EBM. More or less based on my previous Notion posts. Will follow up with more implementations soon.

vkalvakotamath/entropic_llms: Repository on entropic large language models

Statement on the 2024 Nobel Prize in Physics

 Given the response of the physics community (mostly the hep-th community) to the 2024 Nobel Prize in Physics that was awarded to Hopfield and Hinton, here is my impartial statement on the fiasco. Bottom line: most of you were wrong.

Witten and Boltzmann

From my Notion on Witten and Boltzmann

 It is currently 0916 hrs and I am incredibly tired at this point. I really should have slept for more time last night and right now I really should get sleep because it is impossible to function without it. Hopefully I can just go to the library in the next class and not have to worry about the next classes because I don’t have the patience for it. I have loads of work to do, and I am trying to get back into physics because I really have to get to work on the de Sitter algebras work that I was interested in. But whatever, I want to pen some thoughts on EBMs and something a lot of people are overlooking and on de Sitter I guess, with no intention of fully elaborating on either at the moment.

The first thing is that there is a nice relationship between entropy based sampling and min-$p$ directly, which results in a sort of ``sampling redundancy”. Loosely, start by noticing that $p(x)=\frac{\exp (-\beta E)}{Z}$, and the entropy

$$ S(p)=-\int p(x)\log p(x)\;. $$

Expand into

$$ S(p)=\beta ^{-1}\mathbb{E}_{p}[E(x)]+\mathcal{F}_{\beta }\;, $$

where we just substituted for $p(x)$ and identified the Helmholtz free energy. Then, the sampling algorithms are solved in terms of $\min E$: start by computing $E(\Sigma _{i})$, and then find min-$p$ as usual. However, you could instead just seek to minimize the free energy, since it implies selecting higher probability choices and tells you what the entropy is anyway. Working with $\mathcal{F}{\beta }$ rather than just $E$ is in general better because it incorporates $\beta$ and gives a more subtle relationship between the parameters and the entropy. Idk at this point I am too sleepy to continue but I guess I’ll want to do something with the loss functions to better understand Entropix as an EBM.

The other thing I am interested in is the algebra of observables in de Sitter space, and loosely trying to also find what $T\overline{T}+\Lambda _{2}$ deformations do to the OPEs and in general the operator algebras. Loosely, the problem is as follows: in AdS/CFT, we can take the operator algebra of all single-trace operators $\mathcal{O}{i}$, which is a type von Neumann algebra. In the large $N$ microcanonical ensemble, this algebra can be written as the crossed product with its modular group of automorphisms $G_{\mathcal{A}, 0}$, which would give us a type $\mathrm{II}_{\infty }$ algebra. This comes from the Takesaki theorem for the crossed product of type $\mathrm{III}_{1}$ algebras:

$$ \mathrm{III}_{1}\rtimes G_{\mathcal{A},0}=\mathrm{II}_{\infty }\;. $$

Why is this relevant? By obtaining a type $\mathrm{II}_{\infty }$ algebra and not a type $\mathrm{III}$ algebra, we can actually define entropy, traces and density matrices. One result from this is that the bulk generalized entropy $S_{\mathrm{gen}}(X)$ for the bifurcation surface of the two-sided AdS black hole is defined as the entropy $S_{E}(R)$ of the full right boundary algebra. A result that I had in coincidence with implications on Bousso, Chandrasekaran, etc. is that the entropy of appropriate boundary subspaces correspond to the generalized entropies of minimar surfaces, or more generally the apparent horizon. There’s some subtleties with why this construction works — for instance, the reason we use subspaces and not subalgebras is because including the Hamiltonian $H_{R}$ in the algebra $\mathcal{A}_{R}$ means that operator products like $\mathcal{O}_{i}\cdot H_{R}$ cannot lie in a closed subalgebra, or we would be able to reconstruct arbitrary nonlocal operators given access to only a particular entanglement wedge $\mathcal{E}_{W}(R)$. Now whether such a subspace of finite $G{N}$ operators or observables calculates the generalized entropy in de Sitter space is an open question (first remarked in Chandrasekaran, Penington and Witten’s paper on Large $N$ algebras and generalized entropy). Which is vaguely something I am interested in. But more on this later.

November 1, 2024, 1414 Polish time.

Energy Based Models EBM101 -I

 From my Notion on EBM101-I

Needless to say, I just came to my uni and I am bored already. However, I was late to the class so it is 0930 hours approximately now, and hopefully it is only three hours more that I have to survive (ugh). But here is a mini-course on EBMs, as introduced by Teh, Osindero and Hinton in 2003 and subsequently elaborated on in many other papers, most prominently Yann LeCun et al’s Tutorial on EBMs. I turned this into a mini-course because I think EBMs are interesting and will be composed of a number of short posts such as this one.

Here is the high-level argument for EBMs. We start with a model $M(p, \mathbf{s}, S, \Psi )$, where $p$ are the probabilities, $\mathbf{s}$ are the sources, $S$ is the entropy and $\Psi$ is the KL-divergence. There were historically two ICA approaches: the (1) information maximization approach and the (2) causative generative approach. EBMs as an ICA approach caught a lot of attention, and later led to the basic idea that the key is to work with canonical ``statistical mechanics” to obtain a reasonable machine learning model, rather than simply brute-force optimizing or modelling. In vanilla ICA, we only have the parameters above, which are good enough in a sense. However, you could see from something like IMA and the mutual information - entropy correspondence that there are some motivations to the more canonical SM style. While there may be subtleties with how they reduce to something like causal generative approach in the presence of noise (these two are equivalent when we consider a non-noisy case), these two approaches paved way for something more powerful: Entropy based models.

At the core of EBMs is the partition function, which acts also in a way as the generating function for the $n$-point correlators or moments. We will talk about this in a different post, but for now we only need to understand the intuition for the partition function. If you follow Vasu Shyam (@vasud3vshyam on Twitter), you must have seen the post identifying the denominator of the $\mathrm{softmax}$ attention function as the partition function. The intuition behind why this works is based on EBMs, and is as follows. The partition function,

$$ Z=\sum \exp \left( \mathcal{X_{i}} \right) $$

appearing in the $\mathrm{softmax}$ function is related to the probabilities as

$$ p(x)=\frac{\exp (-\beta E)}{Z_{\beta }}\;. $$

Here we momentarily restored the inverse temperature $\beta$ for reasons that will be more apparent in later posts. The idea of EBMs is to interpret the energy $E$ as a minimization parameter and get inference from $\min E$. The idea for this minimization problem is very similar to that of the negative log-likelihood optimization problem; there is no exact reason to use this sort of a deterministic approach over IMAs — for that matter, there are reasons to not use EBMs over IMAs or CGAs for no reason. Overall, this reduces to the more familiar sort of optimization algorithms with say cross-entropy loss $J_{CE}$, which is the difference of $E$ and the free energy of the ensemble $F$ which together become your familiar negative log-likelihood function. This free energy (Helmholtz free energy to be more precise) is given by $\log Z_{\beta }$, and $J_{CE}$ takes the form

\[ J_{CE}=\beta E+F_{\beta }\;, \]

where we can expand $F_{\beta }$ into

\[ F_{\beta }=\beta ^{-1}\log \left( \int \exp (-\beta E) \right)\;. \]

What is interesting about the EBM approach to loss functions is that more often than not, a given loss function may not be compatible with $\min E$, because our requirement is also that training gradients the energy for each model prediction in a way that pushes the energy of the most probable generation down, while pushing the energy for every other answer up. In Yann et al’s paper, they explicitly talk about the different kinds of loss functions and the compatibility arguments.

From Yann LeCun et al 2006.

In the next post, we will discuss these loss functions and the gradient optimization with $\min E$.

As a side note, I will be linking all these posts on my blog along with the titles and timestamps, since I am not sure how to make a homepage with a blog-like feel on Notion. In case there are some issues with updating or access, I will update them on a status page I will make soon. Addaam reshii a zaanta!

October 29, 2024, 2136 hrs

Adaptive Entropy Sparsity

 From my Notion page on Adaptive entropy sparsity

(This is a very crude first draft. There are no exact details or code to go along with this since I wanted some understanding of the abstraction first. )

I will avoid timestamping this for obvious reasons. But here is the vague idea for now just to have something to send to Wowo. You start by taking the entropy and define something like

\[ \zeta _\alpha (S)=\begin{cases} \zeta {\text{min}}\equiv 0 & \text{High entropy = softmax} \\ \mathcal{D}(S) & \mathcal{S}_{0}\leq S \leq \mathcal{S}_{\text{max}} \\ \zeta _{\text{max}} & \text{Low entropy = sparsemax} \end{cases}\;, \]

where $\mathcal{D}(S)$ is a function of entropy that ranges from $\zeta _{\text{min}}=0$ to $\zeta _{\text{max}}$. This is just being bound between the usual $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions, so basically we are saying that if we have high entropy, we want lower sparsity and if we have lower entropy, we want higher sparsity. This is essentially to employ a version of magnitude-clipping or pruning but the more stronger claim I want to make is that doing this entropy-dynamic sparsity increases the $\text{effective correlation}$ of the model, by reducing the correlations between irrelevant and relevant inputs. In most base models this is not exactly an issue, but arguably in the efficiency vs performance trade-off, maximizing effective correlation is possible only if we don’t restrict ourselves to dense attention weights, meaning that $\mathrm{softmax}$ is not very efficient in this sense.

While I have not come up with a very formal definition of $\mathcal{D}(S)$ and how exactly this model gets implemented, here is a reason why this whole entropy-based sparsity does not have to be that adaptive, and here is where $\mathrm{varentropy}$ comes in, deciding whether $\zeta (S)$ has to be a continuously adaptive parameter. In calculating the metric attentions, we find $\mathrm{entropy}$ as well as $\mathrm{varentropy}$. This decides whether the sparsity factor is dynamic or fixed; low $\mathrm{varentropy}$ bounds are exactly $\mathrm{softmax}$ and $\mathrm{sparsemax}$. Higher varentropy implies that the bounds to $\mathcal{D}(S)$ changes, although still bounded by the $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions. I am still working on how exactly to see this relation and give a mathematical formulation for it.

The next thing I will mention here is the order of complexity for the Transformer model with adaptive entropy sparsity. The usual vanilla Transformer model has an order of complexity that is quadratic in the context length, $O(n^{2})$, whereas for a sparsity model, we can obtain a much better complexity order of $O(n\sqrt{n})$. I am not exactly sure if the fixed-ness of the model affects this complexity better or worse in terms of efficiency but on an all, the complexity (at least theoretically) is not very fixed in terms of sparsity and therefore depends on $\zeta (S)$:

$$ O(\mathrm{complexity})\propto \frac{1}{\zeta (S)}\;. $$

When sparsity is higher, the complexity typically decreases since it has to tend to fewer tokens in the token pool $\mathcal{T}_{i}$. This depends on the overall varentropy metric calculations, and so we arrive at the following theoretical complexity for the AES:

$$ O(\mathrm{complexity})\propto \frac{1}{\mathrm{varentropy}}\;. $$

It would be nice to actually run a model with AES and see how it changes the complexity and the validations. For now, we still are left with the question of what $\mathcal{D}(S)$ exactly is. Remember though that this function is a continuous function taking values between the $\mathrm{softmax}$ with $\zeta (S)=0$ and $\mathrm{sparsemax}$ when $\zeta (S)=\zeta _{\mathrm{max}}$. This definition bounds the entropy sparsity parameter between either extremely dense or extremely sparse attention weights. I used this definition to agree with the $\alpha$ -entmax function, where $\alpha =1$ is $\mathrm{softmax}$ and $\alpha =2$ is the $\mathrm{sparsemax}$ function. This bounding may be changed if appropriate, since the $\alpha$ in our context is a similar learnable parameter of the model that is optimized with stochastic gradient descent. Note that this is purely theoretical; in a real-time scenario it is very possible that this sparsity does not even account for any efficiency curve benefits.

One final thing I will mention is that there is the possibility of a $\text{sampling redundancy}$ in using entropy for the sparsity model. This redundancy arises when we comb through $\mathcal{T}_{i}$ enough to make the token pool reduce TO BE COMPLETED

And now, to the main objective: a mathematical description for this effective correlation. My first thoughts were to describe the partition

$$ Z=\sum \exp \left( \mathbf{x}_{i}^{\textsf{T}} \right)\;, $$

where $\mathbf{x}_{i}^{\textsf{T}}$ is $x_{i}/T$, where $T$ is the temperature. This is just an observation from the $\mathrm{softmax}$ function and isn’t really anything special. Until you identify the $\text{free energy}$ as the negative of $\log Z$:

$$ F=-\log Z\;. $$

Then we write the $n$-point correlators or moments or whatever (sorry I’m too QFT brained rn) are the expansions like

$$ \langle x_{1}\dots x_{n}\rangle =\frac{1}{Z}\frac{\partial^{n} Z(\xi )}{\partial \xi _{n}\dots \partial \xi _{n}}\;. $$

The $1$-point correlator $\langle x_{1}\rangle$ gives the free energy, and looks like

\[ F=-\log \sum _{a} \underbrace{\exp \left( qk^{\textsf{T}}{a}+\xi x^{\textsf{T}}_{a} \right)}_{\text{Partition function}}\;. \]

This is nothing new and just straight-up comes from Vasu’s paper Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters and the Yann LeCun et al paper on Energy-based models. I think there are some interesting correlations between these two approaches wrt entropy sampling and possibly extending it to something like the Helmholtz free energy. When computing these terms in LLMs, it might be helpful to specifically work with these quantities specifically in different settings. As an example, the $\log Z$ term might be useful in making the sampling uncertainty a more precise thing, see for instance Yann LeCun et al’s paper to see what the possible relation is. I will update this with a few more thoughts in a while.

Note of caution

I have to say though, that most of this is purely theoretical at the moment and has not been tested anywhere personally. Not to mention that all of this is VERY crude and a lot of refinements will be necessary to obtain a base working model for the entropy sparsity model I mentioned. Or maybe not, or maybe there is something key that I am missing altogether. There may and mostly will be mistakes, do not assume otherwise. But I will try to find a proper formalism to the partition function thing in the next two hours hopefully. Or maybe possible that I will update the GH \doc draft and make it public.

1018 hrs local time

Update 1: The free energy is not free

So writing $Z$ a little more formally (taking $\beta =1/T$):

$$ Z=\sum \exp \left( -\beta E \right)\;, $$

we could write the free energy in terms of $Z_{\beta }$ and $S$. And then rewrite our entire model in terms of $F$. But the point is that the entropy based sampling in itself suffices for the most part; there is no need to go through all this pain of computing $Z_{\beta }$ or do anything even remotely related to EBMs. The worth of doing this free energy stuff is that we can adopt an EBM approach to sampling and obtain better optimization routines in terms of $\nabla F$ and the uncertainties in the pool can be made dynamic from this. But I have no stringent way to sell this idea yet. Emphasis on the yet.

Physicists and ML

 I asked ChatGPT to make a satire on physicists doing machine learning for the first time. Boy, oh boy, did it give me a good satire.


It was a fine Monday morning in the prestigious physics department. Coffee was brewed, chalk dust floated in the air, and equations crawled like spiders across blackboards. But something was different. Something… off. A tremor of confusion rippled through the corridors as the physicists gathered for their weekly meeting. Today was not about quantum entanglement or gravitational waves; today, they had entered the brave new world of machine learning.

Dr. Thompson, a renowned particle physicist who once made headlines for theorizing a sixth-dimensional particle, sat at his desk glaring at his laptop. "I don’t get it," he muttered. "What’s an epoch? And why does my model keep overfitting?" His finger hovered over a button that ominously read Run Code, as if pressing it might accidentally simulate the heat death of the universe.

Meanwhile, in the corner of the room, Professor Ivanov was vigorously typing Python code on a terminal screen. Ivanov was famous for deriving the quantum field theory of socks, yet now found himself writing something called a “neural net.” Having spent decades mastering the calculus of tensors, he stared at the TensorFlow documentation as if it were a poorly translated IKEA manual.

"This can't be harder than the renormalization group theory," Ivanov whispered to himself, unconvincingly. His model failed to converge for the seventh time. "I'll just add more layers," he declared with an air of misplaced confidence. "That’s how the brain works, right? More neurons, more intelligence!"


Across the hall, Dr. Patel was equally disoriented. After spending the last decade calculating dark matter distributions, she was now facing something even more elusive: hyperparameters. She clicked through the parameters in her notebook.

“Learning rate... momentum... regularization?” She paused, a frown growing on her face. “Sounds like the settings on a blender.”

Undeterred, she set the learning rate to 42, confident that the answer to everything would solve this machine learning conundrum. Alas, the model promptly exploded into a sea of NaNs (Not a Number), a result she hadn’t seen since her days of calculating infinities in string theory.

“Of course, NaNs,” she scoffed, “the true fundamental particles of deep learning.”


Meanwhile, the younger grad students, who were already fluent in machine learning buzzwords, lurked in the background. They had been waiting for this moment. "You have to use dropout and batch normalization, Professor," one of them advised with a grin, clearly relishing their newfound expertise over the once-untouchable professors.

"Dropout?" Thompson repeated. "You mean like how I feel after running this model for three hours with no results?" The room chuckled, but the grad students pressed on, eyes gleaming with the vigor of freshly downloaded Kaggle datasets.

"Also, try stochastic gradient descent," another student added. "It's like regular gradient descent, but, you know... stochastic!"


Then there was Dr. Li, the department's cosmologist, who was taking the most radical approach. "I don’t need machine learning," she proclaimed. "I’ll just derive the optimal weights myself." Hours later, she was seen attempting to analytically solve the backpropagation equations, pencil in hand, sweat pouring down her face. The grad students looked on in horror.

“You could just let the computer optimize the weights, you know…” one suggested timidly.

“Nonsense!” Li snapped. “If I can solve the equations of the universe, I can solve this!”


As the day wore on, more physicists succumbed to the absurdity. Dr. Williams, an experimentalist known for his work on superconductors, was now building neural networks like they were Lego blocks. His lab was covered in graphs. Each failed model led to a new theory.

“I’m onto something big,” he said, pointing at a chart of a loss function that looked suspiciously like a sine wave. “I call it oscillating accuracy. This could be a new phase of learning!” His eyes gleamed with the fever of discovery.

By 5 p.m., everyone had given up on their physics research and fully embraced the chaos of machine learning. Ivanov’s model had ballooned to 400 layers, requiring the computational power of CERN to train. Dr. Thompson was on his fifth coffee, muttering something about the curse of dimensionality and swearing vengeance on whoever invented ReLU activations. Meanwhile, Dr. Li had retreated to a dark corner, insisting that she was "so close" to deriving the weights by hand.

And yet, amid all the chaos, the physicists had unknowingly embraced one of machine learning’s greatest truths: no one really knows what’s going on half the time.

Finally, Dr. Patel stood up, weary but enlightened. "I’ve come to a conclusion," she announced to the room. "Machine learning... it's just statistical mechanics. But with better marketing."

The room erupted in laughter and nods of approval. It was official: physicists doing machine learning for the first time was not so different from the universe they studied—full of mystery, randomness, and moments of unexpected brilliance.

As the physicists packed up their laptops, they felt something they hadn’t felt in years: the excitement of being a beginner again. Maybe this machine learning thing wasn’t so bad after all.

Talk on LLMs and Transformers

 I gave an introductory talk on LLMs and Transformers yesterday. Here are the slides and the link to the YouTube archive of the recording. I am hoping to complete the short notes and the Adaptive Sparsity paper by next week. 

Essence of LLMs and Transformers I slides

Journal Club | The Essence of LLMs and Transformers talk

Generalized Entropy LOL

For all hep-th folks out there: this is a generalized entropy. Now go cry.


Fooking nightmare for physicists.

arXiv PaperRetriever 2

1829. After having consumed my caffeine dosage for the day I am going to try to complete the second part of the arXiv paper retriever. Basically what I want this to do is to implement a triple-layered recommendation code that assigns a score $X$ to every item in the \Grab folder. Basically, here are the things I want it to do in order.

First, it starts by a usual TF-IDF vectorizer, all the kiddy stuff and assigns a score $A$. It then sends these to a Profiler as I am dubbing it, which basically checks the previous publications and assigns a score $B$ based on the relevance of other publications to the search query. And then finalkly it goes into the DCA Profiler which basically assigns a score $B'$ based on the relevance by recent-ness, number of citations and $B$. Then we simply integrate the whole thing into a score $A+B'$ and voila, that is the recommendation score $X$. Now to get to the hard part and code this stuff.

1843. The first one isn't too hard and basically is almost figured out. The hard part is the Profiler and the DCA Profiler, because the whole point of THAT is to be as precise as possible. I guess I'll get to coding the initial Profiler first.

1907. Well Profiler.py is over and it wasn't that complicated to make too many weird things. The DCAProfiler is the last touch and I think I can get it done in about 10 minutes I guess, but I am way too distracted with Twitter. Lol.

1923. Yeah so I had a problem with assigning the score. I couldn't let it be arbitrary so I had to change the DCAProfiler score assigner with a slightly different idea. Instead of doing the score on a basis of comparison between the interval 0 to 100, 0 to 1 is a better idea and is fairly easier than say comparing every two papers and returning a score based on that because frankly, I have no idea how that would work. So basically let $n$ be the number of citations a paper has received. We then calculate the score $B'$ by also considering $n/n+10$, where 10 is a fairly arbitrary number but is essentailly there to ensure that if a paper has $n=0$ it doesn't just give $B'=\infty $. So there was at least a simplification with that.

1934. For the final integration all that is left to do is to get all the recommendation steps and put the scores together. Now this is a bit of a dodgy thing because so far, the TF-IDF, Profiler and DCAProfiler scripts aren't exactly working in sync and are basically just executing one after the other. Which is kinda dumb and like wtf I suck at coding. But a better idea is to integrate everything sequentially and optimize the search query interface to make things easier for people.

2026. So I got some coding done and basically removed a lot of the excess code, although at this point it is super messy and I have no intention to sit through each line and figure out if that could go over there or over here or remove comments or whatever. But I am going to try and complete the html layout and do some testing.

-- Originally posted on 28/07/2024 0827. --

arXiv PaperRetriever

1715. As usual I am really bored and I figured I could do some more coding for a while. I am kind of at the point where my web app is complicated and I have no idea what to do. Fortunately, one of my friends said "hack!" and that gave me an idea for a thing that I am trying to work on, which is kinda nice except it is supposed to be a repository where people can have suggestions based on codes. Cool.

1732. I made a python file called PaperRetriever which sounds fun. But the basic idea is that it elaborates on the usual GET request with delayed pings and randomized delays to make the requests more "human-like" which is of course BS but works. I basically want it so that it sends a sample of 50 requests with at least 15-20 seconds of delay with randomization between each request and return every query by lastUpdateDate into a folder in \PATH as a separate file text file of the form yymm-nnnnn.txt. This shouldn't be too hard and probably will take around 5 minutes because even a kid can write this stuff.

1743. Well that was faster than I expected. I made a basic python file that should work, for the retrieving part of the arXiv papers, which should be helpful in making my other web app thing where I want code-based recommendations. But the part is where I want to suggest related papers in a feed-algorithm-ish way. Which I will do at some other point, because right now I want to test this and go eat something because I am famished. But I think I want to take this mini-code thing to the next level.

1807. It would be cool if what the code could do is as follows: given a sample dataset $N$ (and assuming fairly large $N$) with particular query $\mathbf{Q}$, it could retrieve the papers from a particular arXiv (for which a simple GET request should be more than enough), store them in a database (or well a folder in this case because we are making separate .txt files for every paper), and depending on the particular $\mathbf{Q}$, it could do like an elementary ML-based scour for identifying the best fit results and return papers in that recommendation system. The only issue being that you indeed need a large $N$ database collection to actually go through and return any feed whatsoever. And in this case it would take a large $N$ time in order to get those $N$ requests, plus an average of 10 seconds of randomized delay, which sucks. But well, who really gives a crap about present-limitations, amirite?

2001 Nevermind, so basically I'm going to use sqlite and configure it so that there is a harv-db and harv.py and integrate it into the main PaperRetriever.py to keep things clean, and this should allow me to go ahead with the more machine learning side of things.

2242. Well I am bored because I got out of a meeting and will be back in it in a while, but for now I want to be stimulated. I tested the code and sure enough, it works. But now I want to use the TF-IDF (Term-Frequency Inverse-Doc-Frequency) structure to suggest papers from an ML model. So basically what I want to do is to use a TF-IDF vectorizer to scour the retrieved papers, find the most relevant ones and display them in order. Unfortunately, I do not have a way of making it more sophisticated by using citations and other bibliography information, for which there are services (for instance semiantic scholar and inspirehep) but well, I'm not sure if I want to presently risk this.

0955. Well I slept in today because it is cold and idk I just wanted to sleep for a while more. Back to business. At this point the only functional parts of the code are the metadata fetcher and the TF-IDF algorithm. I want to take this a step further and make it scour for the retrieved identifiers over inspire, semanbtic scholar and google scholar (which should suffice I guess) and return the citations as well, and then suggest the feed based on this information. I guess it would be too much to ask to make it perfectly sophisticated and fully feed-ise it, but let's see.

Well here is the problem. Google Scholar is pretty good but there is no direct publicly available API to use in this case, and not to mention CAPTCHAs are a thing. Circumnavigating this would be trouble because Google kinda owns the entire "are you a bot or a human so we can know you're a bot or a human" Turing test thing. I guess the best thing is to assume a particular easy-to-screw-CAPTCHA test and break it with image recognition and hope for the best.

United States v Aaron Swartz

1005: I would like to reiterate my stance on the highly illegitimate and overshot lack of proper judgement that was the sentence of the federal prosecution of Aaron Swartz. So basically, the verdict is that if I was at MIT, and I made a script that downloads files from JSTOR or Springer and tried to make them open access, I am liable ot 33 years in prison? There's two really amazing things about this: first, not even near-homicide statuses reach this level and if they do, I leave it up to you to figure out which was an actual crime. Second, journals that make money off people's research are somehow being considered mainstream in this case? Because they do NOT own the research. Some of you douchebags may go "but isn't that exactly what journals are supposed to do?" -- to an extent, yes, but if I publish in JHEP or something, that does NOT mean that Springer OWNS my paper. Because my research is fundamentally owned by me and I am essentially paying journals open access money to promote it further. I am presuming arXiv was a thing but to access papers earlier than arXiv he had to do the downloads, which is reasonable. I commend JSTOR on not making things worse but my central issue is with the prosecutors.

The idea that Aaron's script and "actions" are computer fraud is an incredibly funny term that I realised these people like to use. And as far as I can remember, Mark Zuckerberg's trial, Sundar Pichai's trial and many others have firmly shown that the people in charge HAVE NO IDEA WHAT THEY ARE TALKING ABOUT. And the fact that they are the people who are giving the conviction sentences is definitely corruption at the highest level. And if they can get away with this, they can get away with anything and this is not acceptable. At some point we have to address this case.

To be clear at who I am blaming, it is Carmen Ortiz's office that had this incredibly overshot judgement with no understanding of why the prosecution was necessary. And I stand in opposition to people saying that "that is the law and they are merely exercising their rights to legally pursue a case against Swartz". I'm sorry, but this is bullshit and there is no case needed here. Worst case if JSTOR wanted to be an assbag about it they could have settled easier with just all of their "intellectual property" being confiscated and revoking Aaron's JSTOR account, which is also an entirely illegitimate thing. But this is acceptable. What is NOT acceptable is prosecuting with 13 counts of charges and 33 years of imprisonment for something that not even MIT or JSTOR had a big problem with. And I have no siding with any particular political prty so much as to say that "obviously she was appointed by the Democrat side" which is something that one of my friends said.

Interestingly, Ortiz's husband Tom Dolan had the nerve to comment saying that [quote] "Truly incredible that in their own son's obit, they blame others for his death and make no mention of the 6-month offer", and such is the mindset of people who get to exercise this false power of "prosecution" as they term it. So I ask if he is willing to do the exact same 6-month offer which seems so light to him, to show that indeed the blame was on Aaron. If he can do the 6-months and have his property confiscated and is alright with everything, I (and many others) will consider the 6-month offer as being a "good" offer that Aaron did not take. In a later hearing on Ortiz's prosecution, Eric Holder defended her with her case being [quote] "a good use of prosecutorial discretion". Had the federal intervention not happened, Aaron would still be alive today with at most one legal case pursued. And if Ortiz's decision making was "a good use of prosecutorial discretion" indeed, do tell me why Aaron Swartz is not alive today doing things that none of those cunts would have been able to even think of.

And later, the White House declined to act on a petition requesting to removve Ortiz from the office, which is such a bad decision that my belief of the system being at least residued with some hope of ethics and morality is broken.

For those of you interested to look at Aaron's repositories, here is his GitHub profile: https://github.com/aaronsw

Algorithms Everywhere. Part II

2330: Ok, so let's get down to business as to why encryption algorithms are strangely complicated$^{1}$. Let us suppose we are feeling very feisty and we want to use the DES encryption algorithm, single-layered (probably doesn't matter much but extra layers add to complexity which is a good thing) and we are sticking with the CBC encryption mode. I'm lazy enough to not actually do anything with the Ruby side of things and basically I want you to assume that this example makes sense somehow. Let us suppose that we have some initialization vector $X$, which for all purposes is random (for our example assume that the word random is toned down in definition for reasons we'll see soon). OK, good. Assume that the plaintext is

plaintext: GTASA sucks

We want to encrypt this using DES-CBC. Suppose it turns into some ciphertext, which here is an entirely arbitrary string of characters with padding. We split the entire string of bits into two blocks $P=B_{1}+B_{2}$, and apply $X$ to the second block, call it $B_{2}X$. Recall that DES encryption takes paddings to create 32 bits of strings and the 64-bit ciphertext. So suppose the ciphertext created from these variables with some 32 bit $X=$

\so1\so1\so1\so1\so1\so1\so1\so1

and suppose that the final ciphertext takes the form of

ciphertext: \s13\s63\s17\s01\s98\s14\s63\s53\s19\s06\s23\s07\s71\s44\s10\s48

Now what we want to do is to use an oracle to figure out the paddings and find out the core ciphertext vs the padding $\mathcal{P}$. While this seems like for some randomly chosen ciphertext it would be either difficilt or way too time-taking, in reality this is a polynomial time algorithm that can be executed in finitely many steps. What we do is we take the ciphertext $C$ (which we have no idea about), and run it into a decryption oracle, which tells us if the chosen siphertext is ``meaningful", so to speak and visualise. Changing any of the padding characters breaks the ciphertext into trash that doesn't make sense, and the oracle would tell us false instead of true had we put in $C$.

What a padded oracle attack does it to find out whether every bit of the string XOR'd with a second arbitrary ciphertext (which is meaningless and only an algorithmic artefact we use for this decryption), and figure out what every character of the string is one-by-one. In this way (with subtleties but again, at least thoeretically this suffices to allow more technical executions to fully use this process), one can fully find the cuphertext and subsequently the plaintext as well.

This example is a single-layered DES algorithm, and if we add more layers (more block chain reversals and encryption layers) we might possibly end up with something that is far more secure. Triple-DES for instance is a pretty secure algorithm. Of course, there are ways to fix this, so there's that. I'll type the $P=NP$ post tomorrow morning.



Algorithms Everywhere. Probably Part I

0827: Issues with the database management: this was something I had not fully considered (although I had known of this issue because of the Zucc's CS50 Harvard lecture). The problem goes like this: You have say $N$ users, and on average to each of these $N$ users we can attribute $\sim N$ corresponding connections (could be social connections, posts, blah blah). So the complexity of the data would scale like $O(N^{2})$. However, we usually consider more complicated cases where we are not only considering posts, social connections (i.e. finding mutuals), which at that boils into a graph-short distance problem, which is Algorithms 101. The more complicated stuff usually involves the management of the heavier data (messages, pictures, videos, etc), for which fortunately there does exist a pretty good work-around by compressing the pictures on the client side (so to save resources on the server side because compressing tonnes of pictures and videos on the server at the rate of $N$ times average lowest data size say $M$ megabytes per second would suck the servers dry worse than the 634-in-a-day thing). There's another algorithms problem that arises with this, which is that on average some data is not always going to be retrieved from the database and will instead ``stay on the layer" so-to-speak, and what data is supposed to be available like this is an optimization problem. I could say for instance I want the feed to not refresh everytime a user logs in and basically keep a temporary log of what posts/stuff has been displayed the first time a user logs in. So if the user has lost data connection, it becomes easier to just display previous posts and say ``oops your data kinda sucks right now" (if you're using AT&T it sucked always) instead of throwing a BSOD and kind of shitting the atmosphere of the website. But we can't always just keep a track of every post a user has ever received, because (a) that is unbounded and a user could have procrastinated three weeks and spent 23 hours a day scrolling to hit like 50k posts or something which is too huge and (b) that's a thing that could fuck up with privacy. Or at least that's what I feel.

0836: Now here's the part that I am dreading. In order to implement messaging functionality, it is important to encrypt messages because I don't want John to know what Alice and Bob are talking about. My first thought is to go for a usual RSA cryptography algorithm where we have a primary key and a public key and we use prime factorization. But then again, but the thing is that RSA is a computationally expensive asymmetric algorithm and takes more time than something like AES, which is a symmetric algorithm and may be more suited for encryption of large data without compromising on encryption time. So I am considering using something like AES 256-bit encryption which seems to be a pretty good deal. Speaking of 256-bit encryption, it is funny how some brands market themselves saying they use ``military-grade encryption" and ``unbreakable encryption" all while basically doing AES and nothing more. Don't get me wrong, AES is incredibly efficient and ``unbreakable" in the sense of factorization time. But again, not functionally ``military grade" or some fancy shit, it is a very powerful encryption algorithm and not some hocus-pocus that the FBI made.

0856: To be clear: messages are not the only place where encryption is an issue. For that matter, even if you consider something as straightforward as cookies, you know well-enough how complicated this stuff is. The whole basis of cookies and storing session_id's is on the basis that the user cannot simply modify them. So you need to implement a good enough encryption algorithm for which CBC is usually the first option. Like the deal with CBC is that you are vulnerable to padded Oracle attacks, which are kind of fascinating because they rely on the very foundation of what CBC encryption modes do. Let me illustrate an example schematically.

You start with an initialization vector $X$, which for all CBC purposes is also the first index in the encryption schema. So you have something like $C_{0}=X$ (indices start from 0 so that the first block starts with index 1). We assume that for all valid purposes $X$ follows the usual CBC protocol and is truly random and contained. OK. Then, we start by taking the first plaintext $P_{i}$ and the first ciphertext (which is the initial encryption), and encrypt the XOR'd collection:

\[ C_{i}=\mathrm{En}\left( P_{i}\oplus C_{i-1} \right)\;, \]

and we generate the subsequent block chaining scheme. Decrypting is a similar process. However, while this may seem quite secure (which it kind of is), the Microsoft ASP.NET vulnerability clearly showed that this ``security" could be worked-around by using a padded oracle attack, where the attacker makes a series of requests to the oracle to understand the padding sequence in the block chain. I will leave the technical details with a link to a very good post by Ron Bowes that gives a concrete example of such an attack. But the essential idea is that breaking the padding ends up generating an error code, and you can tweak it to obtain information about the padding and the ciphertext, and as a result the plaintext. So, in such cases, the user can decrypt the session_id. This is an obvious example of subtle vulnerabilities that might go unnoticed, but it is necessary to have these in mind before making any security decision. And so on and so forth.

0959: An even more appealing overall class of problems in cryptography is that of P vs NP, which is an open problem with a lot of applications in cryptography. I will be posting in continuation to this post in a minute. Not that in general saying $P=NP$ is necessarily going to practically break unbreakable polynomial time based cryptography algorithms like RSA, but is nonetheless important. And not really relevant to our present problem. But that is kinda cool and it lets me make love to my 'Tism which is absolutely fun.

Bored.

 Well, in the last couple of days I have been getting bored very easily. Considering that I finished GTA 3 (DE) in 4 days, GTA SA (DE) in 5 days, and many of the movies I wanted to watch, I want to do something stimulating. Right now I am doing a bit of coding which seems very interesting, but I also need to be able to add in some extra functionalities for the project like being able to handle large amounts of video and audio calls through servers. Right now I am thinking of using RTC protocol because it seems very straightforward to set up and I will have it running on both my laptops (the new one is more for coding whereas the old one is for a temp. dummy production server) to test and see if I can add in extra features. Mostly APIs and stuff but it would be nice to see what else I can add.

Well I don't quite feel like blogging at the moment so get the fuck out of here.

Definition of Holography

 Well, since it has been too long without a good post, here is my talk for Theoretical Nexus where I talked in a very elementary level on the definition of holography. Loosely based on some things that I am currently working on, solely as well as with Aayush. Weird that no one asked any question (except Aayush of course). Nonetheless, here it is.

Nexus Seminar | The Definition of Holography


I am also switching to Reaper from Cakewalk. 

Bulk Physics, Algebras and All That Part Three: Strings Edition -I

 Finally, I have managed to scrape together some time to type in some problems related to the Strings 2024 conference and turn it into a mini-edition of Bulk Physics, Algebras and All That. I most likely will continue on this in the next edition, but for now, this is a light discussion on some interesting stringy problems. This is also the debut of the $\textsf{string theory}$ category for this blog. 

Bulk Physics, Algebras and All That Part Three: Strings Edition -I

Strings 2024

 [From my Twitter post about this.] Strings 2024 ended. It was an amazing conference and a lot of good progress was made. Amazing work to everyone involved, and great works in several good directions like chaos theory, QIT, de Sitter, etc. (as usual celestial holography gets cut out.) Some of the talks: 

The first talk was by Miguel Monterro on string compactifications, which was a review talk on supersymmetric vacua, swampland constraints, non-SUSY string vacua among others. The next one was by Wiesner on bottom-top proof of the emergent string and dependence of species on higher-derivative corrections to the Einstein-Hilbert action. I then listened to Chang’s talk (skipping Figuiredo’s talk) on supercharge “Q” cohomologes and fortuitous states and near-BPS black holes. Next was Collier’s talk on 2D dS as a matrix model was fascinating. It was based on his work with Beatrix, Victor and Lorenz on Virasoro minimal string in 2023, and has directions I am interested in reading more on. Blommaert then had a talk on the gravity dual of DSSYK and fake DSSYK temperature relations to real temperature. This was followed by Stanford, Maxfield, Turiaci, Malda, Lorenz and Lin’s discussion on JT gravity, which was a dope review. 

Next day was kicked off by Juan’s talk on BFSS conjecture, followed by a soft theorems talk that I didn’t attend. Then, Cho had a talk on nonworldsheet string backgrounds, followed by a review by Yin and Erler on SFT. I skipped Budzik’s twisted holography talk and listen to Mahajan’s talk on non-perturbative minimal super string duality with matrix integrals. There was then a gong show with Tourkine, Zhong, Tamargo, Biggs, Delgado (on bordisms group which was fascinating), Gesteau, Guo, Ji, Kundu, Levine, Lin, Parihar and Priyadarshi. Next day, Palti talked about emergent kinetic terms in string theory, and an ML talk on CY geometry that I skipped. Norris had a great talk on dS vacua, which I have to review again. I skipped most of the next observational talks except for van Riet’s talk, and skipped the Townhall (on postdoc applications) and the AI talk by Kaplan. The next day (Thu) I skipped the talk by Duffin and attended Casini’s talk on the ABJ anomaly and U(1) symmetry. 

I skipped all the bootstrap talks unfortunately, but was pleasantly greeted with Wong’s talk on 3D gravity and random ensemble of approx CFTS. Next, Vardhan had a talk I did not quite understand, but was followed up by Faulkner’s gravitational algebras talk, which was great and is timely for me, since I’m working on algebras. The last session yesterday was by Chris Akers and Dan Jafferis, which was great but I had internet issues. Today started with Nameki’s talk on generalized symmetries which I did not get either, and had to skip Dumitrescu’s QCD talk. Hansen had an interesting talk on bootstrapping Virasoro-Shapiro amplitudes in AdS which I have to review again. 

I also skipped Bobev’s talk on M2 branes. Yonekura had an interesting talk on non-SUSY branes in heterotic string theory, followed by Minwalla’s talk on large J+E holographic CFTs. I couldn’t attend Dabholkar’s talk on stringy quantum entanglement entropy, nor Beiras’ talk on topological strings. Zhiboedov had a talk on the future of strings, and and the outro was Hirosi and Andy’s discussion on 100 string problems, which was very good and had de Sitter comments. I asked about analytic continuation, but unfortunately my internet connection dropped out as he was answering. On an all, it was a great conference, and a particularly better improvement over Strings 2023 in light of non-stringy talks. Already feeling nostalgic and missing the talks, and this is how amazing these talks are. David Gross + Ahmed Almheiri’s comments were really touching to hear, especially Ahmed’s joke on the UV index being in Planck units. Can’t wait for Strings 2025@NYUAD in Jan 2025, and since Ahmed +Suvrat+Eva et al are hosting, won’t be surprised if it is as good and even better than this one. Thank you everyone @CERN for this wonderful event.

Remembering TWISTEX

 The May 31, 2013 El Reno tornado is considered one of the most powerful tornadoes to hit the midwest, crossing US-81. With the highest width of 2.6 miles, it was recorded by several storm chasers, and among them, was the group TWISTEX (Tactical Weather-Instrumented Sampling in/near Tornadoes Experiment), which was founded and led by Tim Samaras. The crew at the time of El Reno consisted of Tim Samaras, Paul Samaras and Carl Young. A brief summary of the events is as follows. At around 6 pm, the development of a tornado wall cloud was imminent, and at 6:03 pm, the ground circulations coincided and a tornadic system of multiple sub-vortices and circulations had developed south of El Reno. With increasing intensity (which initially categorized the tornado to EF3), the system developed over 2.6 miles, and over a span of 25 minutes had a course across the Canadian County, OK. The exact categorization of the tornado is controversial, and is typically associated to the EF5 scale. 

One of the characteristics that the El Reno tornado is infamously associated with is that of quick course changing. For instance, the tornado changed directions quickly and cut across the I-40, and would go on for a total span of $\sim $ 40 minutes, stretching over a span of around 17 miles. Another prominent characteristic attributed to El Reno is that the usual clear inside the cell of precipitations around a tornado was not present, and rather the outermost circulations of the tornado itself was a cell of precipitation. Unfortunately, the TWISTEX team was not travelling in their usual chasing-suited heavy lined vehicle, but rather in a Chevrolet with little modifications to suit it to the chase, and some of the first casualties of El Reno was that of Tim Samaras, Paul Samaras, and Carl Young. In Reed Timmer's video of the same tornado [1, 2], the Dominator 2's hood was badly damaged by overthrown lines, and it is suggested that had it not been that they stopped due to the overthrown lines on the road, they would have likely run into the same abrupt-direction changing incident that claimed TWISTEX. While usually the vehicles would be less susceptible to such situations while intercepting from the weaker side, the internal sub-vortices impacted the team. I recommend the paper by NOAA on the aerial survey of the El Reno tornado [3].

Associated to this mesmerizing but dangerous tornado was also the characteristic ``walking dead man" vortices, which can be seen forming and dissipating in the footage of the El Reno tornado [4]. 


----

References

[1] Reed Timmer et al, Tornado Chasers S2E11: Nemesis, Part 1Tornado Chasers, S2 Episode 11: "Nemesis, Part 1" 4K [May 31, 2013]

[2] Reed Timmer et al, Tornado Chasers S2E12: Nemesis, Part 2. Tornado Chasers, S2 Episode 12: "Nemesis, Part 2" 4K [May 31, 2013]

[3] Wakimoto et al, Monthly Weather Review -NOAAAerial Damage Survey of the 2013 El Reno Tornado Combined with Mobile Radar Data. Aerial Damage Survey of the 2013 El Reno Tornado Combined with Mobile Radar Data [2016]

[4] Markus Pfister, Multivortex-Tornado south of El Reno, May 31, 2013


Reboot de Sitter: Part 1

 [From my Twitter thread]: Maybe it is just time to reboot dS holography altogether. No more double Wick rotations, start from ground up and tell me what for the love of god the FLM corrections to entanglement entropy are. My *educated guess* is that static path is the description where we should start from. We get a similar Ryu-Takayanagi-like formula, for which we have to find semiclassical corrections. Leaving technicalities, we’ll assume there is an FLM-type formula. And here is where I note an interesting point. Pseudo-entropy formalism for EE in dS implies that the relevant quantities like the modular Hamiltonian are also pseudo-valued, and while there seem to exist ways to describe pseudo-relative entropy with $K=-\log \tau $ ($\tau $ is a transition matrix) and have a first law of entanglement, in static patch this doesn’t seem to be the case, so finding a JLMS formula becomes neater. Intuitively as well modular flows make sense, which is a good check mathematically. Subsequently subregion-duality would make sense. In retrospect, bulk reconstruction is a slightly unintuitive thing, and perhaps this should be addressed earlier. On the other hand, maybe a mathematical approach could be better suited, starting from operator algebras that Chandrasekharan, Penington, Longo and Witten did. Whether subregion-subalgebra in static patch dS follow on those lines is not entirely clear to me, but if it does follow a Liu-Leutheusser-type formulation with type II_1 algebras, that would go a long way in clearing what exactly dS holography means. One interesting thing here is that opposed to type II_\infty algebras in crossed product AdS/CFT one has a nicer vN algebraic setting, and it would be good to ask how the entropy of subspaces/subalgebras (I’m not sure how to know which in dS yet) corresponds to bulk wedges. Maybe some corresponding notion of entanglement wedges could be derived? Will update the thread with the next set of comments. 

Crossed Product

 Something I am working right now is with the crossed product construction. The idea that the crossed product construction has a physical meaning in AdS/CFT should be very obvious. This is primarily for two reasons: (1) there already exists a well-defined operator algebra for which outer automorphisms group can be defined, and (2) the Hamiltonian for $\mathcal{N}=4$ SYM goes like $H\sim N/g^{2}$, and has a central operator $N\times U$ defined as the difference of the Hamiltonian with the expectation value in the large $N$ limit. The first point is rather interesting and more or less provides most of the motivation needed for the crossed product construction, since time translations generated by the Hamiltonian form a group of outer automorphisms of the left and right copies of the thermofield double state, which we will discuss better in the next subsection. The second point is interesting from a more formal standpoint, where we divide the difference $H-\langle H\rangle $ by $N$ and obtain a well-defined large $N$ limit. This observation leads to the construction presented in 2112.12828, where the canonical ensemble crossed product construction was presented. We will highlight the general idea below.

Start by recalling that the group of automorphisms of an algebra $\mathcal{A}$ generated by a self-adjoint operator $\mathcal{O}$ is defined as 

\[e^{i\mathcal{O}s}\mathsf{a}e^{-i\mathcal{O}s}\in \mathcal{A}\]

for some $\mathsf{a}\in \mathcal{A}$ and $s\in \mathbb{R}$. If $e^{i\mathcal{O}s}\in \mathcal{A}$, then the group of automorphisms generated this way are said to be an ``inner" group of automorphisms, and ``outer" otherwise. We will stick with Takesaki's original convention and define our convention so that the group of automorphisms of $\mathcal{A}$ are Aut$[\mathcal{A}]$, inner automorphisms of $\mathcal{A}$ as Int$[\mathcal{A}]$ and outer automorphisms as Out$[\mathcal{A}]$. Recall that a left Haar measure is a non-zero Radon measure $\mu $ so that for Borel sets $\mathcal{B}\subset G$ and $g\in G$, 

\[ \mu \left(g\mathcal{B} \right)=\mu \left(\mathcal{B} \right)\;.\]

Let $\alpha :G\to $Aut$[\mathcal{A}]$. Then, the covariant representation of the triplet $(G, \mathcal{A}, \alpha )$ is a pair $(u, \pi )$ of a unitary representation $u:G\to \mathcal{U}(\mathcal{H})$ to the unitary group of $\mathcal{H}$ and $\pi :\mathcal{A}\to \mathcal{B}(\mathcal{H})$ (i.e. the set of bounded operators on $\mathcal{H}$) so that the covariance condition is satisfied:

\[u(g)\pi (\mathsf{a})u(g)^{*}=\pi \left(\alpha _{g}(\mathsf{a}) \right)\;.\]

We can then formally approach the crossed product as follows: taking the von Neumann algebra $\mathcal{A}$ acting on a Hilbert space $\mathcal{H}$ composing the triplet, the von Neumann algebra on $L^{2}(\mathcal{H}, G)$ generated by $u(G)$ and $\pi _{\alpha (\mathcal{A})}$ is the crossed product algebra $\mathcal{A}\rtimes G$ w.r.t $\alpha $. This may seem more complicated than necessary, but the physical picture can be found by noting that the crossed product of the type III$_{1}$ algebra $\mathcal{A}$ by Aut$[\mathcal{A}]$ is a type II$_{\infty }$ algebra $\mathcal{A}\rtimes $Aut$[\mathcal{A}]$. P.S. Go watch Dune Part 2.

A Few Ruy-Lopez Sidelines

 Suppose we have 1. e4 e5 2. Nf3 Nc6 3. Bb5, which is the Ruy-Lopez opening. Black has a number of options here, but among the variations I have played against, the two most common are a6 and Nf6, the former being Morphy variation and the latter being Berlin defense. After a6, Bxc6 is an option but I rule it out because it does not improve any position. A note here is that after Bxc6 black should take like dxc6 to prevent Nxe5, after which Qd4 is a fork. Nf3 here is an option, and a queen trade may ensue after which white is not quite very better. So after a6, 4. Ba4 is logical, and after Be7 5. O-O, white is better. Here black may try to support the center and chase the bishop away with b5, after which Bb3 and white is again better. In Sicilian positions where black plays like 1. e4 c5 2. Nf3 Nc6, you necessarily have to exchange on c6 because after a6 4. Ba4 b5 5. Bb3 c4 simply traps the bishop -- this Rossolimo line is very interesting and leads to a lot of nice open positions. Going back to the mainline, after Bb3 black could castle, after which c3 is a very good move intending to support the center and possibly aim for expansion. The old Steinitz defense is a very good counter to the Ruy-Lopez, typically going for closed positions, which I typically find a little inconvenient. 

One reason why Ruy-Lopez is so fascinating but complicated is because once the game deviates from the mainline, it becomes easy for black to take any opportunity to get a better position. Against fianchetto options on b7 the Ruy-Lopez does not change, because for instance say we have 1. e4 b6 2. Nf3 Bb7 3. Bc4. Then, black could take on e4, but while it seems like a free pawn, Bxf7+!! is a brilliant move, after which we force black to take and fork the king and the bishop on e4, so the line goes 3. Bc4 Bxe4 4. Bxf7+!! Kxf7 5. Ng5+! Ke8 6. Nxe4 and black has little development, no right to castling and structural weaknesses due to the open light squared diagonal along e4 to a8. Against Sicilian, there are again many options. After 1. e4 c5 2. Nf3 Nc6, we have the previously seen Bb5 line where we take on c6 and damage black's pawn structure and expand with d4 and O-O later. On the other hand, there are also lines where we have 2. Nf3 d6, after which we play the same 3. Bb5+ nonetheless, forcing either a similar Bxc6 position or bishop trade after black plays Bd7. One line that I found very interesting was something like the Berlin defense with delayed d4 after Nxe4, which usually would be refuted by Re1. The anti-Berlin with d3 immediately after Nf6 is also an option, and is the line that Gukesh played against Alireza Firouzja recently. Typically, a second option that Berlin defense allows is a delayed capture on e4 after something like Be7 O-O a6 Ba4 b5 Bb3 Nxe4, which has an interesting line which usually ends in a draw after d4/d3. One of the plays that I don't quite understand how to tackle are positions where black has already played Nf6 -- the Petrov's defense -- where lines along 1. e4 e5 2. Nf3 Nc6 3. Nxe5 is met with Nc6, called the Stafford gambit. This is a line I play as black, but as white my prep sticks with bxc6 4. Nc3 Bb4, with a kind of reversed Spanish variation from black. 5. Bd3 seems like an obvious logical move also preparing the kingside for castling short. A trick  beginners/intermediate players play is to instead play Nxe4, and if white plays a dumb move like h4 black can play Qe7, attacking on Ne5. In case white moves the knight anywhere, Nc3 discovered check simply wins the queen, but in general Petrov's defense heads into a Stafford gambit. Against the Caro-Kann my prep goes for the exchange variation after 1. e4 c6 2. Nf3 d5 3. exd6 cxd6 4. Bb5+ following a similar line as against Sicilian with d6. As black, the Zhuravlev countergambit is natural for Ruy-Lopez players, but we will discuss more on black reversed-Spanish variations in a next post.

Kronecker-Weber Theorem

 Let $\tilde{F}$ be a field (called the algebraic closure of $F$) obtained by adjoining all roots of polynomials over $F$. The maximal abelian quotient of $\mathrm{Gal}\left(\tilde{F}/F \right)$ is the Galois group of maximal abelian extension $F'$ of $F$, which is the largest subfield of $\tilde{F}$ whose Galois group is abelian. Setting $F=\mathbb{Q}$, from the Kronecker-Weber theorem, the maximal extension $\mathbb{Q}'$ of $\mathbb{Q}$ is found by adjoining $\mathbb{Q}$ to all roots of unity. Let $\zeta _{N}$ be a fixed primitive $N$th root of unity and $\mathbb{Q}\left(\zeta _{N} \right)$ be the $N$th cyclotomic field. Then, there exists an isomorphism 

\[\left(\mathbb{Z}/N \right)^{\times}\cong \mathrm{Gal}\left(\mathbb{Q}\left(\zeta _{N}\right)/\mathbb{Q} \right)\;.\]

The Kronecker-Weber theorem is then the statement that every finite abelian extension of $\mathbb{Q}$ is contained in a cyclotomic field. This has some interesting properties, particularly those linking to p-adic numbers (which caught my attention thanks to Optimized Fuzzball discussing p-adic AdS/CFT), which I will write about in a later post. I am also preparing a Lichess study on Ruy-Lopez, which I will link sometime soon.

How to setup GitHub and Dropbox for collaboration

Here is a helpful tip I am giving you for free: if you use Overleaf for collaborative works (which is ok), I would strongly recommend switching to Dropbox+GitHub for better version control and ease of working. Before I started using GitHub, most of the works were on Overleaf, but it is clear that working on it is not too good. When collaborating with another person (or people or aliens or whatever your colleagues are), it is very important that all the edits are perfectly synced-up, which Overleaf does not do offline on a free plan. Here is how you setup your collaboration folder. 

Suppose you (say X) are working with Y. Then, setup a GitHub repository with a folder of the TeX, bbl, bib and whatever files you want. Then, add Y as a collaborator to the project. I would strongly recommend using Sublime Merge for version control. Now, to whatever the initial state of the files are (denoted by $|\Omega \rangle $), whenever you make a change (or depending upon when you want to merge files), in the gui, FIRST commit, then if Y makes any changes he commits, he will then PUSH his commits onto the master branch, which you have to PULL -- ONLY AFTER YOU FIRST COMMIT, then MERGE the files, and after you obtain some merged commit (without any issues left to be resolved), only then will you PUSH to some state $|\Omega '\rangle $. This goes on back and forth, but for your ease, I strongly advise turning on email notifications for the GitHub commits so that the other person (or people or aliens or whatever your colleagues are) can pull and merge. There will be many issues, and particularly some may be a little more complicated if there are remote branches. So, a way of making sure that no information is lost (because let's face it, information during collaboration is more easily lost than the black hole information problem), people also sync the GitHub repo folder via Dropbox, so that there is a detailed synced-up version history. In case the other person (or people or aliens or whatever your colleagues are) messes something up, the version history can be consulted, but bear in mind that this needs your machine to be online. 

So, to summarise, setup a GH repo, invite collaborator(s), use version control apps such as Sublime Merge, setup email notifications, sync the repo to Dropbox, ask the other person (or people or aliens or whatever your colleagues are) to clone the repo, edit, THEN pull, THEN merge, THEN push, and watch Dune Part 2 when you can. 

The Canvas of Holography in (A)dS/CFT

 The GRF Essay that I wrote with Aayush is now online, which you can see at the following link -- we have submitted to arXiv as well, which should be out in a couple of days. While the musings are based off from the de Sitter review we wrote, this essay should convince you that de Sitter quantum gravity is enigmatic to its core. 

 The Canvas of Holography in (A)dS/CFT

Neural Networks Everywhere


There are a bunch of papers on neural nets that I have been fascinated by, which I hope to eventually read. Here are a few of them:
  1. [2402.13321] Rigor with Machine Learning from Field Theory to the Poincaré Conjecture,
  2. Related to the above paper, [2310.19870] Metric Flows with Neural Networks,
  3. [1801.03918] Black Holes as Brains: Neural Networks with Area Law Entropy,
  4. A paper I found on diffusion models and neural nets (not exactly something I am familiar with yet) recommended by p-adic: [2211.14680] A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction.
And some more that I have to find. I am also typing up on Part Three in Bulk Physics, Algebras and All That, which should be out in a while.

de Sitter Essay

Initially, I intended to write an Essay for the GRF contest on de Sitter subregions, which I have been fascinated by for quite some time now. The first thing I had in mind was to work on a bit with these subregions and see if I could come up with something, but noticing that the natural direction of progression went to doing double Wick rotations and stuff, I decided I wanted to ease on doing something original. Let me explain why I came to this decision.

The whole thing with dS/CFT is that we don't know what to do with subregions, and instead of having nice entanglement entropy like we have in AdS/CFT with the Ryu-Takayanagi formula, we instead have to deal with a non-unitary CFT, and work with transition matrices instead of density matrices. The result of this is that instead of having something like 

\[S=-\mathrm{Tr} \;\rho \log \rho \in \mathbb{R}\;,\]

we have to deal with something like 

\[\mathcal{S}=-\mathrm{Tr}\;\tau \log \tau \in \mathbb{C}\;,\]

which is ugly for two reasons: (1) complex-valued entanglement entropy is indicative of extremal surfaces that are timelike in nature rather than spacelike, and (2) this also implies a non-trivial set of information theoretic things. For instance, in AdS/CFT, Ryu-Takayanagi with corrections is the FLM formula, which in turn implies an equivalence of bulk and boundary relative entropies from the JLMS formula:

\[S(\rho _{A}|\sigma _{A})=S(\rho _{a}|\sigma _{a})\;,\]

where $\rho _{A}$ are density matrices associated to the boundary subregion $A$ and $\rho _{a}$ are density matrices associated to bulk subregions $a$ -- but how to look at something like this in dS/CFT is not entirely clear, since we have to deal with transition matrices; for that matter, what to even naively expect of subregion duality is not clear (except Narayan's geometric works, which are pretty good in having some intuition with this). One idea is to use the first law of entanglement with perturbed states $\rho \to \rho +\delta \rho $, which also works for transition matrices from pseudo-modular Hamiltonians so that we have something like 

\[S(\tau +\delta \tau )-S(\tau )\sim \langle \widetilde{K}_{\tau +\delta \tau }\rangle -\langle \widetilde{K}_{\tau }\rangle +O(\delta \tau ^{2})\;,\]

and do something similar to Dong, Harlow and Wall's works in AdS/CFT in dS/CFT, which is something I am currently working on. But the thing about double Wick rotations is that at least for me, it does not seem empirical enough; the basic idea is that going from Poincare AdS path

\[ds^{2}=\frac{l^{2}_{\text{AdS}}}{z^{2}}\left(-dt^{2}+dz^{2}+\sum _{a=1}^{D-2}dx^{a}dx^{a} \right)\;,\]

 to Euclidean AdS and double Wick rotating this by 

\[z\longrightarrow i\eta \;, \;\;\;\;\; l_{\text{AdS}}\longrightarrow -il_{\text{dS}}\;\]

to relate to the dS planar slicing (here I set $l_{\text{dS}}=1$)

\[ds^{2}=\frac{1}{\eta ^{2}}\left(-d\eta ^{2}+\sum _{a=1}^{D-1}dx^{a}dx^{a} \right)\;.\]

From this, one can find the timelike entanglement entropy and correspondingly subregions (at least in the Hartman-Maldacena fashion). I am trying to get a better feel for the more ``canonical" side of things, and this seems to be a little too straightforward for my liking.

So instead, I am writing a Tom Banks-inspired Essay in which I am basically presenting a few of my thoughts on how some things in dS/CFT could be resolved, although these are presented in a very straightforward way and is not meant to be precise whatsoever. One of the other things I was thinking of adding in the Essay is on asymptotic quantization and CFT partition-like functionals obtained in the $\mu \to 0$ limit (where $\mu $ is some deformation parameter; essentially tells us how the rescaling of the metric $g$ takes us to different slices, and more precisely is the deformation parameter attributed to $T\overline{T}$-deformations in doing Cauchy slice holography), although my remarks in that section are not very clear and I am yet to work on it. 

Bulk Ignorance vs Boundary Algebra

 Most of the time in AdS/CFT the bulk is the more complicated story, although in the general discussion of holography the bulk and the boundary both are somewhat mischievous. A result that I am working with is that the type III and type II attribution of the boundary algebras reflect the algebraic-ification of algebras $\mathcal{A}(\mathcal{U})$ associated to a bulk bounded region $\mathcal{U}$, in the following sense. One works from the basis of Haag-Kastler setting, where we attribute to these $\mathcal{U}\subset \Sigma $ in a globally hyperbolic manifold $(M, g)$. We do not wish to work in a non-holographic theory for the sake of this article, although being in a non-holographic theory allows some amount of ease with things like the split property. It should be clear immediately that this can be extended to bulk subregions, for which the algebra is type III as Liu-Leutheusser showed. In this framework, when $\mathcal{A}(\widetilde{R})$ (for some boundary subregion $\widetilde{R}$) is a type II algebra, one can associate the notion of density matrices and entanglement entropy; the bulk dual $\mathcal{A}(R)$ ($R$ being the bulk subregion dual to $\widetilde{R}$) would then have a plausible definition of generalized entropy. This is a very general argument; one can define entropy for $\widetilde{R}$, $\mathcal{S}(\widetilde{R})$ and attribute to it some generalized entropy, and establish a neat relation with the relative entropy. Formally, in type III von Neumann entropy is not well-defined, but nonetheless one can describe the relative entropy of some semiclassical state $|\hat{\Phi }\rangle $ with $|\hat{\Psi }\rangle $ a cyclic and separating state, set with $\delta A_{X}=0$, which looks something like 

\[S(\hat{\Phi }|\hat{\Psi })=-S(\hat{\Phi })-\langle \ln \rho _{\hat{\Psi }}\rangle _{\hat{\Phi }}\;,\]

which can be simplified into a better form by identifying the modular Hamiltonian. In type II, this is no longer merely ``formal", and allows us to elaborately identify the generalized entropy of the subregion. Of course, to work with bulk algebras more formally is certainly interesting, since usually the bulk is the more complicated half of AdS/CFT. A very vague idea of what I had initially is to essentially do this in terms of LCQFTs to see what happens to the split property established at the bifurcation surface $X$. While I did make some good-ish progress, I never got to formalizing it. However, in between, I am working on some set of notes on LCQFTs, which might be a good read once completed. 

P.S. More Hans Niemann debate in the chess world from the St. Louis club. Apparently SLCC did not invite him for their tournaments because of some ``inappropriate behaviour", which Niemann apologized for but is now claiming is just more drama. Catch up with Niemann's video on their letter to him here: I have fully addressed all of the claims made by the STL Chess club in their letter, addressed their private letter to me and given context to my history with the Club. Let's make one detail absolutely clear, I received 0 invitations to STL Events, before I regretably caused..."

Loopy Ryu-Takayanagi?

 A paper that came to my attention from a Twitter feud on LQG is this paper by Smolin, resulting in me adding a new category to this blog, called ``Loops":

[1608.02932] Holographic relations in loop quantum gravity

This is interesting for two reasons. Firstly, it invokes categorial holography as Smolin terms it, which was introduced by Crane in his paper called Categorial Physics, and secondly, it uses spin networks in an interesting way. The idea is to use punctured 2-surfaces and attribute to them a Hilbert space, and eventually attributing to quantum spinnet a ``bulk" by extending from the punctures. Essentially, stating that there is a map $\mathcal{T}^{p}: \mathcal{H}_{\mathcal{B};j, i}\to \mathcal{H}_{\Sigma ; j, i}$ as noted in the paper. However, I must say something here; one may not need to invoke these constructions in the first place and instead approach a deformation perspective, such as Cauchy slice holography or in general, holography of information if canonical quantum gravity is all we wanted. The feud seems to be very unnecessary and for that matter I don't think there is any need to point fingers saying ``string theory is just conjecture" or ``LQG is just conjecture". If such holography is emergent (as it seems above, but I am no expert) in LQG, then it is very fascinating. Calumny is a dangerous thing and as is the nature of politicising quantum gravity. 

Horowitz-Myers AdS soliton conjecture

A ``new" positive energy conjecture was conjectured back in 1999 by Horowitz and Myers [arXiv:hep-th/9808079], which somehow escaped my attention till Steve McCormick mentioned it to me in a tweet (X-post? sounds weird though). I will briefly outline what it is, but the technicalities will be left to the reader as an exercise. The idea is essentially to take the near-extremal p-brane solution and double Wick rotate it, so as to obtain the AdS soliton with periodicity. The interest then would be to work with metrics that look like 
\[g_{\mu \nu }=\tilde{g}_{\mu \nu }+h_{\mu \nu }\]
and fall-off conditions w.r.t $r$, where $\tilde{g}_{\mu \nu }$ is the AdS soliton metric. Horowitz and Myers go on to propose two conjectures, which are as follows:
  1. SUGRA solutions: $g_{\mu \nu }=\tilde{g}_{\mu \nu }$ is the only solution to $D=10$ type IIB SUGRA, taking $\tilde{g}_{\mu \nu }\times S^{5}_{l}$ for which $\mathcal{E}= 0$, and $\mathcal{E}\geq 0$ in general.
  2. Geometric solutions: $g_{\mu \nu }=\tilde{g}_{\mu \nu }$ is the only solution to $D=5$ field equations with $\Lambda = -6/l^{2}$ so that $\mathcal{E}=0$, and $\mathcal{E}\geq 0$ in general.
Seemingly, these are still open problems. However, I believe that resolving at least some aspects of these conjectures will go a long way in making sense of the geometric description of AdS/CFT. 

Papers to Read

 I was asked what papers I have on my library so far. Since I found it hard to try and point out each paper, I have uploaded a part of the library to the github repository. Hopefully, these papers are the significant advances that one just beginning to work in hep-th can refer to, and other papers that I have not included will be mentioned later. I do not mean any copyright infringement if such a thing happens. If there is the possibility of that, I will instead replace the directory with arXiv links, which should be better. For now, enjoy :)

Papers in high energy physics Theory