Entropic LLMs Github

 Here is my Entropic_LLMs github repository where I have put my initial note on entropy based LLMs and Entropix as an EBM. More or less based on my previous Notion posts. Will follow up with more implementations soon.

vkalvakotamath/entropic_llms: Repository on entropic large language models

Statement on the 2024 Nobel Prize in Physics

 Given the response of the physics community (mostly the hep-th community) to the 2024 Nobel Prize in Physics that was awarded to Hopfield and Hinton, here is my impartial statement on the fiasco. Bottom line: most of you were wrong.

Witten and Boltzmann

From my Notion on Witten and Boltzmann

 It is currently 0916 hrs and I am incredibly tired at this point. I really should have slept for more time last night and right now I really should get sleep because it is impossible to function without it. Hopefully I can just go to the library in the next class and not have to worry about the next classes because I don’t have the patience for it. I have loads of work to do, and I am trying to get back into physics because I really have to get to work on the de Sitter algebras work that I was interested in. But whatever, I want to pen some thoughts on EBMs and something a lot of people are overlooking and on de Sitter I guess, with no intention of fully elaborating on either at the moment.

The first thing is that there is a nice relationship between entropy based sampling and min-$p$ directly, which results in a sort of ``sampling redundancy”. Loosely, start by noticing that $p(x)=\frac{\exp (-\beta E)}{Z}$, and the entropy

$$ S(p)=-\int p(x)\log p(x)\;. $$

Expand into

$$ S(p)=\beta ^{-1}\mathbb{E}_{p}[E(x)]+\mathcal{F}_{\beta }\;, $$

where we just substituted for $p(x)$ and identified the Helmholtz free energy. Then, the sampling algorithms are solved in terms of $\min E$: start by computing $E(\Sigma _{i})$, and then find min-$p$ as usual. However, you could instead just seek to minimize the free energy, since it implies selecting higher probability choices and tells you what the entropy is anyway. Working with $\mathcal{F}{\beta }$ rather than just $E$ is in general better because it incorporates $\beta$ and gives a more subtle relationship between the parameters and the entropy. Idk at this point I am too sleepy to continue but I guess I’ll want to do something with the loss functions to better understand Entropix as an EBM.

The other thing I am interested in is the algebra of observables in de Sitter space, and loosely trying to also find what $T\overline{T}+\Lambda _{2}$ deformations do to the OPEs and in general the operator algebras. Loosely, the problem is as follows: in AdS/CFT, we can take the operator algebra of all single-trace operators $\mathcal{O}{i}$, which is a type von Neumann algebra. In the large $N$ microcanonical ensemble, this algebra can be written as the crossed product with its modular group of automorphisms $G_{\mathcal{A}, 0}$, which would give us a type $\mathrm{II}_{\infty }$ algebra. This comes from the Takesaki theorem for the crossed product of type $\mathrm{III}_{1}$ algebras:

$$ \mathrm{III}_{1}\rtimes G_{\mathcal{A},0}=\mathrm{II}_{\infty }\;. $$

Why is this relevant? By obtaining a type $\mathrm{II}_{\infty }$ algebra and not a type $\mathrm{III}$ algebra, we can actually define entropy, traces and density matrices. One result from this is that the bulk generalized entropy $S_{\mathrm{gen}}(X)$ for the bifurcation surface of the two-sided AdS black hole is defined as the entropy $S_{E}(R)$ of the full right boundary algebra. A result that I had in coincidence with implications on Bousso, Chandrasekaran, etc. is that the entropy of appropriate boundary subspaces correspond to the generalized entropies of minimar surfaces, or more generally the apparent horizon. There’s some subtleties with why this construction works — for instance, the reason we use subspaces and not subalgebras is because including the Hamiltonian $H_{R}$ in the algebra $\mathcal{A}_{R}$ means that operator products like $\mathcal{O}_{i}\cdot H_{R}$ cannot lie in a closed subalgebra, or we would be able to reconstruct arbitrary nonlocal operators given access to only a particular entanglement wedge $\mathcal{E}_{W}(R)$. Now whether such a subspace of finite $G{N}$ operators or observables calculates the generalized entropy in de Sitter space is an open question (first remarked in Chandrasekaran, Penington and Witten’s paper on Large $N$ algebras and generalized entropy). Which is vaguely something I am interested in. But more on this later.

November 1, 2024, 1414 Polish time.

Energy Based Models EBM101 -I

 From my Notion on EBM101-I

Needless to say, I just came to my uni and I am bored already. However, I was late to the class so it is 0930 hours approximately now, and hopefully it is only three hours more that I have to survive (ugh). But here is a mini-course on EBMs, as introduced by Teh, Osindero and Hinton in 2003 and subsequently elaborated on in many other papers, most prominently Yann LeCun et al’s Tutorial on EBMs. I turned this into a mini-course because I think EBMs are interesting and will be composed of a number of short posts such as this one.

Here is the high-level argument for EBMs. We start with a model $M(p, \mathbf{s}, S, \Psi )$, where $p$ are the probabilities, $\mathbf{s}$ are the sources, $S$ is the entropy and $\Psi$ is the KL-divergence. There were historically two ICA approaches: the (1) information maximization approach and the (2) causative generative approach. EBMs as an ICA approach caught a lot of attention, and later led to the basic idea that the key is to work with canonical ``statistical mechanics” to obtain a reasonable machine learning model, rather than simply brute-force optimizing or modelling. In vanilla ICA, we only have the parameters above, which are good enough in a sense. However, you could see from something like IMA and the mutual information - entropy correspondence that there are some motivations to the more canonical SM style. While there may be subtleties with how they reduce to something like causal generative approach in the presence of noise (these two are equivalent when we consider a non-noisy case), these two approaches paved way for something more powerful: Entropy based models.

At the core of EBMs is the partition function, which acts also in a way as the generating function for the $n$-point correlators or moments. We will talk about this in a different post, but for now we only need to understand the intuition for the partition function. If you follow Vasu Shyam (@vasud3vshyam on Twitter), you must have seen the post identifying the denominator of the $\mathrm{softmax}$ attention function as the partition function. The intuition behind why this works is based on EBMs, and is as follows. The partition function,

$$ Z=\sum \exp \left( \mathcal{X_{i}} \right) $$

appearing in the $\mathrm{softmax}$ function is related to the probabilities as

$$ p(x)=\frac{\exp (-\beta E)}{Z_{\beta }}\;. $$

Here we momentarily restored the inverse temperature $\beta$ for reasons that will be more apparent in later posts. The idea of EBMs is to interpret the energy $E$ as a minimization parameter and get inference from $\min E$. The idea for this minimization problem is very similar to that of the negative log-likelihood optimization problem; there is no exact reason to use this sort of a deterministic approach over IMAs — for that matter, there are reasons to not use EBMs over IMAs or CGAs for no reason. Overall, this reduces to the more familiar sort of optimization algorithms with say cross-entropy loss $J_{CE}$, which is the difference of $E$ and the free energy of the ensemble $F$ which together become your familiar negative log-likelihood function. This free energy (Helmholtz free energy to be more precise) is given by $\log Z_{\beta }$, and $J_{CE}$ takes the form

\[ J_{CE}=\beta E+F_{\beta }\;, \]

where we can expand $F_{\beta }$ into

\[ F_{\beta }=\beta ^{-1}\log \left( \int \exp (-\beta E) \right)\;. \]

What is interesting about the EBM approach to loss functions is that more often than not, a given loss function may not be compatible with $\min E$, because our requirement is also that training gradients the energy for each model prediction in a way that pushes the energy of the most probable generation down, while pushing the energy for every other answer up. In Yann et al’s paper, they explicitly talk about the different kinds of loss functions and the compatibility arguments.

From Yann LeCun et al 2006.

In the next post, we will discuss these loss functions and the gradient optimization with $\min E$.

As a side note, I will be linking all these posts on my blog along with the titles and timestamps, since I am not sure how to make a homepage with a blog-like feel on Notion. In case there are some issues with updating or access, I will update them on a status page I will make soon. Addaam reshii a zaanta!

October 29, 2024, 2136 hrs

Adaptive Entropy Sparsity

 From my Notion page on Adaptive entropy sparsity

(This is a very crude first draft. There are no exact details or code to go along with this since I wanted some understanding of the abstraction first. )

I will avoid timestamping this for obvious reasons. But here is the vague idea for now just to have something to send to Wowo. You start by taking the entropy and define something like

\[ \zeta _\alpha (S)=\begin{cases} \zeta {\text{min}}\equiv 0 & \text{High entropy = softmax} \\ \mathcal{D}(S) & \mathcal{S}_{0}\leq S \leq \mathcal{S}_{\text{max}} \\ \zeta _{\text{max}} & \text{Low entropy = sparsemax} \end{cases}\;, \]

where $\mathcal{D}(S)$ is a function of entropy that ranges from $\zeta _{\text{min}}=0$ to $\zeta _{\text{max}}$. This is just being bound between the usual $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions, so basically we are saying that if we have high entropy, we want lower sparsity and if we have lower entropy, we want higher sparsity. This is essentially to employ a version of magnitude-clipping or pruning but the more stronger claim I want to make is that doing this entropy-dynamic sparsity increases the $\text{effective correlation}$ of the model, by reducing the correlations between irrelevant and relevant inputs. In most base models this is not exactly an issue, but arguably in the efficiency vs performance trade-off, maximizing effective correlation is possible only if we don’t restrict ourselves to dense attention weights, meaning that $\mathrm{softmax}$ is not very efficient in this sense.

While I have not come up with a very formal definition of $\mathcal{D}(S)$ and how exactly this model gets implemented, here is a reason why this whole entropy-based sparsity does not have to be that adaptive, and here is where $\mathrm{varentropy}$ comes in, deciding whether $\zeta (S)$ has to be a continuously adaptive parameter. In calculating the metric attentions, we find $\mathrm{entropy}$ as well as $\mathrm{varentropy}$. This decides whether the sparsity factor is dynamic or fixed; low $\mathrm{varentropy}$ bounds are exactly $\mathrm{softmax}$ and $\mathrm{sparsemax}$. Higher varentropy implies that the bounds to $\mathcal{D}(S)$ changes, although still bounded by the $\mathrm{softmax}$ and $\mathrm{sparsemax}$ functions. I am still working on how exactly to see this relation and give a mathematical formulation for it.

The next thing I will mention here is the order of complexity for the Transformer model with adaptive entropy sparsity. The usual vanilla Transformer model has an order of complexity that is quadratic in the context length, $O(n^{2})$, whereas for a sparsity model, we can obtain a much better complexity order of $O(n\sqrt{n})$. I am not exactly sure if the fixed-ness of the model affects this complexity better or worse in terms of efficiency but on an all, the complexity (at least theoretically) is not very fixed in terms of sparsity and therefore depends on $\zeta (S)$:

$$ O(\mathrm{complexity})\propto \frac{1}{\zeta (S)}\;. $$

When sparsity is higher, the complexity typically decreases since it has to tend to fewer tokens in the token pool $\mathcal{T}_{i}$. This depends on the overall varentropy metric calculations, and so we arrive at the following theoretical complexity for the AES:

$$ O(\mathrm{complexity})\propto \frac{1}{\mathrm{varentropy}}\;. $$

It would be nice to actually run a model with AES and see how it changes the complexity and the validations. For now, we still are left with the question of what $\mathcal{D}(S)$ exactly is. Remember though that this function is a continuous function taking values between the $\mathrm{softmax}$ with $\zeta (S)=0$ and $\mathrm{sparsemax}$ when $\zeta (S)=\zeta _{\mathrm{max}}$. This definition bounds the entropy sparsity parameter between either extremely dense or extremely sparse attention weights. I used this definition to agree with the $\alpha$ -entmax function, where $\alpha =1$ is $\mathrm{softmax}$ and $\alpha =2$ is the $\mathrm{sparsemax}$ function. This bounding may be changed if appropriate, since the $\alpha$ in our context is a similar learnable parameter of the model that is optimized with stochastic gradient descent. Note that this is purely theoretical; in a real-time scenario it is very possible that this sparsity does not even account for any efficiency curve benefits.

One final thing I will mention is that there is the possibility of a $\text{sampling redundancy}$ in using entropy for the sparsity model. This redundancy arises when we comb through $\mathcal{T}_{i}$ enough to make the token pool reduce TO BE COMPLETED

And now, to the main objective: a mathematical description for this effective correlation. My first thoughts were to describe the partition

$$ Z=\sum \exp \left( \mathbf{x}_{i}^{\textsf{T}} \right)\;, $$

where $\mathbf{x}_{i}^{\textsf{T}}$ is $x_{i}/T$, where $T$ is the temperature. This is just an observation from the $\mathrm{softmax}$ function and isn’t really anything special. Until you identify the $\text{free energy}$ as the negative of $\log Z$:

$$ F=-\log Z\;. $$

Then we write the $n$-point correlators or moments or whatever (sorry I’m too QFT brained rn) are the expansions like

$$ \langle x_{1}\dots x_{n}\rangle =\frac{1}{Z}\frac{\partial^{n} Z(\xi )}{\partial \xi _{n}\dots \partial \xi _{n}}\;. $$

The $1$-point correlator $\langle x_{1}\rangle$ gives the free energy, and looks like

\[ F=-\log \sum _{a} \underbrace{\exp \left( qk^{\textsf{T}}{a}+\xi x^{\textsf{T}}_{a} \right)}_{\text{Partition function}}\;. \]

This is nothing new and just straight-up comes from Vasu’s paper Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters and the Yann LeCun et al paper on Energy-based models. I think there are some interesting correlations between these two approaches wrt entropy sampling and possibly extending it to something like the Helmholtz free energy. When computing these terms in LLMs, it might be helpful to specifically work with these quantities specifically in different settings. As an example, the $\log Z$ term might be useful in making the sampling uncertainty a more precise thing, see for instance Yann LeCun et al’s paper to see what the possible relation is. I will update this with a few more thoughts in a while.

Note of caution

I have to say though, that most of this is purely theoretical at the moment and has not been tested anywhere personally. Not to mention that all of this is VERY crude and a lot of refinements will be necessary to obtain a base working model for the entropy sparsity model I mentioned. Or maybe not, or maybe there is something key that I am missing altogether. There may and mostly will be mistakes, do not assume otherwise. But I will try to find a proper formalism to the partition function thing in the next two hours hopefully. Or maybe possible that I will update the GH \doc draft and make it public.

1018 hrs local time

Update 1: The free energy is not free

So writing $Z$ a little more formally (taking $\beta =1/T$):

$$ Z=\sum \exp \left( -\beta E \right)\;, $$

we could write the free energy in terms of $Z_{\beta }$ and $S$. And then rewrite our entire model in terms of $F$. But the point is that the entropy based sampling in itself suffices for the most part; there is no need to go through all this pain of computing $Z_{\beta }$ or do anything even remotely related to EBMs. The worth of doing this free energy stuff is that we can adopt an EBM approach to sampling and obtain better optimization routines in terms of $\nabla F$ and the uncertainties in the pool can be made dynamic from this. But I have no stringent way to sell this idea yet. Emphasis on the yet.

Physicists and ML

 I asked ChatGPT to make a satire on physicists doing machine learning for the first time. Boy, oh boy, did it give me a good satire.


It was a fine Monday morning in the prestigious physics department. Coffee was brewed, chalk dust floated in the air, and equations crawled like spiders across blackboards. But something was different. Something… off. A tremor of confusion rippled through the corridors as the physicists gathered for their weekly meeting. Today was not about quantum entanglement or gravitational waves; today, they had entered the brave new world of machine learning.

Dr. Thompson, a renowned particle physicist who once made headlines for theorizing a sixth-dimensional particle, sat at his desk glaring at his laptop. "I don’t get it," he muttered. "What’s an epoch? And why does my model keep overfitting?" His finger hovered over a button that ominously read Run Code, as if pressing it might accidentally simulate the heat death of the universe.

Meanwhile, in the corner of the room, Professor Ivanov was vigorously typing Python code on a terminal screen. Ivanov was famous for deriving the quantum field theory of socks, yet now found himself writing something called a “neural net.” Having spent decades mastering the calculus of tensors, he stared at the TensorFlow documentation as if it were a poorly translated IKEA manual.

"This can't be harder than the renormalization group theory," Ivanov whispered to himself, unconvincingly. His model failed to converge for the seventh time. "I'll just add more layers," he declared with an air of misplaced confidence. "That’s how the brain works, right? More neurons, more intelligence!"


Across the hall, Dr. Patel was equally disoriented. After spending the last decade calculating dark matter distributions, she was now facing something even more elusive: hyperparameters. She clicked through the parameters in her notebook.

“Learning rate... momentum... regularization?” She paused, a frown growing on her face. “Sounds like the settings on a blender.”

Undeterred, she set the learning rate to 42, confident that the answer to everything would solve this machine learning conundrum. Alas, the model promptly exploded into a sea of NaNs (Not a Number), a result she hadn’t seen since her days of calculating infinities in string theory.

“Of course, NaNs,” she scoffed, “the true fundamental particles of deep learning.”


Meanwhile, the younger grad students, who were already fluent in machine learning buzzwords, lurked in the background. They had been waiting for this moment. "You have to use dropout and batch normalization, Professor," one of them advised with a grin, clearly relishing their newfound expertise over the once-untouchable professors.

"Dropout?" Thompson repeated. "You mean like how I feel after running this model for three hours with no results?" The room chuckled, but the grad students pressed on, eyes gleaming with the vigor of freshly downloaded Kaggle datasets.

"Also, try stochastic gradient descent," another student added. "It's like regular gradient descent, but, you know... stochastic!"


Then there was Dr. Li, the department's cosmologist, who was taking the most radical approach. "I don’t need machine learning," she proclaimed. "I’ll just derive the optimal weights myself." Hours later, she was seen attempting to analytically solve the backpropagation equations, pencil in hand, sweat pouring down her face. The grad students looked on in horror.

“You could just let the computer optimize the weights, you know…” one suggested timidly.

“Nonsense!” Li snapped. “If I can solve the equations of the universe, I can solve this!”


As the day wore on, more physicists succumbed to the absurdity. Dr. Williams, an experimentalist known for his work on superconductors, was now building neural networks like they were Lego blocks. His lab was covered in graphs. Each failed model led to a new theory.

“I’m onto something big,” he said, pointing at a chart of a loss function that looked suspiciously like a sine wave. “I call it oscillating accuracy. This could be a new phase of learning!” His eyes gleamed with the fever of discovery.

By 5 p.m., everyone had given up on their physics research and fully embraced the chaos of machine learning. Ivanov’s model had ballooned to 400 layers, requiring the computational power of CERN to train. Dr. Thompson was on his fifth coffee, muttering something about the curse of dimensionality and swearing vengeance on whoever invented ReLU activations. Meanwhile, Dr. Li had retreated to a dark corner, insisting that she was "so close" to deriving the weights by hand.

And yet, amid all the chaos, the physicists had unknowingly embraced one of machine learning’s greatest truths: no one really knows what’s going on half the time.

Finally, Dr. Patel stood up, weary but enlightened. "I’ve come to a conclusion," she announced to the room. "Machine learning... it's just statistical mechanics. But with better marketing."

The room erupted in laughter and nods of approval. It was official: physicists doing machine learning for the first time was not so different from the universe they studied—full of mystery, randomness, and moments of unexpected brilliance.

As the physicists packed up their laptops, they felt something they hadn’t felt in years: the excitement of being a beginner again. Maybe this machine learning thing wasn’t so bad after all.

Talk on LLMs and Transformers

 I gave an introductory talk on LLMs and Transformers yesterday. Here are the slides and the link to the YouTube archive of the recording. I am hoping to complete the short notes and the Adaptive Sparsity paper by next week. 

Essence of LLMs and Transformers I slides

Journal Club | The Essence of LLMs and Transformers talk

Generalized Entropy LOL

For all hep-th folks out there: this is a generalized entropy. Now go cry.


Fooking nightmare for physicists.

arXiv PaperRetriever 2

1829. After having consumed my caffeine dosage for the day I am going to try to complete the second part of the arXiv paper retriever. Basically what I want this to do is to implement a triple-layered recommendation code that assigns a score $X$ to every item in the \Grab folder. Basically, here are the things I want it to do in order.

First, it starts by a usual TF-IDF vectorizer, all the kiddy stuff and assigns a score $A$. It then sends these to a Profiler as I am dubbing it, which basically checks the previous publications and assigns a score $B$ based on the relevance of other publications to the search query. And then finalkly it goes into the DCA Profiler which basically assigns a score $B'$ based on the relevance by recent-ness, number of citations and $B$. Then we simply integrate the whole thing into a score $A+B'$ and voila, that is the recommendation score $X$. Now to get to the hard part and code this stuff.

1843. The first one isn't too hard and basically is almost figured out. The hard part is the Profiler and the DCA Profiler, because the whole point of THAT is to be as precise as possible. I guess I'll get to coding the initial Profiler first.

1907. Well Profiler.py is over and it wasn't that complicated to make too many weird things. The DCAProfiler is the last touch and I think I can get it done in about 10 minutes I guess, but I am way too distracted with Twitter. Lol.

1923. Yeah so I had a problem with assigning the score. I couldn't let it be arbitrary so I had to change the DCAProfiler score assigner with a slightly different idea. Instead of doing the score on a basis of comparison between the interval 0 to 100, 0 to 1 is a better idea and is fairly easier than say comparing every two papers and returning a score based on that because frankly, I have no idea how that would work. So basically let $n$ be the number of citations a paper has received. We then calculate the score $B'$ by also considering $n/n+10$, where 10 is a fairly arbitrary number but is essentailly there to ensure that if a paper has $n=0$ it doesn't just give $B'=\infty $. So there was at least a simplification with that.

1934. For the final integration all that is left to do is to get all the recommendation steps and put the scores together. Now this is a bit of a dodgy thing because so far, the TF-IDF, Profiler and DCAProfiler scripts aren't exactly working in sync and are basically just executing one after the other. Which is kinda dumb and like wtf I suck at coding. But a better idea is to integrate everything sequentially and optimize the search query interface to make things easier for people.

2026. So I got some coding done and basically removed a lot of the excess code, although at this point it is super messy and I have no intention to sit through each line and figure out if that could go over there or over here or remove comments or whatever. But I am going to try and complete the html layout and do some testing.

-- Originally posted on 28/07/2024 0827. --

arXiv PaperRetriever

1715. As usual I am really bored and I figured I could do some more coding for a while. I am kind of at the point where my web app is complicated and I have no idea what to do. Fortunately, one of my friends said "hack!" and that gave me an idea for a thing that I am trying to work on, which is kinda nice except it is supposed to be a repository where people can have suggestions based on codes. Cool.

1732. I made a python file called PaperRetriever which sounds fun. But the basic idea is that it elaborates on the usual GET request with delayed pings and randomized delays to make the requests more "human-like" which is of course BS but works. I basically want it so that it sends a sample of 50 requests with at least 15-20 seconds of delay with randomization between each request and return every query by lastUpdateDate into a folder in \PATH as a separate file text file of the form yymm-nnnnn.txt. This shouldn't be too hard and probably will take around 5 minutes because even a kid can write this stuff.

1743. Well that was faster than I expected. I made a basic python file that should work, for the retrieving part of the arXiv papers, which should be helpful in making my other web app thing where I want code-based recommendations. But the part is where I want to suggest related papers in a feed-algorithm-ish way. Which I will do at some other point, because right now I want to test this and go eat something because I am famished. But I think I want to take this mini-code thing to the next level.

1807. It would be cool if what the code could do is as follows: given a sample dataset $N$ (and assuming fairly large $N$) with particular query $\mathbf{Q}$, it could retrieve the papers from a particular arXiv (for which a simple GET request should be more than enough), store them in a database (or well a folder in this case because we are making separate .txt files for every paper), and depending on the particular $\mathbf{Q}$, it could do like an elementary ML-based scour for identifying the best fit results and return papers in that recommendation system. The only issue being that you indeed need a large $N$ database collection to actually go through and return any feed whatsoever. And in this case it would take a large $N$ time in order to get those $N$ requests, plus an average of 10 seconds of randomized delay, which sucks. But well, who really gives a crap about present-limitations, amirite?

2001 Nevermind, so basically I'm going to use sqlite and configure it so that there is a harv-db and harv.py and integrate it into the main PaperRetriever.py to keep things clean, and this should allow me to go ahead with the more machine learning side of things.

2242. Well I am bored because I got out of a meeting and will be back in it in a while, but for now I want to be stimulated. I tested the code and sure enough, it works. But now I want to use the TF-IDF (Term-Frequency Inverse-Doc-Frequency) structure to suggest papers from an ML model. So basically what I want to do is to use a TF-IDF vectorizer to scour the retrieved papers, find the most relevant ones and display them in order. Unfortunately, I do not have a way of making it more sophisticated by using citations and other bibliography information, for which there are services (for instance semiantic scholar and inspirehep) but well, I'm not sure if I want to presently risk this.

0955. Well I slept in today because it is cold and idk I just wanted to sleep for a while more. Back to business. At this point the only functional parts of the code are the metadata fetcher and the TF-IDF algorithm. I want to take this a step further and make it scour for the retrieved identifiers over inspire, semanbtic scholar and google scholar (which should suffice I guess) and return the citations as well, and then suggest the feed based on this information. I guess it would be too much to ask to make it perfectly sophisticated and fully feed-ise it, but let's see.

Well here is the problem. Google Scholar is pretty good but there is no direct publicly available API to use in this case, and not to mention CAPTCHAs are a thing. Circumnavigating this would be trouble because Google kinda owns the entire "are you a bot or a human so we can know you're a bot or a human" Turing test thing. I guess the best thing is to assume a particular easy-to-screw-CAPTCHA test and break it with image recognition and hope for the best.

United States v Aaron Swartz

1005: I would like to reiterate my stance on the highly illegitimate and overshot lack of proper judgement that was the sentence of the federal prosecution of Aaron Swartz. So basically, the verdict is that if I was at MIT, and I made a script that downloads files from JSTOR or Springer and tried to make them open access, I am liable ot 33 years in prison? There's two really amazing things about this: first, not even near-homicide statuses reach this level and if they do, I leave it up to you to figure out which was an actual crime. Second, journals that make money off people's research are somehow being considered mainstream in this case? Because they do NOT own the research. Some of you douchebags may go "but isn't that exactly what journals are supposed to do?" -- to an extent, yes, but if I publish in JHEP or something, that does NOT mean that Springer OWNS my paper. Because my research is fundamentally owned by me and I am essentially paying journals open access money to promote it further. I am presuming arXiv was a thing but to access papers earlier than arXiv he had to do the downloads, which is reasonable. I commend JSTOR on not making things worse but my central issue is with the prosecutors.

The idea that Aaron's script and "actions" are computer fraud is an incredibly funny term that I realised these people like to use. And as far as I can remember, Mark Zuckerberg's trial, Sundar Pichai's trial and many others have firmly shown that the people in charge HAVE NO IDEA WHAT THEY ARE TALKING ABOUT. And the fact that they are the people who are giving the conviction sentences is definitely corruption at the highest level. And if they can get away with this, they can get away with anything and this is not acceptable. At some point we have to address this case.

To be clear at who I am blaming, it is Carmen Ortiz's office that had this incredibly overshot judgement with no understanding of why the prosecution was necessary. And I stand in opposition to people saying that "that is the law and they are merely exercising their rights to legally pursue a case against Swartz". I'm sorry, but this is bullshit and there is no case needed here. Worst case if JSTOR wanted to be an assbag about it they could have settled easier with just all of their "intellectual property" being confiscated and revoking Aaron's JSTOR account, which is also an entirely illegitimate thing. But this is acceptable. What is NOT acceptable is prosecuting with 13 counts of charges and 33 years of imprisonment for something that not even MIT or JSTOR had a big problem with. And I have no siding with any particular political prty so much as to say that "obviously she was appointed by the Democrat side" which is something that one of my friends said.

Interestingly, Ortiz's husband Tom Dolan had the nerve to comment saying that [quote] "Truly incredible that in their own son's obit, they blame others for his death and make no mention of the 6-month offer", and such is the mindset of people who get to exercise this false power of "prosecution" as they term it. So I ask if he is willing to do the exact same 6-month offer which seems so light to him, to show that indeed the blame was on Aaron. If he can do the 6-months and have his property confiscated and is alright with everything, I (and many others) will consider the 6-month offer as being a "good" offer that Aaron did not take. In a later hearing on Ortiz's prosecution, Eric Holder defended her with her case being [quote] "a good use of prosecutorial discretion". Had the federal intervention not happened, Aaron would still be alive today with at most one legal case pursued. And if Ortiz's decision making was "a good use of prosecutorial discretion" indeed, do tell me why Aaron Swartz is not alive today doing things that none of those cunts would have been able to even think of.

And later, the White House declined to act on a petition requesting to removve Ortiz from the office, which is such a bad decision that my belief of the system being at least residued with some hope of ethics and morality is broken.

For those of you interested to look at Aaron's repositories, here is his GitHub profile: https://github.com/aaronsw

Algorithms Everywhere. Part II

2330: Ok, so let's get down to business as to why encryption algorithms are strangely complicated$^{1}$. Let us suppose we are feeling very feisty and we want to use the DES encryption algorithm, single-layered (probably doesn't matter much but extra layers add to complexity which is a good thing) and we are sticking with the CBC encryption mode. I'm lazy enough to not actually do anything with the Ruby side of things and basically I want you to assume that this example makes sense somehow. Let us suppose that we have some initialization vector $X$, which for all purposes is random (for our example assume that the word random is toned down in definition for reasons we'll see soon). OK, good. Assume that the plaintext is

plaintext: GTASA sucks

We want to encrypt this using DES-CBC. Suppose it turns into some ciphertext, which here is an entirely arbitrary string of characters with padding. We split the entire string of bits into two blocks $P=B_{1}+B_{2}$, and apply $X$ to the second block, call it $B_{2}X$. Recall that DES encryption takes paddings to create 32 bits of strings and the 64-bit ciphertext. So suppose the ciphertext created from these variables with some 32 bit $X=$

\so1\so1\so1\so1\so1\so1\so1\so1

and suppose that the final ciphertext takes the form of

ciphertext: \s13\s63\s17\s01\s98\s14\s63\s53\s19\s06\s23\s07\s71\s44\s10\s48

Now what we want to do is to use an oracle to figure out the paddings and find out the core ciphertext vs the padding $\mathcal{P}$. While this seems like for some randomly chosen ciphertext it would be either difficilt or way too time-taking, in reality this is a polynomial time algorithm that can be executed in finitely many steps. What we do is we take the ciphertext $C$ (which we have no idea about), and run it into a decryption oracle, which tells us if the chosen siphertext is ``meaningful", so to speak and visualise. Changing any of the padding characters breaks the ciphertext into trash that doesn't make sense, and the oracle would tell us false instead of true had we put in $C$.

What a padded oracle attack does it to find out whether every bit of the string XOR'd with a second arbitrary ciphertext (which is meaningless and only an algorithmic artefact we use for this decryption), and figure out what every character of the string is one-by-one. In this way (with subtleties but again, at least thoeretically this suffices to allow more technical executions to fully use this process), one can fully find the cuphertext and subsequently the plaintext as well.

This example is a single-layered DES algorithm, and if we add more layers (more block chain reversals and encryption layers) we might possibly end up with something that is far more secure. Triple-DES for instance is a pretty secure algorithm. Of course, there are ways to fix this, so there's that. I'll type the $P=NP$ post tomorrow morning.



Algorithms Everywhere. Probably Part I

0827: Issues with the database management: this was something I had not fully considered (although I had known of this issue because of the Zucc's CS50 Harvard lecture). The problem goes like this: You have say $N$ users, and on average to each of these $N$ users we can attribute $\sim N$ corresponding connections (could be social connections, posts, blah blah). So the complexity of the data would scale like $O(N^{2})$. However, we usually consider more complicated cases where we are not only considering posts, social connections (i.e. finding mutuals), which at that boils into a graph-short distance problem, which is Algorithms 101. The more complicated stuff usually involves the management of the heavier data (messages, pictures, videos, etc), for which fortunately there does exist a pretty good work-around by compressing the pictures on the client side (so to save resources on the server side because compressing tonnes of pictures and videos on the server at the rate of $N$ times average lowest data size say $M$ megabytes per second would suck the servers dry worse than the 634-in-a-day thing). There's another algorithms problem that arises with this, which is that on average some data is not always going to be retrieved from the database and will instead ``stay on the layer" so-to-speak, and what data is supposed to be available like this is an optimization problem. I could say for instance I want the feed to not refresh everytime a user logs in and basically keep a temporary log of what posts/stuff has been displayed the first time a user logs in. So if the user has lost data connection, it becomes easier to just display previous posts and say ``oops your data kinda sucks right now" (if you're using AT&T it sucked always) instead of throwing a BSOD and kind of shitting the atmosphere of the website. But we can't always just keep a track of every post a user has ever received, because (a) that is unbounded and a user could have procrastinated three weeks and spent 23 hours a day scrolling to hit like 50k posts or something which is too huge and (b) that's a thing that could fuck up with privacy. Or at least that's what I feel.

0836: Now here's the part that I am dreading. In order to implement messaging functionality, it is important to encrypt messages because I don't want John to know what Alice and Bob are talking about. My first thought is to go for a usual RSA cryptography algorithm where we have a primary key and a public key and we use prime factorization. But then again, but the thing is that RSA is a computationally expensive asymmetric algorithm and takes more time than something like AES, which is a symmetric algorithm and may be more suited for encryption of large data without compromising on encryption time. So I am considering using something like AES 256-bit encryption which seems to be a pretty good deal. Speaking of 256-bit encryption, it is funny how some brands market themselves saying they use ``military-grade encryption" and ``unbreakable encryption" all while basically doing AES and nothing more. Don't get me wrong, AES is incredibly efficient and ``unbreakable" in the sense of factorization time. But again, not functionally ``military grade" or some fancy shit, it is a very powerful encryption algorithm and not some hocus-pocus that the FBI made.

0856: To be clear: messages are not the only place where encryption is an issue. For that matter, even if you consider something as straightforward as cookies, you know well-enough how complicated this stuff is. The whole basis of cookies and storing session_id's is on the basis that the user cannot simply modify them. So you need to implement a good enough encryption algorithm for which CBC is usually the first option. Like the deal with CBC is that you are vulnerable to padded Oracle attacks, which are kind of fascinating because they rely on the very foundation of what CBC encryption modes do. Let me illustrate an example schematically.

You start with an initialization vector $X$, which for all CBC purposes is also the first index in the encryption schema. So you have something like $C_{0}=X$ (indices start from 0 so that the first block starts with index 1). We assume that for all valid purposes $X$ follows the usual CBC protocol and is truly random and contained. OK. Then, we start by taking the first plaintext $P_{i}$ and the first ciphertext (which is the initial encryption), and encrypt the XOR'd collection:

\[ C_{i}=\mathrm{En}\left( P_{i}\oplus C_{i-1} \right)\;, \]

and we generate the subsequent block chaining scheme. Decrypting is a similar process. However, while this may seem quite secure (which it kind of is), the Microsoft ASP.NET vulnerability clearly showed that this ``security" could be worked-around by using a padded oracle attack, where the attacker makes a series of requests to the oracle to understand the padding sequence in the block chain. I will leave the technical details with a link to a very good post by Ron Bowes that gives a concrete example of such an attack. But the essential idea is that breaking the padding ends up generating an error code, and you can tweak it to obtain information about the padding and the ciphertext, and as a result the plaintext. So, in such cases, the user can decrypt the session_id. This is an obvious example of subtle vulnerabilities that might go unnoticed, but it is necessary to have these in mind before making any security decision. And so on and so forth.

0959: An even more appealing overall class of problems in cryptography is that of P vs NP, which is an open problem with a lot of applications in cryptography. I will be posting in continuation to this post in a minute. Not that in general saying $P=NP$ is necessarily going to practically break unbreakable polynomial time based cryptography algorithms like RSA, but is nonetheless important. And not really relevant to our present problem. But that is kinda cool and it lets me make love to my 'Tism which is absolutely fun.

Bored.

 Well, in the last couple of days I have been getting bored very easily. Considering that I finished GTA 3 (DE) in 4 days, GTA SA (DE) in 5 days, and many of the movies I wanted to watch, I want to do something stimulating. Right now I am doing a bit of coding which seems very interesting, but I also need to be able to add in some extra functionalities for the project like being able to handle large amounts of video and audio calls through servers. Right now I am thinking of using RTC protocol because it seems very straightforward to set up and I will have it running on both my laptops (the new one is more for coding whereas the old one is for a temp. dummy production server) to test and see if I can add in extra features. Mostly APIs and stuff but it would be nice to see what else I can add.

Well I don't quite feel like blogging at the moment so get the fuck out of here.

Definition of Holography

 Well, since it has been too long without a good post, here is my talk for Theoretical Nexus where I talked in a very elementary level on the definition of holography. Loosely based on some things that I am currently working on, solely as well as with Aayush. Weird that no one asked any question (except Aayush of course). Nonetheless, here it is.

Nexus Seminar | The Definition of Holography


I am also switching to Reaper from Cakewalk.