arXiv PaperRetriever

1715. As usual I am really bored and I figured I could do some more coding for a while. I am kind of at the point where my web app is complicated and I have no idea what to do. Fortunately, one of my friends said "hack!" and that gave me an idea for a thing that I am trying to work on, which is kinda nice except it is supposed to be a repository where people can have suggestions based on codes. Cool.

1732. I made a python file called PaperRetriever which sounds fun. But the basic idea is that it elaborates on the usual GET request with delayed pings and randomized delays to make the requests more "human-like" which is of course BS but works. I basically want it so that it sends a sample of 50 requests with at least 15-20 seconds of delay with randomization between each request and return every query by lastUpdateDate into a folder in \PATH as a separate file text file of the form yymm-nnnnn.txt. This shouldn't be too hard and probably will take around 5 minutes because even a kid can write this stuff.

1743. Well that was faster than I expected. I made a basic python file that should work, for the retrieving part of the arXiv papers, which should be helpful in making my other web app thing where I want code-based recommendations. But the part is where I want to suggest related papers in a feed-algorithm-ish way. Which I will do at some other point, because right now I want to test this and go eat something because I am famished. But I think I want to take this mini-code thing to the next level.

1807. It would be cool if what the code could do is as follows: given a sample dataset $N$ (and assuming fairly large $N$) with particular query $\mathbf{Q}$, it could retrieve the papers from a particular arXiv (for which a simple GET request should be more than enough), store them in a database (or well a folder in this case because we are making separate .txt files for every paper), and depending on the particular $\mathbf{Q}$, it could do like an elementary ML-based scour for identifying the best fit results and return papers in that recommendation system. The only issue being that you indeed need a large $N$ database collection to actually go through and return any feed whatsoever. And in this case it would take a large $N$ time in order to get those $N$ requests, plus an average of 10 seconds of randomized delay, which sucks. But well, who really gives a crap about present-limitations, amirite?

2001 Nevermind, so basically I'm going to use sqlite and configure it so that there is a harv-db and harv.py and integrate it into the main PaperRetriever.py to keep things clean, and this should allow me to go ahead with the more machine learning side of things.

2242. Well I am bored because I got out of a meeting and will be back in it in a while, but for now I want to be stimulated. I tested the code and sure enough, it works. But now I want to use the TF-IDF (Term-Frequency Inverse-Doc-Frequency) structure to suggest papers from an ML model. So basically what I want to do is to use a TF-IDF vectorizer to scour the retrieved papers, find the most relevant ones and display them in order. Unfortunately, I do not have a way of making it more sophisticated by using citations and other bibliography information, for which there are services (for instance semiantic scholar and inspirehep) but well, I'm not sure if I want to presently risk this.

0955. Well I slept in today because it is cold and idk I just wanted to sleep for a while more. Back to business. At this point the only functional parts of the code are the metadata fetcher and the TF-IDF algorithm. I want to take this a step further and make it scour for the retrieved identifiers over inspire, semanbtic scholar and google scholar (which should suffice I guess) and return the citations as well, and then suggest the feed based on this information. I guess it would be too much to ask to make it perfectly sophisticated and fully feed-ise it, but let's see.

Well here is the problem. Google Scholar is pretty good but there is no direct publicly available API to use in this case, and not to mention CAPTCHAs are a thing. Circumnavigating this would be trouble because Google kinda owns the entire "are you a bot or a human so we can know you're a bot or a human" Turing test thing. I guess the best thing is to assume a particular easy-to-screw-CAPTCHA test and break it with image recognition and hope for the best.

United States v Aaron Swartz

1005: I would like to reiterate my stance on the highly illegitimate and overshot lack of proper judgement that was the sentence of the federal prosecution of Aaron Swartz. So basically, the verdict is that if I was at MIT, and I made a script that downloads files from JSTOR or Springer and tried to make them open access, I am liable ot 33 years in prison? There's two really amazing things about this: first, not even near-homicide statuses reach this level and if they do, I leave it up to you to figure out which was an actual crime. Second, journals that make money off people's research are somehow being considered mainstream in this case? Because they do NOT own the research. Some of you douchebags may go "but isn't that exactly what journals are supposed to do?" -- to an extent, yes, but if I publish in JHEP or something, that does NOT mean that Springer OWNS my paper. Because my research is fundamentally owned by me and I am essentially paying journals open access money to promote it further. I am presuming arXiv was a thing but to access papers earlier than arXiv he had to do the downloads, which is reasonable. I commend JSTOR on not making things worse but my central issue is with the prosecutors.

The idea that Aaron's script and "actions" are computer fraud is an incredibly funny term that I realised these people like to use. And as far as I can remember, Mark Zuckerberg's trial, Sundar Pichai's trial and many others have firmly shown that the people in charge HAVE NO IDEA WHAT THEY ARE TALKING ABOUT. And the fact that they are the people who are giving the conviction sentences is definitely corruption at the highest level. And if they can get away with this, they can get away with anything and this is not acceptable. At some point we have to address this case.

To be clear at who I am blaming, it is Carmen Ortiz's office that had this incredibly overshot judgement with no understanding of why the prosecution was necessary. And I stand in opposition to people saying that "that is the law and they are merely exercising their rights to legally pursue a case against Swartz". I'm sorry, but this is bullshit and there is no case needed here. Worst case if JSTOR wanted to be an assbag about it they could have settled easier with just all of their "intellectual property" being confiscated and revoking Aaron's JSTOR account, which is also an entirely illegitimate thing. But this is acceptable. What is NOT acceptable is prosecuting with 13 counts of charges and 33 years of imprisonment for something that not even MIT or JSTOR had a big problem with. And I have no siding with any particular political prty so much as to say that "obviously she was appointed by the Democrat side" which is something that one of my friends said.

Interestingly, Ortiz's husband Tom Dolan had the nerve to comment saying that [quote] "Truly incredible that in their own son's obit, they blame others for his death and make no mention of the 6-month offer", and such is the mindset of people who get to exercise this false power of "prosecution" as they term it. So I ask if he is willing to do the exact same 6-month offer which seems so light to him, to show that indeed the blame was on Aaron. If he can do the 6-months and have his property confiscated and is alright with everything, I (and many others) will consider the 6-month offer as being a "good" offer that Aaron did not take. In a later hearing on Ortiz's prosecution, Eric Holder defended her with her case being [quote] "a good use of prosecutorial discretion". Had the federal intervention not happened, Aaron would still be alive today with at most one legal case pursued. And if Ortiz's decision making was "a good use of prosecutorial discretion" indeed, do tell me why Aaron Swartz is not alive today doing things that none of those cunts would have been able to even think of.

And later, the White House declined to act on a petition requesting to removve Ortiz from the office, which is such a bad decision that my belief of the system being at least residued with some hope of ethics and morality is broken.

For those of you interested to look at Aaron's repositories, here is his GitHub profile: https://github.com/aaronsw

Algorithms Everywhere. Part II

2330: Ok, so let's get down to business as to why encryption algorithms are strangely complicated$^{1}$. Let us suppose we are feeling very feisty and we want to use the DES encryption algorithm, single-layered (probably doesn't matter much but extra layers add to complexity which is a good thing) and we are sticking with the CBC encryption mode. I'm lazy enough to not actually do anything with the Ruby side of things and basically I want you to assume that this example makes sense somehow. Let us suppose that we have some initialization vector $X$, which for all purposes is random (for our example assume that the word random is toned down in definition for reasons we'll see soon). OK, good. Assume that the plaintext is

plaintext: GTASA sucks

We want to encrypt this using DES-CBC. Suppose it turns into some ciphertext, which here is an entirely arbitrary string of characters with padding. We split the entire string of bits into two blocks $P=B_{1}+B_{2}$, and apply $X$ to the second block, call it $B_{2}X$. Recall that DES encryption takes paddings to create 32 bits of strings and the 64-bit ciphertext. So suppose the ciphertext created from these variables with some 32 bit $X=$

\so1\so1\so1\so1\so1\so1\so1\so1

and suppose that the final ciphertext takes the form of

ciphertext: \s13\s63\s17\s01\s98\s14\s63\s53\s19\s06\s23\s07\s71\s44\s10\s48

Now what we want to do is to use an oracle to figure out the paddings and find out the core ciphertext vs the padding $\mathcal{P}$. While this seems like for some randomly chosen ciphertext it would be either difficilt or way too time-taking, in reality this is a polynomial time algorithm that can be executed in finitely many steps. What we do is we take the ciphertext $C$ (which we have no idea about), and run it into a decryption oracle, which tells us if the chosen siphertext is ``meaningful", so to speak and visualise. Changing any of the padding characters breaks the ciphertext into trash that doesn't make sense, and the oracle would tell us false instead of true had we put in $C$.

What a padded oracle attack does it to find out whether every bit of the string XOR'd with a second arbitrary ciphertext (which is meaningless and only an algorithmic artefact we use for this decryption), and figure out what every character of the string is one-by-one. In this way (with subtleties but again, at least thoeretically this suffices to allow more technical executions to fully use this process), one can fully find the cuphertext and subsequently the plaintext as well.

This example is a single-layered DES algorithm, and if we add more layers (more block chain reversals and encryption layers) we might possibly end up with something that is far more secure. Triple-DES for instance is a pretty secure algorithm. Of course, there are ways to fix this, so there's that. I'll type the $P=NP$ post tomorrow morning.



Algorithms Everywhere. Probably Part I

0827: Issues with the database management: this was something I had not fully considered (although I had known of this issue because of the Zucc's CS50 Harvard lecture). The problem goes like this: You have say $N$ users, and on average to each of these $N$ users we can attribute $\sim N$ corresponding connections (could be social connections, posts, blah blah). So the complexity of the data would scale like $O(N^{2})$. However, we usually consider more complicated cases where we are not only considering posts, social connections (i.e. finding mutuals), which at that boils into a graph-short distance problem, which is Algorithms 101. The more complicated stuff usually involves the management of the heavier data (messages, pictures, videos, etc), for which fortunately there does exist a pretty good work-around by compressing the pictures on the client side (so to save resources on the server side because compressing tonnes of pictures and videos on the server at the rate of $N$ times average lowest data size say $M$ megabytes per second would suck the servers dry worse than the 634-in-a-day thing). There's another algorithms problem that arises with this, which is that on average some data is not always going to be retrieved from the database and will instead ``stay on the layer" so-to-speak, and what data is supposed to be available like this is an optimization problem. I could say for instance I want the feed to not refresh everytime a user logs in and basically keep a temporary log of what posts/stuff has been displayed the first time a user logs in. So if the user has lost data connection, it becomes easier to just display previous posts and say ``oops your data kinda sucks right now" (if you're using AT&T it sucked always) instead of throwing a BSOD and kind of shitting the atmosphere of the website. But we can't always just keep a track of every post a user has ever received, because (a) that is unbounded and a user could have procrastinated three weeks and spent 23 hours a day scrolling to hit like 50k posts or something which is too huge and (b) that's a thing that could fuck up with privacy. Or at least that's what I feel.

0836: Now here's the part that I am dreading. In order to implement messaging functionality, it is important to encrypt messages because I don't want John to know what Alice and Bob are talking about. My first thought is to go for a usual RSA cryptography algorithm where we have a primary key and a public key and we use prime factorization. But then again, but the thing is that RSA is a computationally expensive asymmetric algorithm and takes more time than something like AES, which is a symmetric algorithm and may be more suited for encryption of large data without compromising on encryption time. So I am considering using something like AES 256-bit encryption which seems to be a pretty good deal. Speaking of 256-bit encryption, it is funny how some brands market themselves saying they use ``military-grade encryption" and ``unbreakable encryption" all while basically doing AES and nothing more. Don't get me wrong, AES is incredibly efficient and ``unbreakable" in the sense of factorization time. But again, not functionally ``military grade" or some fancy shit, it is a very powerful encryption algorithm and not some hocus-pocus that the FBI made.

0856: To be clear: messages are not the only place where encryption is an issue. For that matter, even if you consider something as straightforward as cookies, you know well-enough how complicated this stuff is. The whole basis of cookies and storing session_id's is on the basis that the user cannot simply modify them. So you need to implement a good enough encryption algorithm for which CBC is usually the first option. Like the deal with CBC is that you are vulnerable to padded Oracle attacks, which are kind of fascinating because they rely on the very foundation of what CBC encryption modes do. Let me illustrate an example schematically.

You start with an initialization vector $X$, which for all CBC purposes is also the first index in the encryption schema. So you have something like $C_{0}=X$ (indices start from 0 so that the first block starts with index 1). We assume that for all valid purposes $X$ follows the usual CBC protocol and is truly random and contained. OK. Then, we start by taking the first plaintext $P_{i}$ and the first ciphertext (which is the initial encryption), and encrypt the XOR'd collection:

\[ C_{i}=\mathrm{En}\left( P_{i}\oplus C_{i-1} \right)\;, \]

and we generate the subsequent block chaining scheme. Decrypting is a similar process. However, while this may seem quite secure (which it kind of is), the Microsoft ASP.NET vulnerability clearly showed that this ``security" could be worked-around by using a padded oracle attack, where the attacker makes a series of requests to the oracle to understand the padding sequence in the block chain. I will leave the technical details with a link to a very good post by Ron Bowes that gives a concrete example of such an attack. But the essential idea is that breaking the padding ends up generating an error code, and you can tweak it to obtain information about the padding and the ciphertext, and as a result the plaintext. So, in such cases, the user can decrypt the session_id. This is an obvious example of subtle vulnerabilities that might go unnoticed, but it is necessary to have these in mind before making any security decision. And so on and so forth.

0959: An even more appealing overall class of problems in cryptography is that of P vs NP, which is an open problem with a lot of applications in cryptography. I will be posting in continuation to this post in a minute. Not that in general saying $P=NP$ is necessarily going to practically break unbreakable polynomial time based cryptography algorithms like RSA, but is nonetheless important. And not really relevant to our present problem. But that is kinda cool and it lets me make love to my 'Tism which is absolutely fun.

Definition of Holography

 Well, since it has been too long without a good post, here is my talk for Theoretical Nexus where I talked in a very elementary level on the definition of holography. Loosely based on some things that I am currently working on, solely as well as with Aayush. Weird that no one asked any question (except Aayush of course). Nonetheless, here it is.

Nexus Seminar | The Definition of Holography


I am also switching to Reaper from Cakewalk.