arXiv PaperRetriever

1715. As usual I am really bored and I figured I could do some more coding for a while. I am kind of at the point where my web app is complicated and I have no idea what to do. Fortunately, one of my friends said "hack!" and that gave me an idea for a thing that I am trying to work on, which is kinda nice except it is supposed to be a repository where people can have suggestions based on codes. Cool.

1732. I made a python file called PaperRetriever which sounds fun. But the basic idea is that it elaborates on the usual GET request with delayed pings and randomized delays to make the requests more "human-like" which is of course BS but works. I basically want it so that it sends a sample of 50 requests with at least 15-20 seconds of delay with randomization between each request and return every query by lastUpdateDate into a folder in \PATH as a separate file text file of the form yymm-nnnnn.txt. This shouldn't be too hard and probably will take around 5 minutes because even a kid can write this stuff.

1743. Well that was faster than I expected. I made a basic python file that should work, for the retrieving part of the arXiv papers, which should be helpful in making my other web app thing where I want code-based recommendations. But the part is where I want to suggest related papers in a feed-algorithm-ish way. Which I will do at some other point, because right now I want to test this and go eat something because I am famished. But I think I want to take this mini-code thing to the next level.

1807. It would be cool if what the code could do is as follows: given a sample dataset $N$ (and assuming fairly large $N$) with particular query $\mathbf{Q}$, it could retrieve the papers from a particular arXiv (for which a simple GET request should be more than enough), store them in a database (or well a folder in this case because we are making separate .txt files for every paper), and depending on the particular $\mathbf{Q}$, it could do like an elementary ML-based scour for identifying the best fit results and return papers in that recommendation system. The only issue being that you indeed need a large $N$ database collection to actually go through and return any feed whatsoever. And in this case it would take a large $N$ time in order to get those $N$ requests, plus an average of 10 seconds of randomized delay, which sucks. But well, who really gives a crap about present-limitations, amirite?

2001 Nevermind, so basically I'm going to use sqlite and configure it so that there is a harv-db and harv.py and integrate it into the main PaperRetriever.py to keep things clean, and this should allow me to go ahead with the more machine learning side of things.

2242. Well I am bored because I got out of a meeting and will be back in it in a while, but for now I want to be stimulated. I tested the code and sure enough, it works. But now I want to use the TF-IDF (Term-Frequency Inverse-Doc-Frequency) structure to suggest papers from an ML model. So basically what I want to do is to use a TF-IDF vectorizer to scour the retrieved papers, find the most relevant ones and display them in order. Unfortunately, I do not have a way of making it more sophisticated by using citations and other bibliography information, for which there are services (for instance semiantic scholar and inspirehep) but well, I'm not sure if I want to presently risk this.

0955. Well I slept in today because it is cold and idk I just wanted to sleep for a while more. Back to business. At this point the only functional parts of the code are the metadata fetcher and the TF-IDF algorithm. I want to take this a step further and make it scour for the retrieved identifiers over inspire, semanbtic scholar and google scholar (which should suffice I guess) and return the citations as well, and then suggest the feed based on this information. I guess it would be too much to ask to make it perfectly sophisticated and fully feed-ise it, but let's see.

Well here is the problem. Google Scholar is pretty good but there is no direct publicly available API to use in this case, and not to mention CAPTCHAs are a thing. Circumnavigating this would be trouble because Google kinda owns the entire "are you a bot or a human so we can know you're a bot or a human" Turing test thing. I guess the best thing is to assume a particular easy-to-screw-CAPTCHA test and break it with image recognition and hope for the best.

No comments:

Post a Comment