arXiv PaperRetriever 2

1829. After having consumed my caffeine dosage for the day I am going to try to complete the second part of the arXiv paper retriever. Basically what I want this to do is to implement a triple-layered recommendation code that assigns a score $X$ to every item in the \Grab folder. Basically, here are the things I want it to do in order.

First, it starts by a usual TF-IDF vectorizer, all the kiddy stuff and assigns a score $A$. It then sends these to a Profiler as I am dubbing it, which basically checks the previous publications and assigns a score $B$ based on the relevance of other publications to the search query. And then finalkly it goes into the DCA Profiler which basically assigns a score $B'$ based on the relevance by recent-ness, number of citations and $B$. Then we simply integrate the whole thing into a score $A+B'$ and voila, that is the recommendation score $X$. Now to get to the hard part and code this stuff.

1843. The first one isn't too hard and basically is almost figured out. The hard part is the Profiler and the DCA Profiler, because the whole point of THAT is to be as precise as possible. I guess I'll get to coding the initial Profiler first.

1907. Well Profiler.py is over and it wasn't that complicated to make too many weird things. The DCAProfiler is the last touch and I think I can get it done in about 10 minutes I guess, but I am way too distracted with Twitter. Lol.

1923. Yeah so I had a problem with assigning the score. I couldn't let it be arbitrary so I had to change the DCAProfiler score assigner with a slightly different idea. Instead of doing the score on a basis of comparison between the interval 0 to 100, 0 to 1 is a better idea and is fairly easier than say comparing every two papers and returning a score based on that because frankly, I have no idea how that would work. So basically let $n$ be the number of citations a paper has received. We then calculate the score $B'$ by also considering $n/n+10$, where 10 is a fairly arbitrary number but is essentailly there to ensure that if a paper has $n=0$ it doesn't just give $B'=\infty $. So there was at least a simplification with that.

1934. For the final integration all that is left to do is to get all the recommendation steps and put the scores together. Now this is a bit of a dodgy thing because so far, the TF-IDF, Profiler and DCAProfiler scripts aren't exactly working in sync and are basically just executing one after the other. Which is kinda dumb and like wtf I suck at coding. But a better idea is to integrate everything sequentially and optimize the search query interface to make things easier for people.

2026. So I got some coding done and basically removed a lot of the excess code, although at this point it is super messy and I have no intention to sit through each line and figure out if that could go over there or over here or remove comments or whatever. But I am going to try and complete the html layout and do some testing.

-- Originally posted on 28/07/2024 0827. --

No comments:

Post a Comment