Between Big and Small Data

Posted by Jarrod Valentine on June 28, 2017

I used the Dandelion Entity Extraction Api on the text of each article and on the text accompanying each video to get ~150k entities. I then used a binarizer to encode each post (article/video) so that it can be represented as a row of zeroes and ones corresponding to the categories each entity related to the post belongs to. All the posts together made a matrix of about 6500 posts by ~150k encodings, all of which is to say that I have a large and very sparse matrix.

Not Quite Big,
But Not Quite Small Data

I tried to do some Principal Component Analysis to look at the data in a reduced cimensional space, but a Pandas data frame of nearly 1 billion entries was too much to process. I tried using a compressed matrix format to avoid data overflow errors, but the computation was too intensive for my laptop. Rather than use the categories the entities belong to, I tried using the entities themselves which reduced the encodings to 55k. Unfortunately, PCA on this reduced matric didn't give me any important insights.

This amount of data is not quite big enough to warrant some sort of Map Reduce strategy, but not quite small enough to be able to do more complex in-memory calculations on my local laptop. Luckily, the pairwise cosine similarity module in sci-kit learn doesn't have to hold the entire matrix in memory to operate on it at once. I found the cosine similarity between each post by taking the cosine similarity of the entire matrix with itself, and then used the result of each entry as the similarity score. One step closer to a recommender!