Word Embedding
From the paper itself and some medium and towardsdatascience writings. I am onto this, hope I don’t give up in the early stage!
Word Embedding carries this concept : ‘If two words are similar, they must carry similar representation in their vector space’. The analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation king − queen = man − woman.
Word2Vec
We have 10,000 neurons of input vector (say) as our vocab contains 10,000 unique words. 300(Google did this much) neurons on hidden layer and 10,000 output. Output is softmax activated and hidden layer is linearly activated (no activation). So, the weight of the hidden layer would be 10,000*300 i.e. we would have for all 10,000 words we have 300 dimension-ed vector. This is what the Neural Network learns!
So the end goal of all of this is really just to learn this hidden layer weight matrix from input layer .. Remember, the weight from hidden to output is the weight of output neurons, not the hidden neurons! — we’re done!!
If two different words have very similar “contexts” (that is, what words are likely to appear around them), then our model needs to output very similar results for these two words. And what does it mean for two words to have similar contexts? I think we could expect that synonyms like “intelligent” and “smart” would have very similar contexts.
But as we are seeing that training 10,000 vectors as inputs and 10,000 output with softmax classifier is too much of burden. The randomly initialized weight would be changed only on that row and not others of weight matrix. Both of these layers would have a weight matrix with 300 x 10,000 = 3 million weights each!
So we go for tweaks. Also, our input is one hot, right? What’s the point of having such giant weight matrix of 10,000*300 when it is only going to affect one single row in it? I mean think of it, say our input vector has 1 on 927th row, and all are 0.
But we may run into problem. Suppose, we take an example where we have too many unnecessary words in our vocabs which tells gives no meaning to the word whatsoever. For example: ‘The quick brown box jumps over the lazy dog’. In this sentence, the word ‘the’ carries no significant importance to the sentence.
Subsampling
Word2Vec uses subsampling method to address the problem defined above. Subsampling is a method of diluting very frequent words, akin to(similar to) removing stop-words. Subsampling can indeed improve performance for some tasks, while decreasing performance for others; there is no clear-cut rule for when subsampling helps.
Negative Sampling
All of our weights (10,000*300) would be changed very slightly by our tremendous amount of data(millions and millions). Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.
Working mechanism:
When training the network on the word pair (“fox”, “quick”). Input is one hot vector of fox and output is one hot vector of quick. With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0). That is we would like to update the weight of any words which does not correspond to output. Okay let me explain further. We have quick as an output right? Lets say ‘quick’ likes on the 92nd index, so we have 300 vector at 92nd row. So, in negative sampling, we update some random 5 rows apart from 92nd row. i.e. row of ‘brown’, ‘fox’, e.t.c. We will also still update the weights for our “positive” word (which is the word “quick” in our current example).
The research paper says that “The paper says that ”selecting 5–20 words works well for smaller datasets, and you can get away with only 2–5 words for large datasets. ”
The output layer has weight matrix of 300*10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons. So, we have 6*300 = 1800 weights to be updated instead of 300*10000.
In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not). HOW?? The backpropagation would come from output to hidden and should be updating right? No. Because gradient at any neuron is dJ/dz*x where x is the input, since x is a one hot vector, for non-input neurons, it is 0, so it doesn’t need updating.
Selecting Negative Sample
In “negative samples”, more frequent words are more likely to be selected as negative samples. i.e. if probability of word ‘rooney’ occuring in a word is more likely (we can determine it by count(rooney)/count(allWords). But, the author said that raising the count to the power of 3/4 gives more accurate result ).
Select upto 5th highest P(w) for negative sampling in our case. Well, easy word2Vec? Isn’t it? The implementation is what will bug me, but I will check it out, once I am done with GloVe.