TF-IDF — Small Intro
TF-IDF is a simple thing in NLP. what defines the importance of word in a context? Its occurrence. If the word appears most, it might be very contributing to the context.
Say we are talking about a tweets from football man. His tweets are like ‘I love a club named Man Utd, Man Utd is best, Man Utd is love’. So, as we can see Man Utd repeats more than once, so analyzing his tweet we might say he loves Man Utd. If we calculate the frequency of the word, it is 3. What is its occurrence? 3/9. That is TF.
What about IDF?
Let me explain further. What if we have corpus like this?
‘A game today, so there is a chance for me to see a match today? ’ I couldn’t come up with better example while I was typing so bear with me. The word ‘a’ here, I know I could have said letter ‘a’, but let me say word. A comes here, has highest frequency there in that sentence but is a stop word. You don’t know what is stop word? IDIOT.
So, it is not contributing much, but still TF score is high. So we use IDF. IDF also known as Inverse Documentary Frequency, does this. Checks if value is repeats also in other context, if it repeats then, it must be stop word. You see where we are going? If your current word is also repeating in other context, then such word is stop word. So, we try to ignore them. How do we calculate IDF in our case?
Log(Total number of Example we check/Total Number of Example with that word+1). 1 is for Laplace Smoothing.
So basically what we are doing is, we are checking if that number repeats and make sure it doesn’t repeat in every context, making it a stop word. In our example, TIF for ‘a’ would be log(2/2) = 0. I am not using Laplace smoothing here, you can use if you want.
So, TF-DIF for a is 0*3/15 = 0
TF-DIF for Man Utd is log(2/1)*3/9 = 0.099.
Drawback of TF-DIF(explained in other medium post) on Bag of Words:
- It doesn’t care about synonyms of word. So, context is not understood by it. For example: ‘The hotel is closed today due to lockdown’. ‘Its holiday for staffs in restaurants’ very similar. Since, no words are repeated there, there is no similarity here.