Many current contextual ad targeting approaches claim to use semantics as their underlying processing approach. But there are many varieties in applied semantics. One such approach most commonly used is a keyword & statistics approach used by Autonomy. The second is a Natural Language Processing (NLP) approach used by ADmantX. This document does a comparison between two vendor’s approaches that typifies the differences and argues that the incremental benefits of a NLP approach out way any additional costs.
Keyword based systems combined with statistics go from stemming (eliminating the last part of a word in order to work with the base form) to simple linguistic operations (finding the terms with standard patterns of inflection). The use of a thesaurus is also possible – archives of keyword relations. The statistical methods then aim to catch the keyword relations using mathematical algorithms of probability (Bayes theorem and Hidden Markov Chains being the most prominent); because of their objectives, sometimes these are inappropriately labeled as semantic technology or natural language technology.
Classification is done through Bayes’ functions, therefore creating values during the training phase and when in use, calculating the probability of a group of keywords which are “the closest” to a training group with respect to others. Entity Extraction is done through the use of regular expressions: logical operations on characters sometimes assisted by Markov Chains. Limitations exist in the approach in constructing the fullest expression of language logic often referred to as a semantic triple (subject-predicate-object) used in establishing sentiment, direction of action, motivation, intention etc. The limitation is due to the variety in human expression that current probabilistic models cannot capture.
Probability based methods suffer from a zero sum problem in terms of performance. That is to say when an increase in precision is needed it always comes at the expense of recall, and vice versa. Probabilistic systems draw “boundaries” between documents relevant and retrieved documents. The boundaries are ridged in the sense that that cannot be adjusted for single or small groups of documents or other special conditions. They change probabilities for all documents.
Diagram 1: The diagram above represents both precision and recall. The relevant documents are to the left of vertical line. The documents inside the oval are the retrieved documents – the documents the system thinks are correct. The proportion of documents not retrieved but should have been relative to those correctly retrieved is the precision measurement and marked as arrow 1. The proportion of documents not identified as relevant that should have been relative to those correctly identified as relevant is the recall measurement and marked as arrow 2.
In a probabilistic system when work is done to increase precision by increasing the number of relevant documents by definition the recall worsens as seen in the Diagram 2 below.
Diagram 2: Changes that improve precision worsen recall e.g. ratio 1 improves but ratio 2 worsens.
Conversely in a probabilistic system if work is done to improve recall by increasing the number of retrieved documents the precision suffers as is shown in Diagram 3 below.
Diagram 3: Changes that improve recall worsen precision e.g. ratio 2 improves but ratio 1 worsens.