Never Trust Sentiment Accuracy Claims - Social Media Explorer
Never Trust Sentiment Accuracy Claims
Never Trust Sentiment Accuracy Claims

Sentiment analysis plays a key role in social intelligence (a generalization of social-media analytics) and in customer-experience programs, but the disparity in tool performance is wide. It’s natural that users will look for accuracy figures, and that solution providers — the ones that pretend to better performance — will use accuracy as a differentiator. The competition is suspect, for reasons I outlined in Social Media Sentiment: Competing on Accuracy. Per that article, there’s no standard yardstick for sentiment-analysis accuracy measurement. But that’s a technical point. Worth further exploring:

  • Providers, using human raters as a yardstick, don’t play by the same rules.
  • It’s fallacious that humans are the ultimate accuracy arbiters anyway. Can a machine in no way judge better (as opposed to faster or more exhaustively) than a person?
  • This focus on accuracy distracts users from the real goal, not 95% analysis accuracy but rather support for the most effective possible business decision making.

To explore —

Human benchmarks

We benchmark machine performance, on purely quantitative tasks, against natural measures: Luminosity according to a model of the sensitivity of the human eye (candela), and mechanical-engine output against the power of draft horses (horsepower). But just as a spectrometer measures light of wavelengths unseeable by humans and quantifies visible-wavelength measurements in a way humans never could, and a Saturn V rocket will (or could) take you places an animal could never go unassisted, I believe that sentiment and other human-language analysis technologies, when carefully applied, can deliver super-human accuracy. I believe it is no longer true that “The right goal is for the technology to be as good as people,” as Philip Resnik, a Univ of Maryland linguistics professor and lead scientist at social-media agency Converseon, puts it.

As Professors Claire Cardie and John Wilkerson explain, “The gold standard of text annotation research is usually work performed by human coders… In other words, the assessment is not whether the system accurately classifies events, but the extent to which the system agrees with humans where those classifications are concerned.”

“Agrees with humans”

Note the statement, “the assessment is not whether the system accurately classifies events, but the extent to which the system agrees with humans where those classifications are concerned.”

And consider a company, Metavana, that competes on accuracy, with claims of 95-96% performance on combined topic extraction and sentiment analysis. Metavana President Michael Tupanjanin says the company measures accuracy “the old fashioned way.” According to Tupanjanin, “We literally will take — we recently did about 3,000 quotes that we actually rated, and we sat down with a bunch of high school kids and actually had them go through sentence by sentence by sentence and see, how would you score this sentence?” I praise Metavana’s openness, but this approach is backwards, as we shall see. It assesses whether humans agree with the machine, not whether the machine agrees with humans, per established methods.

According to Erick Watson, the company’s director of product management, the software identifies entities and topics and then further mines sources for sentiment expressions. In the automotive sector, says Watson, the engine identifies expressions “such as ‘fuel efficient’ or ‘poor service quality’ and automatically determines which of these sentiment expressions is associated with [a] brand.” Sounds reasonable, but then Watson wrote me, “Expressions that contain no sentiment-bearing keywords are classified as neutral (e.g. ‘I purchased a Honda yesterday.’)”

I ran a Twitter poll on Watson’s ‘I purchased a Honda yesterday.’ With 22 respondents, 45% rated it neutral and 55% rated it positive. Humans may see sentiment in an expression that contains no sentiment-bearing keywords! Metavana’s summary dismissal of such expressions, coupled to an accuracy-measurement method that restricts evaluation to machine-tagged expressions (the ones the company doesn’t dismiss), inflates the company’s accuracy results.

There’s more to the accuracy appraisal.

Beyond humans

I believe that sentiment and other human-language analysis technologies, when carefully applied, can deliver super-human accuracy. True, we’re years from autonomous agents that can navigate world of sensory (data) inputs and uncertain information in order to flexibly carry out arbitrary tasks, which is what humans do. But arguably, we can design a system that can, or soon will be able to, conduct any given task — whether driving a car or competing at Jeopardy — better than a human ever could.

A first attempt at automating a process typically involves mimicking human methods, but an intelligent system may reason in ways humans don’t. In analyzing language, in particular, machines look for nuance that may emerge only when statistical analyses are applied to very large data sets. That’s the Unreasonable Effectiveness of Data when, per Google’s Peter Norvig, “the hopeless suddenly becomes effective, and computer models sometimes meet or exceed human performance.” That’s not to say that machines won’t fail, badly, in certain circumstances. It is to say that overall, in the (large) aggregate, computers can and will outperform humans both on routine tasks and by making connections — finding patterns and discovering information — that a human never would.

Think of this insight as an extension of the Mythical Man-Month corollary, that “Nine women can’t make a baby in one month.” A machine can’t make a baby at all, but one can accelerate protons to near light speed to create sufficient mass for collisions to result in the generation of unseeable, but inferable, particles, namely Higgs bosons. Machines can already throw together (fuse) text-extracted and otherwise-collected information to establish links and associations that a human (or nine hundred) would never perceive.

Philip Resnik’s attitude, the established attitude, that “the right goal is for the technology to be as good as people,” is only a starting point. We seek to create machines that are better than humans, and we should measure their performance accordingly.

The accuracy distraction

My final (but central) point is this: The accuracy quest-for-the-best is a distraction.

Social intelligence providers often claim accuracy that beats the competitions’. (Lexalytics and OpenAmplify should be pleased that they’re the benchmarks new entrant Group of Men chose to compare itself to.) Providers boast of filtering the firehouse. They claim to enable customers to transform into social enterprises, as if presenting or plugging into a widget-filled social-analytics dashboard, with simplistic +/- sentiment ratings, were the key to better business operations and decision making. Plainly stated —

The market seeks ability to improve business processes, to facilitate business tasks. Accuracy should be good enough to matter, but more important, analytical outputs should be useful and usable, aligned to business goals (positive/negative sentiment ratings often aren’t) and consumable within line-of-business applications.

I’m interested in how your technology and solutions made money for your customers, or helped them operate more efficiently and effectively, or, for that matter, saved lives or improved government services. The number that counts is demonstrated ROI.

Disclosure: Earlier this year, Converseon engaged me for a small amount of paid consulting and was a paying sponsor of my November 2011 Sentiment Analysis Symposium.

And a plug: Check out the up-coming Sentiment Analysis Symposium, slated for October 30, 2012 in San Francisco, preceded by a half-day Practical Sentiment Analysis tutorial, to be taught by Diana Maynard of of the Univ of Sheffield, UK.

Enhanced by Zemanta

About the Author

Seth Grimes
Seth Grimes is the leading industry analyst covering natural language processing (NLP), text analytics, and sentiment analysis technologies and their business applications. He consults via Alta Plana Corporation and organizes the Sentiment Analysis Symposium and LT-Accelerate conferences.
  • Pingback: Social Sentiment’s Missing Measures //()

  • Pingback: Social Sentiment’s Missing Measures()

  • Pingback: Bitext Presentation in San Francisco (October 2nd, 2013) » Sentiment Analysis and Text Analytics for Social Media()

  • Pingback: Sentiment Analysis: What Do Your Customers Think of Your Business?()

  • Pingback: Big data – little impact? | Common Sense()

  • Pingback: Metavana Mix: Social Complexity, SparkScore Simplicity | Social Media Explorer()

  • Pingback: Metavana Mix: Social Complexity, SparkScore Simplicity « Breakthrough Analysis()

  • It makes perfect sense to measure the quality of text analytics by the time/effort gains it helped to achieve in various business applications. But such applications are all unique, there are all sorts of other factors involved, so if one wanted to compare the accuracy of one system/algorithm to another, abstracted from their specific uses, this evaluation method would not work. One would need to look at benchmark datasets comparing automatic analysis to human analysis.

    And surely, such benchmarks should reflect the most common business uses of sentiment analysis –  for example, identifying sentiment associated with individual brand mentions. I guess if vendors, who have been doing the evaluation work, made their benchmark datasets public, that would make it possible to determine the best industry standards. So I’d very much welcome public releases of these benchmark data.

  • jordanfrank

    I’m playing devil’s advocate here a bit, but couldn’t the argument be made that accuracy numbers can be trusted, and play an important role in evaluating a product. What they indicate is how well, if one were to use the product in a similar manner as the experiment that generated the numbers, one would expect the product to perform (assuming their experimental methodology was sound)? 

    If I look at a 95% accuracy rating, what that means to me is that if I do the same thing they did, I would expect around 95% accuracy on my task. The key is understanding whether your task is similar enough to the one on which the accuracy number was generated.

    I agree with your statement that we should look to exceed human performance. In your example, ‘I bought a Honda’, I have to imagine that for any marketer, someone posting to all of their friends that they purchased your product is a pretty positive thing. So if a product has a 95% accuracy number, then I would expect that if I trained it, labeling this statement as positive, then I can be quite confident that similar statements ‘I bought a Civic’, for instance, would be classified the way I want them to be. 

    I do agree that if accuracy numbers are taken to mean “If I buy this product and just plug it in and press ‘go’, it will perform whatever task I’m interested in with 95% accuracy right out of the gate”, then sure, the numbers are being misinterpreted (but not mistrusted, really). 

    Anyway, I’m interested in what you think about this. I know that article titles have to be somewhat hyperbolic to catch peoples’ attention, but isn’t your title completely bunk? What does “trust” have to do with any of this, unless you are suggesting that these companies are lying about the numbers.

  • Meta Brown

    It’s so easy to get wrapped up in the search for sentiment accuracy and forget to ask what we’re going to do with sentiment analysis and why. Let’s say we could know, perfectly, whether the intended sentiment associated with each social media post was positive, neutral or negative. So what? What could we do with that knowledge?

    As a corporate kinda gal, I’d want to use that information to try and sell something. So I’d use the sentiment as a factor to use when deciding whether to make and offer, what offer to make, what copy, images and layout to use in ads. I’d combine that information with anything else I knew about the individual – demographics, buying history and so on, and then test, test, test. Others might be interested in actions other than purchasing – perhaps seeking votes, donations or other actions, but the process would be similar.

    Although we don’t have perfect sentiment analysis, and we never will, nothing prevents us from using sentiment analysis and other types of text analytics in that kind of application. We’d have objective ways of measuring results and value, and we wouldn’t need perfection, just sufficient improvement on our current process to outweigh the investment.

    Meta Brown

  • Hi Seth,

    First off – What a powerful piece you’ve shared. The annotation that was made from
    your views is quite undeniable – Even logic when thinking about the facts, opinions
    and considering different circumstances that may play a part. Though I tend to
    agree to the fullest with you. 

  • Eugene Borisenko

    I generally tend to agree. NLP providers go to lengths such as calling sentiment “intent to purchase”, when any consumer insights pro knows the extent of research exercises needed to assess intent. I think positive vs. negative rating based on contextual analysis is a directional indicator of emotion. What marketrs should be looking for is a disproportionate number of positives vs. negatives and the other way around. That may indicate a crisis risk or success of a campaign.

  • Rob Key

    Thanks Seth for putting attention on a much needed issue — the need for verifiable performance standards.   We, at Converseon, have suggested one approach.  

    I think it’s important for brands though to not simply dismiss sentiment claims, but to truly scrutinize them and demand better from the industry.   We hope this helps spur users to dig a little deeper — is sentiment on a record or mention, how does it compare to the human standard and how reliable is that human standard (is it using intercoder reliability for example).   The power of social data is profound and the ability to truly understand sentiment, and emotion and more is powerful.  Unfortunately to date many claims have not been verifiable, subject to hyperbole, not standardized and and certainly not testable.  We hope this helps spur greater discussion that helps lift the industry up to higher standards and truly begins to demystify sentiment.  Transparency is critically important if we want organizations to truly embrace this intelligence in a meaningful way.  

  • 1. Definitely agree that providers rig the numbers in their favor using various techniques. I can build a sentiment analysis system that only looks for instances of “I love X [product]” or “I hate [product]” and score those as positive and negative respectively, leave everything else as neutral, and achieve 95% accuracy. But that doesn’t mean I’ve solved a problem.

    2. Agree (in most cases) that accuracy is a distraction. It’s still needed, however, if I’m basing a decision on the analysis of the data — I’d prefer a system that is 80% accurate over one that is 50% (or worse).

    3. Completely disagree on your belief that automated sentiment analysis can become “super-human.” You’re confusing things that a machine can do or determine that are generally objective — driving a car, answering questions on Jeopardy, and even looking for patterns in text that expose relationships — with something that is subjective. The difference is there is no “correct” answer here that is evident from the text itself (which is the only data point in almost all cases). If I type, “boy I reeealllyy loved the Avengers,” is that a positive statement or a (sarcastic) negative one? You don’t know unless you ask me or can read my mind, and there is nothing you (or a machine) can do with that text to determine the “truth.” The best you can do is try to interpret as well as a human. In fact it could be argued that the whole point here is to determine whether readers of these statements interpret them as positive or negative (social media swaying opinion), not whether they actually “are.”

  • Seth:

    Nice blog post.

    Claims on absolute accuracy are mostly BS.  More accurate is better, but the need for perfect accuracy is highly dependent on the use case.  The technology will continue to move forward (one month at a time . . .) and improve, but use case matters.

    If your use case is to find any negative comment about your brand, then missing anything is a problem and the holy grail of perfect automated sentiment matters deeply.

    If your use case is to reveal patterns in massive piles of SM data over time, across competitors,segments, topics and issues, then perfect accuracy is not necessary. 

    What is necessary is a robust  automated methodology that highlights the inflection points – when a trendline shifts sharply upward, or when a competitor’s sentiment scores change dramatically.  In this use case, we use these inflection points to identify the areas (piles of data) that warrant further analysis so we can learn something about market/brand/segment dynamics.

    Finding these patterns in massive piles of data (remember, only 5% of all food conversation mentions any brand) is where marketers can find the opportunities for innovation, leverage and differentiation – and I think sentiment is a pointer – not an absolute metric.

    My $0.02

    Tom O’Brien
    NM Incite


Social Media Jobs

VIP Explorer’s Club