Be wary of data manipulation

Karen AhnBlog Post

iStock_000030692906_Large-640x373

When data analysts want to show their findings to their teams, or to broader audiences on the web, they need to be efficient and effective in their presentations.

A number of third-party tools such as Tableau are making it easier to access raw data to create easily digestible graphs and charts.

As students, we receive our fair share of learning about chart-building, but as we read more graphs, and as we interpret and draw conclusions from them, we need to understand their nuances and, in some cases, their hidden messages.

In the lead-up to the U.S. election, it’s worth examining a few fictional graphs as examples to indicate how candidate performances can be misrepresented from the time they announce their bids.

As you will see in the example below, one party might be inclined to portray the rating of its candidate in a certain way to gain supporters. It’s important to pay attention to who is publishing data and how the information is being presented to the public.

The above graph includes ratings for Candidate A and Candidate B since September, 2015. This simple line graph illustrates that the monthly ratings have been quite close between the two candidates, making it difficult to predict what might happen leading into an election.

A voter could interpret that Candidate A and Candidate B started with the same ratings, with no dramatic increase or decrease in the past six months, and it is too close to call as of February. The campaign managers could begin to analyze the efforts put forth each month to see what worked, what didn’t and how they should strategize moving forward.

The same data is in the next graph, below, but do you see the difference? It includes another candidate, Candidate C, who had the highest rating from February, 2015 until August of that year. However, Candidate C quit in August, despite the high rating, and since then, Candidate A and Candidate B have been the only two contenders.

Not only that, Candidate B’s rating was much higher before Candidate C dropped out of the race, and since then, Candidate A and Candidate B have been neck-and-neck, despite Candidate B’s relatively strong start in February.

Armed with this new information, a voter should question the motivation behind the party who published the first graph. Why did it only include data from September, 2015? By publishing the second graph, would it have hurt either candidate A or B?

After viewing the second graph, you might also ask the following questions:

  • What happened to Candidate B in August that caused a huge drop: 6.15% to 5%? Is this drop relevant to Candidate C’s exit?
  • What is causing Candidate A’s steady growth since April?
  • Where have the supporters of Candidate C gone? Neither Candidate A nor Candidate B was able to increase their ratings despite Candidate C’s departure.
  • Could it mean general interest in this fictional election has decreased since Candidate C’s departure, and if so, why?

After reading the first graph, a simple conclusion might be that if the election were to take place in March, it would be too close to call. However, when presented with the second graph, the pattern becomes more straightforward, with Candidate B’s downward trend and Candidate A’s slow and steady growth.

When impressive data findings are published, be mindful that it’s intended to supplement the story someone is trying to tell. When basing your conclusion on a data set or a graph, it helps to seek a second or third source before drawing your own conclusions.

When data is too simple and too good to be true, it probably is.

Karen Ahn is the data analyst for The Globe Edge content studio. She can be reached @KarenAhn