Robert Kosara

Why the Obsession with Tables?

 Uncategorized  Comments Off
May 022013
 

Lots of data are still presented and released as tables. But why, when we know that visual representations are so much easier to read and understand? Eric Newburger from the U.S. Census Bureau has an interesting theory.

In a short talk on visualization at the Census Bureau, he describes how in the 1880s, the Census published maps and charts. Many of those are actually amazingly well done, even by today’s standards. But starting with 1890 census, they were replaced with tables.

This, according to Newburger, was due to an important innovation: the Hollerith Tabulating Machine. The new machines were much faster and could slice and dice the data in a lot of new ways, but their output ended up in tables. Throughout the 20th century, the Census created enormous amount of tables, with only a small fraction of the data shown as maps or charts.

Newburger argues that people don’t bother trying to read tables, whereas visualizations are much more likely to catch their attention and get them interested in the underlying data. We clearly have the means to create any visualization we want today, and there is plenty of data available, so why keep publishing tables? It’s a matter of the attitudes towards data, and these can be hard to change after more than 100 years:

We were producing analysts who knew how to make tables. Really really good tables. But what we’re doing is making tables.

There are three short talks in this recorded webinar, which also go into some detail on the visualization efforts inside the Census, their visualization gallery, etc. It’s an interesting insight into the way the Census Bureau works and how a small group of people is trying to change the way the Census communicates information to the public.

Continuous Values and Baselines

 Uncategorized  Comments Off
Apr 292013
 

One of the most common mistakes people make when creating charts is to cut off the vertical axis. But why is that a problem? And what can you do when you need to show data where the amount of change is small compared to the absolute values?

When we think of continuous data, we almost always think of values that have a meaningful zero. There is no question what an amount of money is measured from, we understand the meaning of zero money. The same is true for most other things: length, weight, volume, etc. all have an obvious zero. It doesn’t matter what unit you use, zero meters is zero feet is zero furlongs is zero lightyears.

As a consequence, we can think in terms of multiples, without even caring about units. Something being twice as heavy as something else is meaningful independently of whether you weigh using pounds or kilograms, and something is twice expensive whether you pay in Euros or Dollars or Yen.

Bars: Length Is Just Another Unit

When data gets mapped to visual variables for visualization, we tend to make the same assumptions. A bar that is twice as long represents a value that’s twice as big. But that is only true if that bar starts from zero. If it was cut off, that is no longer true.

The following image shows the monthly sales of a fictitious coffee chain over a few months. The left bar chart starts at zero, the right one at $29K. Notice the difference?

Bars, with baseline 0 and baseline $29K

In the right-hand chart, the bar for February appears to be roughly twice as high as the one for January. Twice the bar size means twice the value, right? But looking at the chart on the left, it’s obvious that the change is rather small.

The first thing to do when looking at a chart, therefore, is to make sure you understand the vertical axis. If it starts at 0, it is much easier to read the chart without being misled.

Lines Don’t Need Baselines?

Some people suggest that in contrast to bar charts, line charts are not sensitive to the baseline problem. However, I disagree. Look at the same data as before, this time shown as a line chart.

Lines, with baseline 0 and baseline $29K

Is the change not much more dramatic in the right-hand part of this image? The line chart maps the value to vertical position rather than length, which is less obviously connected to the axis. But when the points are connected, we tend to think in terms of the distance from the axis, not in terms of a few points floating in space.

Line charts with a non-zero baseline are very common. They are still problematic, however, because the apparent change can be deceiving. Having to look at the numbers on the axis to figure out the amount of change requires a lot more mental work and partly defeats the point of the chart.

Mapping Change

So what alternative do we have when we want to create a chart that makes the change visible, but the amount of change is small compared to the absolute values? One way is to plot the change separately. This could be done as percent or absolute difference, here it is absolute difference (same values shown as lines and bars).

Change, shown using lines or bars

Now the scale for the amount is independent of the scale for the change. This also makes it easy to see whether the change is positive or negative, because the relation with the zero line is very visually salient (especially when using bars). Also, the rate of change is much more obvious. While that can be seen in the bar and line charts, it is much harder to get a good sense of it.

Showing small changes in large values is a challenge, but it helps to ask, what do we care about here? What do we need to know? That should guide the way the data is shown.

Apr 222013
 

Are you looking for inspiration while writing a paper or grant? Do you feel that there is a lack of information visualization content on Twitter? Is your timeline too empty and slow? Follow @InfoVis_Ebooks, a Twitter account that posts random pieces of text from infovis papers.

Related Work

Accounts that tweet more or less random snippets of text have become a genre in themselves. If you’ve spent any time on Twitter, you’ve probably seen the one that started it all: Horse ebooks. Despite being a spam account, it has almost 170,000 followers who presumably enjoy its random and often nonsensical tweets. Following in its footsteps are more or less serious accounts, like Bogost ebooks, which tweets pieces of Ian Bogost‘s writing.

Materials and Method

InfoVis Ebooks takes a random piece of text from a random paper in its repository and tweets it. It has read all of last year’s InfoVis papers, and is now getting started with the VAST proceedings. After that, it will start reading infovis papers published in last year’s EuroVis and CHI conferences, and then work its way back to previous years.

Each tweet contains a reference to the paper the snippet is from. For InfoVis, VAST, and CHI, these are DOIs rather than links. Links get long and distracting, whereas DOIs are much easier to tune out in a tweet. If you want to see the paper, google the DOI string (keep the “doi:” part). You can also take everything but the “doi:” and append it to http://dx.doi.org/ to be redirected to the paper page. For other sources, I will probably have to use links.

As the name suggests, InfoVis Ebooks is about infovis papers. If you want to do the same for SciVis, HCI, or anything else, the code is available on github.

Results

InfoVis Ebooks currently tweets roughly once every two hours. The time is randomized, and there can be much more (and less) than two hours between tweets; it all depends on how chatty the bot is feeling.

The results are sometimes nonsensical, sometimes funny, and sometimes pieces of code or formulas. Despite the limited set of papers right now, there is a lot of variety in the tweets.

Conclusions and Future Work

This is clearly only the start, and further research is needed. The number of sources needs to be expanded, which is a slow, manual process. The goal is to eventually not only include papers (and maybe posters), but also have the bot follow visualization blogs.

In addition to the text, the document database knows the venue and year a paper was published. The idea is to be able to focus the tweets on papers from a particular venue (e.g., during a conference, only tweet from papers that were published in earlier years at that same place), or restrict to a time period (vintage papers from the early 90′s?).

The bot will be continue to get tweaked to create more interesting and entertaining tweets. It is currently based on some very simple heuristics and rules for what makes a snippet acceptable, but I plan on refining those over time.  Also, a user study.

Data: Continuous vs. Categorical

 Uncategorized  Comments Off
Apr 182013
 

Data comes in a number of different types, which determine what kinds of mapping can be used for them. The most basic distinction is that between continuous (or quantitative) and categorical data, which has a profound impact on the types of visualizations that can be used.

The main distinction is quite simple, but it has a lot of important consequences. Quantitative data is data where the values can change continuously, and you cannot count the number of different values. Examples include weight, price, profits, counts, etc. Basically, anything you can measure or count is quantitative.

Categorical data, in contrast, is for those aspects of your data where you make a distinction between different groups, and where you typically can list a small number of categories. This includes product type, gender, age group, etc.

Both quantitative and categorical data have some finer distinctions, but I will ignore those for this posting. What is more important, is: why do those make a difference for visualization?

Quantitative Data: Values

Most data sets contain both types of data. It’s actually quite difficult to visualize data that is purely quantitative or purely categorical (parallel coordinates are a good way to show the former, parallel sets for the latter).

Let’s take the example of a hypothetical coffee chain and look at their profits. A simple bar chart can show this data broken down by product type.

Simple bars

As simple as this chart is, some decisions had to be made how to show the data. The quantitative Profit variable is shown well by position or length. The categorical Product Type naturally divides the data into individual items, hence the bars.

What if we picked a different variable for the second axis, one that is continuous? This changes the type of chart we want to a line chart.

Single line

Profit is now on the vertical axis, but it is still a continuous variable. We might treat time as categorical, which would give us another bar chart, perhaps with one bar per month (or whatever granularity we want). But I decided to treat time as continuous here, which results in a line chart. Time is a special case that can be either type, depending on the way you want to look at the data. To focus on individual months, treat time as discrete and use bars. To look at trends and the rate of change (and thus, the space in between the data points), use continuous time.

Line and bar charts can appear to be interchangeable, but they are usually not. The encoding is subtly different (length for the bars, position for the line), and there is a clear implication in the line that there is a continuum between the points. Using a line chart for the product type chart above would not make sense, since there is nothing in between Espresso and Herbal Tea. Even if we only have one data point for each month, though, time is still continuous, so we can treat it as such if we want.

Categorical Data: Breaking Things Down

We often want to see more than two data attributes at the same time. Categorical axes can be used to break data down further. Each category is subdivided by the categories of the additional dimensions. Adding two categorical dimensions, Market and Year to the initial chart gives us a lot more bars. More bars

Here, time is now categorical, which means we get separate bars for each year. We’ve also broken out the different regions to get individual bars for every combination of market, product type, and year. There are other ways to show the same data: we could stack the bars for the different product groups, for example. Which dimensions are nested, and in what order, is also important. We could decide that we want to see each product type broken down by market instead, rather than the other way around, or maybe break each year down into markets, and look at the products across those combinations.

Which is the right configuration depends on the question you want to ask. But the type of visualization has not changed, we are still looking at bars. Adding categorical dimensions to a visualization usually divides the visualization up rather than changing the type.

The same thing can be done for our line chart. Let’s break that one down by product type.

Multiple lines

The axis mappings have not changed, they are still (continuous) time and profit. But adding the product type subdivides the total into four separate lines. We can now see how each of them have done over time, which ones are flat, which increasing, etc.

Adding color is not strictly necessary here, but it makes following the lines and identifying them much easier. Color works great for categories, at least as long as the number is reasonably small.

More Encodings

These examples are very straight-forward. Simple charts tend to work well for a small number of data dimensions. More unusual encodings should only be used when more variables are needed. As an example, let’s look at sales compared to profits in a scatterplot.

Scatterplot

The scatterplot shows two numerical values using position along each axis. I’ve added two categorical ones: color and shape. This shows me that the West market had the highest sales in all but the Coffee category (look at the locations of the X marks compared to the other shapes of the same color), though not always the highest profits.

Like color, shape works well for a small number of categories, because we can really only tell a very limited number of them apart (10 is roughly the maximum for both).

If we wanted to add another quantitative dimension, we might use size, though that would start to overload the chart. It is usually a better idea to keep the number of visual variables (like color, shape, size, orientation, etc.) small, as they interact and become difficult to read. It is often more effective to create several different charts or rethink the question to make sure all these dimensions are really needed at the same time.

Data types play an important role in visualization because they determine what visualization types can or should be used. That doesn’t mean that there is only one chart for any combination of data types, but it does narrow down the possibilities.

Apr 082013
 

With Google Reader shutting down July 1st, now is the time to find alternative ways to follow your favorite blogs. For this one, you can now get new postings on Facebook and through a dedicated Twitter feed, in addition to the RSS feed. See below for some RSS aggregator/reader alternatives to Google Reader.

Facebook and Twitter

I don’t use Facebook and Twitter to follow feeds, that’s what I have my RSS reader for. But for people who like doing that, perhaps using Flipboard or similar, I have now created pure feed accounts. No talking, just links to new postings.

There’s still my personal Twitter account, of course, where I will also retweet the new posting tweets.

Feedburner

Luckily, I never trusted feedburner, so almost 90% of you are subscribed to the feed URL on my website directly. This used to point to feedburner (via some redirect magic), but I switched that a few weeks ago, when Google announced the end of Reader. Feedburner is not on the chopping block yet, but it can’t be far behind.

Some time this or next week, I will completely phase out feedburner. If you’re following this site using the feedburner feed, you will see a posting appear that will tell you where to subscribe to the original feed. As the posting will point out, you will not see any further updates in that feed (other than maybe a few nag postings to remind you to change your subscription) after that.

If you’re among my 40 or so email subscribers, you will soon get an email asking you to resubscribe to the site with the new mechanism powered directly by WordPress. If you don’t want to wait, you can do this now using the subscribe field at the bottom of the navigation bar on the right.

Google Reader Alternatives

It’s important to understand that Google Reader is not just a website for reading feeds, but also the service that virtually all RSS readers currently talk to to get feed items from and synchronize status between devices. So even if you’re using a client and don’t remember ever having seen Reader, you’re almost certainly using it and will lose access to your feeds come July 1.

A number of alternatives have sprung up in the last few weeks. The service that has gotten the most attention from the people I talk to is feedly. They have very nice apps for iOS and Android, as well as plugins for Chrome and Safari. At this point, they are still talking directly to Google Reader and keep it in sync when you make changes (or mark things as read). That means you can easily try feedly and if you don’t like it, you can go back to your previous reader and not have to worry about having to wade through hundreds of items you’ve already seen.

Eventually, feedly will let you disconnect from Reader and then host their own feed aggregation service. They also seem to have some ideas that go beyond simple feed aggregation, which is good. It’s not clear whether their aggregation service will be free (like their apps currently are), but I hope that they will charge. That’s the only way they will be around for the long haul.

Feedbin looks like a good service if you’re prepared to pay money ($2/month, $20/year) from the start. Its web interface is quite nice and it will be one of the backends Reeder will talk to at some point in the future (Reeder is a beautiful RSS reader app for the Mac and iOS that acts as a frontend for Reader).

Other services worth mentioning are The Old Reader and Newsblur. I haven’t tried The Old Reader, and I’ve only played with Newsblur briefly, so I don’t have anything intelligent to say about them.

Google Reader is Dead, Long Live RSS!

There is a lot of value in simple feeds and Real Simple Syndication (which is what RSS stands for). I love Twitter, it’s incredibly useful and fun, and I spend way too much time there. But it doesn’t do what RSS does.

The end of Google Reader is unfortunate, but I think once we’ve all figured out what alternatives to use, it will be a good thing. Google pushed all the other RSS aggregators out of existence, and then mostly just sat there doing nothing. There is a good chance that the new crop of feed aggregators that is sprouting now will lead to some real innovation in this area.

The Revolution Will Be Visualized

 Uncategorized  Comments Off
Apr 042013
 

In the 1970s, it was the protest songs. In the 1980s, it was the anti-war movies. Today, the protest is no longer happening in songs or movies. Today, it’s online, based on data, and using visualization.

U.S. Gun Deaths

Gun Deaths

It’s a very abstract and yet very clear image: something moves along a trajectory, is suddenly stopped, and drops to the ground. A gun has been fired, somebody has been killed. Periscopic’s U.S. Gun Deaths visualization is visceral and it doesn’t just show data: it makes an argument. People are being robbed of their lives. Hundreds of years are lost every day.

In the deleted slides from his Tapestry talk, Jonathan Corum criticizes the visualization because there are elements that don’t mean anything. The filtered views also don’t work nearly as well as the initial animation. But the point is made there. It’s the impact, the punch in the guts that makes this work.

Out of Sight, Out of Mind

Drone Strikes

Pitch Interactive’s Out of Sight, Out of Mind shows U.S. drone strikes in Pakistan. It breaks down the victims into high-profile targets, alleged combatants, civilians, and children. It’s essentially a stacked bar chart.

But the animation of the dropping bombs gives the strikes much more of a reality than a mere monthly number would. And the number of people killed is staggering when you see it as bars like that. These aren’t just bars, but they have segments, one for each person.

Switch to the Victims view and it gets even more personal. A small figure is drawn for every person killed. Continuous bars don’t give you a sense of individuals, but little figures do.

Mapping the Dead: Gun Deaths Since Sandy Hook

Guns Again

The Huffington Post’s Mapping the Dead: Gun Deaths Since Sandy Hook shows gun deaths since the elementary school shooting that got so much attention last December. It’s a simple map, but with a twist: it zooms out from Newtown, CT, to reveal the entire U.S. and all the gun deaths over the last few months. It’s breathtaking.

Hovering over the bars also gives you something else: names. These are not just numbers, they were real people. Listing them, similar to the figures for the drone strikes, makes them much more tangible and real.

The New Language of Protest

How do you make people notice an issue? How do you get them to care? What if we’re no longer moved by songs (and the artists too comfy and reluctant to take sides) and no longer want to see movies about real issues (and Hollywood won’t take the risk of offending anybody)?

What if the new way to get us to care is with a visceral, raw display of data?

Mar 242013
 

Arguments in data visualization are so fierce because the stakes are so low is a great zinger that I’ve heard a few times recently. But it’s not always true. Data visualization influences important decisions every day. The Congressional Budget Office’s new snapshots are but one example.

The role of the Congressional Budget Office (CBO) is to provide information to members of the U.S. Congress so they can make better decisions. The usual way of doing this is through reports that are prepared on a variety of topics.

Snapshots are like tweets: they contain a small amount of information, but are crisp, to the point, easy to consume, and link to more in-depth information to be found elsewhere.

Snapshot of Unemployment Benefits

The CBO does not make policy recommendations, which makes creating charts with a purpose and message a much bigger challenge. You won’t find any monsters here, but the points are still clear and easy to follow.

Snapshot of Child Nutrition Programs

Rather than overwhelm the reader with numbers, snapshots are constrained on purpose, with exact numbers largely missing. That’s what the reports are for, after all, that can be found at the URLs at the bottom right of each snapshot.

Snapshot of the Highway Trust Fund

Eventually, the idea is to print these onto 4“x6” index cards, which seems to me to be a crucial component of the campaign. Having them pop up on the CBO blog is nice and all, but they will have much more impact when they are clipped to congresspeople’s and senator’s memos, shoved into pockets, and just lying around on tables and desks to be picked up randomly.

Snapshot of Guarantees of New Residential Mortgages

It may well be true that many arguments in visualization are pointless and petty. But in some cases, the stakes are high.


I’m honored to have been asked to provide input during the design phase of this effort. Like with all my secret government work, the CBO will neither confirm nor deny my involvement.

Study on Creative Data Visualization

 Uncategorized  Comments Off
Mar 222013
 

To explore how we can make it easier to create new visualization designs, we are running a study based on a new approach, called visualization primitives. It lets you map data to the properties of objects like rectangles and ellipses. Build something with data, have fun, and help us figure out if it works!

This being a study, it asks you a few questions before you start, but that takes less than a minute. Then there’s a brief tutorial, after which you’re free to play. Build interesting things with the data (we’re using the OECD Better Life Index data) and submit the ones you like. This is all anonymous, obviously. My student Drew Skau, who is running this study, will analyze the data to see what kinds of things you and others are building and how much of the design space you explore.

Don’t forget to hit Done when you don’t want to play any more. You then get asked a few more questions that also won’t take much time, but that are crucial for the study. Once you’re done with those, you can continue building or just close the window.

Here’s the link to the study: Visualization Creativity Study

A Better Definition of Chart Junk

 Uncategorized  Comments Off
Mar 182013
 

Maximizing the data-ink ratio sounds like a good idea, but when actually followed to the letter produces terrible and nonsensical results. Here is a more reasonable definition of chart junk that does away with the pretense of a mathematical formula and puts some common sense back into the question of good chart design.

Much has been made of Tufte’s famous data-ink ratio, and many people like to rail, privately and online, against chart junk. In short, the data-ink-ratio defines the amount of information your chart elements (“ink”) are providing, with the goal of maximizing that ratio. Since we can assume that the information is constant, this means we need to minimize the amount of ink. Any ink on your chart that does not convey data is considered junk.

While this extremely reduced definition makes for great flame war fuel, it places the emphasis on the wrong question, and when property followed, leads to largely nonsensical charts (this example is from Stephen Few’s recreation of Tufte’s argument).

Minimal Bars

The first issue is the whole notion of ink. What does that even mean? If you live in a world of black ink on white paper, that may be a reasonable criterion. But add color and the whole thing breaks down. Color can be used well and can be terrible. Reducing ink does not tell us anything about that. The same is true for interactions like mouse-overs, sorting, and other conveniences our modern visualization machines afford us.

There is a parallel here with writing. While you might argue that using fewer and simpler words is generally preferable, nobody would argue that writing is merely a question of maximizing the information-to-letters ratio. Good writing needs clarity and simplicity just as it needs variation, voice, explanation, and many other things.

Which brings me to my alternative definition of chart junk:

Chart junk is any element of a chart that does not contribute to clarifying the intended message.

Do you have more bars than necessary? Get rid of them! Are you missing context that would help people understand the values better? Add it in! Is your use of color distracting from the message? Change it! Are people not able to figure out what you are telling them? Use highlights!

Do you see the difference? Instead of minimization at all cost, we are now asking questions about the purpose of this thing you are creating. We are no longer pretending that visualization design is a mathematical optimization, and instead thinking about what we want to achieve.

Chart junk is still chart junk. Don’t add meaningless nonsense to your charts! Don’t clutter them up! Reduce the impact of grid lines, etc. But also think about how you can clarify the message, how you want people to read your data, and what you want them to take away. Perhaps adding things will actually help. What was considered chart junk before might turn out to be useful.

Tableau Desktop Now Free For University Students

 Uncategorized  Comments Off
Mar 142013
 

If you are a student at a university, you can now get a free license for the full version of Tableau Desktop. No matter if you use it in class or for research, this is the full version that does not restrict the amount of data or the kind of connectivity (like Tableau Public does). The license is good for one year and can be renewed as long as you are enrolled at university.

This has been in the works for a while, and I’m very happy to finally see it happen. Tableau’s roots are in academia, so it only makes sense to make it available to students and faculty. There was a poorly advertised way of getting a discounted version of Tableau before, but that still cost a bit of money and required you to jump through some hoops. The new program asks you a minimum of information for verification, and is automated, so in most cases you will get your license within minutes. This is not restricted to the U.S. either, though verification can take a bit longer in that case.

Tableau for Students is different from Tableau for Teaching (TFT). The latter is specific to a particular course where Tableau is used as part of the teaching (I’ve used Tableau to teach Visual Analytics several times, for example). The licenses for TfT are limited to one semester, and you were not supposed to use them for research. That restriction no longer applies, and in fact we hope that you will find Tableau useful for your data analysis!

Tableau for Teaching is not going away, though. If you are an instructor and are thinking about using Tableau for a course, get in touch! There are real, enthusiastic, awesome people who handle these emails, and they’re happy to give you licenses, answer questions, and help in any way they can. You can also contact me, if you want to talk to a grumpy ex-professor.

So if you’ve ever wanted to try Tableau, and you’re a student, this is your chance now.