Apr 012013
 

01-start-finish

Disclaimer: Everyone's graduate school experience is going to be different. Mine wasn't a typical one, mainly because I spent so much time away from campus (in a different state), but hey, most of your PhD experience is independent learning anyways. That's the best part.

Before you begin (or apply)

You should really like the field you're thinking about pursuing a PhD in. You don't have to have this, but you kind of do. A doctorate is a commitment of several years (for me it was 7), and if you're not fascinated by your work, it's going to feel like an impossible chore. There are going to be a lot things that will be actual chores— administration, research results that go against your expectations, challenging collaborations, etc—and the interest in your work is what will pull you through.

I don't know anyone who finished their PhD who wasn't excited about the field in some way.

On that note, do your research before you apply to programs, and try to find faculty whose interests align with yours. Of course this is easier said than done. I entered graduate school with statistics education in mind and came out the other end with a focus in visualization. The size of my department probably allowed for some of that flexibility. Luck was also involved.

So what I actually did was apply to more than one program and then wait to hear if I got in or not. If I only got into one place (or none), then the decision was easy. In the end, I compared department interests, and then went with the one I thought sounded better.

If it's hard to find faculty information because there's little to nothing online, that should be a red flag. There's really no excuse these days not to have updated faculty pages.

Absorb information

02-absorb

Okay, you're in graduate school now. The undergrads suddenly look really young and all of them expect that you know everything there is to know about statistics (or whatever field you're in). This becomes especially obvious if you're a teaching assistant, which can feel weird at first because you're not that far out of undergrad yourself. Use the opportunity to brush up on your core statistics knowledge though.

I had coursework for the first two years, but it varies by department I'm sure.You will also be taking classes yourself. Don't freak out if the lectures are confusing and everyone seems to be asking smart questions that you don't understand. In reality, it's probably only a handful of people who dominate the discussion, and well, there's just always some people who are ahead of the curve. Maybe you're one of them.

If you're confused, don't worry about it. Tough early goings has a lot to do with learning the language of statistics. There's jargon that makes it easier to describe concepts (once you know them already), and there's a flow of logic that you pick up over time.

There's usually a qualifying exam after the first two years to make sure you learned in class.Don't hesitate to ask questions and make use of office hours (but don't be the person who waits until the week before an exam or project to get advice, because that's just so undergrad). Once you finish your coursework, it's going to be a lot of independent learning, so take advantage of the strong guidance while you can.

The key here is to absorb as much information as you can and try to find the area of statistics that excites you the most. Pursue and dig deeper when you do find that thing.

I remember the day I discovered visualization. It was a guest lecture by someone who became my adviser. He talked about it, I grew really interested, and then went home and googled away.

Oh, and read a lot of papers. I didn't do nearly enough of this early on, and you need proper literature review for your dissertation.

Find an adviser

03-adviser

Actually, I don't think I ever officially asked my adviser to be my adviser. It was just assumed when I became a student researcher in his group.I kind of had an adviser from the start of graduate school, because I was lucky to get a research assistant position that had to do with statistics education. However, as my interests changed, I switched my adviser around the two-year mark.

This is important, and goes back to the application process. After a couple years, you should have a sense of what the faculty in your department work on and their teaching styles, and you should go for the best match.

I think a lot of people expect an adviser to have all the answers and give you specific directions during each meeting. That's kind of what it's like in the early goings, but it eventually develops into a partnership. It's not your adviser's job to teach you everything. A good adviser points you in the right direction when you're lost.

Jump at opportunities

04-opps

Statistics is an innately collaborative field, and there are a lot of opportunities to work with others within the department and outside of it. A lot of companies are often in search of interns, so they might send fliers and listings that end up posting to the grad email list. Jump at these opportunities if you can.

Graduate school doesn't have to be expensive.Opportunities within the department or university should be of extra interest, because it usually means that your tuition could be reduced by quite a bit.

If something sounded interesting, I'd respond to it right away, and it usually resulted in something good. A lot of people pass up opportunities, because they see the requirements of an ideal candidate and feel like they're not qualified. Instead, apply and let someone else decide if you're qualified. There's usually a lot of learning on the job, and it's usually more important that you'll be able to pick up the necessary skills.

At the very least, you'll pick up interview experience, which comes in handy later on if you want one of those job things after you graduate.

Learn to say no

05-no

As you progress in your academic career, you're going to look more and more like a PhD (hopefully). You'll have more skills, more knowledge, and more experience, which means you'll become more of an asset to potential collaborators, researchers, and departments. A lot of my best experiences come from working with others, but eventually, you have to focus on your own work so that you can write your dissertation. Hopefully, you'll have a lot of writing routes to take after you've jumped at all the opportunities that crossed your desk.

So it's a whole lot of yes in the beginning, but you have to be more stingy with your time as you progress.

There are probably going to be potential employers knocking at your door at some point, too. If you really want to finish your PhD, you must make them wait. I know this is much easier said than done, but once you start a full-time job, it's going to be hard to muster of the energy at the end of a day to work on a dissertation.

All the times I wanted to quit, I justified it by telling myself that I would probably have the same job with or without a doctorate. I also know a lot of people who quit and are plenty successful, so finding a job didn't work for me as a motivator. But it might be different for you, depending on what work you're interested in.

Solitude

06-solitude

This might've been the toughest part for me. During my first two years in school, I hung out with my classmates a lot and we'd discuss our work or just grab some drinks, but I had to study from a distance from my third year and on. I've always been an independent learner anyways, so I thought I'd be okay, but my first year away, it was hard to focus, and it got kind of lonely in the apartment by myself. I didn't want to do much of anything. I eventually made friends, and pets provided nice company during the day. It's important to have a life outside of dissertation work. Give your brain a rest.

It wasn't all bad though. FlowingData came out of my moving away, and my dissertation topic came out of a personal project.

I found Twitter useful to connect with other work-at-homers and PhD Comics proved to be a great resource for feeling less isolated.Anyways, my situation is kind of specific, but it's good to have a support system rather than go at it alone. I mean, you still have to do all the work, but there will be times of frustration when you need to vent or talk your way through a problem.

Write the dissertation and defend

07-write

Despite what you might've heard, a dissertation does not write itself. Believe me. I've tried. Many times. And it never ever writes itself.

I even (shamefully) bought a book that's lying around somewhere on how to write your dissertation efficiently. That's gotta be up there on my list of bad Amazon impulse buys. The book arrived, I started reading, and then realized that it'd be a lot more efficient to be writing instead of reading about how to write. That'd be more efficient.

The hardest part for me was getting started. Just deal with the fact that the writing is going to be bad at first. You're going to come back and revise anyways. I've heard this advice a lot, but you really do just have to sit down and write (assuming you've worked on enough things by now that you can write about).

If you already have articles on hand, it doesn't hurt to take notes so that it's easier to clean up citing towards the end.Don't worry about proper citing, what pronouns to use, and the tone of your writing. This stuff is easy to fix later. Focus on the framework and outline first.

Just google "successful PhD defense."By the time you're done writing, you'll know about your specific topic better than most people, which will make your defense less painful. There's a lot of online advice on a successful defense already, but the two main points are (1) your committee wants you to succeed; and (2) think of it as an opportunity to talk about your work. In my experience and from what I've heard, these are totally true. That didn't stop me from being really nervous though.

I like this video by Ze Frank on public speaking.The best thing to do is prepare. Rehearse your talk until you can deliver it in your sleep. Your preparation depends on your style. Some like to write their talks out. I like to keep it more natural so it's not like of reading a script. Go with what you're comfortable with.

It'll all be fine and not nearly as horrible as you imagine it will be.

Wrapping up

So there you go. A PhD at a glance. Work hard, try to relax, and embrace the uniqueness of graduate school. It can be fun if you let it.

Any graduate students have more advice? Leave it in the comments.

Jan 182013
 

How to Animate Transitions Between Multiple Charts

Sometimes one chart just isn’t good enough. Sometimes you need more.

Perhaps the story you are telling with your visualization needs to be told from different perspectives and different charts cull out these different angles nicely. Maybe, you need to support different types of users and different plots appeal to these separate sets. Or maybe you just want to add a bit of flare to your visualization with a chart toggle.

In any case, done right, transitioning between multiple chart types in the same visualization can add more insights and depth to the experience. Smooth, animated transitions make it easier for the user to follow what is changing and how the data presented in different formats relates to one another.

This tutorial will use D3.js and its built in transitioning capabilities to nicely contort our data into a variety of graph types.

If you're a FlowingData member, you might be familiar with creating these chart types in R.We will focus on visualizing time series data and will allow for transitioning between 3 types of charts: Area Chart, Stacked Area Chart, and Streamgraph. The data being visualized will be New York City 311 request calls around the time hurricane Sandy hit the area.

Check out the demo to see what we will be creating, then download the source code and follow along!

The Setup

Before we dive in, let’s take a look at the ingredients that will go into making this visualization.

D3 v3

Recently, the third major version of D3.js was released: d3.v3.js. Updates include tweaks and improvements to how transitions work, making them easier overall to use.

If you have used D3.js previously, one significant change you will want to know about is that the signature of the callback function for loading data has changed. Specifically, when you load data, you now get passed any errors that have occurred during the data request first, and then the actual data array. So this:

d3.json('data', (data) -> console.log(data.length))

Becomes this:

d3.json('data', (error, data) -> console.log(data.length))

Not a huge deal, as the old API is still supported (though deprecated), but one that gives you enough advantages that you should start using it.

For More on D3.v3, check out the 3.0 upgrading guide on the D3.js wiki.

A Big Cup of CoffeeScript

And again, please feel free to compile to javascript if that will make you happy. But before you do, give CoffeeScript 5 minutes of your time – who knows, you just might fall in love.As in my previous tutorial on interactive networks , I’ll be writing the code in CoffeeScript . I recommend going back to that tutorial if you aren’t familiar with CoffeeScript to get some notes on its syntax. But just in case you don’t want to click that link, here’s the 3 second version:

functions look like this:

functionName = (input1, input2) ->
  console.log('hey! I'm a function')

We see that white space matters – the indentation indicates the lines of code inside a function, loop, or conditional statement. Also, semicolons are left off, and parentheses are sometimes optional, though I usually leave them in.

A Little Python Web Server

The README in the source code also has instructions for using Ruby.Because of how D3.js loads data, we need to run it from a web server, even when developing on our own local machine. There are lots of web servers out there, but probably the easiest to use for our development purposes is Python’s built in Simple Server.

From the Terminal, first navigate to the source code directory for this tutorial. Then check which version of Python you have installed using:

python --version

If it is Python 3.x then use this line to start the server:

python -m http.server

If it is Python 2.x then use

python -m SimpleHTTPServer

In either case, you should have a basic web server that can serve up any file from the directory you are in to your web browser, just by navigating to http://0.0.0.0:8000.

The simple python web server running on my machinepython server running

But What About Windows

On Linux or Mac systems, you will already have python installed. However, it takes some moxie to get it working on a Windows machine. I would suggest looking over this blog post, to make sure you don't overlook something.

If you aren't in the mood for some python wrangling, you might take the advice of this getting started with D3 guide and try out EasyPHP. With this installed and running, you can host your D3 projects out of the www/ directory in the root of the installation location.

A Dash of Bootstrap

While we will use D3.js for the actual visualization implementation, we will take advantage of Twitter’s Bootstrap framework to make our vis just a bit more attractive.

Mostly, it will be used to make a nice toggle button that will used to transition between charts. This might not be the most efficient method for getting a decent looking toggle on a site, but it is very easy to implement and will give you a chance to check out Bootstrap, if you haven’t already. It is quite lovely.

Transitions

Before we start using them, lets talk a bit about what D3 transitions are and how they work.

Think of a transition as an animation. The staring point of this animation is the current state of whatever you are transitioning. Its position, color, etc. When creating a new transition, we tell it what the elements should end up looking like. D3 fills in the gap from the current state to the final one.

D3 is built around working with selections. Selections are arrays of elements that you work with as a group. For example, this code selects all the circle elements in the SVG and colors them red:

svg.selectAll("circle")
  .attr("fill", "red")

It might then come as little surprise that transitions in D3 are a special kind of selection, meaning you can effect a group of multiple elements on a page concisely within a single transition. This is great because if you are already familiar with selections, then you already know how to create and work with transitions.

There are a few more differences between selections and transitions – mainly due to the fact that some element attributes cannot be animated.The main difference between regular selections and transitions is that selections modify the appearance of the elements they contain immediately. As soon as the .attr("fill", "red") code is executed, those circles become red. Transitions, on the other hand, smoothly modify the appearance over time.

Here is an example of a transition that changes the position and color of the circles in a SVG:

# First we set an initial position and color for these circles.
# This is NOT a transition
svg.selectAll("circle")
  .attr("fill", "red")
  .attr("cx", 40)
  .attr("cy", height / 2) 

# Here is the transition that changes the circles
# position and color.
svg.selectAll("circle")
  .transition()
  .delay(500)
  .duration(750)
  .attr("fill", "green")
  .attr("cx", 500)
  .attr("cy", (d, i) -> 100 * (i + 1))

I’ve coded up a live version of this demo (in JavaScript), to get a better feel for what is going on.

The functions called on the transition can be separated into 2 groups: those modifying the transition itself, and those indicating what the appearance of the selected elements should be when the transition completes.

The delay() and duration() functions are in the former category. They indicate how long to wait to start the transition, and how long the transition will take.

The attr() calls on the transition are in the later category. They indicate that once the animation is done, the circles should be green, and they should be in new positions. As you can see from the live example, D3 does the hard work of interpolating between starting and ending appearance in the duration you’ve provided.

There are lots of interesting details you can learn about transitions. For a more through introduction, I’d recommend Jerome Cukier’s introduction on visual.ly.

Custom interpolation, start and end triggers, transition life cycles, and more await you in this great guide!To really rip off the covers, check out Mike Bostock’s Transitions Guide , which exposes more of the nitty gritty details of transitions and is required reading once you start needing their more advanced capabilities.

For now, let’s stop with the prep work and get going on more of the specifics of how this visualization works.

A Peak at the Data

When I discovered the NYC OpenData site provided access to raw 311 service request data, I had visions of recreating the classic 311 streamgraph from Wired Magazine originally created by Pitch Interactive.

Alas, my dreams were dashed upon the realization that the times reported for all the requests was set to midnight! I assume some sort of bug in the export process is currently preventing the time from being encoded correctly.

Not wanting to give up on this interesting dataset, I decided to switch gears and instead look at daily aggregation of requests during an interesting period of recent New York history: hurricane Sandy. This tells, I think, an interesting, if not surprising, story. Priorities change when a natural disaster strikes.

Here is what the data looks like:

[
  {
    "key": "Heating",
    "values": [
      {
        "date": "10/14/12",
        "count": 428
      },
      {
        "date": "10/15/12",
        "count": 298
      },
      // ...
    ]
  },
  {
    "key": "Damaged tree",
    "values": [
      // ...
    ]
  },
  // ...
]

In words, our array of data is organized by 311 request type. Each request object has a key string and then an array called values. Values has an entry for each day in the visualization. Each day object has a string representation of the date as well as the number of this type of request for that day, stored in count.

You could use d3.nest to convert a simple table into a similar array of objects, but that is a tutorial for another day.This format was chosen to match up with how the visualization will be built. As we will see, the root-level request objects will be represented as SVG groups. Inside each group, the values array will be converted into line and area paths.

A Static Starting Point

To create movement, one must begin with stillness. How’s that for sage advice? Not great? Well, it will work well enough for us in this tutorial.

Transitions don’t deal with the creation of new elements. An element needs to exist already in order to be animated. So to begin our visualization, we will create a starting point from which the visualization can transition from.

Layouts and Generators

First let’s setup the generators and layout we will use to create the visualization. We will be using an area generator to create the areas of each chart, a line generator for the detail on the regular area chart, and the stack layout for the streamgraph and stacked area chart, as well as some scales for x, y, and color.

Here is what the initialization code looks like:

x = d3.time.scale()
  .range([0, width])

y = d3.scale.linear()
  .range([height, 0])

color = d3.scale.category10()

# area generator to create the
# polygons that make up the
# charts
area = d3.svg.area()
    .interpolate("basis")
    .x((d) -> x(d.date))

# line generator to be used
# for the Area Chart edges
line = d3.svg.line()
    .interpolate("basis")
    .x((d) -> x(d.date))

# stack layout for streamgraph
# and stacked area chart
stack = d3.layout.stack()
  .values((d) -> d.values)
  .x((d) -> d.date)
  .y((d) -> d.count)
  .out((d,y0,y) -> d.count0 = y0)
  .order("reverse")

The stack layout could use a bit more explanation.

Unlike what its name might imply, this layout doesn’t actually move any elements itself – that would be very un-D3 like. Instead, its main purpose in this visualization is to calculate the location of the baseline – which is to say the bottom – of the area paths. It computes the baseline for all the elements in the values array based on the stack’s offset() algorithm.

The out() function allows us to see this calculated baseline value and capture it in an attribute of our value objects. In the code above, we assign count0 to this baseline value. After the stack is executed on a set of data, we will be able to use count0 along with the area generator to create areas in the right location.

Loading the Data

Ok, we need to load the JSON file that contains all our data.

This is done in D3 by using d3.json:

$ ->
  d3.json("data/requests.json", display)

Load the requests.json file, then call the display function with the results.

Here is display:

display = (error, rawData) ->
  # a quick way to manually select which calls to display. 
  # feel free to pick other keys and explore the less frequent call types.
  filterer = {"Heating": 1, "Damaged tree": 1, "Noise": 1, "Traffic signal condition": 1, "General construction":1, "Street light condition":1}

  data = rawData.filter((d) -> filterer[d.key] == 1)

  # a parser to convert our date string into a JS time object.
  parseTime = d3.time.format.utc("%x").parse

  # go through each data entry and set its
  # date and count property
  data.forEach (s) ->
    s.values.forEach (d) ->
      d.date = parseTime(d.date)
      d.count = parseFloat(d.count)

    # precompute the largest count value for each request type
    s.maxCount = d3.max(s.values, (d) -> d.count)

  data.sort((a,b) -> b.maxCount - a.maxCount)

  start()

The requests.json file has data for every request type, which would overload our visualization. Here we perform a basic filter to cherry pick some interesting types.

d3.time.format and the other time formatting capabilities of D3.js are great for converting strings into JavaScript Date objects. Here, our parser is expecting a date string in the %m/%d/%y format (which is what %x is shorthand for. We use this formatter when we iterate through the raw data to convert each string into a date and save it back in the object.

Then we call start() to get the display ball rolling.

The Start of the Visualization

Finally, we are ready to create the elements needed to get our charts going. Here is the start() function which sets up these elements:

start = () ->
  # x domain setup
  minDate = d3.min(data, (d) -> d.values[0].date)
  maxDate = d3.max(data, (d) -> d.values[d.values.length - 1].date)
  x.domain([minDate, maxDate])

  # I want the starting chart to emanate from the
  # middle of the display. 
  area.y0(height / 2)
    .y1(height / 2)

  # now we bind our data to create
  # a new group for each request type
  g = svg.selectAll(".request")
    .data(data)
    .enter()

  requests = g.append("g")
    .attr("class", "request")

  # add some paths that will
  # be used to display the lines and
  # areas that make up the charts
  requests.append("path")
    .attr("class", "area")
    .style("fill", (d) -> color(d.key))
    .attr("d", (d) -> area(d.values))

  requests.append("path")
    .attr("class", "line")
    .style("stroke-opacity", 1e-6)

  # default to streamgraph display
  streamgraph()

We still haven’t drawn anything, but we are getting close.

The data array is bound to the empty .request selection. Then, as mentioned in the data section above, a g element is created for each request type.

Finally, two path elements are appended to the group. One of which is for drawing the areas of the three charts. The other, with the class .line, will be used to draw lines in the regular area chart.

Without this, the first transition will just cause the areas to appear immediately.
As a little detail, I’ve started the .area paths in the center of the display, so the first transition to the first chart will grow out from the center.

A Movement in Three Parts

Now that we have the basic visualization framework, we can focus on developing the code for each chart.

We want the user to be able to switch back and forth between all the graph styles, in a non-linear manner. To accomplish this, the functions implementing each chart needs to accomplish 3 things:

  1. Recompute values that might get changed by switching to the other charts.
  2. Reset shared layouts and scales to handle the selected chart.
  3. Create a new transition on the elements making up each chart.

With this consistent structure in mind, let’s start coding up some charts.

Steamgraph

The initial streamgraph displayStreamgraph display

We will start with the streamgraph – because of my original dreams to emulate Wired, and because it is pretty easy to create with the stack layout.

streamgraph = () ->
  # 'wiggle' is streamgraph offset
  stack.offset("wiggle")
  stack(data)

  # reset our y domain and range so that it 
  # accommodates the highest value + offset
  y.domain([0, d3.max(data[0].values.map((d) -> d.count0 + d.count))])
    .range([height, 0])

  # setup the area generator to utilize
  # the count0 values created from the layout
  area.y0((d) -> y(d.count0))
    .y1((d) -> y(d.count0 + d.count))

  # here we create the transition
  t = svg.selectAll(".request")
    .transition()
    .duration(duration)
 
  # D3 will take care of the details of transitioning
  t.select("path.area")
    .style("fill-opacity", 1.0)
    .attr("d", (d) -> area(d.values))

Its all a bit anticlimactic, right? The shape of the path is defined by the attribute d. See the MDN tutorial if you aren’t familiar with SVG paths.Look at that. We didn’t even have to get our hands dirty with creating SVG paths. The area generator did it all for us. Nor did we have to deal with any of the animation from current state to final streamgraph. The transition helped us out there. So what did we do?

The initial call to stack(data) causes the stack layout to run on our data. Its setup to use wiggle as the offset, which is the offset to use for streamgraphs.

The y scale needs to be updated to ensure the tallest ‘stream’ is accounted for in its calculation.

Again, check out that Transition Guide for more clarity on how this works.The last section of the streamgraph function is the transition. We create a new transition selection on the .request groups. Then we select the .area path’s inside each group and set the path and opacity they should end up using the attr() calls.

D3 will interpolate the path’s values smoothly over the duration of the transition to end up at a nice looking streamgraph for our data. The great thing is that this same code will work for transitioning from the initial blank display as well as from the other chart types!

Stacked Area Chart

The stacked area chart provides a new view with little code.Stacked area chart

I’m not going to go over the code for the stacked area chart – as it is near identical to the streamgraph.

The only real difference is that the offset used for the stack layout calculations is switched from wiggle to zero. This modifies the count0 values when the stack is executed on the data, which then adjusts the area paths to be stacked instead of streamed.

Area Chart

With the overlapping area chart, we reduce opacity to prevent from obscuring ‘short’ areasArea chart

Our last chart is a basic overlapping area chart. This one is a little different, as we won’t need to use the stack layout for area positioning. Also, we will finally get to use that .line path we created during the setup.

Here is the relevant code for this chart:

areas = () ->
  g = svg.selectAll(".request")

  # as there is no stacking in this chart, the maximum
  # value of the input domain is simply the maximum count value,
  # which we precomputed in the display function 
  y.domain([0, d3.max(data.map((d) -> d.maxCount))])
    .range([height, 0])

  # the baseline of this chart will always
  # be at the bottom of the display, so we
  # can set y0 to a constant.
  area.y0(height)
    .y1((d) -> y(d.count))

  line.y((d) -> y(d.count))

  t = g.transition()
    .duration(duration)

  # partially transparent areas
  t.select("path.area")
    .style("fill-opacity", 0.5)
    .attr("d", (d) -> area(d.values))

  # show the line
  t.select("path.line")
    .style("stroke-opacity", 1)
    .attr("d", (d) -> line(d.values))

The main difference between this chart and the previous two is that we are not using the count0 values in any of the area layouts. Instead, the bottom line of the areas is set to the height of the visualization, so it will always stay at the bottom of the display.

The .line is adjusted in the other charts too (just not shown in these snippets). It is just always set to be invisible in the transition.In the transition, we set the opacity of the area paths to be 0.5 so that all the areas are still visible. Then we do another selection to set the .line path so that it appears as the top outline of our areas.

Switching Back and Forth

As each of these charts is contained in its own function, transitioning between charts becomes as easy as just executing the right function.

Here is the code that does just that when the toggle button is pushed:

transitionTo = (name) ->
  if name == "stream"
    streamgraph()
  if name == "stack"
    stackedAreas()
  if name == "area"
    areas()

Each of these functions creates and starts a new transition, meaning switching to a new chart will halt any transition currently running, and then immediately start the new transition from the current element locations.

The Little Details

There are some finishing touches that I’ve made to the visualization that I won’t go into too much depth on. D3’s axis component was used to create the background lines marking every other day.

Shameless plug: Check out my tutorial on small multiples if you want to take a deeper look into the implementation of this great pieceA little legend, inspired by the legend in the Manifest Destiny visualization. It is also an SVG element and the mouseover event causes a transition that shifts the key into view. The details are in the code.

Finally, like I mentioned above, the toggle button to switch between charts was created using bootstrap. Checkout the button documentation for the details.

Wrapping Up

Well hopefully now you have a better grasp on using transitions to switch between different displays for your data. We can really see the power of D3 in how little code it takes to create these different charts and interactively move between them.

Thanks again to Mike Bostock, the creator of D3. His presentation on flexible transitions served as the main inspiration for this tutorial.

Now get out there and start transitioning! Let me know when you create your own face melting (and functional) animations.

Aug 022012
 

How to Make an Interactive Network Visualization

Networks! They are all around us. The universe is filled with systems and structures that can be organized as networks. Recently, we have seen them used to convict criminals, visualize friendships, and even to describe cereal ingredient combinations. We can understand their power to describe our complex world from Manuel Lima's wonderful talk on organized complexity. Now let's learn how to create our own.

In this tutorial, we will focus on creating an interactive network visualization that will allow us to get details about the nodes in the network, rearrange the network into different layouts, and sort, filter, and search through our data.

In this example, each node is a song. The nodes are sized based on popularity, and colored by artist. Links indicate two songs are similar to one another.

Try out the visualization on different songs to see how the different layouts and filters look with the different graphs.

Technology

This visualization is a JavaScript based web application written using the powerful D3 visualization library. jQuery is also used for some DOM element manipulation. Both frameworks are included in the js/libs directory of the source code.

If you hate CoffeeScript, you can always compile the code to JavaScript and start there.The code itself is actually written in CoffeeScript, a little language that is easy to learn, and compiles down to regular JavaScript. Why use CoffeeScript? I find that the reduced syntax makes it easier to read and understand what the code is doing. While it may seem a bit intimidating to learn a whole new 'language', there are just a few things you need to know about CoffeeScript to be a pro.

Quick CoffeeScript Notes

Functions

First and foremost, This is what a function looks like:

functionName = (input) ->
  results = input * 2
  results

So the input parameters are inside the parentheses. The -> indicates the start of the implementation. If a function's implementation is super simple, this can all go on one line.

cube = (x) -> x * x * x

A function returns the last thing executed, so you typically don't need a return statement, but you can use one if you like.

Indentation matters

The other main syntactical surprise is that, similar to Python, indentation is significant and used to denote code hierarchy and scope. We can see an example of this in the function above: The implementation is indented.

In practice, this isn't too big of an issue: just hit the Tab key instead of using curly braces – {, } – and you are set.

Semicolons and Parentheses

Taking a page from Ruby, semicolons are not needed and should be avoided in CoffeeScript.

Also, parentheses are optional in many places. While this can get confusing, I typically use parentheses unless they are around a multi-line function that is an input argument into another function. If that doesn't make sense, don't worry – the code below should still be easy to follow.

For other interesting details, spend a few minutes with the CoffeeScript documentation. You won't be disappointed.

In-browser CoffeeScript

Using CoffeeScript is made simpler by the fact that our CoffeeScript code can be compiled into JavaScript right in the browser. Our index.html includes the CoffeeScript compiler which in turn compiles to JavaScript any script listed as text/coffeescript :

<script src="js/libs/coffee-script.js"></script>
<script type="text/coffeescript" src="coffee/vis.coffee"></script>

It's just that simple. When vis.coffee loads, it will be parsed and compiled into JavaScript before running. For production, we would want to do this compilation beforehand, but this let's us get started right away.

Setting up the Network

You don't have to organize your code like this, if it seems like too much work. But the encapsulation makes it easier to change your input data laterOk, let's get started with coding this visualization. We are going to be using a simplified version of Mike Bostock's (the creator of D3) reusable chart recommendations to package our implementation. What this means for us is that our main network code will be encapsulated in a function, with getters and setters to allow interaction with the code from outside. Here is the general framework:

Network = () ->
  width = 960
  height = 800
  # ...
  network = (selection, data) ->
    # main implementation

  update = () ->
    # private function

  network.toggleLayout = (newLayout) ->
    # public function

  return network

So the Network function defines a closure which scopes all the variables used in the visualization, like width and height. The network function is where the main body of the code goes, and is returned by Network at the end of the implementation.

Functions defined on network, like network.toggleLayout() can be called externally while functions like update are 'private' functions and can only be called by other functions inside Network. You can think of it as the same abstraction that classes provide us in object-oriented programing. Here is how we create a new network :

$ ->
  myNetwork = Network()
  # ...

  d3.json "data/songs.json", (json) ->
    myNetwork("#vis", json)

So here, myNetwork is the value Network() returns – namely the function called network. Then we call this network function, passing in the id of the div where the visualization will live, and the data to visualize.

Network Data Format

From above, we see we are passing in the data from songs.json to be visualized. How is this data organized?

Since a network defines connections between nodes as well as the data contained in the nodes themselves, it would be difficult to define as a simple 'spreadsheet' or table of values. Instead, we use JSON to capture this structure with as little overhead as possible.

The input data for this visualization is expected to follow this basic structure:</p

{
  "nodes": [
    {
      "name": "node 1",
      "artist": "artist name",
      "id": "unique_id_1",
      "playcount": 123
    },
    {
      "name": "node 2",
      # ...
    }
  ],
  "links": [
    {
      "source": "unique_id_1",
      "target": "unique_id_2"
    },
    {
      # ...
    }
  ]
}

This is a JSON object (just like a JavaScript object). If you haven't looked at JSON before, then let's take a look now! The format is pretty straight forward (as it's just JavaScript).

This object requires two names, nodes and links. Both of these store arrays of other objects. nodes is an array of nodes. Each node object needs some fields used in the visualization as well as an id which uniquely identifies that node.

The objects in the links array just need source and target. Both of these point to @id@'s of nodes.

The traditional/default method in D3 of defining a link's source and target is to use their position in the nodes array as the value. Since we are going to be filtering and rearranging these nodes, I thought it would be best to use values independent of where the nodes are stored. We will see how to use these id's in a bit.

Moved by the Force

We will start with the default force-directed layout that is built into D3 . Force-based network layouts are essentially little physics simulations. Each node has a force associated with it (hence the name), which can repel (or attract) other nodes. Links between nodes act like springs to draw them back together. These pushing and pulling forces work on the network over a number of iterations, and eventually the system finds an equilibrium.

Force-directed layouts usually result in pretty good looking network visualizations, which is why they are so popular. D3's implementation does a lot of work to make the physics simulation efficient, so it stays fast in the browser.

Example of force-directed layout from our song network demo
Force Directed Layout

To start, as is typical with most D3 visualizations, we need to create a svg element in our page to render to. Lets look at the network() function which performs this action:

First we declare a bunch of variables that will be available to us inside the Network closure. They are 'global' and available anywhere in Network. Note that our D3 force directed layout is one such global variable called force.

Inside our network function, we start by tweaking the input data. Then we use D3 to append an svg element to the input selection element. linksG and nodesG are group elements that will contain the individual lines and circles used to create the links and nodes. Grouping related elements is a pretty common strategy when using D3. Here, we create the linksG before the nodesG because we want the nodes to sit on top of the links.

The update function is where most of the action happens, so let's look at it now.

  # The update() function performs the bulk of the
  # work to setup our visualization based on the
  # current layout/sort/filter.
  #
  # update() is called everytime a parameter changes
  # and the network needs to be reset.
  update = () ->
    # filter data to show based on current filter settings.
    curNodesData = filterNodes(allData.nodes)
    curLinksData = filterLinks(allData.links, curNodesData)

    # sort nodes based on current sort and update centers for
    # radial layout
    if layout == "radial"
      artists = sortedArtists(curNodesData, curLinksData)
      updateCenters(artists)

    # reset nodes in force layout
    force.nodes(curNodesData)

    # enter / exit for nodes
    updateNodes()

    # always show links in force layout
    if layout == "force"
      force.links(curLinksData)
      updateLinks()
    else
      # reset links so they do not interfere with
      # other layouts. updateLinks() will be called when
      # force is done animating.
      force.links([])
      # if present, remove them from svg
      if link
        link.data([]).exit().remove()
        link = null

    # start me up!
    force.start()

The final version of this visualization will have filtering and sorting capabilities, so update starts with filtering the nodes and links of the total dataset. Then sorts if necessary. We will come back and hit these functions later. For the basic force-directed layout without all these bells and whistles to come, all we really care about is:

force.nodes(curNodesData)
updateNodes()

force.links(curLinksData)
updateLinks()

The force's nodes array is set to our currently displayed nodes, erasing any previous nodes in the simulation. Then we update the visual display of the nodes in the visualization. This same pattern is then followed for the links.

Remember: the force layout doesn't add circles and lines for you. It just tells you where to put them.It is important to realize that in D3, the nodes and links in the force layout don't automatically get visualized in any way. This is to say that there is a separation between the force-directed physics simulation and any visual mapped to that simulation.

To create this visual representation associated with this force-directed simulation, we will need to bind to the same data being used, which we do in updateNodes and updateLinks:

  # enter/exit display for nodes
  updateNodes = () ->
    node = nodesG.selectAll("circle.node")
      .data(curNodesData, (d) -> d.id)

    node.enter().append("circle")
      .attr("class", "node")
      .attr("cx", (d) -> d.x)
      .attr("cy", (d) -> d.y)
      .attr("r", (d) -> d.radius)
      .style("fill", (d) -> nodeColors(d.artist))
      .style("stroke", (d) -> strokeFor(d))
      .style("stroke-width", 1.0)

    node.on("mouseover", showDetails)
      .on("mouseout", hideDetails)

    node.exit().remove()

  # enter/exit display for links
  updateLinks = () ->
    link = linksG.selectAll("line.link")
      .data(curLinksData, (d) -> "#{d.source.id}_#{d.target.id}")
    link.enter().append("line")
      .attr("class", "link")
      .attr("stroke", "#ddd")
      .attr("stroke-opacity", 0.8)
      .attr("x1", (d) -> d.source.x)
      .attr("y1", (d) -> d.source.y)
      .attr("x2", (d) -> d.target.x)
      .attr("y2", (d) -> d.target.y)

    link.exit().remove()

Finally, some D3 visualization code! Looking at updateNodes, we select all circle.node elements in our nodeG group (which at the very start of the execution of this code, will be empty). Then we bind our filtered node data to this selection, using the data function and indicating that data should be identified by its id value.

The enter() function provides an access point to every element in our data array that does not have a circle associated with it. When append is called on this selection, it creates a new circle element for each of these representation-less data points. The attr and style functions set values for each one of these newly formed circles. When a function is used as the second parameter, like:

.attr("r", (d) -> d.radius)

The d is the data associated with the visual element, which is passed in automatically by D3. So with just a few lines of code we create and style all the circles we need.

Because we will be filtering our data to add and remove nodes, there will be times where there is a circle element that exists on screen, but there is no data behind it. This is where the exit() function comes into play. exit() provides a selection of elements which are no longer associated with data. Here we simply remove them using the remove function.

If the concepts of enter() and exit() are still not clicking, check out the Thinking With Joins and three little circles tutorials. These selections are a big part of D3, so it is worth having a feel for what they do.

Configuring the Force

There has to be a ton of Star Wars jokes I should be making... but I can't think of any.In order to get the force-directed graph working the way we want, we need to configure the force layout a bit more. This will occur in the setLayout function. For the force-directed layout, our force configuration is pretty simple:

force.on("tick", forceTick)
.charge(-200)
.linkDistance(50)

Here, charge is the repulsion value for nodes pushing away from one another and linkDistance is the maximum length of each link. These values allow the nodes to spread out a bit.

The forceTick function will be called each iteration (aka 'tick') of the simulation. This is where we need to move our visual representations of the nodes and links of the network to where they are in the simulation after this tick. Here is forceTick:

  # tick function for force directed layout
  forceTick = (e) ->
    node
      .attr("cx", (d) -> d.x)
      .attr("cy", (d) -> d.y)

    link
      .attr("x1", (d) -> d.source.x)
      .attr("y1", (d) -> d.source.y)
      .attr("x2", (d) -> d.target.x)
      .attr("y2", (d) -> d.target.y)

Pretty straightforward. The D3 simulation is modifying the x and y values of each node during the simulation. Thus, for each tick, we simply need to move the circles representing our nodes to where x and y are. The links can be moved based on where their source and target nodes are.

Setting Up Data

Speaking of source and target, we need to go back and see how to deal with our initial data where we were using the id of a node in place of the node's index in the nodes array. Here is setupData which is the very first thing executed in our network code:

  # called once to clean up raw data and switch links to
  # point to node instances
  # Returns modified data
  setupData = (data) ->
    # initialize circle radius scale
    countExtent = d3.extent(data.nodes, (d) -> d.playcount)
    circleRadius = d3.scale.sqrt().range([3, 12]).domain(countExtent)

    data.nodes.forEach (n) ->
      # set initial x/y to values within the width/height
      # of the visualization
      n.x = randomnumber=Math.floor(Math.random()*width)
      n.y = randomnumber=Math.floor(Math.random()*height)
      # add radius to the node so we can use it later
      n.radius = circleRadius(n.playcount)

    # id's -> node objects
    nodesMap  = mapNodes(data.nodes)

    # switch links to point to node objects instead of id's
    data.links.forEach (l) ->
      l.source = nodesMap.get(l.source)
      l.target = nodesMap.get(l.target)

      # linkedByIndex is used for link sorting
      linkedByIndex["#{l.source.id},#{l.target.id}"] = 1

    data

setupData is doing a few things for us, so let's go through it all. First, we are using a d3.scale to specify the possible values that the circle radii can take, based on the extent of the playcount values. Then we iterate through all the nodes, setting their radius values, as well as setting their x and y values to be within the current visualization size. Importantly, nodes are not automatically sized by any particular data value they contain. We are just adding radius to our data so we can pull it out in updateNodes. The x and y initialization is just to reduce the time it takes for the force-directed layout to settle down.

Finally, we map node id's to node objects and then replace the source and target in our links with the node objects themselves, instead of the id's that were in the raw data. This allows D3's force layout to work correctly, and makes it possible to add/remove nodes without worrying about getting our nodes array and links array out of order.

Radializing the Force

Now we have everything needed to display our network in a nice looking, fast, force-directed layout. It might have been a lot of explanation, but we did it in a pretty small amount of code.

The force-directed layout is a great start, but also a bit limiting. Sometimes you want to see your network in a different layout – to find patterns or trends that aren't readily apparent in a force-directed one. In fact, it would be really cool if we could toggle between different layouts easily – and allow our users to see the data in a number of different formats. So, lets do that!

Here's a basic idea: we are going to hijack D3's force-directed layout and tell it where we want the nodes to end up. This way, D3 will still take care of all the physics and animations behind the scenes to make the transitions between layouts look good without too much work. But we will get to influence where the nodes go, so their movements will no longer be purely based on the underlying simulation.

Radial layout example, grouping song nodes by artistRadial Network Layout

Always force-jack with extreme cautionTo aid in our force-jacking, I've created a separate entity to help position our nodes in a circular fashion called RadialPlacement. Its not really a full-on layout, but just tries to encapsulate the complexity of placing groups of nodes. Essentially, we will provide it with an array of keys. It will calculate radial locations for each of these keys. Then we can use these locations to position our nodes in a circular fashion (assuming we can match up our nodes with one of the input keys).

RadialPlacement is a little clumsy looking, but gets the job done. The bulk of the work occurs in setKeys and radialLocation :

# Help with the placement of nodes
RadialPlacement = () ->
  # stores the key -> location values
  values = d3.map()
  # how much to separate each location by
  increment = 20
  # how large to make the layout
  radius = 200
  # where the center of the layout should be
  center = {"x":0, "y":0}
  # what angle to start at
  start = -120
  current = start

  # Given a set of keys, perform some
  # magic to create a two ringed radial layout.
  # Expects radius, increment, and center to be set.
  # If there are a small number of keys, just make
  # one circle.
  setKeys = (keys) ->
    # start with an empty values
    values = d3.map()

    # number of keys to go in first circle
    firstCircleCount = 360 / increment

    # if we don't have enough keys, modify increment
    # so that they all fit in one circle
    if keys.length < firstCircleCount       increment = 360 / keys.length     # set locations for inner circle     firstCircleKeys = keys.slice(0,firstCircleCount)     firstCircleKeys.forEach (k) -> place(k)

    # set locations for outer circle
    secondCircleKeys = keys.slice(firstCircleCount)

    # setup outer circle
    radius = radius + radius / 1.8
    increment = 360 / secondCircleKeys.length

    secondCircleKeys.forEach (k) -> place(k)

  # Gets a new location for input key
  place = (key) ->
    value = radialLocation(center, current, radius)
    values.set(key,value)
    current += increment
    value

  # Given an center point, angle, and radius length,
  # return a radial position for that angle
  radialLocation = (center, angle, radius) ->
    x = (center.x + radius * Math.cos(angle * Math.PI / 180))
    y = (center.y + radius * Math.sin(angle * Math.PI / 180))
    {"x":x,"y":y}

Hopefully the comments help walk you through the code. In setKeys our goal is to break up the total set of keys into an inner circle and an outer circle. We use slice to pull apart the array, after we figure out how many locations can fit in the inner circle.

radialLocation does the actual polar coordinate conversion to get a radial location. It is called from place, which is in turn called from setKeys.

Toggling Between Layouts

Lets the user explore different layouts interactivelyToggle Layout

With RadialPlacement in tow, we can now create a toggle between our force-directed layout and a new radial layout. The radial layout will use the song's artist field as keys so the nodes will be grouped by artist.

In the update function described above, we saw a mention of the radial layout:

 if layout == "radial"
      artists = sortedArtists(curNodesData, curLinksData)
      updateCenters(artists)

Here, sortedArtists provides an array of artist values sorted by either the number of songs each artist has, or the number of links. Let's focus on updateCenters, which deals with our radial layout:

  updateCenters = (artists) ->
    if layout == "radial"
      groupCenters = RadialPlacement().center({"x":width/2, "y":height / 2 - 100})
        .radius(300).increment(18).keys(artists)

We can see that we just pass our artists array to the RadialPlacement function. It calculates locations for all keys and stores them until we want to position our nodes.

Now we just need to work on this node positioning and move them towards their artist's location. To do this, we change the tick function for the D3 force instance to use radialTick when our radial layout is selected:

  # tick function for radial layout
  radialTick = (e) ->
    node.each(moveToRadialLayout(e.alpha))

    node
      .attr("cx", (d) -> d.x)
      .attr("cy", (d) -> d.y)

    if e.alpha < 0.03       force.stop()       updateLinks()   # Adjusts x/y for each node to   # push them towards appropriate location.   # Uses alpha to dampen effect over time.   moveToRadialLayout = (alpha) ->
    k = alpha * 0.1
    (d) ->
      centerNode = groupCenters(d.artist)
      d.x += (centerNode.x - d.x) * k
      d.y += (centerNode.y - d.y) * k

We can see that radialTick calls moveToRadialLayout which simply looks up the location for the node's artist location from the previously computed groupCenters. It then moves the node towards this center.

This movement is dampened by the alpha parameter of the force layout. alpha represents the cooling of the physics simulation as it reaches equilibrium. So it gets smaller as the animation continues. This dampening allows the nodes repel forces to impact the position of the nodes as it nears stopping – which means the nodes will be allowed to push away from each other and cause a nice looking clustering effect without node overlap.

We also use the alpha value inside radialTick to stop the simulation after it has cooled enough and to have an easy opportunity to redisplay the links.

Because the nodes are different sizes, we want them to have different levels of repelling force to push on each other with. Luckily, the force's charge function can itself take a function which will get the current node's data to calculate the charge. This means we can base the charge off of the node's radius, as we've stored it in the data:

charge = (node) -> -Math.pow(node.radius, 2.0) / 2

The specific ratio is just based on experimentation and tweaking. You are welcome to play around with what other effects you can come up with for charge.

Filter and Sort

Force-directed and Radial layouts after filtering for obscure songsFiltering Example

The filter and sort functionality works how you would expect: we check the networks current setting and perform operations on the nodes and links based on these settings. Let's look at the filter functionality that deals with popular and obscure songs, as it uses a bit of D3 array functionality:

  # Removes nodes from input array
  # based on current filter setting.
  # Returns array of nodes
  filterNodes = (allNodes) ->
    filteredNodes = allNodes
    if filter == "popular" or filter == "obscure"
      playcounts = allNodes.map((d) -> d.playcount).sort(d3.ascending)
      # get median value
      cutoff = d3.quantile(playcounts, 0.5)

      filteredNodes = allNodes.filter (n) ->
        if filter == "popular"
          n.playcount > cutoff
        else if filter == "obscure"
          n.playcount
    filteredNodes

filterNodes defaults to returning the entire node array. If popular or obscure is selected, it uses D3's quantile function to get the median value. Then it filters the node array based on this cutoff. I'm not sure if the median value of playcounts is a good indicator of the difference between 'popular' and 'obscure', but it gives us an excuse to use some of the nice data wrangling built into D3.

Bonus: Search

Search bar - Simple searching made easySearch

Search is a feature that is often needed in networks and other visualizations, but often lacking. Given a search term, one way to make a basic search that highlights the matched nodes would be:

  # Public function to update highlighted nodes
  # from search
  network.updateSearch = (searchTerm) ->
    searchRegEx = new RegExp(searchTerm.toLowerCase())
    node.each (d) ->
      element = d3.select(this)
      match = d.name.toLowerCase().search(searchRegEx)
      if searchTerm.length > 0 and match >= 0
        element.style("fill", "#F38630")
          .style("stroke-width", 2.0)
          .style("stroke", "#555")
        d.searched = true
      else
        d.searched = false
        element.style("fill", (d) -> nodeColors(d.artist))
          .style("stroke-width", 1.0)

We just create a regular expression out of the search, then compare it to the value in the nodes that we want to search on. If there is a match, we highlight the node. Nothing spectacular, but its a start to a must-have feature in network visualizations.

We can see that updateSearch is a public function, so how do we connect it to the UI on our network visualization code?

Wiring it Up

The other button groups use very similar code.There are a lot of ways we could connect our buttons and other UI to the network functionality. I've tried to keep things simple here and just have a separate section for each button group. Here is the layout toggling code:

  d3.selectAll("#layouts a").on "click", (d) ->
    newLayout = d3.select(this).attr("id")
    activate("layouts", newLayout)
    myNetwork.toggleLayout(newLayout)

So we simply active the clicked button and then call into the network closure to switch layouts. The activate function just adds the active class to the right button.

Our search is pretty similar:

  $("#search").keyup () ->
    searchTerm = $(this).val()
    myNetwork.updateSearch(searchTerm)

It just uses jQuery to watch for a key-up event, then re-runs the updateSearch function.

Thanks and Goodnight

Hopefully we have hit all the highlights of this visualization. Interactive networks are a powerful visual, and this code should serve as a jumping off point for your own amazing network visualizations.

Other placement functions could be easily developed for more interesting layouts, like spirals, sunflowers, or grids. Filtering and sorting could be extended in any number of ways to get more insight into your particular dataset. Finally, labelling could be added to each node to see what is present without mousing over.

I hope you enjoyed this walk-through, and I can't wait to see your own networks!

May 162012
 

How to visualize distributions

There are a lot of ways to show distributions, but for the purposes of this tutorial, I'm only going to cover the more traditional plot types like histograms and box plots. Otherwise, we could be here all night. Plus the basic distribution plots aren't exactly well-used as it is.

Before you get into plotting in R though, you should know what I mean by distribution. It's basically the spread of a dataset. For example, the median of a dataset is the half-way point. Half of the values are less than the median, and the other half are greater than. That's only part of the picture.

What happens in between the maximum value and median? Do the values cluster towards the median and quickly increase? Are there are lot of values clustered towards the maximums and minimums with nothing in between? Sometimes the variation in a dataset is a lot more interesting than just mean or median. Distribution plots help you see what's going on.

Want more? Google and Wikipedia are your friend.Anyways, that's enough talking. Let's make some charts.

If you don't have R installed yet, do that now.

Box-and-Whisker Plot

This old standby was created by statistician John Tukey in the age of graphing with pencil and paper. I wrote a short guide on how to read them a while back, but you basically have the median in the middle, upper and lower quartiles, and upper and lower fences. If there are outliers more or less than 1.5 times the upper or lower quartiles, respectively, they are shown with dots.

The method might be old, but they still work for showing basic distribution. Obviously, because only a handful of values are shown to represent a dataset, you do lose the variation in between the points.

To get started, load the data in R. You'll use state-level crime data from the Chernoff faces tutorial.

# Load crime data
crime <- read.csv("http://datasets.flowingdata.com/crimeRatesByState-formatted.csv")

Remove the District of Columbia from the loaded data. Its city-like makeup tends to throw everything off.

# Remove Washington, D.C.
crime.new <- crime[crime$state != "District of Columbia",]

Oh, and you don't need the national averages for this tutorial either.

# Remove national averages
crime.new <- crime.new[crime.new$state != "United States ",]

Now all you have to do to make a box plot for say, robbery rates, is plug the data into boxplot().

# Box plot
boxplot(crime.new$robbery, horizontal=TRUE, main="Robbery Rates in US")

Want to make box plots for every column, excluding the first (since it's non-numeric state names)? That's easy, too. Same function, different argument.

# Box plots for all crime rates
boxplot(crime.new[,-1], horizontal=TRUE, main="Crime Rates in US")

Multiple box plot for comparision.

Histogram

Like I said though, the box plot hides variation in between the values that it does show. A histogram can provide more details. Histograms look like bar charts, but they are not the same. The horizontal axis on a histogram is continuous, whereas bar charts can have space in between categories.

Just like boxplot(), you can plug the data right into the hist() function. The breaks argument indicates how many breaks on the horizontal to use.

# Histogram
hist(crime.new$robbery, breaks=10)

Look, ma! It's not a a bar chart.

Using the hist() function, you have to do a tiny bit more if you want to make multiple histograms in one view. Iterate through each column of the dataframe with a for loop. Call hist() on each iteration.

# Multiple histograms
par(mfrow=c(3, 3))
colnames <- dimnames(crime.new)[[2]]
for (i in 2:8) {
	hist(crime[,i], xlim=c(0, 3500), breaks=seq(0, 3500, 100), main=colnames[i], probability=TRUE, col="gray", border="white")
}

Using the same scale for each makes it easy to compare distributions.

Density Plot

For smoother distributions, you can use the density plot. You should have a healthy amount of data to use these or you could end up with a lot of unwanted noise.

To use them in R, it's basically the same as using the hist() function. Iterate through each column, but instead of a histogram, calculate density, create a blank plot, and then draw the shape.

# Density plot
par(mfrow=c(3, 3))
colnames <- dimnames(crime.new)[[2]]
for (i in 2:8) {
	d <- density(crime[,i])
	plot(d, type="n", main=colnames[i])
	polygon(d, col="red", border="gray")
}

Multiple filled density plots.

You can also use histograms and density lines together. Instead of plot(), use hist(), and instead of drawing a filled polygon(), just draw a line.

# Histograms and density lines
par(mfrow=c(3, 3))
colnames <- dimnames(crime.new)[[2]]
for (i in 2:8) {
	hist(crime[,i], xlim=c(0, 3500), breaks=seq(0, 3500, 100), main=colnames[i], probability=TRUE, col="gray", border="white")
	d <- density(crime[,i])
	lines(d, col="red")
}

Histogram and density, reunited, and it feels so good.

Rug

The rug, which simply draws ticks for each value, is another way to show distributions. It usually accompanies another plot though, rather than serve as a standalone. Simply make a plot like you usually would, and then use rug() to draw said rug.

# Density and rug
d <- density(crime$robbery)
plot(d, type="n", main="robbery")
polygon(d, col="lightgray", border="gray")
rug(crime$robbery, col="red")

Using a rug under a density plot.

Violin Plot

The violin plot is like the lovechild between a density plot and a box-and-whisker plot. There's a box-and-whisker in the center, and it's surrounded by a centered density, which lets you see some of the variation.

# Violin plot
library(vioplot)
vioplot(crime.new$robbery, horizontal=TRUE, col="gray")

I bet this violin sounds horrible.

Bean Plot

The bean plot takes it a bit further than the violin plot. It's something of a combination of a box plot, density plot, and a rug in the middle. I've never actually used this one, and I probably never will, but there you go.

# Bean plot
library(beanplot)
beanplot(crime.new[,-1])

A little too busy for me, but here you go.

Wrapping Up

If you take away anything from this, it should be that variance within a dataset is worth investigating. Picking out single datapoints or only using medians is the easy thing to do, but it's usually not the most interesting.

Related

iA Writer for Mac

 Featured, general  Comments Off
May 282011
 

A better tool doesn’t make a better craftsman, but a good tool makes working a pleasure. iA Writer for Mac is a digital writing tool that makes sure that all your thoughts go into the text instead of the program. iA Writer has no preferences. It is how it is. It works like it works. Love it or hate it. It’s unique FocusMode allows me to think, spell and write at one sentence at a time. iA Writer is fast; it works without mouse. It automatically formats semantical entities such as headlines, lists, bold, strong, block quotes written in markdown.

You can get iA Writer for Mac at the App store. (10% off during the first few days!)

iA Writer for Mac

NOTE: Currently ONLY LATIN ALPHABETICAL LANGUAGES AND RUSSIAN are supported (NO Japanese, Korean, Chinese, Thai, Hebrew, Arabian…).

1. Character: No Preferences

One of our goals was to create a writing app without settings. When opening Writer, all you can do is write. The only option you have is full screen and FocusMode.

2. Signal vs Noise: Focus Mode (patent pending)

In focus Mode you write one sentence at a time. Why? It’s a common pattern, that, instead of following the voice and fleshing out the text in one go, people start editing before the text is done.

We’re more easily distracted by signals similar to those we produce (text), than by signals that are different (the browser icon).

How to Use It: Writing one sentence at a time goes hand in hand with this rule of thumb for good writing: One thought per sentence. You might not like it because it’s not your thing. Fair enough. But if you ever get caught in one of those moments, where the big white empty window scares you or when you get stuck in the middle of a text: try Focus Mode.

3. Speed: No Mouse

Auto Markdown will help you to format texts without letting off your keyboard. It’s easy. You can learn it in 30 seconds. Auto Markdown automatically formats the Markdown language. The advantage is that you don’t need to use your mouse to create semantic structure.

To increase the pleasure of writing is exactly what we intended when creating Writer. A better tool doesn’t make a better craftsman, but a good tool makes working a pleasure.

You can get iA Writer for Mac at the App store. (10% off during the first few days!)

May 052011
 

Where America flies

Ever since seeing the Facebook friendship map and later, the map of scientific collaboration, I've been looking for an excuse to play with great circles. So I thought, why not come back to Aaron Koblin's classic Flight Patterns? But instead of just looking at all flights (above), I broke it down by airline to see where each one flies.

I grabbed the most recent flight data from the Bureau of Transportation Statistics, aggregated by airline, and counted arriving and departing flights between airport pairs. What follows are non-stop domestic flights by major air carriers during February 2011.

Brighter lines represent more arriving and departing flights between the two endpoints, and blue lines are the flights with heaviest traffic. Coloring is relative to within the airline as opposed to overall flight count.

On a quick glance you can spot where the hubs of each carrier are and flights most often flown. We start off with Southwest Airlines, which flies across the country. There's a focus obviously in the southwest.

Delta Air Lines flies just about everywhere, too, but also includes flights to Alaska and Hawaii. Their largest hub is in Atlanta, which explains the focus at Hartsfield–Jackson Atlanta International Airport.

United Airlines, on the other hand, has hubs more north and on the west coast including O'Hare International in Chicago and San Francisco International Airport. It appears they also have flights to all major Hawaiian islands.

Lots of American Airlines traffic in an out of Dallas/Fort Worth International Airport and JFK.

Continental Airlines looks similar to American Airlines, except Continental's headquarters are in Houston, Texas.

Pretty obvious where JetBlue goes. Despite some delays the past couple of times I've flown with them, they're still my favorite. One time Bill Murray was on the flight. If it's good enough for him, it must be good enough for me.

Mesa is a smaller airline that also operates United Express and US Airways Express.

US Airways' largest hub is at Charlotte/Douglas International Airport.

The Alaska Airlines connections look really interesting, streaming out of the the northwest. Most flights go through Seattle-Tacoma International, but there are also flights to and from Portland International. Oh, and of course to and from Alaska.

Atlantic Southeast lives up to its name.

As does Frontier Airlines.

Hawaiian Airlines looks exactly like you'd expect. They exclusively fly to Raleigh, North Carolina. Ah, I kid. I don't know where they fly.

Apr 272011
 

Paying More for Stuff

After seeing this article and graphic on the rising cost of food in The New York Times week in review, I was curious about how prices for other stuff has changed in the past year. The Bureau of Labor Statistics provides this data monthly via Consumer Price Index (CPI).

Why they have to provide it in an equal-spaced text file over a plain CSV or a spreadsheet still confuses me, but at least they publish the data regularly, which is more than I can say for other government departments. I originally wanted to see how obesity rates have changed over the past year, but the Centers for Disease Control and Prevention (CDC) hasn't updated obesity trend data since 2009.

In any case, price changes alone are still interesting. Relative to overall inflation (2.7 percent), transportation prices increased the most by far, at plus 9.8 percent. The cost of gasoline alone went up 27.5 percent.

Education also took a hit, going up 4.0 percent. Meanwhile, the cost of apparel and technology went down this past year by 0.6 and 1.4 percent, respectively. Having a baby? Now might be good time to buy some onesies.

Apr 262011
 

I’m not a nuclear expert. I am a 40-year-old Swiss Web designer, with a degree in philosophy, living in Tokyo. And I’m a father of a two-year-old boy. I was kind of nonchalant about nuclear energy so far, but not anymore. For obvious reasons. I’ve read a lot recently; it’s hard to understand the discussion. I’m not talking about technicalities. One can learn the basics pretty quickly. I’m more confused about the overall logic of the debate. The debate about our future. What I’d like to know: Is more technology really the right solution? Don’t we have enough technology? What is it that we are really lacking?

As far as I can see some claim that the next generation of safe nuclear power plants will solve all problems; other people believe that only clean energy can save us from doom. Both parties operate with somewhat ironic notions (clean energy, safe nuclear power). And ironically, everybody agrees that humanity’s problems are due to bad technology that should be replaced with good technology. That in particular I find curious because I don’t think that nuclear power plants were bad technology. As far as I know the technology as such was quite impressive. I still find it stunning how much raw power these nuclear power plants are able to produce with such a tiny amount of fuel. But as I said, I can’t really judge that. As a Web designer, all I can say is for sure that:

  1. All the power plants I’ve seen so far look quite ugly.
  2. In my world engineering is a matter of compromise, not perfection.
  3. It’s usually not bad technology but bad practice that causes trouble.

The Good the Bad and the Ugly

The Good

The Web is based on pretty good technology. Even though it’s considerably younger, I’d say that it’s probably as advanced, as secure and as understood as nuclear technology. Maybe it’s even a tick ahead. After all, Billions of man hours of development and testing went into it.

I’m not saying that the Web is an alternative to nuclear energy. I’m not that confused. What I’m saying is: The infrastructure and the top layer of the Web is so solid, that whenever things go wrong on the Web (which of course happens quite often), it’s usually too much ambition and a lack of thought or experience in the way technology is used that causes problems. And good technology is technology that doesn’t rely on perfection.

The Bad

Of course, I am not saying either that there is no “bad technology.” There is bad technology; bad technology is technology not fit for its purpose, technology that lacks thought and consideration.

The lack of thought is a problem as old as humanity. Thinking hurts. And as with every form of pain, it’s a great business. The whole design business (putting thought into things) only exists because there is a lack of thought in the way things are constructed. As you might guess, Web design is not an easy business, since most of our work is thinking. And that hurts.

The Ugly

Of course, there is an easy way to make money from that same lack of thought. In our business (use your buzzword radar to find them) as much as in other businesses. Throughout human history con men of all shades have used the lack of thought shamelessly in their favor. You can be sure that whoever promises cheap solutions (for example: social media marketing) to complicated problems (for example: the Internet) is trying to profit from your physical or mental laziness.

Politicians and corporations are particularly good at this form of charlatanry. In Switzerland for instance, the populist party now blames foreigners for the need of power plants. “If we didn’t have foreigners, we wouldn’t need nuclear power plants” is their new slogan. That’s not just bad thought, it’s evil: Blaming foreigners for the nuclear energy is as honest an argument as blaming blond people for power plants. But I am digressing.

The Use of Thought

Thinking is hard work. Thinking clearly and consistently for hours and hours is something that only a few trained people can do. As the Greeks have proven at length, thinking for hours and hours over dozens of years can lead to incredible results.

Sure. You can’t move mountains or even cut a tree by just thinking. So what’s the use of thinking if it doesn’t change anything? — In short: Thinking helps the economy of action. If you want to cut a mountain of wood, you’d better take the time to think and sharpen your axe before hacking away. I am convinced that if we’d all think with concentration for five minutes every day our lives would be much easier. Unfortunately we rather believe that we have no time to think than taking the time for it.

Applied Science and Truth

In the eyes of the Ancient Greeks, applied science (speculate, test and see) would have been regarded as pure barbarism. In Greek eyes, only idiots test their thoughts in the world. Sophisticated people test their thoughts with other people.

To the Greek mind there is truth and truth is apparent, it doesn’t need to be forced out of a foxhole with glowing irons. Truth is something that is uncovered, it’s obvious, naked. If it’s not naked, it’s not the truth. Truth doesn’t need to be hunted down. It reveals itself. It’s not always nice to you but it’s clear. I don’t know if that concept of truth still makes sense today, but it’s certainly still a beautiful concept. (The skeptic position that there is no truth but only probability and everything is relative is more popular, and many believe that this position is extraordinarily wise, forgetting that where there is no truth and everything is relative, wisdom has very bad cards).

Science and Wisdom

As unpopular as the notion of truth is, in our time we can hardly afford aesthetic truth concepts. We need to test the results of our science because one of our main goals in developing science is to use it to build more technology. We don’t want to know in order to know, we want to know in order to use.

It’s hard to digest, but the goal of classic philosophy was not applicable knowledge or financial profit, it was done for the sake of pure reason, or, to put it in sweeter terms: out of love of wisdom. That is a massive difference. Not that the classic philosophers were completely against the use of knowledge; that’s impossible; but the practical usefulness of knowledge was not the primary goal of research. It was the desire to become wiser.

Wisdom is something we don’t seem to believe in anymore. How do we know if it even exists? It’s neither measurable, nor weighable, nor countable.

Wisdom and Achievements

You can frown upon the unpractical ways of the Greeks — “why think when you can test and know right away?” — their desire for such unmeasurable, unweighable, and uncountable baloney as “wisdom,” but the very unpractical philosophical method discovered and defined logic, democracy, art, philosophy, literature, and any form of natural science (including, among others, the theory of atoms).

The Greeks out-innovated us without A/B-testing, but mainly by thinking. And they thought all that in a historically speaking very short time span with very few people.

Technology and Hubris

Techne, Technique and Technology

Technology is a Greek word. Techne meant the way to do things (we now use the word technique for that). Technology is not the way but the means with which we do things. It’s more of a pragmatic, neutral notion than technique. We put a lot of thought into technology so we need less thought, effort and technique to achieve our goals.

Our civilization prides itself of its technological achievements. We are proud to achieve more with less thought, effort, and technique. We are so proud of our machines that only few people realize that other civilizations had invented them way before our civilization had even formed. Here the thing: The old Greeks for instance already had steam engines. However, they were not used for practical purposes.

Why didn’t they build railways, cars, and rockets? They didn’t dare. Using automats for pragmatic tasks seemed just too much, over the top, inhuman. What held them back? Being as smart and inventive as they were, they definitely could have come up with a concept as obvious as wheels on rails. It was not the lack of steel or the missing pistons but the fear of hubris that prevented them to use the steam engine for more practical tasks. It was the fear of hubris.

Hubris? WTF?

While we all still have a basic understanding of what defines “hubris,” the fear of giving into hubris is not one of our first concerns anymore. In contrary. We now call the cars that are supposed to save the planet: Hybrid cars. If you look at what hubris (or hybris) originally meant, that’s quite ironic:

Greek for “insolence,” excessive pride that constitutes the protagonist’s tragic flaw and leads to a downfall.

Now, whether you believe in the Greek Gods or not, avoiding hubris still is a pretty reasonable approach. You don’t need to believe in Divine Intervention to understand why excessive pride is dangerous. Pride is the blindness that comes with power. And power and pride are as dramatic a duo as nitro and glycerin:

Hubris often indicates a loss of touch with reality and overestimating one’s own competence or capabilities, especially for people in positions of power.

Power and Blindness

The reason why we are so proud of the machines is the power they give us. Sure. It was not easy to achieve all this power. It took us a couple of hundred years to develop those Aeolipiles into piston driven steam engines, coal engines, fuel engines, jet engines and atomic power plants. And they really are massively powerful. And the massive power of our technology is why we are so incredibly proud and incredibly blind today. For a half a century we are actually able to completely eradicate ourselves and our living condition. We are way more powerful than any other civilization before us. What we sacrificed is what we now need most: wisdom.

“We need wisdom the most when we believe in it the least.” (Hans Jonas)

The method that made us build, operate and manage our awesome power goes back that very same barbaric thinking that Greeks would have frowned upon: Try and see. Which brings me back to my profession (which involves a lot of hybrid try and see).

A Designer’s Perspective

Forget that Old Shit

You may still feel completely comfortable and laugh about the Old Greeks and their superstitious fear of hubris and their preference for the obvious over the experiment in researching the truth. You may still believe that the next generation atomic power plants are completely safe. And maybe you are right.

What do I know? I am just a Web designer. Maybe really knowledgeable people do honestly think that next generation power plants cannot break. Maybe there is no such thing as hubris.

Engineering and Perfection

However I look at it, from where I come from (web- and application design) engineering is not a matter of perfection. It’s a matter of compromise. For websites and applications it just doesn’t make sense to assume that they never break. In my world there is no absolute predictability; in my world security is not just a technological problem. In my world, the weak spot of the technology I deal with is us: the humans. The humans that build, operate and manage technology.

I can’t say for sure whether or not perfect machines are possible, but I know for sure that there are no perfect humans to build them, manage them, and use them. Unless human nature becomes divine through divine intervention, any installations we build, manage and operate cannot be fully trusted. I might be wrong, but it seems to me that any technology that threatens the existence of humanity (or a substantial part of it) should not be built, managed or operated by humans.

Perfect People

And it also seems to me that giving technology in the hands of people that think they know everything, people that don’t know that they don’t know is where things started to get really dangerous.

Whether you look at Chernobyl, Three Mile Island or Fukushima, the operators all knew a lot about nuclear science, but they just didn’t feel the hubris that made them do what they did.

Conclusion

Experts

If you, just like me, still want to replace bad technology with better technology (to my ignorant hybrid mind, clean energy still sounds like a pretty sweet thing, after all), that’s completely cool. But, whatever we do next: Let us not just ask the experts. It’s not expertise that was missing, what was missing was reason, modesty, wisdom:

When the Kanazawa branch of the Nagoya High Court handed down a ruling in January 2003 nullifying permission that had previously been given for the construction of the prototype Monju fast breeder reactor (FBR), electrical power companies and researchers involved in the power industry were up in arms. At a debate about the court ruling, a university professor who was a proponent of nuclear energy employed his knowledge of specialized terminology to talk down an opposition-party Diet member. Later on, I witnessed the professor and some cronies smirk in the corner of the room as they muttered, “Take that, you amateurs.”

Many of them might be perfectly reasonable and honest, some might even be wise, but it can’t hurt to ask many different reasonable people from many different disciplines.

So, dear experts, you might disagree with my ignorant text, but don’t ever tell us again, that we don’t understand enough about nuclear energy to form a relevant opinion. Don’t tell anyone they cannot have a relevant opinion on nuclear energy on the base they will never understand the ultimate scientific particularities of it like you do.

Citizens

And, to you, dear citizen, if you have doubts about your own knowledge: Don’t let anyone tell you that you don’t have the brain or time or will to learn and think and understand whatever the problem is. That’s what the political and corporate charlatans want you to believe: They want you to be as ignorant as possible. So they can continue to insult our intelligence by claiming that only they can understand what they understand and treat us like fools promising that nuclear energy will save humanity, free of charge. That’s too good to be true. It is your civil duty to know what is what.

In reality, most people will understand enough about nuclear energy if they spend 15 Minutes on Wikipedia. Let anyone who is reasonable enough to care about the problem have a voice in this, whether they’re shoe makers, nurses, gymnastic teachers, grandmothers or hairdressers. There is not just a scientific, political and managerial perspective on nuclear energy. There are many different very reasonable perspectives, and they all count–as long as they are honest and thought through.

Maybe the machines we need are not those that help us lifting heavier weights than our arms can carry, running faster than our legs can move, seeing further than our eyes can see, hearing more than our ears can hear — what we need is support to be able to think clearly. A thinking that allows us to rely on less technology. Wishful thinking? Pretty words? Lame dreaming? Just look at the insane amount of energy all these Japanese electric air conditioners pump into the atmosphere because of bad insulation…

I’m curious to hear what you think.