May 202013
 

After seeing a Reddit post on the convergence of Miss Korea faces, supposedly due to high rates of plastic surgery, graduate student Jia-Bin Huang analyzed the faces of 20 contestants. Below is a short video of each face slowly transitioning to the other.

From the video and pictures it's pretty clear that the photos look similar, but Huang took it a step further with a handful of computer vision techniques to quantify the likeness between faces. And again, the analysis shows similarity between the photos, so the gut reaction is that the contestants are nearly identical.

However, you have to assume that the pictures are accurate representations of the contestants, which doesn't seem to pan out at all. It's amazing what some makeup, hair, and photoshop can do.

You gotta consider your data source before you make assumptions about what that data represents.

May 092013
 

Average dissertation

On R is My Friend, as a way to procrastinate on his own dissertation, beckmw took a look at dissertation length via the digital archives at the University of Minnesota.

I've selected the top fifty majors with the highest number of dissertations and created boxplots to show relative distributions. Not many differences are observed among the majors, although some exceptions are apparent. Economics, mathematics, and biostatistics had the lowest median page lengths, whereas anthropology, history, and political science had the highest median page lengths. This distinction makes sense given the nature of the disciplines.

I was on the long end of the statistics distribution, around 180 pages. Probably because I had a lot of pictures.

As I was working on my dissertation, people often asked me how many pages I had written and how many pages I had left to write. I never had a good answer, because there's no page limit or required page count. It's just whenever you (and your adviser) feel like there's enough to get a point across. Sometimes that takes 50 pages. Other times it takes 200.

So for those who get that dreaded page-count question, you can wave your finger at this chart and tell people you're somewhere in the distribution.

Data Points: Visualization That Means Something is available now. Order your copy.

Apr 292013
 

Jake Porway, the founder of DataKind, has a new show on the National Geographic channel called The Numbers Game. I unfortunately don't have the channel, so the clips on the site will have to suffice for now.

Keep in mind this show is for a wide audience though. Jake notes:

Now for those of you who have been writing to me excited that Big Data is finally getting its own TV show, I should point out that this show is a lot more like a science show than a show about data. You won’t find discussions about Hadoop, machine learning, or even the basics of correlation vs. causation here. Instead, the show tries to make the latest statistics accessible to a wide audience of people who may just be dipping their toes in to this new world of data. It’s more Guy Fieri than Carl Sagan, but it’s a blast.

The first of three episodes aired last week, and the second is on tonight. You should watch it.

My newest book, Data Points, is in the wild. Grab a copy.

Apr 182013
 

In my previous post, I noted the low median income in the census tract surrounding the Sedgwick stop on the Brown and Purple Lines. That the median income in that part of the city would be less than $20k went against my own experience of Chicago—as well as my prejudices about the North Side. How could a census tract a (light) stone’s throw (by a strong arm) from Lincoln Park, a few blocks away from the Container Store(s) and sundry other gateways to yuppie distinction in Wicker Park and Bucktown have such a low median income? Was there a mistake?

In the post, I suggested that the culprit was the ghost of Cabrini-Green. After all, as Whet reminds us, the ACS is a rolling survey, meaning that the 2011 version still has data from 2007, when Cabrini-Green still existed. But I bristled a bit at that explanation: Cabrini-Green, at least the Green high-rises, weren’t in that tract. Nor are the original Frances Cabrini Homes. The high-rises are in the tract just to the west, and the row houses are a bit to the south.Cabrini-Green-2And even so, look at the coloring in the tracts: where the high rises once were is relatively better off than the Sedgwick tract, and the Cabrini Homes are even better off, in terms of median income, being in the same discussion as the handsome tract just east of Sedgwick, where I’ve had a devil of a time finding parking on the way to Oak St. Beach because of all the fancy cars already gobbling up all the street parking.

So unless I’m making some kind of mistake, this answer is insufficient. Another idea is that the massive depopulation of the Green high-rises has spilled into the Sedgwick tract. There might be something to this. After all, in 2000, what is now the tract that includes the high-rises was three separate tracts, suggesting serious depopulation (NB: the blues are still for the 2010 tracts):

Screen Shot 2013-04-17 at 19.28.13

2000 census tracts in pink on background of 2010 tracts.

But that only provides some of the answer: the high-rises area is economically well off now since there are no longer 10,000 people involved in the CHA living there. It still tells us nothing about the Sedgwick tract.

Yet maybe there’s an answer in what I’ve already mentioned above: Oak St. beach, parking, Container Store, Lincoln Park… that is, a certain amount of upper middle-class white privilege. Let’s break down these tracts by median income (with margins of error) and then by estimated median incomes of white and African-American households:

Median income w/ MoE (top), white median income (center), African-American median income (bottom)

Median income w/ MoE (top), white median income (center), African-American median income (bottom)

Keeping in mind the sometimes outlandish margins of error, this area begins to tell a rather different story depending on who’s telling it. These tracts are all comfortably upper middle-class for their white inhabitants, while the story for the African-American inhabitants is rather more all over the map (further indicated by the margins of error), but decidedly distant from the lofty heights of six-digit annual incomes. So as a non-expert on this neighborhood like me, it’s the white income that tells the story I expect given the retail, entertainment, and housing options nearby.

Is the white story the majority story here, though? Here’s the last picture of the area that puts my assumptions into check:

Screen Shot 2013-04-17 at 18.59.57-2It’s tempting to say that these numbers speak for themselves and leave it at that, but two quick caveats: the margins of error tip past ±10% for the two tracts just south of the Sedgwick tract. Elsewhere they’re all within 10%.

In geography school, we quickly learn Tobler’s First Law of Geography, where everything influences everything, but near things influence things more than far things. And in that case, we do see the effect of the Cabrini-Green public housing efforts on the racial and income makeup of the tracts surrounding it. So the answer to “what’s the matter with Sedgwick?” is simply “nothing at all.” It’s not an aberration. It and its nearby tracts reflect the brutal racial and economic segregation of Chicago that continues to this day. And my surprise at that is just a function of my own time spent living within that segregation.

 Posted by on April 18, 2013

Flexible data

 statistics  Comments Off
Apr 172013
 

Data is an abstraction of something that happened in the real world. How people move. How they spend money. How a computer works. The tendency is to approach data and by default, visualization, as rigid facts stripped of joy, humor, conflict, and sadness — because that makes analysis easier. Visualization is easier when you can strip the data down to unwavering fact and then reduce the process to a set of unwavering rules.

The world is complex though. There are exceptions, limitations, and interactions that aren't expressed explicitly through data. So we make inferences with uncertainty attached. We make an educated guess and then compare to the actual thing or stuff that was measured to see if the data and our findings make sense.

Data isn't rigid so neither is visualization.

Are there rules? There are, just like there are in statistics. And you should learn them.

However, in statistics, you eventually learn that there's more to analysis than hypothesis tests and normal distributions, and in visualization you eventually learn that there's more to the process than efficient graphical perception and avoidance of all things round. Design matters, no doubt, but your understanding of the data matters much more.

Apr 112013
 

In December 2011 I published a post on genre preferences among UK cinema audiences, applying correspondence analysis to data from the BFI’s Opening Our Eyes report. You can read the article that was subsequently published in Participations last year here.

At the time I meant to write a follow up piece on genre preferences for UK television audiences using data from the same source but I never quite got round to it. I have now finished this analysis and the draft article can be found in the pdf file attached to this post. I also look at how age and gender affect audiences perceptions of television as a medium

The PDF file can be accessed here: Nick Redfern – Age, Gender, and Television

We apply correspondence analysis to data produced for the BFI’s Opening Our Eyes report published in 2011 to discover how age and gender shape the experience of television for audiences in the UK. Age is an important factor in shaping how audience perceive television, with older viewers describing the medium as ‘informative,’ ‘thought provoking,’ ‘artistic,’ ‘good for people’s self-development,’ and ‘escapist’ and while younger viewers are more likely to describe television as ‘exciting,’ ‘fashionable,’ and ‘sociable.’ Younger respondents are also more likely to describe the effect of television on people/society as negative. Variation in programme choice is highly structured in terms of age and gender, though the extent to which of these factors determine audience choice varies greatly. Gender is the dominant factor in explaining preferences for some programme types with age a secondary factor in several cases, while age is the explanatory factor for other genres for which gender seemingly has little influence. Male audiences prefer sports, factual entertainment, and culture programmes and female audiences reality TV/talent shows, game/quiz/panel shows, chat shows, and soap operas. Older audiences prefer news, documentaries, and wildlife/nature programmes, while music shows/concerts and comedy/sitcoms are more popular with younger viewers.

The BFI report and the raw data can be accessed here.


Mar 272013
 

Deputy editor at Ars Technica Nate Anderson was curious if he could learn to crack passwords in a day. Although there's definitely a difference between advanced and beginner crackers, openly available software and resources make it easy to get started and do some damage.

After my day-long experiment, I remain unsettled. Password cracking is simply too easy, the tools too sophisticated, the CPUs and GPUs too powerful for me to believe that my own basic attempts at beefing up my passwords are a long-term solution. I've resisted password managers in the past over concerns about storing data in the cloud or about the hassle of syncing with other computers or about accessing passwords from a mobile device or because dropping $50 bucks never felt quite worth it—hacks only happen to other people, right?

But until other forms of authentication take root, the humble password will form a primary defense of our personal information. The time has come for me to find a better solution to generating, storing, and handling them.

I use 1Password.

Mar 232013
 

Math professor Jeff Bergen explains the odds of picking a perfect bracket.

The first probability is based on a 50/50 split of correct picks, which is like using fair coin flips to pick winners. Bergen doesn't really go into how he calculated the second probability, but that smaller number comes up by bumping up the probability of picking the right team for each game. I think he's using an average probability of slightly less than 70% (based on simulation results from this old Wall Street Journal column).

That's why businesses can offer up million dollar prizes. In all likelihood, no one is going to win, which turns out to be a great business model for insurance companies who back these contests:

If millions of people enter a particular contest, it might seem like the chance of someone winning is suddenly in the realm of possibility. But there's a catch: This scenario assumes everyone maximized their chances by picking mostly favorites, so those with the best shot at winning are likely to have identical entries. These contests generally protect themselves from big losses by stating they'll divvy up the loot if there are multiple perfect brackets.

These favorable conditions make insuring these prize offers a good business, as the Dallas company SCA Promotions has discovered. SCA, founded by 11-time world bridge champion Robert D. Hamman, has taken on the insurance risk for roughly 50 perfect-bracket prizes -- including a Sporting News offer of $1 million in 2001, according to vice president Chris Hamman, the founder's son. In the 12 years it has been doing so, SCA has never had to pay out a claim.

Mar 222013
 

And so after a long (and much enjoyed break) I return to the blogosphere with the first draft of paper on film style and narration in Rashomon. This paper is different to other statistical analyses of film style I have published on this site and to all other studies of film style and narration because it uses multivariate analysis to look at several different aspects of film style together. The method used is multiple correspondence analysis, and you can find a good introductory chapter on MCA here. The software I used is FactoMineR for R, and the website explaining how to do the analysis can be found here.

Multivariate analysis has been used in the quantitative study of literature for some time (see the links below the abstract), but this is the first time multivariate analysis has been applied to film style and it appears to work very well. I am currently looking at some other applications, particularly in distinguishing between the different parts of portmanteau horror films (which is a proper scholarly endeavour and not simply an excuse to watch lots of portmanteau horror films).

The pdf file can be accessed here: Nick Redfern – Film style and narration in Rashomon

An Excel file contain the data used in the analysis can be accessed here: Nick Redfern – Rashomon. This file contains two worksheets: the first is the shot length data for the film, and the second is that data used in the multiple correspondence analysis.

Abstract

This article analyses the use of film style in Rashomon (1950) to determine if the different accounts of the rape and murder provided by the bandit, the wife, the husband, and the woodcutter are formally distinct by comparing shot length data and using multiple correspondence analysis to look for relationships between shot scale, camera movement, camera angle, and the use of point-of-view shots, reverse-angle cuts, and axial cuts. The results show that the four accounts of the rape and the murder in Rashomon differ not only in their content but also in the way they are narrated. The editing pace varies so that although the action of the film is repeated the presentation of events to the viewer is different each time. There is a distinction between presentational (shot scale and camera movement) and perspectival (shot types) aspects of style depending on their function within the film, while other elements (camera angle) fulfil both these functions. Different types of shot are used to create the narrative perspectives of the bandit, the wife, and the husband that marks them out as either active or passive narrators reflecting their level of narrative agency within the film, while the woodcutter’s account exhibits both active and passive aspects to create an ambiguous mode of narration. Rashomon is a deliberately and precisely constructed artwork in which form and content work together to create an epistemological puzzle for the viewer.

On the multivariate analysis of literature see the following:

Hoover DL 2003 Multivariate analysis and the study of style variation, Literary and Linguistic Computing 18 (4): 341-360.

Stewart LL 2003 Charles Brockden Brown: quantitative analysis and literary style, Literary and Linguistic Computing 18 (2): 129-138.

Tabata T 1995 Narrative style and the frequencies of very common words: a corpus-based approach to Dickens’s first person and third person narratives, English Corpus Studies 2: 91-109.


Mar 212013
 

Songwriters by age

Do singer-songwriters age well like a fine wine, or does quality decline with age? Kyle Biehle analyzed fan ratings by age.

I understand all of the reasons for not comparing artists in this way. Despite twenty-one Academy Award nominations, Woody Allen never attends the Oscars. His reason is that art isn't competition — judging art is so subjective who's to say who or what is best? After all one man's Poison is another man's Cream. Similarly, Elvis Costello (featured in the viz) is famously credited with saying: "Writing about music is like dancing about architecture - It's a really stupid thing to want to do." I agree that using ratings - whether from fans or critics — to judge artistic merit is at best flawed and at worst a fool's exercise.

But I wanted to do it anyway.

Most peak in their 20s and either stabilize later on or continue to decline. Occasionally, as in the case with Bob Dylan, there's some see-sawing. Take a look at the Tableau interactive for a closer look. [via Waxy]