I’ve come to believe that, for high-dimensional data, visualizations (aka graphs), and particularly interactive graphs, can be more important than precise statistical inference.
We first need to be able to view and explore the data, and when it is unusually abundant, that is especially hard. This was a primary contributor to my recent embarrassments, in which clear problems in the data were not discovered when they should have been.
I gave a talk on interactive graphs (with the title above) at Johns Hopkins last fall, and then a related talk at ENAR earlier this week, and I have a few thoughts to add here.
A brief digression
I’m giving a talk at a plant breeding symposium at Kansas State in a couple of weeks, and I’ve been pondering what to talk about. A principal problem is that I don’t really work on plant breeding. My most relevant talks are a bit too technical, and my more interesting talks are not relevant.
Then I had the idea to talk about some of my recent work with my graduate student, Il-youp Kwak, on the genetic analysis of phenotypes measured over time.
I realized that I could incorporate some interactive graphs into the talk. Initially I was just thinking that the interactive graphs would make the talk more interesting and would allow me to talk about things that weren’t necessarily relevant but were interesting to me.
But then I realized that this work really cries out for interactive graphs. And as I begin to construct one of them, I thought of a whole bunch more I might create. More importantly, I realized that these interactive graphs are extremely useful teaching tools.
More D3 examples
Here’s an image of first graph I created for the talk; click on it to jump to the interactive version.
Statisticians are often confronted with a large set of curves. We’d like to show the individual curves, but there are too many. The resulting spaghetti plot is a total mess. An image plot (like the lasagna plot) allows us to see all of the curves, but it can be hard to get a sense of what the actual curves look like. The interactive version solves the problem.
Here’s a second example; again click on the image to jump to the interactive version. (I’ve shown this before, but I want to use it to make another point.)
Typically, in a lecture on complex trait analysis, I’d show one LOD curve (like the top panel in the image below) and a few different plots of phenotype vs genotype (the lower-right panel in the image). I think the exploratory tool will be much more effective, in a lecture, for explaining what it all means.
Statisticians need to be doing this routinely
In constructing a graph, one must make some difficult choices. For high-dimensional data, one must greatly compress the available information. The resulting summaries, while potentially informative, take one far away from the original data.
Interactive graphs provide a means through which one may view the overall summary but have immediate access to the underlying details.