3 min read

I am a data scientist

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistics.

It was a short post, so I might as well quote the whole thing:

When physicists do mathematics, they don't say they're doing "number science". They're doing math. If you're analyzing data, you're doing statistics. You can call it data science or informatics or analytics or whatever, but it's still statistics. If you say that one kind of data analysis is statistics and another kind is not, you're not allowing innovation. We need to define the field broadly. You may not like what some statisticians do. You may feel they don't share your values. They may embarrass you. But that shouldn't lead us to abandon the term "statistics".

I still sort of feel that way, but I must admit that my definition of “statistics” is rather different than most others’ definition. In my view, a good statistician will consider all aspects of the data analysis process:

  • the broader context of a scientific question

  • study design

  • data handling, organization, and integration

  • data cleaning

  • data visualization

  • exploratory data analysis

  • formal inference methods

  • clear communication of results

  • development of useful and trustworthy software tools

  • actually answering real questions

I’m sure I missed some things there, but my main point is that most academic statisticians focus solely on developing “sophisticated” methods for formal inference, and while I agree that that is an important piece, in my experience as an applied statistician, the other aspects are often of vastly greater importance. In many cases, we don’t need to develop sophisticated new methods, and most of my effort is devoted to the other aspects, and these are generally treated as being unworthy of consideration by academic statisticians.

As I wrote in a later post, “Reform academic statistics”, we as a field appear satisfied with

  • Papers that report new methods with no usable software

  • Applications that focus on toy problems

  • Talks that skip the details of the scientific context of a problem

  • Data visualizations that are both ugly and ineffective

Discussions of Data Science generally recognize the full range of activities that are required for the analysis of data, and place greater value on such things as data visualization and software tools which are obviously important but not viewed so by many statisticians.

And so I’ve come to embrace the term Data Science.

Data Science is also a much more straightforward and understandable label for what I do. I don’t think we should need a new term, and I think we should argue against misunderstandings of Statistics rather than slink off to a new “brand”. But in general, when I talk about Data Science, I feel I can better trust that folks will understand that I am talking about the broad set of activities required in good data analysis.

If people ask me what I do, I’ll continue to say that I’m a Statistician, even though I do tend to stumble over the word. But I am also a Data Scientist.

One last thing: I’ve also come to realize that computer science folks working in computational biology are really just like me. They have expertise in a somewhat different set of tools, but then that’s true for pretty much every statistician, too: they’re much like me but they have expertise in a somewhat different set of tools. And it’s nice to be able to say that we’re all data scientists.

It should be recognized, too, that academic computer science suffers from many of the same problems that academic statistics has suffered: an overemphasis on novelty, sophistication, and toy applications, and an under-appreciation for solving real problems, for data visualization, and for useful software tools.