Cosma Shalizi reminds me of the internet "data scientists are (good and empirically oriented) statisticians" discussion of 2011-12.
Let me say three things:
You should never use Excel to handle your data.
I don't know whether it is depressing or exhilarating to recognize that, for me as for Cosma, how often my reaction these days is: "I already wrote something incisive and very much worth reading about that—now to find it in my weblog archives..."
Increasingly, data management, analysis, and presentation are things that many more people need for their jobs than statistics departments can reasonably expect to funnel through their major programs. It's like in the middle ages: the number of people who needed to have a good, clear, legible-penmanship chancery hand vastly exceeded the number of professional calligraphers and illustrators. Data management, analysis, and presentation skills are, increasingly, the legible-penmanship chancery hand of the twenty-first century.
Cathy O'Neil (2011): Why and how to hire a data scientist for your business: "When do you need a data scientist?... https://mathbabe.org/2011/09/25/why-and-how-to-hire-a-data-scientist-for-your-business/
...When you have too much data for Excel to handle: data scientists know how to deal with large data sets. When your data visualization skills are being stretched... data scientists are skilled (or should be) at data visualization.... When you aren’t sure if something is noise or information: this is a big one, and we will come back to it. When you don’t know what a confidence interval is.... Every number you see... is actually an estimate of something, and the question you constantly face is, how trustworthy is that estimate?...
Cosma Shalizi (2011): New "data scientist" is but old "statistician" writ large: "Cathy O'Neil has an interesting post.... The skills she's describing a good "data scientist" as having are a subset of the skills of a good statistician... https://bactra.org/weblog/805.html
...At most, they are a subset of the skills of a good computationally competent statistician. These are even, at least here, undergraduate-level skills... regression and advanced data analysis... graphics and visualization, data mining, and/or statistical computing. (IMHO, graphics and computing ought to be mandatory courses, but that's another story for another audience.) While I modestly admit to the unrivaled greatness of our undergrad program, I draw two conclusions:
- Other people re-inventing the job of statisticians under a new name is a sign that we really need to do better at spreading the word about what we know and what we can do.
- If you want a data scientist, get a CMU statistics major.
Cathy O'Neil (2011): Data science: tools vs. craft: "One consistent reaction is that I’m just saying that a data scientist needs to know undergraduate level statistics... https://mathbabe.org/2011/10/04/data-science-tools-vs-craft/
...On some level this is true: undergrad statistics majors can learn everything they need to know to become data scientists, especially if they also take some computer science classes. But... to set up an analogy: I’m not a chef because I know about casserole dishes.... Once we boil something down to a question in statistics it’s kind of a breeze. Even so, nothing is ever as standard as you would actually find in a stats class....
My advice... is to get someone who is really freaking smart and who has also demonstrated the ability to work independently and creatively, and who is very good at communicating. And now that I’ve written the above issues down, I realize that another crucial aspect to the job of the data scientist is the ability to create methodology on the spot and argue persuasively that it is kosher... have broad knowledge of the standard methods... be able to hack together a bit of the relevant part of each... understand it sufficiently to implement it in code.... sell it to everyone else....
I would argue that an undergrad education probably doesn’t give enough perspective to do all of this, even though the basic mathematical tools are there. You need to be comfortable building things from scratch and dealing with people in intense situations. I’m not sure how to train someone for the latter...
Cosma Shalizi (2012): No, Really, Some of My Best Friends Are Data Scientists https://bactra.org/weblog/925.html#b4: "We could demand more programming, but as Cathy says very well...
...Don't confuse a data scientist with a software engineer!... Data scientists know how to program in the sense that they typically know how to use a scripting language like python to manipulate the data into a form where they can do analytics on it. They sometimes even know a bit of java or C, but they aren't software engineers, and asking them to be is missing the point of their value to your business.
However, we are right to demand some programming. It is all very well to be able to use someone else's software, but (again, to repeat myself) "someone who just knows how to run canned routines is not a data analyst but a drone attached to a machine they do not understand". I am classist enough to look down on someone who chooses, when they have an alternative, to be a mere fleshly interface to a dumb tool, to be enslaved by one of our own creations. Too often, of course, people have no choice in this.
This is why I insist on programming in all my classes, and why I encourage my advisees to take real computer science classes...
Cathy O'Neil (2012): Statisticians aren’t the problem for data science. The real problem is too many posers: "I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position... https://mathbabe.org/2012/07/31/statisticians-arent-the-problem-for-data-science-the-real-problem-is-too-many-posers/
...That’s not to say I agree with absolutely everything.... There’s a difference between being a master at visualizations for the statistics audience and... data scientists... dumb[ing] stuff down without letting it become vapid... [while] reading other people’s minds in advance to see what they find sexy.... And communications skills are a funny thing.... An earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype.... As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all....
What I really want to rant about today... is... posers... in the land of data scientists.... It’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust.... Efficient algorithms... memorized by their users are basically black-box algorithms.... But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist.... If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with....
One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures...
Anil Dash (2011): If your website's full of assholes, it's your fault: "If you run a website, you need to follow these steps. if you don't, you're making the web, and the world, a worse place. And it's your fault..." https://anildash.com/2011/07/if-your-websites-full-of-assholes-its-your-fault.html
Cosma Shalizi (2010): The Bootstrap: "Statisticians can reuse their data to quantify the uncertainty of complex models..." https://web.archive.org/web/20101123034535/https://www.americanscientist.org/issues/pub/2010/3/the-bootstrap/1
Cosma Shalizi (2008): Minimal Advice to Undergrads on Programming: "Don't give up; complain!... When complaining, tell me what you tried, what you expected it to do, and what actually happened. The more specific you can make this, the better. If possible, attach the relevant R session log and workspace to your e-mail. Of course, this presumes that you start the homework earlier than the night before it's due..." https://bactra.org/weblog/593.html
Derek Jones (2008): Recommendations for teachers of programming: "Students pick up lots of folklore that gets them through the course; work with this not against it. You are the chief wizard, give them a few powerful
principlesspells that can be applied to solve the problems they are likely to encounter. Understanding takes more time than is available, accept this fact..." https://shape-of-code.coding-guidelines.com/2008/12/29/recommendations-for-teachers-of-programming/
Cosma Shalizi (2010): The Neutral Model of Inquiry (or, What Is the Scientific Literature, Chopped Liver?): "900 words of wondering what the scientific literature would look like if it were entirely a product of publication bias..." https://bactra.org/weblog/698.html
Herbert Simon: The Sciences of the Artificial https://amzn.to/2vQ4MFN