In this interview, Kirk Borne, principal data scientist at Booz Allen Hamilton and former professor of astrophysics at George Mason University, discusses Bayesian networks, ensemble modelling and novel trading strategies.

What are some common mistakes that data scientists make?

It is easy for all of us to become infatuated with the nuances and power of algorithms, especially those that we learn for the first time. We like to try them out on almost every problem, whether or not it is a wise choice. The biggest danger in this is not the choice to try them — you should try them, experiment with them, see what works and doesn’t — that’s how we learn and that’s the “science” component of data science. No, the real danger is in trying to sell this shiny new algorithm to your client (boss or customer) when in fact a simpler and more transparent model might be good enough. A small improvement in accuracy doesn’t offset large increases in confusion or in brittleness (i.e., a model that must be fine-tuned and handcrafted to produce a minor improvement in accuracy in special circumstances). So, I strongly encourage experimentation, but also exercise wisdom and restraint in models that must be delivered and put into production. As Einstein said, “all things should be kept as simple as possible, but no simpler.” Also, famous statistician George Box said: “All models are wrong. Some are useful.”

Could you give an example of an interesting machine-learning problem which Booz Allen has worked on, and how datamining techniques were used to solve it?

Yes, there are many fascinating examples. But I cannot talk about them. Nevertheless, we have many fine examples described in our freely available Field Guide to Data Science. One of those was the application of Bayesian Networks to root cause analysis of air traffic safety incident reports. Applying a graph model with conditional probabilities along each edge that connected different causal factors and conditions to outcomes (safety incidents) was very illuminating — which further proved a point that I always emphasize: the natural data structure of the world is a graph (a network). A graph model (such as a Bayesian Network) explicitly reveals all sorts of interesting and useful connections, linkages, causal trails, and relationships between events and entities.

Which models/ algorithms do you make use of most often when solving data science problems?

I am less involved with model-building these days, and more focused on model reviews and recommendations for our data science teams. They do the heavy lifting. I drop in to add my suggestions and lessons learned. In many of those cases, we are definitely looking at ensembles since real world problems are complex and have nontrivial interdependencies in the data features. Exploring the high-dimensional parameter space for patterns, trends, associations, and other deep relationships requires a multi-model approach. Ensembles have proven to be exceptionally good at modeling (exploring and exploiting) such complexities.

What advice would you give to somebody starting out in data science?

I always say to newcomers in data science that you should follow your passion first. No matter what it is (science, engineering, sports, healthcare, computer technologies, art, music, finance, marketing, manufacturing, retail, customer service, ...), there is now a major digital component (with lots of data) in every discipline, domain, and direction you explore. So do both — follow your passion and also become passionate about learning the data science. You won’t regret it. In fact, every day you will love what you do because you will be doing what you love.

In your opinion, what can and can’t be predicted? E.g., do you think it's possible to predict who will win the World Series, or Apple's stock price a month from now?

Yes, no, maybe! Predictions about very specific outcomes (winners of sports games) are strongly uncertain probabilistically. But predictions about general outcomes or a range of outcomes are statistically more robust. Think of the latter as a cumulative probability versus think of a specific outcome prediction as one point in that cumulative distribution. It is easier to predict accurately that something will fall into a range of outcomes, and not so easy to predict that a specific outcome will definitely occur.

What are the biggest challenges you face as a data scientist?

I sometimes say that there are “two biggest challenges” in data science and analytics: (1) integrating diverse, disparate, distributed (often siloed), and (often) dirty data; and (2) deriving actionable insights from all those data, algorithms, and models. But it is definitely a challenge also to develop a clear statement of the problem, fix on a manageable set of desired outcomes, and specify the metrics that will validate success!

Could you give an example of a novel trading strategy which you think would be profitable in today’s world?

I wish I knew that. If I did know it, I would sell it. But, here’s my novel excuse to give you a nontrivial answer to your question: I would do a market basket analysis (association rule mining) that includes many different stocks, categories of stocks, and other factors (those things that experts call exogenous variables, such as the prices of commodities, or interest rates, or bond rates, or the top trending emerging technologies that startups and venture capitalists are investing in), then insert time delays in some of those factors’ values in the model, and then finally search for interesting rules (with good Lift) that associate rises or falls in specific stocks within X amount of time of those other factors occurring. Good luck with that. If you do make any money on it, I hereby claim $10\%$ commission on those earnings in front of everyone here to see. 😁