Black Swans


250 points

Hi, I'm Jack - I created Black Swans. 

I love all sorts of statistics, but am particularly interested in time series modelling. Last summer, I did a 6-week internship at St Andrews developing an ARMA-GARCH python 🐍 package to model stock prices. I hope to develop it further next semester. 

If anyone has any problems with the site, please let me know and I'll fix them as soon as possible.


Hey elc31,

I read this one a while ago which I found interesting:

Let me know if you have any questions about it!

Journal Club (Week 2)


This week the paper is: A general reinforcement learning algorithm that masters chess, shogi and Go through self-play.

Here's the link:

I'm looking forward to reading it! 😜


Hmm... I think I'd like to do the first one in London, simply because I live in the UK, and would have to sort out a venue, etc. If it went well, it'd be very cool to go to US or Canada for the next one.

Perhaps you are right. We'll see what others think. I'd like to record the presentations for the benefit of users who cannot attend, e.g., because they live too far away. I also think it would be cool to grow a YouTube channel full of short, inspiring, data-science presentations!


To understand Doc2Vec, you first have to understand Word2Vec. So if anybody is having trouble, I suggest watching this brilliant Stanford lecture: The slides in the presentation can be downloaded here:

Then, if people are curious which is better, PV-DM or PV-DBOW, or what hyperparameters are best for different tasks, I suggest reading this paper: The authors also found that Doc2Vec models can be improved using pre-trained word embeddings. In addition, they created some off-the-shelf Doc2Vec models (trained on Wikipedia articles) which can immediately create embeddings for any documents you want. They can be downloaded here:

Hope these links help!

Black Swans Meet-Ups 🚀


I think it'd be really cool to organise a Black Swans Meet-Up in the next couple of months!

I haven't thought about it much, but I imagine an evening where:

  • 1 or 2 data-scientists from well-known companies deliver presentations.
  • A few of us give inspiring presentations on projects we've been working on.
  • We get to chat to others on the site 🍻.

I'd also like to film the presentations and put them on YouTube for the world to see 📽.

If you like this idea, or think it could be improved somehow, please comment below!


Thanks eclipto!

Why are there so many great papers out there?! Would definitely like to read this one too. 🤣

I'd love to see an overview of your project. Am super busy atm, but will read it sometime in the next week! I no nothing about how reinforcement learning may improve LSTM by implying a value function, but would definitely like to learn about it! You can find my email on my profile page. Thanks Claus!

Welcome Claus! Your work sounds very interesting. I'm working on predicting stock prices using news and technical data myself, so would be interested to hear more about your project on optimising trading systems. Out of interest, what university do you attend?

Journal Club (Week 1)


The paper for this week is: Distributed Representations of Sentences and Documents.

Please share below any questions you have about the paper or useful insights, resources, summaries, criticisms, etc.

If you aren't using doc embeddings in any of your current projects, and think it might be a waste of time to read through this paper, I urge you to think again!

Word and doc vectors are fundamental to Natural Language Processing and text classification problems, and, sooner or later, you will have to make use of them!

Have fun, and don't be afraid to ask stupid questions! 😎


Hi ekowduker!

Thanks for joining the community. It sounds like you have a wealth of experience!

Out of curiosity, what are the titles of your novels? Could we find them on amazon?

Hey memahesh! I'm not sure about ii, maybe someone else can help with that. But for i, I would first build a vocabulary from a large training dataset of reviews, i.e., a dictionary (if in python) mapping numbers to a subset of the words in the training dataset, e.g., {0: "brilliant", 1: "hate", 2: "love", ...}. I'd then add the token "UNK" to this vocabulary to represent unknown words. If a word appears in a new review which isn't in this vocabulary, you simply represent it by the "UNK" token. Let me know if I've misinterpreted the question, or if my answer is confusing!

Hey Fernan - welcome to the site! I'm also very interested in NLP. At the moment I'm reading a lot about doc2vec and similar tools. Let me know if you have any problems, or ideas to make the site better.

Exactly lisa. It's a very cool area!

Great suggestion! Looks very interesting.

Probably Monday. If you have any ideas of papers we could read, please post them here

Hey utkarshver! Great to have you on the site. Let me know how you find it!

Hey! I designed it from scratch using Django.

Title: Distributed Representations of Sentences and Documents

Description: The paper introducing Paragraph Vector, a way to create vector embeddings for documents.


Journal Club Paper Suggestions


Please suggest papers (1 per comment) you think we should read and discuss here. Include a 1 sentence description of why it's a cool paper, and provide a link if it's available.

We'll choose the papers with the most upvotes!

Thanks 🚀

Hi d! Welcome to the site. I'm also interested in how to apply AI to poker. Maybe we could chat (or play!) sometime!

Thanks srallaba - great to have you on the site!

Hi skaterska -- I'll email people who comment on this post asking for suggestions for the first paper to go through.

Deep Learning Journal Club

We're launching a journal club! 🍷🍻

Being part of a journal club is a fast, efficient, way to stay up to date with current research, learn about exciting new applications, and have your technical questions answered.

Meeting new data-scientists is also a great way to learn different approaches to problems and get new ideas for projects!

We plan to read and discuss one paper each week, chosen by the club. We'll host the discussion on Black Swans, but would like to also hold video calls to get to know each other better.

Comment below if you'd like to join.

Thanks RafayAK. Unfortunately Goodfellow hasn't got back to me yet - I've just sent an email chasing him up. Will try to fix latex issue shortly. I created the points system so that good answers/ replies would "rise" and poor ones would "fall". Haven't got plans for it beyond that! Have you any thoughts about the points system?

Merry Christmas 🎄


Just wanted to wish all those using the site a merry Christmas 🎅 and a happy new year!

I've had exams at uni so have haven't had much time to work on the site -- but I'm done now so will try to do some more interviews, etc.

If anyone has any cool ideas for the site, I'd be very grateful if you'd share them!

Cheers 🍻


Lessons Learnt by Booz Allen Data Scientist Kirk Borne

In this interview, Kirk Borne, principal data scientist at Booz Allen Hamilton and former professor of astrophysics at George Mason University, discusses Bayesian networks, ensemble modelling and novel trading strategies.

What are some common mistakes that data scientists make?

It is easy for all of us to become infatuated with the nuances and power of algorithms, especially those that we learn for the first time. We like to try them out on almost every problem, whether or not it is a wise choice. The biggest danger in this is not the choice to try them — you should try them, experiment with them, see what works and doesn’t — that’s how we learn and that’s the “science” component of data science. No, the real danger is in trying to sell this shiny new algorithm to your client (boss or customer) when in fact a simpler and more transparent model might be good enough. A small improvement in accuracy doesn’t offset large increases in confusion or in brittleness (i.e., a model that must be fine-tuned and handcrafted to produce a minor improvement in accuracy in special circumstances). So, I strongly encourage experimentation, but also exercise wisdom and restraint in models that must be delivered and put into production. As Einstein said, “all things should be kept as simple as possible, but no simpler.” Also, famous statistician George Box said: “All models are wrong. Some are useful.”

Could you give an example of an interesting machine-learning problem which Booz Allen has worked on, and how datamining techniques were used to solve it?

Yes, there are many fascinating examples. But I cannot talk about them. Nevertheless, we have many fine examples described in our freely available Field Guide to Data Science. One of those was the application of Bayesian Networks to root cause analysis of air traffic safety incident reports. Applying a graph model with conditional probabilities along each edge that connected different causal factors and conditions to outcomes (safety incidents) was very illuminating — which further proved a point that I always emphasize: the natural data structure of the world is a graph (a network). A graph model (such as a Bayesian Network) explicitly reveals all sorts of interesting and useful connections, linkages, causal trails, and relationships between events and entities.

Which models/ algorithms do you make use of most often when solving data science problems?

I am less involved with model-building these days, and more focused on model reviews and recommendations for our data science teams. They do the heavy lifting. I drop in to add my suggestions and lessons learned. In many of those cases, we are definitely looking at ensembles since real world problems are complex and have nontrivial interdependencies in the data features. Exploring the high-dimensional parameter space for patterns, trends, associations, and other deep relationships requires a multi-model approach. Ensembles have proven to be exceptionally good at modeling (exploring and exploiting) such complexities.

What advice would you give to somebody starting out in data science?

I always say to newcomers in data science that you should follow your passion first. No matter what it is (science, engineering, sports, healthcare, computer technologies, art, music, finance, marketing, manufacturing, retail, customer service, ...), there is now a major digital component (with lots of data) in every discipline, domain, and direction you explore. So do both — follow your passion and also become passionate about learning the data science. You won’t regret it. In fact, every day you will love what you do because you will be doing what you love.

In your opinion, what can and can’t be predicted? E.g., do you think it's possible to predict who will win the World Series, or Apple's stock price a month from now?

Yes, no, maybe! Predictions about very specific outcomes (winners of sports games) are strongly uncertain probabilistically. But predictions about general outcomes or a range of outcomes are statistically more robust. Think of the latter as a cumulative probability versus think of a specific outcome prediction as one point in that cumulative distribution. It is easier to predict accurately that something will fall into a range of outcomes, and not so easy to predict that a specific outcome will definitely occur.

What are the biggest challenges you face as a data scientist?

I sometimes say that there are “two biggest challenges” in data science and analytics: (1) integrating diverse, disparate, distributed (often siloed), and (often) dirty data; and (2) deriving actionable insights from all those data, algorithms, and models. But it is definitely a challenge also to develop a clear statement of the problem, fix on a manageable set of desired outcomes, and specify the metrics that will validate success!

Could you give an example of a novel trading strategy which you think would be profitable in today’s world?

I wish I knew that. If I did know it, I would sell it. But, here’s my novel excuse to give you a nontrivial answer to your question: I would do a market basket analysis (association rule mining) that includes many different stocks, categories of stocks, and other factors (those things that experts call exogenous variables, such as the prices of commodities, or interest rates, or bond rates, or the top trending emerging technologies that startups and venture capitalists are investing in), then insert time delays in some of those factors’ values in the model, and then finally search for interesting rules (with good Lift) that associate rises or falls in specific stocks within X amount of time of those other factors occurring. Good luck with that. If you do make any money on it, I hereby claim $10\%$ commission on those earnings in front of everyone here to see. 😁

Study Group for Bishop's "Pattern Recognition and Machine Learning" (Pages 44 - 58)

Here are your tasks for the week commencing 29/10/2018:

  • Read pages 44 - 59 of the book.
  • Leave at least one comment below, e.g., a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 44 - 58.
  • Optional end-of-chapter exercises will be posted here shortly. Just got to finalize them on Slack.

Enjoy 😜

Study Group for Goodfellow's "Deep Learning" (Pages 44 - 59)

Here are your tasks for the week commencing 29/10/2018:

  • Read pages 44 - 59 of Deep Learning. This finishes the chapter on linear algebra (the last couple of pages look pretty heavy, but hang in there 😲) and gives a gentle introduction to probability.
  • Leave at least one comment below, e.g., a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 44 - 59.
  • If you have time, watch the first video of's fantastic course Computational Linear Algebra for Coders.

Enjoy 🍻

Agreed - working through the tutorial on Google Colab was brilliant!

I don't unfortunately, but would also like to know.

It's a little unclear, but surely he means that if a vector $\mathbf{x}$ and $\mathbf{y}$ are orthogonal and both vectors have nonzero norm, then they are at $90^{\circ}$ to each other.

Study Group for Bishop's "Pattern Recognition and Machine Learning (Pages 29 - 44)

There was some good discussion last week - I hope people found it helpful. Here are your tasks for the week commencing 22/10/2018:

  • Read pages 29 - 44 of the book.
  • Leave at least one comment below, e.g., a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 29 - 44.

We will set exercises next week!

Enjoy 🍻

Which exercises would you like to discuss? We'll make sure to work through them when the time comes!

TensorFlow's "Train Your First Neural Network" Tutorial


If you have any questions about TensorFlow's Train Your First Neural Network Tutorial or your code isn't working 🤬, comment below and somebody will help you out!

Feel free to also share improvements which could be made to the model created in the tutorial!

Cheers 🍻

Study Group for Goodfellow's "Deep Learning" (Pages 29 - 44)


There was some really great discussion on the first 28 pages. Here are your tasks for the week commencing 22/10/2018 ...

  • Read pages 29 - 44 of Deep Learning about linear algebra.
  • Leave at least one comment below, e.g., a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 29 - 44.
  • Optionally work through TensorFlow's tutorial Train Your First Neural Network. If you have any questions about it, or your code isn't working, please comment them here and not on this post.

Not the most exciting section to read through, but it's important to have a good grasp of the basics!

Enjoy 🍻

If you would like to help manage this study group, please join our Slack group!

$\mu_{\text{ML}}$ and $\sigma^2_{\text{ML}}$ do not represent the mean and variance parameters of the normal distribution in the example. They are estimators of these parameters and are functions of the data, e.g., $\mu_{\text{ML}} = \frac{1}{N} \sum_{i=1}^{N} X_i$. Since these estimators are functions of random variables (i.e., the data), they themselves are random variables and thus have a mean and variance.

How Many Flips?

Alice tosses a coin until she sees a head followed by a tail. Bob tosses a coin until he sees two heads in a row. On average, how many flips will Alice and Bob require?

I’ll add that soon!

Never knew that Microsoft had open-sourced - I'm excited to see what it can do!

Hi @sharmayush - welcome to the site!

I highly recommend checking out Applied Machine Learning in Python on Coursera. It was created by the University of Michigan and is free to audit.

Hi @Khaledbayoudh - welcome to the site.

Kenneth Cukier, author of Big Data, talked about this in my interview with him.

Hope that helps!

Study Group for Goodfellow's "Deep Learning" (Pages 1 - 28)


It's great that so many of you wanted to join this study group. Here are your tasks for the week commencing 15/10/2018 ...

  • Read pages 1 - 28 of Deep Learning. I know this is more than 15 pages (as was agreed) but it's just an introduction, so won't take long!
  • Leave at least one comment below, e.g., a question about something you don't understand or a link to a brilliant resource. It must somehow relate to the first 28 pages.
  • Reply to at least one other person's comment.

I will lastly mention that the more you put into this study group, the more you will get out!

Cheers 🍻

If you would like to help manage this study group, please join our Slack group!

Hi Hari,

Fitting the model $\log{Y} = \beta X + \epsilon$ doesn't have a disadvantage per se, but of course the way you interpret $\beta$ changes. In this case, a one unit increase in $X$ leads to a $(e^{\beta}-1) \times 100 \%$ increase/ decrease in Y rather than a $\beta$ change in $Y$.

Unfortunately I don't know enough about weighted least squares to help with your last question. Hopefully somebody else can!

Great to have you on board Hossein!

Thanks for posting this. It's a really great article!

Great question @cirpis! Just to clarify, are you asking for the probability that at least one pair in a group of $N$ end up picking each other?

Lessons Learnt by Facebook Data Scientist Brandon Rohrer

Brandon Rohrer, an MIT graduate now working as a data scientist at Facebook, gives advice to those starting out in data science and discusses the intricacies of model selection.

What advice would you give to somebody starting out in data science?

When I think about the advice that I would go back in time and give to myself as a beginner data scientist, several things pop to mind. The first and biggest is to spend most of your time building things. Projects, analyses, tutorials, visualizations, code. The more applied my work has been, the more I have learned from it. There is definitely a place for studying theory, derivations, and philosophical deep dives. But these for me have just been the mortar. The bulk strength of my data science foundation, the granite blocks, has come from practice.

The next piece of advice is to not be afraid of digging into your tools and asking how they work. Being able to use the XGBoost library in practice is a fantastic skill. Having an intuition for how it works takes you to the next level. It lets you accomplish things with it that you wouldn’t otherwise be able to do. It lets you know where you should and where you shouldn’t use it. While we rightly respect our models and methods, there is nothing sacred about them. They each have their limitations and quirks. Understanding these will help you progress from proficiency to mastery.

The third piece of advice I would give myself is to actively resist imposter syndrome. There are more facets to data science and more analysis tools than any one individual can possibly become an expert in. There will always be things you don’t know. You’ll hear names of algorithms tossed out casually on Twitter that you have never heard of before. This is OK. Not only is it normal, it is universal.

Which models/ algorithms do you make use of most often when solving data-science problems? How do you decide between competing models?

I’ve found model selection to be a very personal exercise. The way you go about it reveals something about you, just as you can get insights into how someone thinks by watching them play chess. One way to think about the selection process as having different levels of maturity.

At the beginning, we choose a model because we know it, or because we are familiar with it. It may not be the best, it may not even be appropriate, but it is the one that we used in our last project or that we published a bunch of conference papers about, and we are interested in it or committed to it.

The next step in the progression is a performance-driven selection. An advanced modeler will be aware that some models handle small data sets better than others, some are much more forgiving when their underlying assumptions are violated. An experienced modeler will have a solid understanding of these trade-offs, and will choose the option that best suits the problem they’re trying to solve.

After someone has seen their model implemented, and used by others over the course of a few years, they start to gain and even broader appreciation for the trade-offs involved. At this level, a modeler selects models not just based on their technical performance, but also on their long-term costs and benefits. Some models are very easy to compute. Some occupy very little space in memory or on disk. Some models are very sensitive to changes in the phenomenon being modeled and some are not. Some require much more maintenance than others, or require that someone who is an expert in the method be on hand to make adjustments. Someone selecting a model with this system level awareness will take all of these things into account.

How did you get started in data-science?

I didn’t know I was getting started in data science. I was just trying to answer questions that happened to involve a lot of data. My educational background is in mechanical engineering and biomedical applications of robotics. I was trying to do things like, given a very irregular, noisy, and intermittent recording of blood pressure, infer someone’s heart rate. Given a series of images taken from the robot's camera, infer the robot's position. Given the arm movements of several dozen stroke patients, determine whether and how much those movements have gotten smoother. These problems required a lot of data interpretation, data cleaning, modeling, and statistical assessment. Finally, in 2013 when switching employers I realized that my toolbox was being referred to as data science, and I applied for data scientist positions. I have been wearing that label ever since.

What are the greatest risks presented by Big Data and AI?

It is difficult to predict what changes a proliferation of data and machine learning methods will bring. Historically, we humans are really bad at predicting the future. So I won’t try. However I do find it instructive to look at previous examples of technological changes. Every big development, whether it is the automobile, space flight, or vacuum cleaners, has been painted as a threat to humanity as we know it. So far, this has not proven to be the case. My favorite example of this is the alarm raised by those who feared that reading books would so fully occupy the minds of young people that they would be socially stunted and degrade society. This is a reliable human response to any significant change, real or perceived. However, all of these have in fact resulted in big changes. In some cases, these changes have upended people’s lives. Some people have lost money; some people have made money. Some people have lost jobs, some people have found new areas of employment. Some regions of the country and the world have benefited more than others. If history is any guide, we can expect that the changes we’ve seen will continue to unfold in ways we haven’t quite predicted. It probably won’t be the end of the world, but it is surely worth our time to keep an eye on what is changing and in which direction.

If you're keen to learn more about programming, machine learning and AI, we thoroughly recommend checking out Brandon's brilliant YouTube channel.

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

This is a very helpful comment!

To strike-through text, use <del>some words</del>. I'll work on allowing double tildes to be used soon!

Nice explanation @Tiago!

Secret Santa 🎅

This shouldn't take too long to solve...

A group of $N$ people place their names into a hat. Each person draws a name at random to decide who they will give a gift to. What is the probability that nobody will draw their own name?

Post solutions below!

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.


According to this article, the Shapiro-Wilk test is the most powerful normality test.

Wow - brilliant solution! 👏👏👏

Spin to Win 💰

Here's a brilliant puzzle we stumbled across...

You spin a wheel and it randomly lands of $£1$, $£2$ or END. If you land on $£1$ or $£2$, you bank the money and spin the wheel again. You keep spinning until you land on END, at which point you win the money you have banked. On average, how much do you expect to win?

For an extra challenge...

What if the wheel now has values $£1$, $£2$, ..., $£\text{n}$ as well as END and and you play the same game? How much do you expect to win now?

Post solutions below!

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

Hi. Sorry for the delay. We had lots of submissions to get through. We've just posted the solution here.

Check out my ARMA-GARCH poster!


Over the summer I did a 6-week research internship at St Andrews where I studied ARMA-GARCH models for financial time series.

As part of the internship, I had to create a simple poster outlining how ARMA-GARCH models are fit and why they are more suitable for financial returns than pure ARMA models, and I thought I'd share it here for those who are interested.

Let me know what you think!

Glad you found it helpful! 😁

Kenneth Cukier on the Rise of Big Data 🤖

In this interview, Kenneth Cukier, senior editor at The Economist and co-author of the brilliant book Big Data, details how companies are making use of Big Data, what it actually refers to and the risks it presents.

Could you give some interesting examples of how companies are making use of Big Data?

There are zillions of examples. A classic one is from Charles Duhigg’s excellent book Habit, in which the US retailer Target could tell if a customer was pregnant based on their shopping patterns. It’s basic data-mining. You let shoppers with a loyalty card sign up for a baby-shower registry. You now have an idea of some people who are pregnant, and when they’re due. Then, you watch how their shopping changes over time.

In the first trimester, they’re buying things like vitamins. In the second trimester, as the body changes, they’re buying lotions (typically unscented). In the third trimester, they’re buying cloths and things for the new arrival. When you start to see a similar pattern form among other customers, you can make a prediction if they’re pregnant. This matters because people’s spending is higher and new brand loyalties form as people become parents.

Lastly, I should add that this example gets totally misunderstood: it’s NOT that the company knows that a person is pregnant before the women herself knows. It’s that without any explicit disclosure, a company can infer a woman’s status. Strikingly, this technique can be used for lots of things -- notably, predicting medical conditions from search queries before people know they have a disease, which has been shown by Eric Horvitz of Microsoft Research for pancreatic cancer.

In your opinion, what is the most significant problem that has been solved using Big Data?

Well, considering that I treat big data as simply a user-friendly way to refer to machine learning, and a branch of machine learning is deep learning (which of course works far better now than in the past in large part because of more data), then I can claim all of the modern AL revival as a win for big data: from Alexa voice recognition to self-driving cars, to Google image search, to the recent advances in computational pathology. From this, how could I possibly chose the most significant?

But if I had to choose, I’d say using image-recognition to diagnose disease is the most important. It will change healthcare for the better and lead to better lives for many people.

What are the greatest risks presented by Big Data and AI?

The biggest danger is the loss of human freedom from a surveillance society. I don’t fear that AI take over the world -- at least, we’re a very long way from this being probable. But it’s far more likely that people will use AI to destructive ends, that harm others. I am very worried about that -- and looking at the governance we have in established democracies in the West, the fears are real. As for developing countries and China, it’s clear that the fears are even closer of this happening there. Yes, companies are powerful. But we’d rather have a tyranny of Facebook likes than a real tyranny.

What sorts of problems can Big Data not solve?

It’s a method like any other, and has its uses and non-uses. The question is like asking “where does stats or empirical evidence” not apply. The answer is: a whole host of areas. Matters of the heart, of risk-taking and judgement and instinct and emotion. Or, areas where just a small amount of data will do, say, grading a student exam -- where all you care about is how individuals performed and their distribution as a group. But we’ll find more and more uses for this data.

For example, if all the student exam records were digital, nationwide, and went back a decade, we might find relationships between exam performance and weather, breakfast, distance from school, etc, that we never knew before. I should add that big data can help run simulations, to predict future possibilities, but all simulations are prone to error because of the models not the data. So you could use big data to predict the potential consequences of Brexit -- but it might not be right.

Could you suggest a novel trading strategy which makes use of Big Data?

There is so much incentive for people to get a leg up on the markets that, almost axiomatically, I won’t come up with anything original. That said, one area of interesting study, as algorithmic trading increases its share of transactions, would be the degree to which markets respond to “human events” or “model-fed events” -- that is to say, a local sports team win (French market if France wins the World Cup; the NY Yankees win the world series, etc) versus changes where it’s data fed into an algo. Then, if we can find triggers to human emotions that might affect traders, we’d have an insight on how markets may shift. But of course I hate this very idea, since I’d rather that markets reflect fundamentals rather than the vagaries of human or algorithmic behaviour!

What are the differences between Big Data, Machine Learning and AI?

Everyone will have a different answer to this, and it’s a tiresome matter of semantics. In my mind, the term “big data” is basically a popularisation of machine learning. AI refers to the broad field, and machine learning is just one general strand of it: using data and stats to infer answers that aren’t programmed in at the outset.

There’s a lot more to these terms, but this simple breakdown should do the trick. I’d point out, though, that big data need not just mean ML. Big data is really a catch-all for the new era we’re in, where we have more data than ever, and we can do more with it than ever before. For example, we never had a way to easily measure the number and location of people on public transport in real-time, and now we can. There’s no ML involved with this -- it’s just “counting” -- but we can now do these things, and derive new insights from it.

Who benefits and who loses from Big Data?

Society benefits. No one loses per se. But there are risks and dangers. Let’s first look at the optimistic case. Saying “big data” today sounds novel and to some people, perhaps a bit frightening. But a good way to think about it is to transport ourselves in our imagination to the mid 1500s in Europe, as algebra was undergoing a revival among scholars. Someone might ask: “who benefits and who loses”…?!

How could we answer that? Society benefits: we’ll build taller buildings, sturdier ships, unlock mysteries of science to improve health. Who loses? We’ll mathematise aspects of society and overlook more spiritual or humanist dimensions of life; only a handful of people will excel at harnessing the methods; we’ll use it to build weapons etc. Big data is similar: it’s a huge advancement in how humankind interacts with information around us, and will improve our lives. No one “loses” even if there are concerns on how it’s used.

Can causal relationships, as opposed to correlations, be identified using machine learning techniques? Is it a problem if they can’t?

Perhaps. On a purely statistical level, no -- you only have a correlation. But if you establish a method to find causality, there’s no inherent reason why you couldn’t find causal relationships just because you’re using more data. But then you’re doing a causal experiment, and sort of ruining the benefit of relying on big data -- tapping the vast amount of data in the wild to give you a rough-and-ready answer.

For lots of things, you don’t care about the causal relationship as long as you have the correct answer. For example, why do people who buy Murakami novels like Hemmingway but not Margret Atwood? We could do interviews and psychological studies to possibly learn this (invented example)? But why bother, if just by knowing it you can make better book recommendations and increase sales?

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

Black Swans Workshop: Deep Learning Models for Predicting Stock Prices


We're organising a free online workshop called "Deep Learning Models for Predicting Stock Prices" 🤖 for January 2019.

If you'd like to attend, please comment below so we can gauge interest.


Crooked Coin

We came across this brilliant puzzle on Plus Magazine, and they've kindly allowed us to share it.

Two people have asked you out for a date and you decide who to go with by flipping a coin. The only coin you have, however, lands heads with a probability of $0.75$ and tails with a probability of $0.25$. How can you a use a series of flips to decide who to go with so that both people have the same chance of being picked?

Post solutions below!

Well, you could also convert latitude and longitude to spherical coordinates. Perhaps this article will help. Best of luck with the competition - let us know how you get on!

Sign Up for an RSS e-Student Account to get Free Access to Significance Magazine!


I only just found out about this, but if you're a student you can get free access to Significance magazine by signing up as an RSS e-Student.

Here's the link to sign up.


Tim Harford on Misleading Statistics, Naive Predictions and Sticky Research

In this interview, Tim Harford, author of The Undercover Economist and senior columnist at the Financial Times, explains why Vote Leave’s bogus £350m NHS claim was so effective, why it’s so difficult to predict stock prices and why early researchers often get too much credit.

How was Brexit affected by misleading statistics?

I think the debate was derailed by endless discussion of the lie on the bus – the demonstrably untrue assertion that the UK sent £350m a week to the EU. Andrew Lilico, one of the small number of economists campaigning for Leave, told me during the campaign that he would have preferred to use a smaller number. When we discussed it after the vote, though, he reflected that the larger number had been politically effective, simply because the Remain campaign had devoted so much energy to rebutting it.

It’s something to reflect on for anyone interested in the practice of fact-checking: simply rebutting untrue claims may be unhelpful or even counter-productive. We need to be thoughtful in the way we do it.

Do you believe that enough statistical research can turn hypotheses into facts? Or will there always be doubt?

There will always be doubt but we need to bear in mind that doubt has been systematically used as a weapon against expertise. Arguably this started with the cigarette companies, who employed Darrell Huff – author of the famous How To Lie With Statistics – to chip away at the idea that the epidemiologists knew what they were doing. Then we saw the same thing with climate change, and now populist politicians.

More research is always needed, the world is always complicated, and experts often get things wrong – but we need to be careful not to let healthy scepticism turn into corrosive cynicism.

In your opinion, what can and can’t be predicted? E.g., Do you think it is possible to predict the weather? Or the consequences of Brexit? Or Apple’s stock price a month from now?

These are very different projects. The weather is a complex system, but with more data and more powerful computers you can have a go at forecasting a few days out, with some demonstrable success. (It helps that weather forecasters get a lot of quick feedback.)

Brexit is simply complicated: we’re predicting the impact of an as-yet-unknown political agreement on any already-complex economy. I think we can make some reasonable conditional forecasts – the economic equivalent of “if you eat a lot of doughnuts you will probably put on weight” but that’s about it. And if someone eats doughnuts despite our advice, there’s always room for them to deny that the doughnuts are making any difference.

As for Apple’s stock price, a stock price is a prediction of the discounted value of future cash flows. So you’re asking for a prediction of a prediction; that’s intrinsically very hard, because the prediction influences itself. Storms don’t come because we predict storms, but shares can soar or fall because we think they will.

What are some pitfalls that statisticians should be aware of when doing research?

One difficult balancing act is to reflect the history of a claim.

On one hand, we sometimes give the early researchers too much credit. There’s a lot of social psychology at the moment based on fairly small samples with noisy measures that is proving hard to replicate, yet remains quite sticky. We seem reluctant to dismiss a small, noisy study on the basis of a larger, later study, unless the evidence is overwhelming. If the small noisy study had come later, we’d ignore it.

And physicists have found that errors in estimating physical constants – such as the charge on an electron – tend to persist. A more accurate experiment comes along, but people are reluctant to completely dismiss the earlier work.

On the other hand, if there’s a huge body of work out there and some crazy new study grabs all the headlines, the simple heuristic should be that if it’s astonishing, it’s probably wrong. This is not an easy balance to strike but we need to try.

Do you think statistics should be taught differently to how it is today?

I think we should think more about the way statistics are communicated. This is an old argument – Florence Nightingale and William Farr debated how her famous Coxcomb diagrams should be presented. Farr said that “statistics should be the dryest of all reading”; Nightingale believed that statistics needed to be presented with impact if they were to make a positive difference. She was right – but of course statistics can be presented with misleading impact, too.

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

I can't think of any others myself, but I'll share your post with others and see what they think. Out of interest, what are you using the GPS coordinates for?

Nice solution @KvanteKat! 👏👏👏

Looks good kl2792! I'm currently adding markdown support for posts and comments to make code look nicer.

Hi Prabhat - great to have you on the site, and congratulations for getting your AI internship with Samsung 🎉🍾!

Out of interest, what fields in AI and machine learning most interest you?

Hi langtudeplao - welcome to the site.

You don't need to know vast amounts of maths to solve problems with machine learning techniques. As you mention, it would be helpful if you learnt Linear Algebra, since that would help you to understand some of the algorithms, as well as some basic statistics.

The best book I've come across to learn about machine learning from scratch is Introduction to Statistical Learning. You can get it for free here.

Out of interest, how do you call R from Python? Is the rpy2 package your best option?

Hey James - I think your interpretation of fair is most appropriate.

Hi Paula - welcome to the site. Your blog is amazing! I just read your post Deadly Ice Cream, which brilliantly explains the difference between correlation and causation, and encourage everybody else to do so as well!

The Last Banana 🍌

Here's a counter-intuitive puzzle:

Tom and Harry are trapped on a desert island, playing a game of dice for the last banana. They roll two dice, and agree that if the biggest number rolled is a one, two, three, or four, Tom wins, but if the biggest number rolled is a five or six, Harry wins. Who has the best probability of winning the banana?

Post solutions below!

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

Ahh I see - very helpful points. Which areas of statistics are you most interested in?

Thanks for the advice Fich! I'm curious to learn what you gained from your masters which you wouldn't if you instead just read through statistics textbooks? I guess you must meet a lot of like-minded people which would be rewarding.

True - I think this one is a little easier though.

Can You Solve the Puzzle Which Inspired Probability Theory?

The puzzle below, known as The Problem of Points, is often said to have inspired probability theory.

Fermat and Pascal play a game by repeatedly flipping coins - if a coin lands heads, Fermat gains a point, and if it lands tails, Pascal gains a point. It is agreed that first to $10$ points wins a prize of $100$ francs. Suppose the game abruptly stops when the score is $8$-$7$ to Fermat. How should the prize be fairly divided between the two players?

Post solutions below!

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

What is your Statistic of the Year?

The Royal Statistical Society (RSS) are accepting nominations for Statistic of the Year.

If you know of any mind-blowing 🤯 statistics, I encourage you to and post them below and submit them to RSS!

Nice answer simon!

Foxes and Hounds

Here's another puzzle, kindly given to us by Adrian Torchiana, creator of the brilliant app Probability Maths Puzzles.

Five foxes and seven hounds run into a foxhole. While they're inside, they get all jumbled up, so that all orderings are equally likely. The foxes and the hounds run out of the hole in a neat line. On average, how many foxes are immediately followed by a hound?

Post solutions below!

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

Funny you should post this - I'm in the same boat! Would be interested to hear peoples' thoughts.

Just added an option to delete your account ✅

Thanks for your feedback BVO999. I'm working on the basic functionality now - should be done by tomorrow 😁. If there's anything else you think I could work on, just let me know!

Hi! Thank you for your comments. I will add an option to delete your account ASAP, but am not sure where I stand on threads. I agree that they encourage discourse, but am not sure how nice they look! If lots of people want them I'll of course build them in.

Hi James - welcome to the site! I used to prefer pure maths too, but then I stumbled upon quant trading 😍 which heavily involves stats. Best of luck 🍀 with your A-Levels and university applications. If you ever want any help with them, feel free to write a post and I'm sure people would help you out.

How a Firefox Engineer is Using Distributed Computing to Replicate AlphaZero 🔥🔥

Gian-Carlo Pascutto is the creator of LeelaZero, an Open Source Go Engine based on DeepMind’s AlphaZero. In this interview he explains why he built it, how it works and how he plans to improve it in the future.

What motivated you to build LeelaZero originally?

Leela Zero was based on Leela, so we have to start from there. I originally programmed computer chess, but got fed up with the plagiarism in that domain, so when computer Go sort of got "reset" with the introduction of MCTS (Monte Carlo Tree Search) in 2006, I started exploring that as well. I had some fun and achieved some good results, but, at that time, computers weren’t really strong enough to be useful.

In 2016 my interest in computer Go rekindled a bit with AlphaGo beating Lee Sedol so I updated Leela with neural network support, which got a ton of very positive feedback from the Go community. When the AlphaZero paper came out, it was obvious to me that reconstructing it starting from Leela would not be too hard, and that hardware would be the limiting factor, but I also figured that the largest part of the computation was amendable to being distributed, so I open sourced the result as LeelaZero.

How does LeelaZero actually work?

LeelaZero probes a deep residual convolutional neural network (DCNN) to assess which player is in a better position and to compute the probabilities that a given move is the best it could make. Then, based on this information, it constructs a search tree and re-evaluates the network after every move. This allows it to search for moves that initially look good and then shift its focus (after enlarging the search tree) to moves that end up leading to good positions.

The reinforcement learning procedure that allows the program to self-learn basically just runs the tree search to select moves and plays an entire game, and then feeds back the output from the search to the neural network, so a single evaluation of the network can learn to "predict" what a larger search evaluating hundreds of nodes would eventually have found. The outcome of the game corrects the assessment of who was better in each position.

How does LeelaZero differ from AlphaZero?

We followed AlphaZero’s published papers as closely as possible, though we can't be 100% sure because some parts are a bit ambiguous. There have been some small tweaks for efficiency, such as starting the self-play with a smaller network (you don't need a large one if the moves are still mostly random), a better initialization for not-yet-evaluated nodes, etc. The problem with such improvements is that we can't restart the entire learning procedure to find out what their "eventual" effect is (unless we want to wait many months!), so we have to be rather conservative.

How do you leverage distributed computing in training LeelaZero?

About 500 people on average are running a client that downloads LeelaZero’s “best” network and lets the program plays against itself. This generates training data that is collected on a central server. After some amount of training steps, a new candidate network is uploaded and clients start playing matches against the old network to determine whether it is better (and thus the new "best").

In Season 12 of the “Top Chess Engine Championship”, LeelaChessZero (a chess engine adapted from LeelaZero) only won one of its 28 games, but in Season 13, it won the division. How did it improve so quickly over such a short period of time?

Various reasons. A lot of bugs were fixed, it had been trained longer, an optimized client was written to work faster, etc. Barring any bugs (very important), a reinforcement learning approach has a very fast initial phase until the gains start to flatten out, so throwing hardware at it leads to rapid improvement. E.g., Google famously claimed it took "4 hours" for AlphaZero to master chess.

How do you plan to make Leela Zero better in the future?

For pure strength, the size of the network can be enlarged. Of course there is a point of diminishing returns eventually, but we haven’t reached it yet. I strongly suspect the tree search itself can also be improved further to get some strength gain - this is a good area for future research.

What is your goal for Leela Zero?

The goal was to demonstrate that one could replicate the DeepMind result with a distributed effort and make the results and data publicly available. Of course it can always be made a bit better or stronger but my personal goal for the project has been achieved. The program is much, much stronger than my previous closed source engine and it's beyond human level too.

Will Leela learn to play games other than Go and Chess in the future?

I don't plan to work on any other games myself, but obviously the Leela Zero code is open source so it does not only depend on me.

To learn more about Leela Zero, go to

If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.

Good point Tiago - the answer should not be provided conditional on the fact that one of the players wins.

Ah I see. I thought you might be targeting it at investors interested in healthcare stocks. If you ever wanted any help or advice, don't hesitate to write a post, I'm sure lots of people would be happy to help you out 😁.

Hi Caitlin - welcome to the site.

I would love to hear more about your SaaS. It sounds really interesting, and is right up my street 😎. What do you think people will primarily use the service for?

I'm always impressed by people who can surf 😂, I've only ever managed to get up on the board 🏄‍♂️ for a couple of seconds.

In my opinion, time series analysis is a really interesting topic to learn about. It's exciting to build up tools which can predict Apple's stock price 💵 tomorrow or the support for politicians in the run up to elections 🗳.

Who's Doing the Dishes?

We're giving £100 🤑 to whoever submits the best (i.e., shortest, cleverest, etc.) solution to the puzzle below, kindly donated to us by Professor Peter Winkler, author of Mathematical Mind-Benders and Mathematical Puzzles: A Connoisseur's Collection:

You have begun a series of rock-paper-scissors games with your spouse, the loser to wash the dishes. Bad news: your spouse being the better player, your probability of winning any given game is $< 0.5$. Worse, you've already lost the first game. But there's good news too: your spouse has generously agreed to let you pick, right now, the number $n$ of games that must be won to win the series. Thus, for example, if you pick $n=3$, you must win $3$ games before your spouse does (but remember, your spouse has already won a game). What is your best choice of $n$, as a function of $p$?

Scan your solutions to [email protected] before 21st September and we'll announce the winner 🍾 shortly afterwards. If you have any questions, post them below.

If you would like to be sent a list of our statistics puzzles each week, sign up to Black Swans here.

Random Connect 4

We're giving a year's subscription to Spotify Premium 🥳 to whoever submits the best (i.e., shortest, cleverest, etc.) solution to the puzzle below:

Jack and Will play a game of Connect 4. If Jack goes first, and both players randomly choose which columns to drop their discs down, what is the probability that Jack wins the game?

Send scanned solutions to [email protected] before 21st September and we'll announce the winner 🍾 shortly afterwards. If you have any questions, post them below.

If you would like to be sent a list of our statistics puzzles each week, sign up to Black Swans here.

These might change, but at the moment I've put down:

  • Computing in Statistics
  • Bayesian Inference
  • Joint Project (on ARIMA-GARCH models)
  • Statistical Modelling

Hi Brooke - great to have you on the site! Hopefully I'll get a chance to meet you in St Andrews!

Thanks Jordan. Glad you like the site. Let me know if there's anything I could do to make it better.

Hi, I'm Jack.

For the past month I've been creating Black Swans 😎, but before that I did a 6-week research internship (supervised by Valentin Popov) at St Andrews ⛳ where I built an ARMA-GARCH python package called tsmodels. Currently it’s much slower than similar R packages (e.g., rugarch), but I plan to speed it up using Cython 🔥.

Earlier this year, I created another package called algo-trading which backtests trading strategies. Although it's quite basic at the moment, I’d love to develop it into a fully-fledged trading platform 💵 which allowed you to create and manage strategies and execute trades.

Besides stats, I really enjoy playing and watching football, golf and tennis. For those that are interested, I’m a huge fan of Man Utd, Roger Federer 🐐 and (despite his private life 😮) Tiger Woods.


Introduce Yourself!

Here's what you should do if you're new to the community:

  1. Reply to somebody else's introduction, just to say hi or to mention that you are working on similar projects or have common interests.
  2. Leave your own comment introducing yourself. Let people know what statistics topics you're interested in, what projects you're currently working on and what you enjoy doing in your free time (apart from stats 🤓).

I look forward to getting to know you all!

Plans for the Future

Here are some features I’d like to build into the site.

  1. Some sort of functionality which allowed users to independently organise statistics meetups. Nothing beats meeting people face-to-face. Each meetup would start with one or two speakers presenting exciting projects they’ve been working on, and then there would be an opportunity to chat with others in a very informal setting 🍻.
  2. Allow users to “follow” others, and then on the Community page there would be a button to see just the posts from people you follow.
  3. Allow users to “favourite” posts, which are then stored in their profile.
  4. After signing up, users are presented with a bunch of stats topics and asked to select the topics which interest them. Then, depending on whether the user selected at least one topic, a new tab appears saying something like “Recommended”.

Let me know your thoughts...

Welcome to Black Swans!

Hi, I’m Jack. I created Black Swans – a tight-knit group of students and lecturers from UK universities who help each other with statistics problems and collaborate on projects.

How to Get Started

To get started, all you have to do is create an account, although I encourage you to also upload a profile picture and add a short bio. It would also be nice if you introduced yourself to the community.

What Can You Post About?

Anything that somehow relates to statistics and isn’t an advert. Perhaps you’ve been stuck on a tutorial question for the past hour and want someone to shed some light 🔦 on it. Or maybe you’re building a neural network to predict football scores and want to find others who might be interested in collaborating with you. It could even be that you'd like to share some mind-blowing 🤯 results from your research.

How to Tag Posts

A tag is a label (e.g., machine-learning) which categorises a post. Adding one or more (max 3) tags to your post makes it easier for people who will be interested in your post to find it, and so it’ll get more comments 😁. Most of the tags which I’ve created are self-explanatory, except “management” perhaps, which should be used for posts relating to the management of Black Swans. If you think a new tag should be created, please send me a message.

Upvoting Posts and Comments

We’re a tight-knit community. We want to help each other. We want to inspire each other. We want to make each other laugh 🤣. So if you stumble across a post or comment which you find insightful, thought-provoking or entertaining, please upvote it, because then more people will see it 🙏. It also shows appreciation to the person who created the post or comment.

Want to Support the Site?

It costs me roughly £200 each year to run Black Swans, and also takes up lots of my time and energy. If anyone could chip in anything 💵 at all to help out, please get in touch.

I’d also really appreciate it if you followed the account on Instagram, which I use to showcase interesting and inspiring projects that people are working on, and also to get feedback on the site.