The No Free Lunch Theorems

This article summarizes several key points about the no free lunch (NFL) theorems. These are a set of mathematical results about the efficacy of methods in search, optimization, and supervised learning. Especially metaheuristics like evolutionary or nature inspired swarm algorithms. Do they really “work”? What does it even mean to say that they “work”?

The no free lunch theorems are frequently brought up in online discussions but often without any depth or mention of the nuances, which is part of my motivation for writing this article. Another key reason for the existence of this article is that the authors specifically single out the existence of free lunches in coevolutionary algorithms.

Before we dig into the materials, I would like to point out that the authors of these theorems have a website dedicated to this topic. However, do note that some of their links are broken at the time of writing.

The Papers and What They Say

Online discussions tend to cite the no free lunch theorems as saying “there is no best algorithm” whenever someone asks if one algorithm is better than another. This is, of course, too simplistic of a description for the no free lunch theorems. What I am hoping to do here is to go one step and summarize the key points of each papers, as well as list situations that can in fact have a “best” algorithm.

Supervised Learning

The very first paper on the NFL theorems was for supervised learning:

David Wolpert 1996 - “The Lack of A Priori Distinctions Between Learning Algorithms”.

This paper is infamous for being “often cited but rarely read” because it consists of 52 pages and appears to be a rather difficult read. I must admit that I have not fully read it as well. In fact, the author wrote a version much later that is a lot easier to read:

David Wolpert 2001 - “The Supervised Learning No-Free-Lunch Theorems”.

The author introduced a new “Extended Bayesian Formalism” framework and analyzed the expected error (“cost”), $ \mathbb{E}(C | d) $ , of a supervised learning algorithm, given training set $ d $ .

Key Points

The datasets used for calculaing the expected error must be “off-training” data points. These are data points that do not appear in the training dataset. This strict assumption prevents algorithms that memorizes the training data to have an advantage.
Let $ d $ be the training dataset. Then an algorithm is equivalent to $ P(h \mid d) $ , where $ h $ is the algorithm’s estimate of the true objective function $ f $ . The h here stands for “hypothesis”.
The authors used $ P(f \mid d) $ to express our uncertainty about the true objective function $ f $ that the algorithm is trying to learn.
Finally, the authors showed that $ \mathbb{E}(C | d) $ is the same for any two algorithms when uniformly averaged over all objective functions $ f $! The same is true when uniformly averaged over all priors $ P(f) $ and also when uniformly averaged over all training sets $ d $ or size of the training set!

In English, this means that if we do not know anything about the objective function $ f $ that we are trying to learn, then any function is possible and any two algorithms will have the same expected error on new unseen data points.

Optimization and Search

Two more papers were published in 1996 with no free lunch theorems for search and optimization.

Wolpert & Macready 1996 - “No Free Lunch Theorems for Search”
Wolpert & Macready 1996 - “No Free Lunch Theorems for Optimization”

I have not read the one for search algorithms but the reasoning should be similar to optimization, which can be seen as a search for the maximum or minimum of an objective function.

Key Points

Let $ f : X \rightarrow Y $ be an objective function on final sets $ X $ and $ Y $ . The paper considers the space of all possible objective functions on these two sets. This space has size $ |Y|^{|X|} $.
Let $ d^y_m = \{ d^y_m(1), d^y_m(2), ..., d^y_m(m) \} $ be a time ordered sequence of $ m $ $ Y$ -values. Note that $ d^y_m(i) = f(d^x_m(i)) $ , where $ d^x_m(i) $ is the $ X $-value “visited” by the algorithm at time $ i $ .
For any pair of algorithms $ a_1 $ and $ a_2 $ , this version of the NFL theorem says that $ \Sigma_f P(d^y_m | f, m, a_1) = \Sigma_f P(d^y_m | f, m, a_2)$.

In English, this means that any two algorithms will have the same probability of generating any particular outcome, when averaged over all possible objective functions!

Coevolutionary Free Lunch

Finally, the authors wrote a paper much later showing the existence of free lunches for coevolutionary algorithms. I was very excited to read this at first but it turns out that this result has a massive caveat and is not the discovery of an actual free lunch.

Wolpert & Macready 2005 - “Coevolutionary Free Lunches”

Key Points

There are two players, the agent and an opponent. Let $ \underline{x} $ be the agent’s moves (choice of input value $ x $ ) and $ \bar{x} $ be the opponent’s responses to the agent’s moves.
The “play” is sequential. The agent selects a move, then the opponent selects a response.
Let $ f: X \times X \rightarrow Y $ be the fitness/objective function of the agent. An input element $ (\underline{x}, \bar{x}) \in X \times X \rightarrow Y $. Here, it is a payoff function that agent want to maximize.
The conclusion section of the paper explains succinctly why free lunches exists here: summing over all payoff functions $ f $ is not the same as summing over all functions $ \min_{\bar{x}} f( \cdot , \bar{x}) $ .

In English, this means that for algorithms whose payoff depends on the opponent’s response, some algorithm can do better than others. Trivially, an algorithm that chooses to ignore the opponent’s moves will perform worse than those that tries to get information from the opponent’s moves. This Stack Exchange discussion has a very nice phrase that summarizes why free lunches exist in this coevolutionary setting:

“Lunches are free because the opponent is buying.”

When Can One Algorithm Outperform Another?

The NFL theorems say that the performance of all algorithms are the same, when averaged over all possible objective functions. However, it is still possible for one algorithm to beat another in several ways.

The NFL theorems do not compare computational time and resource usage. An algorithm can outperform another by running much faster or using much less resources.
If we have inductive bias about the structure of our dataset or objective function, it is possible to for some algorithms to exploit this structure and outperform others.
The NFL theorems are not known to hold for “higher order” algorithms that use information such as the function’s gradient (1st order) or the Hessian (2nd order). This Stack Exchange discussion has a great explanation of this.

Follow Up Research

There are also numerous directions for future research.

What kind of datasets and objective functions would the NFL theorems fail to hold? For example, Schumacher, Vose and Whitley 2001 showed that the NFL theorems hold if and only if the set of objective functions we are averaging over is closed under permutation.
Which algorithms perform better on these datasets and objective functions? Why do they outperform?
When do metaheuristics have an advantage? For example, instead of saying that all genetic algorithms are “useless” because of the NFL theorems, we could research conditions under which genetic algorithms will have an advantage.

Here is a review paper not written by the original authors, but one that I find very useful to keep up with the various developments in this area.

Adam et al. 2019 - “No Free Lunch Theorem: A Review”