Statistical Learning is not “Learning”

New techniques are named to provide an intuition of their process.  All techniques built upon it are colored by the original terminology.  In time, those techniques come to color the terminology itself, slowly supplanting the original meaning with the new one.


Sometimes a discussion about word usage is an irrelevant exercise of ego.  Other times, word usage becomes very important, as the word choice can (perhaps deliberately) mislead people, and make progress in a topic space very difficult where those words are common.

This is one of those times.  Whether by wishful thinking, or by a genuine desire to cast work in the light of “learning” for grant proposals, (statistical) frequency updating has come to be thought of as “learning”, and over time has come to influence the way people think of and interpret learning itself.

Unfortunately, the difference between statistical learning and real, commonly accepted learning is the difference between night and day. The terminology “learning” contains a set of features which have been unknowingly overridden and replaced by a conflicting set of features. Without explicit knowledge of the conflict, awareness of the difference and the applicability of resultant models becomes formidable.

This paper deals with the feature set incompatibilities of key terminology related to learning/causal model building (“learning”, “generalization”, and “hypothesis” (LG&H)), not necessarily with mathematical differences.  Exposing the mathematical differences entails a self-consistent set of definitions, which does not yet exist.


In research & academics, many words take on a “technical” definition additional to their commonly understood feature sets.  In situations of conflicting usage, one might be inclined to think of the common definition as “unsophisticated” or “naive”, rather than to resolve the implicit conflict.

To be a compatible and “sophisticated” definition, the technical definition must resolve to the non-technical definition when used in the commonly understood context.  That is, either the set of parameters applied of some terminology T() must be commutative to be applicably compatible — removal either before or after application should yield the same result (same meaning).

Example: Consider Johnny running: running(Johnny())

The function “running()” has a primary feature set, derived from the word ‘run’: “moving continuously”, “legs pumping” (in non-harmonic energy transfer — i.e.  not jogging, bouncing, loping, etc.) .  running() has several secondary, overridable features: “high energy output”, “forward”, “high speed”.

Over time, we may find it insufficient to simply describe Johnny as running, but want to know his exact position at time t.  So we give running the technical definition of runningposition(x) over time (t), x = 5×t.  There are many other ways to run, but lets just assume the one for now.  This definition of running is applicably compatible to our common definition of “running()” because the primary feature of “moving continuously” is satisfied by x = 5×t.  At every point over time, Johnny is at a different position than the last (continuous in position).

But suppose we define runningposition(x) as a step function (x = 5×step(t)):

While Johnny may reach the end at the same time (t) with our step function, this would be an incorrect definition, because he is not actually moving continuously, and thus not actually “running”.  He is merely warping statically between positions.  The technical function becomes incompatible with our common definition and cannot be applied at any time (t) between the warp times.

Thus, for a technical definition and feature set to be self-consistent, it needs to inherit (and perhaps extend) the well understood and primary definition, but not override it.


To understand the primary component of the word “learning”, it is not necessary to have a model for how learning actually works — only what learning produces.  Though it may be possible to produce identical output via a non-learning process, the failure to produce the correct one is a clear indication of an incompatibility.

The primary and defining component of the word “learning”, and thus its output, is that “the process of learning” produces 100% correctness.  All internal process models aside, the end result of C having been “learned” is that C is applied without mistake each and every following time.

One might be inclined to think of a process that “improves” as “learning”, but correctness in the limit is not correctness.  For “learning” to occur, there must be a state where the process of “learning” is changed to “learned” (in the past). A steadily improving process that is correct in the limit has no such time where the state changes from “have been learning” to “has been learned”.

When misused, the secondary features of “currently occurring”, “acquisition”, and “improvement” are often relied upon, and bear some relation to learning — though ancilliarly so.  For instance, one might assume that if someone is learning that they are steadily improving — though much of learning occurs through error.  Also, for something to be learned, it must be internalized, so many uses focus on acquisition as learning. Both these words and many others maintain their own well defined lexicons that bear no similarity to “learning” beyond the incidental overlapping features. Otherwise, one might mistakenly use the secondary features in a primary context:

  1. The boy learned an apple [acquired]
  2. The temperature learned. [improved]

The difference is that the process that results in production of the gold standard has been internalized. Failure to produce the gold standard, even in the state of improvement, is failure to have learned.


In the study of learning, “generalization” is the quintessential test.  It asks the question of the learner, “Which part of the ‘gold standard’ is the salient part?”. In so doing, it achieves the function of separating the two processes (replication or learning) that could have both produced the gold standard.

As a result, “generalization” requires the 100% correct application of the principal component into a seemingly new domain — in the absense of evidence.  The principle component is either correctly identified or not — associating a probability distribution of correctness is contrary to generalization’s primary definition.

Nonetheless, over time a process better thought of as the “average difference” (or average performance ratio) has come to be known as “generalization”. Simply stated, the generalization and the generalization error is a transfer function, representing the discrete distributional difference between the gold standard and actual production in a new domain.

This technical definition of generalization bears stark difference to the widely accepted meaning of “generalization”, highlighting an accummulated difference, rather than identifying the process by which a concept is correctly applied to a new, seemingly different domain.


While traditionally associated with process of science, “Hypothesizing” is the formal process by which all learning [the formation of the singular causal link [cite: how humans learn]] takes place. Hypothesis has a functional definition that derives from its purpose, which is to expose a single variable and its either true or false causality in a given process.

This means that an “Expectation about the outcome” (as in Probability of Hypothesis given data) is not sufficient to achieve the goals of a hypothesis.  There can be any number of expected, but irrelevant outcomes of an experiment, but a process is only a “hypothesis” if it leads to single variable exposure.  This also reveals that a hypothesis must be exclusive — only one may be applied to any variable-outcome combination.

Therefore, a distribution of hypotheses is contrary to definition — enumerating all possible outcomes along a distribution does not expose either the single variable or the single causal link.  Without exclusion, there is no hypothesis — a hypothesis distribution is merely a distribution of outcomes.

The result is that failure of a hypothesis yields true/false knowledge about a specific outcome — not an adjustment of distribution.  Relative to learning, hypothesis failure forces induction of the gold standard.

  1. The boy hypothesized that the water might fall. (incorrect).
  2. The boy hypothesized that the water would fall. (correct).

In the first statement, the outcome of the boy’s hypothesis is untestable. It can be neither confirmed nor denied.  But in the second, he may be proved incorrect, and may then ask the question why?.

Statistical Learning

Statistics, Statistical Algorithms, and thus statistical learning all derive from the law of averages — which is iteratively updated under frequency.

They are all used to identify and choose the most likely outcome, observation, or hypothesis.  But the decision to choose the “most likely” yields a reductive function, one that is necessarily under-representative of the data.

When we choose the most probable model, we are essentially choosing the most optimal compressor function for the current data set.  But a compression function is not a model, and cannot reproduce the data set or the gold standard exactly — it merely “accounts” for the majority of observations. In terms of generalization, this means that excepting the anomalous single outcome, single observation example, X < Y ; the generalization performance |X-Y|/Y will always be less than 1.

The generalization error, or the “average difference” is the transfer function of the compressor to the data. It is a measure of the matched “correctness” of the compressor’s distribution to the data’s.

In terms of “learning” (improvement), excepting the single outcome/single observation anomaly, the conditioned probability of any outcome of an N > 1 experiment only approaches 1.0 in the limit.

Consider an experiment example with 2 possible outcomes {A,B}: With 1 trial, turning out accidentally as {A}, P(A) = ((å {A})¤(å {A,B})) = 1.0 .  As we add more trials, all of which turn out as {B}, P(B) continually grows with each additional observation (o).
P(B)1/2, 2/3, 3/4, … 99/1009999/10000999999/1000000

P(B) approaches 1.0 in the limit.


Any process or technique that is not applied with 100% correctness is not learned. Excepting $N=1$, statistical learning can never result in 100% correctness, perhaps because there is no internalization of the process that leads to the gold standard — at best the algorithms improve asymptotically — never to reach 100% correctness.

Learning, Generalization, and Hypothesis (LG&H) all achieve “why discovery”; probablisms and distributions (the superset of statistics) are achieved by assuming that “why” is unknowable [ cite: fail of stats] — and are thus incompatible with the definitions of LG&H.

Through “why discovery”, the human model for an event or process includes the ability to internalize and produce the gold standard. The best statistical prediction of outcome is achieved through an evaluation of the expected value across the distribution.  For this reason, the models produced by statistical learning are not applicably compatible with those for human learning, and cannot be internalized directly.

It is dangerous to interchange the usage classes identified by “statistics” and “learning”. The terms establish an implicit framework for how the techniques are to be used by humans and conclusions to be drawn. If the users believe to any degree that actual learning is taking place when the statistical techniques are applied, then they will attempt apply the produced (but incompatible) models directly into their own human map.

Finally, it is important that LG&H terms not be used interchangeably with statistical processes, and that they retain compatible technical definitions because eventually terms like “statistical learning” come to affect how researchers believe actual learning takes place. Research in the field of real learning becomes paralyzed with the loss of its primary defining “why discovery” component.