Failures of Statistics
“There are three kinds of lies: lies, damn lies, and statistics.”
The assumptions underlying statistical processes, and statistically based learning algorithms in particular, naturally prevent the induction of correct causal models that explain the physical production of data in any given experiment. Nonetheless, these processes are used for this purpose and resultingly explain data in both a limited and incorrect manner. Researchers tend to be unaware of the difference, and draw incorrect conclusions unknowingly.
The problems with statistically based learning algorithms, which are more correctly labelled “frequency updating”, are more easily understood if one assumes that there is a problem, rather than that they are correct. Assuming the algorithms are correct means that one inherently subscribes also to the idea that the terminology is self-consistent — that the mathematical representation of terminology such as “learning”, “generalization”, etc. has an identical “meaning” as its informal, but more completely understood verbal counterpart.
In [ statistical learning is not learning ], it was shown that this is not the case — the strictly (mathematically) defined terminology not only lacks the salient and defining components of the generally understood word from which it derives, but at times opposes it, rendering the “understanding” of particularly named techniques incorrect.
Additionally, the pair of “human+statistical algorithm” has provided a complete modelling and deductive mechanism for observations in smaller experiments, but larger experiments exceed human observational and modelling capacity, leading to a reduced “statistics only” analysis construct — conclusions are drawn only from statistical reductions, not from the observations themselves.
Many problems here are highly interrelated.
All statistical models derive the same problems as they are all “flavors” of probablism, suffering from all of the nefarious underlying assumptions — assumptions, which have some very undesirable and sometimes poorly understood consequences, that generally stem from n®infinity and the inherent lack of discrimination.
The following points illustrate how probablism is ultimately an observation-centric analysis tool, blind to those events and relations that are responsible for variegating the surfacing observations.
Consider the coin example. When we flip the coin we say that the probability it is Heads or Tails is 50%. That is not true. If we know the starting side, air resistence, trajectory angle, and rotational speed, the p of heads or tails is no longer 50%. It is only 50% in the absense of any and all other information. Any experiments run and analyzed with p flatten the results to 1 non-causal dimension, assuming that any external causal relationships are unknowable.
In assuming that there is an absense of knowledge about causal factors, we necessary assume that all possible causal factors, and thus all resulting data points must be treated equally. With a growing number of observations, each data point becomes increasingly indistinguishable from all others (res: law of large numbers )— meaning that each additional observation is of diminishing value. An exceptive observation, which under a normal (human) analysis yields a change in model, becomes a trivial observation, weighted and normalized by all the observations that came before and following. The repetitive observations become overweighted and drive analysis.
There is in fact no reason to believe that observations should be be treated equally — in fact, many reasons for them to not be treated equally. (throws away information content)
Consider a coin that lands on its edge. Yes this is an exception, but represents exactly a different and noisy outcome that becomes indistinguishable if the human does not know already to look for it.
p is frequency accumulative, neither subtractive, nor discriminative. It is impossible to represent anything but direct non-inhibitive relationships with it. Any analysis from an experiment that contains either indirect or inhibitory relations leads to weightings that are necessarily wrong.
p is lossy — Much of the information is thrown away to bet on the majority, frequently occurring events. The inherent redundancy assumption is not the same as “reproducible”. Reproducible means that the same event will cause the same observation to surface. Redundancy however means that the same observation is seen, without necessarily the same event. Reproducible means that the minimum number of trials is necessary to expose the causal law, whereas redundant trials are never enough, only approaching what one hopes is the rule in the limit.
[ It is better suited to compression and transmission than to causal analysis. ] .
Deduction is implicitly discarded. It impossible
to recover models which have been implicitly discarded in the first step of analysis.
All Frequency based analyses are inherenty reording operations — the linear internal state of an experiment’s data set is transformed into a set of unordered frequency counts. Because the experiment is essentially a black box, the sequential set of observations indicates the state that the black box is in before it changes under a new causal event. The black box may have no internal model, being an output connected to a randomly generated random variable, but in practice, it most likely is not. Everything in the world has an internal state, and events change that state — it is a terrible assumption to assume that the black box/experiment operates any differently.
This is a 2nd degree manifestation of the P(X) problem, which introduces its own problems. People assume P(X) is the correct way to represent their data, and as a result make the following assumptions:
Unfortunately, with real data, the state of the world has necessarily changed from observation to observation. The only experiment where the independence assumption is true is the degenerate form of the single-variable, single-outcome, direct, fully transparent test subject.
(i.e. experiment unnecessary). In all other experiments, where the subject (in NLP, the corpus, human text) is a “black box”, the state of the world is unknown, and necessarily multivariate and changing. Any analysis making the independence assumption is totally incorrect.
The model that they choose (or settle on) is fully enumerable through the data. (i.e. P(H|Dn) ). This is completely not true. All clustering, SVM, etc. machineries are designed to overcome noise, which is not just an external uncontrollable event, but a set of unknown ones which have been assumed to be unmodellable.
Example: Consider a simple experiment with gravity. On 3 trials, you let go of a ball and it falls to the ground. On the fourth trial, you release the ball and it stays where it is. Do you concluded that gravity only operates 75% of the time? That’s silly. We know that something has interfered .
Probablistic Models can be arbitrarily flattened to a distribution of pure conditional probabilities — meaning that it is not in fact a model at all, but a simplistic overlay.
As with many things, the choice of words to label and describe have far reaching consequences. P(M|D) (and the resulting “causal” (not really causal)) network as achieved from frequency accumulations will necessarily be incorrect as it inherits the inability to form non-direct, non inhibiting relationships.
Bayes, not a true probability, as it is not a localized event distribution, it’s a sample distribution.
meaning: the technical generalization, is really the impedance match between the sample and a new sample’s distribution.
[ leave out asides of learning not really being learning]
N’s unknowability (Black Box)
Every experiment may be considered adjusting inputs to a black box, and recording the output. “all things equal” is essentially what causes the observer to (mistakenly) ignore the *internal state* of the black box. In a multivariate black box, the internal state can be represented (minimally) as a state machine with *at least* 2^n(number of internal vars) possible states in the discrete case, with
f^($n-1$) combined possible mathematical transforms in the nondiscrete case (f is a possible function (chained rule'd)). To observe a state-dependent pattern from the top down requires *at least* 2*2^n observations (nyquist double). This also means that even if the relationship the researcher is looking for is $1-1$ ($X-Y$), direct input to output, there are 2^n-1 other possible states that the black box could be in to generate Y. A correllation between $X-Y$ does not mean that the black box actually has that X->Y model, just that a majority of states result in a Y output — that might seem the same, but it isn't. The “Y” is a different Y, a Y produced by a black box in some state s. When you accumulate, you throw away all of the state information contained by the linear experiment, meaning that a new experiment is not comparable, and a new input variable does not mean anything because the states are different and unknown. The new var Z could put our box into the same Y producing state, and it might not. That's the discrete case — in the nondiscrete case, there are $f^n$ possible outputs, and the correllation of 1 direct observation is insignificant in the sum of all other observed states. 1 out of $f^n$.
Coming back to nyquist top down, the f accumulations simply throw away enough observations such that model discrimination via nyquist is impossible.
If compared to signal-frequency (like RF) transforms (which all probablism is, really) within a particular sampling window, thinning the sample results by 2 cuts the maximum observed frequency in half. Thinning indiscriminately results in unpredictable loss. Reordering results in state loss AND signal loss. Its the same with prob accumulation — which similarly results in state and model loss.
the real kicker is that n is unknown (leading to $2^n$ possible states in our black box), meaning that it is impossible to do a top down analysis of observations. The only remaining method is an incremental observation-by-obs. comparison, where n is known to at most be 1.
Thus, guessing at n (which is what we do) is impossible (from all the prior points). (and n is more likely to grow as obs grow, law large num (external event more likely)). law of large numbers applies, and an unknown external event effectively doubles the number of obsersations required.
in is always at least 2 degrees away from knowability. besides being unknown itself, if the sample size is too small, which is not known, then n is unknowable. If the sample size is too large (also not known), then n is changing (growing, large num law), which is also unknowable. No clustering or local optima can overcome that, since it is potentially different for every experiment.