Statistics is the branch of math which deals with data. As a fluid dynamics professor of mine put it years ago, when reality is too complicated, and we can't track every individual event, we resort to
statistics. Theoretically we should be able to model the motion of every molecule in the ocean and atmosphere based on the principles of energy and momentum, but there are far too many molecules for any computer to calculate for them all. Whenever we can't assess the full population of events, we take samples and apply statistics.
The science falls into two categories: descriptive, which summarizes data, and inferential, which draws conclusions from it. The first requirement for statistical methods to have value is to ensure that the data sample is as random as possible. Bias is a concept endemic to science, and refers to any influence on a data set due to the sampling and processing methods. Bias can appear in any method of gathering information, from designing a poll and writing the questions, to the design and setup of instruments for use in the field, to processing methods. Human error is very much a potential cause. Bias cannot be eliminated, but can be reduced. As far as possible the data gathering and processing methods must not affect the data itself.
Included in the concept of bias is removing all confounding variables, bringing the objects under
observation down to one. When conducting a random poll, selecting the test population is paramount to not skew data toward one demographic group. When testing for the efficacy of a drug, a control group which did not take the drug is critical, to properly evaluate the drug's effectiveness against no treatment at all. (And the "double blind" concept, where neither the patients nor the treating doctors know who received the drug or the placebo, is to prevent the
doctor's quality of care based on this knowledge becoming a confounding variable.) Bias also exists in physical measurements in the field, such as temperature, depth or humidity. Instruments are carefully calibrated to reduce as far as possible their inaccuracy.
The central limit theorem states that when the measurement methods are adequate, a number of samples will tend to coalesce around an average value, with a symmetrical spread out to both greater and lesser. The normal distribution, or "bell curve", is the mathematical model for
this, an estimate of the frequency of an average value and of the values around it. When data conforms, the normal distribution can be the basis for a number of mathematical tests.
Tomorrow: the bell curve.