summaryrefslogtreecommitdiffstats
path: root/Doc
diff options
context:
space:
mode:
authorRaymond Hettinger <rhettinger@users.noreply.github.com>2019-09-05 07:18:47 (GMT)
committerGitHub <noreply@github.com>2019-09-05 07:18:47 (GMT)
commite4810b2a6c1d0db1a27ad046831b8fa3b57967a4 (patch)
tree18c2c6412afb287e8e60f44bfd11ca46ff388638 /Doc
parent8f9cc8771ffb8d0e21be287eaed42ae06087acca (diff)
downloadcpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.zip
cpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.tar.gz
cpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.tar.bz2
bpo-36324: Apply review comments from Allen Downey (GH-15693)
Diffstat (limited to 'Doc')
-rw-r--r--Doc/library/statistics.rst129
1 files changed, 65 insertions, 64 deletions
diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst
index 04b731d..62d856b 100644
--- a/Doc/library/statistics.rst
+++ b/Doc/library/statistics.rst
@@ -26,10 +26,10 @@ numeric (:class:`Real`-valued) data.
Unless explicitly noted otherwise, these functions support :class:`int`,
:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
Behaviour with other types (whether in the numeric tower or not) is
- currently unsupported. Mixed types are also undefined and
- implementation-dependent. If your input data consists of mixed types,
- you may be able to use :func:`map` to ensure a consistent result, e.g.
- ``map(float, input_data)``.
+ currently unsupported. Collections with a mix of types are also undefined
+ and implementation-dependent. If your input data consists of mixed types,
+ you may be able to use :func:`map` to ensure a consistent result, for
+ example: ``map(float, input_data)``.
Averages and measures of central location
-----------------------------------------
@@ -102,11 +102,9 @@ However, for reading convenience, most of the examples show sorted sequences.
.. note::
The mean is strongly affected by outliers and is not a robust estimator
- for central location: the mean is not necessarily a typical example of the
- data points. For more robust, although less efficient, measures of
- central location, see :func:`median` and :func:`mode`. (In this case,
- "efficient" refers to statistical efficiency rather than computational
- efficiency.)
+ for central location: the mean is not necessarily a typical example of
+ the data points. For more robust measures of central location, see
+ :func:`median` and :func:`mode`.
The sample mean gives an unbiased estimate of the true population mean,
which means that, taken on average over all the possible samples,
@@ -120,9 +118,8 @@ However, for reading convenience, most of the examples show sorted sequences.
Convert *data* to floats and compute the arithmetic mean.
This runs faster than the :func:`mean` function and it always returns a
- :class:`float`. The result is highly accurate but not as perfect as
- :func:`mean`. If the input dataset is empty, raises a
- :exc:`StatisticsError`.
+ :class:`float`. The *data* may be a sequence or iterator. If the input
+ dataset is empty, raises a :exc:`StatisticsError`.
.. doctest::
@@ -136,15 +133,20 @@ However, for reading convenience, most of the examples show sorted sequences.
Convert *data* to floats and compute the geometric mean.
+ The geometric mean indicates the central tendency or typical value of the
+ *data* using the product of the values (as opposed to the arithmetic mean
+ which uses their sum).
+
Raises a :exc:`StatisticsError` if the input dataset is empty,
if it contains a zero, or if it contains a negative value.
+ The *data* may be a sequence or iterator.
No special efforts are made to achieve exact results.
(However, this may change in the future.)
.. doctest::
- >>> round(geometric_mean([54, 24, 36]), 9)
+ >>> round(geometric_mean([54, 24, 36]), 1)
36.0
.. versionadded:: 3.8
@@ -174,7 +176,7 @@ However, for reading convenience, most of the examples show sorted sequences.
3.6
Using the arithmetic mean would give an average of about 5.167, which
- is too high.
+ is well over the aggregate P/E ratio.
:exc:`StatisticsError` is raised if *data* is empty, or any element
is less than zero.
@@ -312,10 +314,10 @@ However, for reading convenience, most of the examples show sorted sequences.
The mode (when it exists) is the most typical value and serves as a
measure of central location.
- If there are multiple modes, returns the first one encountered in the *data*.
- If the smallest or largest of multiple modes is desired instead, use
- ``min(multimode(data))`` or ``max(multimode(data))``. If the input *data* is
- empty, :exc:`StatisticsError` is raised.
+ If there are multiple modes with the same frequency, returns the first one
+ encountered in the *data*. If the smallest or largest of those is
+ desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
+ If the input *data* is empty, :exc:`StatisticsError` is raised.
``mode`` assumes discrete data, and returns a single value. This is the
standard treatment of the mode as commonly taught in schools:
@@ -325,8 +327,8 @@ However, for reading convenience, most of the examples show sorted sequences.
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3
- The mode is unique in that it is the only statistic which also applies
- to nominal (non-numeric) data:
+ The mode is unique in that it is the only statistic in this package that
+ also applies to nominal (non-numeric) data:
.. doctest::
@@ -368,15 +370,16 @@ However, for reading convenience, most of the examples show sorted sequences.
.. function:: pvariance(data, mu=None)
- Return the population variance of *data*, a non-empty iterable of real-valued
- numbers. Variance, or second moment about the mean, is a measure of the
- variability (spread or dispersion) of data. A large variance indicates that
- the data is spread out; a small variance indicates it is clustered closely
- around the mean.
+ Return the population variance of *data*, a non-empty sequence or iterator
+ of real-valued numbers. Variance, or second moment about the mean, is a
+ measure of the variability (spread or dispersion) of data. A large
+ variance indicates that the data is spread out; a small variance indicates
+ it is clustered closely around the mean.
- If the optional second argument *mu* is given, it should be the mean of
- *data*. If it is missing or ``None`` (the default), the mean is
- automatically calculated.
+ If the optional second argument *mu* is given, it is typically the mean of
+ the *data*. It can also be used to compute the second moment around a
+ point that is not the mean. If it is missing or ``None`` (the default),
+ the arithmetic mean is automatically calculated.
Use this function to calculate the variance from the entire population. To
estimate the variance from a sample, the :func:`variance` function is usually
@@ -401,10 +404,6 @@ However, for reading convenience, most of the examples show sorted sequences.
>>> pvariance(data, mu)
1.25
- This function does not attempt to verify that you have passed the actual mean
- as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
- results.
-
Decimals and Fractions are supported:
.. doctest::
@@ -423,11 +422,11 @@ However, for reading convenience, most of the examples show sorted sequences.
σ². When called on a sample instead, this is the biased sample variance
s², also known as variance with N degrees of freedom.
- If you somehow know the true population mean μ, you may use this function
- to calculate the variance of a sample, giving the known population mean as
- the second argument. Provided the data points are representative
- (e.g. independent and identically distributed), the result will be an
- unbiased estimate of the population variance.
+ If you somehow know the true population mean μ, you may use this
+ function to calculate the variance of a sample, giving the known
+ population mean as the second argument. Provided the data points are a
+ random sample of the population, the result will be an unbiased estimate
+ of the population variance.
.. function:: stdev(data, xbar=None)
@@ -502,19 +501,19 @@ However, for reading convenience, most of the examples show sorted sequences.
:func:`pvariance` function as the *mu* parameter to get the variance of a
sample.
-.. function:: quantiles(dist, *, n=4, method='exclusive')
+.. function:: quantiles(data, *, n=4, method='exclusive')
- Divide *dist* into *n* continuous intervals with equal probability.
+ Divide *data* into *n* continuous intervals with equal probability.
Returns a list of ``n - 1`` cut points separating the intervals.
Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
*n* to 100 for percentiles which gives the 99 cuts points that separate
- *dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
+ *data* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
is not least 1.
- The *dist* can be any iterable containing sample data or it can be an
+ The *data* can be any iterable containing sample data or it can be an
instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
- results, the number of data points in *dist* should be larger than *n*.
+ results, the number of data points in *data* should be larger than *n*.
Raises :exc:`StatisticsError` if there are not at least two data points.
For sample data, the cut points are linearly interpolated from the
@@ -523,7 +522,7 @@ However, for reading convenience, most of the examples show sorted sequences.
cut-point will evaluate to ``104``.
The *method* for computing quantiles can be varied depending on
- whether the data in *dist* includes or excludes the lowest and
+ whether the data in *data* includes or excludes the lowest and
highest possible values from the population.
The default *method* is "exclusive" and is used for data sampled from
@@ -535,14 +534,14 @@ However, for reading convenience, most of the examples show sorted sequences.
Setting the *method* to "inclusive" is used for describing population
data or for samples that are known to include the most extreme values
- from the population. The minimum value in *dist* is treated as the 0th
+ from the population. The minimum value in *data* is treated as the 0th
percentile and the maximum value is treated as the 100th percentile.
The portion of the population falling below the *i-th* of *m* sorted
data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
values, the method sorts them and assigns the following percentiles:
0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
- If *dist* is an instance of a class that defines an
+ If *data* is an instance of a class that defines an
:meth:`~inv_cdf` method, setting *method* has no effect.
.. doctest::
@@ -580,7 +579,7 @@ A single exception is defined:
:class:`NormalDist` is a tool for creating and manipulating normal
distributions of a `random variable
<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
-composite class that treats the mean and standard deviation of data
+class that treats the mean and standard deviation of data
measurements as a single entity.
Normal distributions arise from the `Central Limit Theorem
@@ -616,13 +615,14 @@ of applications in statistics.
.. classmethod:: NormalDist.from_samples(data)
- Makes a normal distribution instance computed from sample data. The
- *data* can be any :term:`iterable` and should consist of values that
- can be converted to type :class:`float`.
+ Makes a normal distribution instance with *mu* and *sigma* parameters
+ estimated from the *data* using :func:`fmean` and :func:`stdev`.
- If *data* does not contain at least two elements, raises
- :exc:`StatisticsError` because it takes at least one point to estimate
- a central value and at least two points to estimate dispersion.
+ The *data* can be any :term:`iterable` and should consist of values
+ that can be converted to type :class:`float`. If *data* does not
+ contain at least two elements, raises :exc:`StatisticsError` because it
+ takes at least one point to estimate a central value and at least two
+ points to estimate dispersion.
.. method:: NormalDist.samples(n, *, seed=None)
@@ -636,10 +636,10 @@ of applications in statistics.
.. method:: NormalDist.pdf(x)
Using a `probability density function (pdf)
- <https://en.wikipedia.org/wiki/Probability_density_function>`_,
- compute the relative likelihood that a random variable *X* will be near
- the given value *x*. Mathematically, it is the ratio ``P(x <= X <
- x+dx) / dx``.
+ <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
+ the relative likelihood that a random variable *X* will be near the
+ given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
+ X < x+dx) / dx`` as *dx* approaches zero.
The relative likelihood is computed as the probability of a sample
occurring in a narrow range divided by the width of the range (hence
@@ -667,8 +667,10 @@ of applications in statistics.
.. method:: NormalDist.overlap(other)
- Returns a value between 0.0 and 1.0 giving the overlapping area for
- the two probability density functions.
+ Measures the agreement between two normal probability distributions.
+ Returns a value between 0.0 and 1.0 giving `the overlapping area for
+ the two probability density functions
+ <https://www.rasch.org/rmt/rmt101r.htm>`_.
Instances of :class:`NormalDist` support addition, subtraction,
multiplication and division by a constant. These operations
@@ -740,12 +742,11 @@ Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
... return (3*x + 7*x*y - 5*y) / (11 * z)
...
>>> n = 100_000
- >>> seed = 86753099035768
- >>> X = NormalDist(10, 2.5).samples(n, seed=seed)
- >>> Y = NormalDist(15, 1.75).samples(n, seed=seed)
- >>> Z = NormalDist(50, 1.25).samples(n, seed=seed)
- >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
- NormalDist(mu=1.8661894803304777, sigma=0.65238717376862)
+ >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
+ >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
+ >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
+ >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
+ [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Normal distributions commonly arise in machine learning problems.