diff options
author | Raymond Hettinger <rhettinger@users.noreply.github.com> | 2019-09-05 07:18:47 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2019-09-05 07:18:47 (GMT) |
commit | e4810b2a6c1d0db1a27ad046831b8fa3b57967a4 (patch) | |
tree | 18c2c6412afb287e8e60f44bfd11ca46ff388638 /Doc | |
parent | 8f9cc8771ffb8d0e21be287eaed42ae06087acca (diff) | |
download | cpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.zip cpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.tar.gz cpython-e4810b2a6c1d0db1a27ad046831b8fa3b57967a4.tar.bz2 |
bpo-36324: Apply review comments from Allen Downey (GH-15693)
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/library/statistics.rst | 129 |
1 files changed, 65 insertions, 64 deletions
diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst index 04b731d..62d856b 100644 --- a/Doc/library/statistics.rst +++ b/Doc/library/statistics.rst @@ -26,10 +26,10 @@ numeric (:class:`Real`-valued) data. Unless explicitly noted otherwise, these functions support :class:`int`, :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`. Behaviour with other types (whether in the numeric tower or not) is - currently unsupported. Mixed types are also undefined and - implementation-dependent. If your input data consists of mixed types, - you may be able to use :func:`map` to ensure a consistent result, e.g. - ``map(float, input_data)``. + currently unsupported. Collections with a mix of types are also undefined + and implementation-dependent. If your input data consists of mixed types, + you may be able to use :func:`map` to ensure a consistent result, for + example: ``map(float, input_data)``. Averages and measures of central location ----------------------------------------- @@ -102,11 +102,9 @@ However, for reading convenience, most of the examples show sorted sequences. .. note:: The mean is strongly affected by outliers and is not a robust estimator - for central location: the mean is not necessarily a typical example of the - data points. For more robust, although less efficient, measures of - central location, see :func:`median` and :func:`mode`. (In this case, - "efficient" refers to statistical efficiency rather than computational - efficiency.) + for central location: the mean is not necessarily a typical example of + the data points. For more robust measures of central location, see + :func:`median` and :func:`mode`. The sample mean gives an unbiased estimate of the true population mean, which means that, taken on average over all the possible samples, @@ -120,9 +118,8 @@ However, for reading convenience, most of the examples show sorted sequences. Convert *data* to floats and compute the arithmetic mean. This runs faster than the :func:`mean` function and it always returns a - :class:`float`. The result is highly accurate but not as perfect as - :func:`mean`. If the input dataset is empty, raises a - :exc:`StatisticsError`. + :class:`float`. The *data* may be a sequence or iterator. If the input + dataset is empty, raises a :exc:`StatisticsError`. .. doctest:: @@ -136,15 +133,20 @@ However, for reading convenience, most of the examples show sorted sequences. Convert *data* to floats and compute the geometric mean. + The geometric mean indicates the central tendency or typical value of the + *data* using the product of the values (as opposed to the arithmetic mean + which uses their sum). + Raises a :exc:`StatisticsError` if the input dataset is empty, if it contains a zero, or if it contains a negative value. + The *data* may be a sequence or iterator. No special efforts are made to achieve exact results. (However, this may change in the future.) .. doctest:: - >>> round(geometric_mean([54, 24, 36]), 9) + >>> round(geometric_mean([54, 24, 36]), 1) 36.0 .. versionadded:: 3.8 @@ -174,7 +176,7 @@ However, for reading convenience, most of the examples show sorted sequences. 3.6 Using the arithmetic mean would give an average of about 5.167, which - is too high. + is well over the aggregate P/E ratio. :exc:`StatisticsError` is raised if *data* is empty, or any element is less than zero. @@ -312,10 +314,10 @@ However, for reading convenience, most of the examples show sorted sequences. The mode (when it exists) is the most typical value and serves as a measure of central location. - If there are multiple modes, returns the first one encountered in the *data*. - If the smallest or largest of multiple modes is desired instead, use - ``min(multimode(data))`` or ``max(multimode(data))``. If the input *data* is - empty, :exc:`StatisticsError` is raised. + If there are multiple modes with the same frequency, returns the first one + encountered in the *data*. If the smallest or largest of those is + desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. + If the input *data* is empty, :exc:`StatisticsError` is raised. ``mode`` assumes discrete data, and returns a single value. This is the standard treatment of the mode as commonly taught in schools: @@ -325,8 +327,8 @@ However, for reading convenience, most of the examples show sorted sequences. >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 3 - The mode is unique in that it is the only statistic which also applies - to nominal (non-numeric) data: + The mode is unique in that it is the only statistic in this package that + also applies to nominal (non-numeric) data: .. doctest:: @@ -368,15 +370,16 @@ However, for reading convenience, most of the examples show sorted sequences. .. function:: pvariance(data, mu=None) - Return the population variance of *data*, a non-empty iterable of real-valued - numbers. Variance, or second moment about the mean, is a measure of the - variability (spread or dispersion) of data. A large variance indicates that - the data is spread out; a small variance indicates it is clustered closely - around the mean. + Return the population variance of *data*, a non-empty sequence or iterator + of real-valued numbers. Variance, or second moment about the mean, is a + measure of the variability (spread or dispersion) of data. A large + variance indicates that the data is spread out; a small variance indicates + it is clustered closely around the mean. - If the optional second argument *mu* is given, it should be the mean of - *data*. If it is missing or ``None`` (the default), the mean is - automatically calculated. + If the optional second argument *mu* is given, it is typically the mean of + the *data*. It can also be used to compute the second moment around a + point that is not the mean. If it is missing or ``None`` (the default), + the arithmetic mean is automatically calculated. Use this function to calculate the variance from the entire population. To estimate the variance from a sample, the :func:`variance` function is usually @@ -401,10 +404,6 @@ However, for reading convenience, most of the examples show sorted sequences. >>> pvariance(data, mu) 1.25 - This function does not attempt to verify that you have passed the actual mean - as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible - results. - Decimals and Fractions are supported: .. doctest:: @@ -423,11 +422,11 @@ However, for reading convenience, most of the examples show sorted sequences. σ². When called on a sample instead, this is the biased sample variance s², also known as variance with N degrees of freedom. - If you somehow know the true population mean μ, you may use this function - to calculate the variance of a sample, giving the known population mean as - the second argument. Provided the data points are representative - (e.g. independent and identically distributed), the result will be an - unbiased estimate of the population variance. + If you somehow know the true population mean μ, you may use this + function to calculate the variance of a sample, giving the known + population mean as the second argument. Provided the data points are a + random sample of the population, the result will be an unbiased estimate + of the population variance. .. function:: stdev(data, xbar=None) @@ -502,19 +501,19 @@ However, for reading convenience, most of the examples show sorted sequences. :func:`pvariance` function as the *mu* parameter to get the variance of a sample. -.. function:: quantiles(dist, *, n=4, method='exclusive') +.. function:: quantiles(data, *, n=4, method='exclusive') - Divide *dist* into *n* continuous intervals with equal probability. + Divide *data* into *n* continuous intervals with equal probability. Returns a list of ``n - 1`` cut points separating the intervals. Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set *n* to 100 for percentiles which gives the 99 cuts points that separate - *dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n* + *data* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n* is not least 1. - The *dist* can be any iterable containing sample data or it can be an + The *data* can be any iterable containing sample data or it can be an instance of a class that defines an :meth:`~inv_cdf` method. For meaningful - results, the number of data points in *dist* should be larger than *n*. + results, the number of data points in *data* should be larger than *n*. Raises :exc:`StatisticsError` if there are not at least two data points. For sample data, the cut points are linearly interpolated from the @@ -523,7 +522,7 @@ However, for reading convenience, most of the examples show sorted sequences. cut-point will evaluate to ``104``. The *method* for computing quantiles can be varied depending on - whether the data in *dist* includes or excludes the lowest and + whether the data in *data* includes or excludes the lowest and highest possible values from the population. The default *method* is "exclusive" and is used for data sampled from @@ -535,14 +534,14 @@ However, for reading convenience, most of the examples show sorted sequences. Setting the *method* to "inclusive" is used for describing population data or for samples that are known to include the most extreme values - from the population. The minimum value in *dist* is treated as the 0th + from the population. The minimum value in *data* is treated as the 0th percentile and the maximum value is treated as the 100th percentile. The portion of the population falling below the *i-th* of *m* sorted data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample values, the method sorts them and assigns the following percentiles: 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. - If *dist* is an instance of a class that defines an + If *data* is an instance of a class that defines an :meth:`~inv_cdf` method, setting *method* has no effect. .. doctest:: @@ -580,7 +579,7 @@ A single exception is defined: :class:`NormalDist` is a tool for creating and manipulating normal distributions of a `random variable <http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a -composite class that treats the mean and standard deviation of data +class that treats the mean and standard deviation of data measurements as a single entity. Normal distributions arise from the `Central Limit Theorem @@ -616,13 +615,14 @@ of applications in statistics. .. classmethod:: NormalDist.from_samples(data) - Makes a normal distribution instance computed from sample data. The - *data* can be any :term:`iterable` and should consist of values that - can be converted to type :class:`float`. + Makes a normal distribution instance with *mu* and *sigma* parameters + estimated from the *data* using :func:`fmean` and :func:`stdev`. - If *data* does not contain at least two elements, raises - :exc:`StatisticsError` because it takes at least one point to estimate - a central value and at least two points to estimate dispersion. + The *data* can be any :term:`iterable` and should consist of values + that can be converted to type :class:`float`. If *data* does not + contain at least two elements, raises :exc:`StatisticsError` because it + takes at least one point to estimate a central value and at least two + points to estimate dispersion. .. method:: NormalDist.samples(n, *, seed=None) @@ -636,10 +636,10 @@ of applications in statistics. .. method:: NormalDist.pdf(x) Using a `probability density function (pdf) - <https://en.wikipedia.org/wiki/Probability_density_function>`_, - compute the relative likelihood that a random variable *X* will be near - the given value *x*. Mathematically, it is the ratio ``P(x <= X < - x+dx) / dx``. + <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute + the relative likelihood that a random variable *X* will be near the + given value *x*. Mathematically, it is the limit of the ratio ``P(x <= + X < x+dx) / dx`` as *dx* approaches zero. The relative likelihood is computed as the probability of a sample occurring in a narrow range divided by the width of the range (hence @@ -667,8 +667,10 @@ of applications in statistics. .. method:: NormalDist.overlap(other) - Returns a value between 0.0 and 1.0 giving the overlapping area for - the two probability density functions. + Measures the agreement between two normal probability distributions. + Returns a value between 0.0 and 1.0 giving `the overlapping area for + the two probability density functions + <https://www.rasch.org/rmt/rmt101r.htm>`_. Instances of :class:`NormalDist` support addition, subtraction, multiplication and division by a constant. These operations @@ -740,12 +742,11 @@ Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: ... return (3*x + 7*x*y - 5*y) / (11 * z) ... >>> n = 100_000 - >>> seed = 86753099035768 - >>> X = NormalDist(10, 2.5).samples(n, seed=seed) - >>> Y = NormalDist(15, 1.75).samples(n, seed=seed) - >>> Z = NormalDist(50, 1.25).samples(n, seed=seed) - >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP - NormalDist(mu=1.8661894803304777, sigma=0.65238717376862) + >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) + >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) + >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) + >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP + [1.4591308524824727, 1.8035946855390597, 2.175091447274739] Normal distributions commonly arise in machine learning problems. |