diff options
author | Larry Hastings <larry@hastings.org> | 2013-10-19 18:50:09 (GMT) |
---|---|---|
committer | Larry Hastings <larry@hastings.org> | 2013-10-19 18:50:09 (GMT) |
commit | f5e987bbe64156ebeae5eea730962c209fbb9d74 (patch) | |
tree | 760ff3a67867e00e33b038da9bd4e497706cfed0 | |
parent | aa2b22abf342dd000d243512c620d5b5022381cf (diff) | |
download | cpython-f5e987bbe64156ebeae5eea730962c209fbb9d74.zip cpython-f5e987bbe64156ebeae5eea730962c209fbb9d74.tar.gz cpython-f5e987bbe64156ebeae5eea730962c209fbb9d74.tar.bz2 |
Issue #18606: Add the new "statistics" module (PEP 450). Contributed
by Steven D'Aprano.
-rw-r--r-- | Doc/library/numeric.rst | 1 | ||||
-rw-r--r-- | Doc/library/statistics.rst | 463 | ||||
-rw-r--r-- | Lib/statistics.py | 612 | ||||
-rw-r--r-- | Lib/test/test_statistics.py | 1534 | ||||
-rw-r--r-- | Misc/NEWS | 3 |
5 files changed, 2613 insertions, 0 deletions
diff --git a/Doc/library/numeric.rst b/Doc/library/numeric.rst index 2732a84..7c76a47 100644 --- a/Doc/library/numeric.rst +++ b/Doc/library/numeric.rst @@ -23,3 +23,4 @@ The following modules are documented in this chapter: decimal.rst fractions.rst random.rst + statistics.rst diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst new file mode 100644 index 0000000..4f5b759 --- /dev/null +++ b/Doc/library/statistics.rst @@ -0,0 +1,463 @@ +:mod:`statistics` --- Mathematical statistics functions +======================================================= + +.. module:: statistics + :synopsis: mathematical statistics functions +.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> +.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> + +.. versionadded:: 3.4 + +.. testsetup:: * + + from statistics import * + __name__ = '<doctest>' + +**Source code:** :source:`Lib/statistics.py` + +-------------- + +This module provides functions for calculating mathematical statistics of +numeric (:class:`Real`-valued) data. + +Averages and measures of central location +----------------------------------------- + +These functions calculate an average or typical value from a population +or sample. + +======================= ============================================= +:func:`mean` Arithmetic mean ("average") of data. +:func:`median` Median (middle value) of data. +:func:`median_low` Low median of data. +:func:`median_high` High median of data. +:func:`median_grouped` Median, or 50th percentile, of grouped data. +:func:`mode` Mode (most common value) of discrete data. +======================= ============================================= + +:func:`mean` +~~~~~~~~~~~~ + +The :func:`mean` function calculates the arithmetic mean, commonly known +as the average, of its iterable argument: + +.. function:: mean(data) + + Return the sample arithmetic mean of *data*, a sequence or iterator + of real-valued numbers. + + The arithmetic mean is the sum of the data divided by the number of + data points. It is commonly called "the average", although it is only + one of many different mathematical averages. It is a measure of the + central location of the data. + + Some examples of use: + + .. doctest:: + + >>> mean([1, 2, 3, 4, 4]) + 2.8 + >>> mean([-1.0, 2.5, 3.25, 5.75]) + 2.625 + + >>> from fractions import Fraction as F + >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) + Fraction(13, 21) + + >>> from decimal import Decimal as D + >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) + Decimal('0.5625') + + .. note:: + + The mean is strongly effected by outliers and is not a robust + estimator for central location: the mean is not necessarily a + typical example of the data points. For more robust, although less + efficient, measures of central location, see :func:`median` and + :func:`mode`. (In this case, "efficient" refers to statistical + efficiency rather than computational efficiency.) + + The sample mean gives an unbiased estimate of the true population + mean, which means that, taken on average over all the possible + samples, ``mean(sample)`` converges on the true mean of the entire + population. If *data* represents the entire population rather than + a sample, then ``mean(data)`` is equivalent to calculating the true + population mean μ. + + If ``data`` is empty, :exc:`StatisticsError` will be raised. + +:func:`median` +~~~~~~~~~~~~~~ + +The :func:`median` function calculates the median, or middle, data point, +using the common "mean of middle two" method. + + .. seealso:: + + :func:`median_low` + + :func:`median_high` + + :func:`median_grouped` + +.. function:: median(data) + + Return the median (middle value) of numeric data. + + The median is a robust measure of central location, and is less affected + by the presence of outliers in your data. When the number of data points + is odd, the middle data point is returned: + + .. doctest:: + + >>> median([1, 3, 5]) + 3 + + When the number of data points is even, the median is interpolated by + taking the average of the two middle values: + + .. doctest:: + + >>> median([1, 3, 5, 7]) + 4.0 + + This is suited for when your data is discrete, and you don't mind that + the median may not be an actual data point. + + If data is empty, :exc:`StatisticsError` is raised. + +:func:`median_low` +~~~~~~~~~~~~~~~~~~ + +The :func:`median_low` function calculates the low median without +interpolation. + +.. function:: median_low(data) + + Return the low median of numeric data. + + The low median is always a member of the data set. When the number + of data points is odd, the middle value is returned. When it is + even, the smaller of the two middle values is returned. + + .. doctest:: + + >>> median_low([1, 3, 5]) + 3 + >>> median_low([1, 3, 5, 7]) + 3 + + Use the low median when your data are discrete and you prefer the median + to be an actual data point rather than interpolated. + + If data is empty, :exc:`StatisticsError` is raised. + +:func:`median_high` +~~~~~~~~~~~~~~~~~~~ + +The :func:`median_high` function calculates the high median without +interpolation. + +.. function:: median_high(data) + + Return the high median of data. + + The high median is always a member of the data set. When the number of + data points is odd, the middle value is returned. When it is even, the + larger of the two middle values is returned. + + .. doctest:: + + >>> median_high([1, 3, 5]) + 3 + >>> median_high([1, 3, 5, 7]) + 5 + + Use the high median when your data are discrete and you prefer the median + to be an actual data point rather than interpolated. + + If data is empty, :exc:`StatisticsError` is raised. + +:func:`median_grouped` +~~~~~~~~~~~~~~~~~~~~~~ + +The :func:`median_grouped` function calculates the median of grouped data +as the 50th percentile, using interpolation. + +.. function:: median_grouped(data [, interval]) + + Return the median of grouped continuous data, calculated as the + 50th percentile. + + .. doctest:: + + >>> median_grouped([52, 52, 53, 54]) + 52.5 + + In the following example, the data are rounded, so that each value + represents the midpoint of data classes, e.g. 1 is the midpoint of the + class 0.5-1.5, 2 is the midpoint of 1.5-2.5, 3 is the midpoint of + 2.5-3.5, etc. With the data given, the middle value falls somewhere in + the class 3.5-4.5, and interpolation is used to estimate it: + + .. doctest:: + + >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) + 3.7 + + Optional argument ``interval`` represents the class interval, and + defaults to 1. Changing the class interval naturally will change the + interpolation: + + .. doctest:: + + >>> median_grouped([1, 3, 3, 5, 7], interval=1) + 3.25 + >>> median_grouped([1, 3, 3, 5, 7], interval=2) + 3.5 + + This function does not check whether the data points are at least + ``interval`` apart. + + .. impl-detail:: + + Under some circumstances, :func:`median_grouped` may coerce data + points to floats. This behaviour is likely to change in the future. + + .. seealso:: + + * "Statistics for the Behavioral Sciences", Frederick J Gravetter + and Larry B Wallnau (8th Edition). + + * Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_. + + * The `SSMEDIAN <https://projects.gnome.org/gnumeric/doc/gnumeric-function-SSMEDIAN.shtml>`_ + function in the Gnome Gnumeric spreadsheet, including + `this discussion <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_. + + If data is empty, :exc:`StatisticsError` is raised. + +:func:`mode` +~~~~~~~~~~~~ + +The :func:`mode` function calculates the mode, or most common element, of +discrete or nominal data. The mode (when it exists) is the most typical +value, and is a robust measure of central location. + +.. function:: mode(data) + + Return the most common data point from discrete or nominal data. + + ``mode`` assumes discrete data, and returns a single value. This is the + standard treatment of the mode as commonly taught in schools: + + .. doctest:: + + >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) + 3 + + The mode is unique in that it is the only statistic which also applies + to nominal (non-numeric) data: + + .. doctest:: + + >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) + 'red' + + If data is empty, or if there is not exactly one most common value, + :exc:`StatisticsError` is raised. + +Measures of spread +------------------ + +These functions calculate a measure of how much the population or sample +tends to deviate from the typical or average values. + +======================= ============================================= +:func:`pstdev` Population standard deviation of data. +:func:`pvariance` Population variance of data. +:func:`stdev` Sample standard deviation of data. +:func:`variance` Sample variance of data. +======================= ============================================= + +:func:`pstdev` +~~~~~~~~~~~~~~ + +The :func:`pstdev` function calculates the standard deviation of a +population. The standard deviation is equivalent to the square root of +the variance. + +.. function:: pstdev(data [, mu]) + + Return the square root of the population variance. See :func:`pvariance` + for arguments and other details. + + .. doctest:: + + >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) + 0.986893273527251 + +:func:`pvariance` +~~~~~~~~~~~~~~~~~ + +The :func:`pvariance` function calculates the variance of a population. +Variance, or second moment about the mean, is a measure of the variability +(spread or dispersion) of data. A large variance indicates that the data is +spread out; a small variance indicates it is clustered closely around the +mean. + +.. function:: pvariance(data [, mu]) + + Return the population variance of *data*, a non-empty iterable of + real-valued numbers. + + If the optional second argument *mu* is given, it should be the mean + of *data*. If it is missing or None (the default), the mean is + automatically caclulated. + + Use this function to calculate the variance from the entire population. + To estimate the variance from a sample, the :func:`variance` function is + usually a better choice. + + Examples: + + .. doctest:: + + >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] + >>> pvariance(data) + 1.25 + + If you have already calculated the mean of your data, you can pass + it as the optional second argument *mu* to avoid recalculation: + + .. doctest:: + + >>> mu = mean(data) + >>> pvariance(data, mu) + 1.25 + + This function does not attempt to verify that you have passed the actual + mean as *mu*. Using arbitrary values for *mu* may lead to invalid or + impossible results. + + Decimals and Fractions are supported: + + .. doctest:: + + >>> from decimal import Decimal as D + >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) + Decimal('24.815') + + >>> from fractions import Fraction as F + >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) + Fraction(13, 72) + + .. note:: + + When called with the entire population, this gives the population + variance σ². When called on a sample instead, this is the biased + sample variance s², also known as variance with N degrees of freedom. + + If you somehow know the true population mean μ, you may use this + function to calculate the variance of a sample, giving the known + population mean as the second argument. Provided the data points are + representative (e.g. independent and identically distributed), the + result will be an unbiased estimate of the population variance. + + Raises :exc:`StatisticsError` if *data* is empty. + +:func:`stdev` +~~~~~~~~~~~~~~ + +The :func:`stdev` function calculates the standard deviation of a sample. +The standard deviation is equivalent to the square root of the variance. + +.. function:: stdev(data [, xbar]) + + Return the square root of the sample variance. See :func:`variance` for + arguments and other details. + + .. doctest:: + + >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) + 1.0810874155219827 + +:func:`variance` +~~~~~~~~~~~~~~~~~ + +The :func:`variance` function calculates the variance of a sample. Variance, +or second moment about the mean, is a measure of the variability (spread or +dispersion) of data. A large variance indicates that the data is spread out; +a small variance indicates it is clustered closely around the mean. + +.. function:: variance(data [, xbar]) + + Return the sample variance of *data*, an iterable of at least two + real-valued numbers. + + If the optional second argument *xbar* is given, it should be the mean + of *data*. If it is missing or None (the default), the mean is + automatically caclulated. + + Use this function when your data is a sample from a population. To + calculate the variance from the entire population, see :func:`pvariance`. + + Examples: + + .. doctest:: + + >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] + >>> variance(data) + 1.3720238095238095 + + If you have already calculated the mean of your data, you can pass + it as the optional second argument *xbar* to avoid recalculation: + + .. doctest:: + + >>> m = mean(data) + >>> variance(data, m) + 1.3720238095238095 + + This function does not attempt to verify that you have passed the actual + mean as *xbar*. Using arbitrary values for *xbar* can lead to invalid or + impossible results. + + Decimal and Fraction values are supported: + + .. doctest:: + + >>> from decimal import Decimal as D + >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) + Decimal('31.01875') + + >>> from fractions import Fraction as F + >>> variance([F(1, 6), F(1, 2), F(5, 3)]) + Fraction(67, 108) + + .. note:: + + This is the sample variance s² with Bessel's correction, also known + as variance with N-1 degrees of freedom. Provided that the data + points are representative (e.g. independent and identically + distributed), the result should be an unbiased estimate of the true + population variance. + + If you somehow know the actual population mean μ you should pass it + to the :func:`pvariance` function as the *mu* parameter to get + the variance of a sample. + + Raises :exc:`StatisticsError` if *data* has fewer than two values. + +Exceptions +---------- + +A single exception is defined: + +:exc:`StatisticsError` + +Subclass of :exc:`ValueError` for statistics-related exceptions. + +.. + # This modelines must appear within the last ten lines of the file. + kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; diff --git a/Lib/statistics.py b/Lib/statistics.py new file mode 100644 index 0000000..a18241b --- /dev/null +++ b/Lib/statistics.py @@ -0,0 +1,612 @@ +## Module statistics.py +## +## Copyright (c) 2013 Steven D'Aprano <steve+python@pearwood.info>. +## +## Licensed under the Apache License, Version 2.0 (the "License"); +## you may not use this file except in compliance with the License. +## You may obtain a copy of the License at +## +## http://www.apache.org/licenses/LICENSE-2.0 +## +## Unless required by applicable law or agreed to in writing, software +## distributed under the License is distributed on an "AS IS" BASIS, +## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +## See the License for the specific language governing permissions and +## limitations under the License. + + +""" +Basic statistics module. + +This module provides functions for calculating statistics of data, including +averages, variance, and standard deviation. + +Calculating averages +-------------------- + +================== ============================================= +Function Description +================== ============================================= +mean Arithmetic mean (average) of data. +median Median (middle value) of data. +median_low Low median of data. +median_high High median of data. +median_grouped Median, or 50th percentile, of grouped data. +mode Mode (most common value) of data. +================== ============================================= + +Calculate the arithmetic mean ("the average") of data: + +>>> mean([-1.0, 2.5, 3.25, 5.75]) +2.625 + + +Calculate the standard median of discrete data: + +>>> median([2, 3, 4, 5]) +3.5 + + +Calculate the median, or 50th percentile, of data grouped into class intervals +centred on the data values provided. E.g. if your data points are rounded to +the nearest whole number: + +>>> median_grouped([2, 2, 3, 3, 3, 4]) #doctest: +ELLIPSIS +2.8333333333... + +This should be interpreted in this way: you have two data points in the class +interval 1.5-2.5, three data points in the class interval 2.5-3.5, and one in +the class interval 3.5-4.5. The median of these data points is 2.8333... + + +Calculating variability or spread +--------------------------------- + +================== ============================================= +Function Description +================== ============================================= +pvariance Population variance of data. +variance Sample variance of data. +pstdev Population standard deviation of data. +stdev Sample standard deviation of data. +================== ============================================= + +Calculate the standard deviation of sample data: + +>>> stdev([2.5, 3.25, 5.5, 11.25, 11.75]) #doctest: +ELLIPSIS +4.38961843444... + +If you have previously calculated the mean, you can pass it as the optional +second argument to the four "spread" functions to avoid recalculating it: + +>>> data = [1, 2, 2, 4, 4, 4, 5, 6] +>>> mu = mean(data) +>>> pvariance(data, mu) +2.5 + + +Exceptions +---------- + +A single exception is defined: StatisticsError is a subclass of ValueError. + +""" + +__all__ = [ 'StatisticsError', + 'pstdev', 'pvariance', 'stdev', 'variance', + 'median', 'median_low', 'median_high', 'median_grouped', + 'mean', 'mode', + ] + + +import collections +import math +import numbers +import operator + +from fractions import Fraction +from decimal import Decimal + + +# === Exceptions === + +class StatisticsError(ValueError): + pass + + +# === Private utilities === + +def _sum(data, start=0): + """_sum(data [, start]) -> value + + Return a high-precision sum of the given numeric data. If optional + argument ``start`` is given, it is added to the total. If ``data`` is + empty, ``start`` (defaulting to 0) is returned. + + + Examples + -------- + + >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75) + 11.0 + + Some sources of round-off error will be avoided: + + >>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero. + 1000.0 + + Fractions and Decimals are also supported: + + >>> from fractions import Fraction as F + >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)]) + Fraction(63, 20) + + >>> from decimal import Decimal as D + >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")] + >>> _sum(data) + Decimal('0.6963') + + """ + n, d = _exact_ratio(start) + T = type(start) + partials = {d: n} # map {denominator: sum of numerators} + # Micro-optimizations. + coerce_types = _coerce_types + exact_ratio = _exact_ratio + partials_get = partials.get + # Add numerators for each denominator, and track the "current" type. + for x in data: + T = _coerce_types(T, type(x)) + n, d = exact_ratio(x) + partials[d] = partials_get(d, 0) + n + if None in partials: + assert issubclass(T, (float, Decimal)) + assert not math.isfinite(partials[None]) + return T(partials[None]) + total = Fraction() + for d, n in sorted(partials.items()): + total += Fraction(n, d) + if issubclass(T, int): + assert total.denominator == 1 + return T(total.numerator) + if issubclass(T, Decimal): + return T(total.numerator)/total.denominator + return T(total) + + +def _exact_ratio(x): + """Convert Real number x exactly to (numerator, denominator) pair. + + >>> _exact_ratio(0.25) + (1, 4) + + x is expected to be an int, Fraction, Decimal or float. + """ + try: + try: + # int, Fraction + return (x.numerator, x.denominator) + except AttributeError: + # float + try: + return x.as_integer_ratio() + except AttributeError: + # Decimal + try: + return _decimal_to_ratio(x) + except AttributeError: + msg = "can't convert type '{}' to numerator/denominator" + raise TypeError(msg.format(type(x).__name__)) from None + except (OverflowError, ValueError): + # INF or NAN + if __debug__: + # Decimal signalling NANs cannot be converted to float :-( + if isinstance(x, Decimal): + assert not x.is_finite() + else: + assert not math.isfinite(x) + return (x, None) + + +# FIXME This is faster than Fraction.from_decimal, but still too slow. +def _decimal_to_ratio(d): + """Convert Decimal d to exact integer ratio (numerator, denominator). + + >>> from decimal import Decimal + >>> _decimal_to_ratio(Decimal("2.6")) + (26, 10) + + """ + sign, digits, exp = d.as_tuple() + if exp in ('F', 'n', 'N'): # INF, NAN, sNAN + assert not d.is_finite() + raise ValueError + num = 0 + for digit in digits: + num = num*10 + digit + if sign: + num = -num + den = 10**-exp + return (num, den) + + +def _coerce_types(T1, T2): + """Coerce types T1 and T2 to a common type. + + >>> _coerce_types(int, float) + <class 'float'> + + Coercion is performed according to this table, where "N/A" means + that a TypeError exception is raised. + + +----------+-----------+-----------+-----------+----------+ + | | int | Fraction | Decimal | float | + +----------+-----------+-----------+-----------+----------+ + | int | int | Fraction | Decimal | float | + | Fraction | Fraction | Fraction | N/A | float | + | Decimal | Decimal | N/A | Decimal | float | + | float | float | float | float | float | + +----------+-----------+-----------+-----------+----------+ + + Subclasses trump their parent class; two subclasses of the same + base class will be coerced to the second of the two. + + """ + # Get the common/fast cases out of the way first. + if T1 is T2: return T1 + if T1 is int: return T2 + if T2 is int: return T1 + # Subclasses trump their parent class. + if issubclass(T2, T1): return T2 + if issubclass(T1, T2): return T1 + # Floats trump everything else. + if issubclass(T2, float): return T2 + if issubclass(T1, float): return T1 + # Subclasses of the same base class give priority to the second. + if T1.__base__ is T2.__base__: return T2 + # Otherwise, just give up. + raise TypeError('cannot coerce types %r and %r' % (T1, T2)) + + +def _counts(data): + # Generate a table of sorted (value, frequency) pairs. + if data is None: + raise TypeError('None is not iterable') + table = collections.Counter(data).most_common() + if not table: + return table + # Extract the values with the highest frequency. + maxfreq = table[0][1] + for i in range(1, len(table)): + if table[i][1] != maxfreq: + table = table[:i] + break + return table + + +# === Measures of central tendency (averages) === + +def mean(data): + """Return the sample arithmetic mean of data. + + >>> mean([1, 2, 3, 4, 4]) + 2.8 + + >>> from fractions import Fraction as F + >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) + Fraction(13, 21) + + >>> from decimal import Decimal as D + >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) + Decimal('0.5625') + + If ``data`` is empty, StatisticsError will be raised. + """ + if iter(data) is data: + data = list(data) + n = len(data) + if n < 1: + raise StatisticsError('mean requires at least one data point') + return _sum(data)/n + + +# FIXME: investigate ways to calculate medians without sorting? Quickselect? +def median(data): + """Return the median (middle value) of numeric data. + + When the number of data points is odd, return the middle data point. + When the number of data points is even, the median is interpolated by + taking the average of the two middle values: + + >>> median([1, 3, 5]) + 3 + >>> median([1, 3, 5, 7]) + 4.0 + + """ + data = sorted(data) + n = len(data) + if n == 0: + raise StatisticsError("no median for empty data") + if n%2 == 1: + return data[n//2] + else: + i = n//2 + return (data[i - 1] + data[i])/2 + + +def median_low(data): + """Return the low median of numeric data. + + When the number of data points is odd, the middle value is returned. + When it is even, the smaller of the two middle values is returned. + + >>> median_low([1, 3, 5]) + 3 + >>> median_low([1, 3, 5, 7]) + 3 + + """ + data = sorted(data) + n = len(data) + if n == 0: + raise StatisticsError("no median for empty data") + if n%2 == 1: + return data[n//2] + else: + return data[n//2 - 1] + + +def median_high(data): + """Return the high median of data. + + When the number of data points is odd, the middle value is returned. + When it is even, the larger of the two middle values is returned. + + >>> median_high([1, 3, 5]) + 3 + >>> median_high([1, 3, 5, 7]) + 5 + + """ + data = sorted(data) + n = len(data) + if n == 0: + raise StatisticsError("no median for empty data") + return data[n//2] + + +def median_grouped(data, interval=1): + """"Return the 50th percentile (median) of grouped continuous data. + + >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) + 3.7 + >>> median_grouped([52, 52, 53, 54]) + 52.5 + + This calculates the median as the 50th percentile, and should be + used when your data is continuous and grouped. In the above example, + the values 1, 2, 3, etc. actually represent the midpoint of classes + 0.5-1.5, 1.5-2.5, 2.5-3.5, etc. The middle value falls somewhere in + class 3.5-4.5, and interpolation is used to estimate it. + + Optional argument ``interval`` represents the class interval, and + defaults to 1. Changing the class interval naturally will change the + interpolated 50th percentile value: + + >>> median_grouped([1, 3, 3, 5, 7], interval=1) + 3.25 + >>> median_grouped([1, 3, 3, 5, 7], interval=2) + 3.5 + + This function does not check whether the data points are at least + ``interval`` apart. + """ + data = sorted(data) + n = len(data) + if n == 0: + raise StatisticsError("no median for empty data") + elif n == 1: + return data[0] + # Find the value at the midpoint. Remember this corresponds to the + # centre of the class interval. + x = data[n//2] + for obj in (x, interval): + if isinstance(obj, (str, bytes)): + raise TypeError('expected number but got %r' % obj) + try: + L = x - interval/2 # The lower limit of the median interval. + except TypeError: + # Mixed type. For now we just coerce to float. + L = float(x) - float(interval)/2 + cf = data.index(x) # Number of values below the median interval. + # FIXME The following line could be more efficient for big lists. + f = data.count(x) # Number of data points in the median interval. + return L + interval*(n/2 - cf)/f + + +def mode(data): + """Return the most common data point from discrete or nominal data. + + ``mode`` assumes discrete data, and returns a single value. This is the + standard treatment of the mode as commonly taught in schools: + + >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) + 3 + + This also works with nominal (non-numeric) data: + + >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) + 'red' + + If there is not exactly one most common value, ``mode`` will raise + StatisticsError. + """ + # Generate a table of sorted (value, frequency) pairs. + table = _counts(data) + if len(table) == 1: + return table[0][0] + elif table: + raise StatisticsError( + 'no unique mode; found %d equally common values' % len(table) + ) + else: + raise StatisticsError('no mode for empty data') + + +# === Measures of spread === + +# See http://mathworld.wolfram.com/Variance.html +# http://mathworld.wolfram.com/SampleVariance.html +# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance +# +# Under no circumstances use the so-called "computational formula for +# variance", as that is only suitable for hand calculations with a small +# amount of low-precision data. It has terrible numeric properties. +# +# See a comparison of three computational methods here: +# http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/ + +def _ss(data, c=None): + """Return sum of square deviations of sequence data. + + If ``c`` is None, the mean is calculated in one pass, and the deviations + from the mean are calculated in a second pass. Otherwise, deviations are + calculated from ``c`` as given. Use the second case with care, as it can + lead to garbage results. + """ + if c is None: + c = mean(data) + ss = _sum((x-c)**2 for x in data) + # The following sum should mathematically equal zero, but due to rounding + # error may not. + ss -= _sum((x-c) for x in data)**2/len(data) + assert not ss < 0, 'negative sum of square deviations: %f' % ss + return ss + + +def variance(data, xbar=None): + """Return the sample variance of data. + + data should be an iterable of Real-valued numbers, with at least two + values. The optional argument xbar, if given, should be the mean of + the data. If it is missing or None, the mean is automatically calculated. + + Use this function when your data is a sample from a population. To + calculate the variance from the entire population, see ``pvariance``. + + Examples: + + >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] + >>> variance(data) + 1.3720238095238095 + + If you have already calculated the mean of your data, you can pass it as + the optional second argument ``xbar`` to avoid recalculating it: + + >>> m = mean(data) + >>> variance(data, m) + 1.3720238095238095 + + This function does not check that ``xbar`` is actually the mean of + ``data``. Giving arbitrary values for ``xbar`` may lead to invalid or + impossible results. + + Decimals and Fractions are supported: + + >>> from decimal import Decimal as D + >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) + Decimal('31.01875') + + >>> from fractions import Fraction as F + >>> variance([F(1, 6), F(1, 2), F(5, 3)]) + Fraction(67, 108) + + """ + if iter(data) is data: + data = list(data) + n = len(data) + if n < 2: + raise StatisticsError('variance requires at least two data points') + ss = _ss(data, xbar) + return ss/(n-1) + + +def pvariance(data, mu=None): + """Return the population variance of ``data``. + + data should be an iterable of Real-valued numbers, with at least one + value. The optional argument mu, if given, should be the mean of + the data. If it is missing or None, the mean is automatically calculated. + + Use this function to calculate the variance from the entire population. + To estimate the variance from a sample, the ``variance`` function is + usually a better choice. + + Examples: + + >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] + >>> pvariance(data) + 1.25 + + If you have already calculated the mean of the data, you can pass it as + the optional second argument to avoid recalculating it: + + >>> mu = mean(data) + >>> pvariance(data, mu) + 1.25 + + This function does not check that ``mu`` is actually the mean of ``data``. + Giving arbitrary values for ``mu`` may lead to invalid or impossible + results. + + Decimals and Fractions are supported: + + >>> from decimal import Decimal as D + >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) + Decimal('24.815') + + >>> from fractions import Fraction as F + >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) + Fraction(13, 72) + + """ + if iter(data) is data: + data = list(data) + n = len(data) + if n < 1: + raise StatisticsError('pvariance requires at least one data point') + ss = _ss(data, mu) + return ss/n + + +def stdev(data, xbar=None): + """Return the square root of the sample variance. + + See ``variance`` for arguments and other details. + + >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) + 1.0810874155219827 + + """ + var = variance(data, xbar) + try: + return var.sqrt() + except AttributeError: + return math.sqrt(var) + + +def pstdev(data, mu=None): + """Return the square root of the population variance. + + See ``pvariance`` for arguments and other details. + + >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) + 0.986893273527251 + + """ + var = pvariance(data, mu) + try: + return var.sqrt() + except AttributeError: + return math.sqrt(var) diff --git a/Lib/test/test_statistics.py b/Lib/test/test_statistics.py new file mode 100644 index 0000000..71b3654 --- /dev/null +++ b/Lib/test/test_statistics.py @@ -0,0 +1,1534 @@ +"""Test suite for statistics module, including helper NumericTestCase and +approx_equal function. + +""" + +import collections +import decimal +import doctest +import math +import random +import types +import unittest + +from decimal import Decimal +from fractions import Fraction + + +# Module to be tested. +import statistics + + +# === Helper functions and class === + +def _calc_errors(actual, expected): + """Return the absolute and relative errors between two numbers. + + >>> _calc_errors(100, 75) + (25, 0.25) + >>> _calc_errors(100, 100) + (0, 0.0) + + Returns the (absolute error, relative error) between the two arguments. + """ + base = max(abs(actual), abs(expected)) + abs_err = abs(actual - expected) + rel_err = abs_err/base if base else float('inf') + return (abs_err, rel_err) + + +def approx_equal(x, y, tol=1e-12, rel=1e-7): + """approx_equal(x, y [, tol [, rel]]) => True|False + + Return True if numbers x and y are approximately equal, to within some + margin of error, otherwise return False. Numbers which compare equal + will also compare approximately equal. + + x is approximately equal to y if the difference between them is less than + an absolute error tol or a relative error rel, whichever is bigger. + + If given, both tol and rel must be finite, non-negative numbers. If not + given, default values are tol=1e-12 and rel=1e-7. + + >>> approx_equal(1.2589, 1.2587, tol=0.0003, rel=0) + True + >>> approx_equal(1.2589, 1.2587, tol=0.0001, rel=0) + False + + Absolute error is defined as abs(x-y); if that is less than or equal to + tol, x and y are considered approximately equal. + + Relative error is defined as abs((x-y)/x) or abs((x-y)/y), whichever is + smaller, provided x or y are not zero. If that figure is less than or + equal to rel, x and y are considered approximately equal. + + Complex numbers are not directly supported. If you wish to compare to + complex numbers, extract their real and imaginary parts and compare them + individually. + + NANs always compare unequal, even with themselves. Infinities compare + approximately equal if they have the same sign (both positive or both + negative). Infinities with different signs compare unequal; so do + comparisons of infinities with finite numbers. + """ + if tol < 0 or rel < 0: + raise ValueError('error tolerances must be non-negative') + # NANs are never equal to anything, approximately or otherwise. + if math.isnan(x) or math.isnan(y): + return False + # Numbers which compare equal also compare approximately equal. + if x == y: + # This includes the case of two infinities with the same sign. + return True + if math.isinf(x) or math.isinf(y): + # This includes the case of two infinities of opposite sign, or + # one infinity and one finite number. + return False + # Two finite numbers. + actual_error = abs(x - y) + allowed_error = max(tol, rel*max(abs(x), abs(y))) + return actual_error <= allowed_error + + +# This class exists only as somewhere to stick a docstring containing +# doctests. The following docstring and tests were originally in a separate +# module. Now that it has been merged in here, I need somewhere to hang the. +# docstring. Ultimately, this class will die, and the information below will +# either become redundant, or be moved into more appropriate places. +class _DoNothing: + """ + When doing numeric work, especially with floats, exact equality is often + not what you want. Due to round-off error, it is often a bad idea to try + to compare floats with equality. Instead the usual procedure is to test + them with some (hopefully small!) allowance for error. + + The ``approx_equal`` function allows you to specify either an absolute + error tolerance, or a relative error, or both. + + Absolute error tolerances are simple, but you need to know the magnitude + of the quantities being compared: + + >>> approx_equal(12.345, 12.346, tol=1e-3) + True + >>> approx_equal(12.345e6, 12.346e6, tol=1e-3) # tol is too small. + False + + Relative errors are more suitable when the values you are comparing can + vary in magnitude: + + >>> approx_equal(12.345, 12.346, rel=1e-4) + True + >>> approx_equal(12.345e6, 12.346e6, rel=1e-4) + True + + but a naive implementation of relative error testing can run into trouble + around zero. + + If you supply both an absolute tolerance and a relative error, the + comparison succeeds if either individual test succeeds: + + >>> approx_equal(12.345e6, 12.346e6, tol=1e-3, rel=1e-4) + True + + """ + pass + + + +# We prefer this for testing numeric values that may not be exactly equal, +# and avoid using TestCase.assertAlmostEqual, because it sucks :-) + +class NumericTestCase(unittest.TestCase): + """Unit test class for numeric work. + + This subclasses TestCase. In addition to the standard method + ``TestCase.assertAlmostEqual``, ``assertApproxEqual`` is provided. + """ + # By default, we expect exact equality, unless overridden. + tol = rel = 0 + + def assertApproxEqual( + self, first, second, tol=None, rel=None, msg=None + ): + """Test passes if ``first`` and ``second`` are approximately equal. + + This test passes if ``first`` and ``second`` are equal to + within ``tol``, an absolute error, or ``rel``, a relative error. + + If either ``tol`` or ``rel`` are None or not given, they default to + test attributes of the same name (by default, 0). + + The objects may be either numbers, or sequences of numbers. Sequences + are tested element-by-element. + + >>> class MyTest(NumericTestCase): + ... def test_number(self): + ... x = 1.0/6 + ... y = sum([x]*6) + ... self.assertApproxEqual(y, 1.0, tol=1e-15) + ... def test_sequence(self): + ... a = [1.001, 1.001e-10, 1.001e10] + ... b = [1.0, 1e-10, 1e10] + ... self.assertApproxEqual(a, b, rel=1e-3) + ... + >>> import unittest + >>> from io import StringIO # Suppress test runner output. + >>> suite = unittest.TestLoader().loadTestsFromTestCase(MyTest) + >>> unittest.TextTestRunner(stream=StringIO()).run(suite) + <unittest.runner.TextTestResult run=2 errors=0 failures=0> + + """ + if tol is None: + tol = self.tol + if rel is None: + rel = self.rel + if ( + isinstance(first, collections.Sequence) and + isinstance(second, collections.Sequence) + ): + check = self._check_approx_seq + else: + check = self._check_approx_num + check(first, second, tol, rel, msg) + + def _check_approx_seq(self, first, second, tol, rel, msg): + if len(first) != len(second): + standardMsg = ( + "sequences differ in length: %d items != %d items" + % (len(first), len(second)) + ) + msg = self._formatMessage(msg, standardMsg) + raise self.failureException(msg) + for i, (a,e) in enumerate(zip(first, second)): + self._check_approx_num(a, e, tol, rel, msg, i) + + def _check_approx_num(self, first, second, tol, rel, msg, idx=None): + if approx_equal(first, second, tol, rel): + # Test passes. Return early, we are done. + return None + # Otherwise we failed. + standardMsg = self._make_std_err_msg(first, second, tol, rel, idx) + msg = self._formatMessage(msg, standardMsg) + raise self.failureException(msg) + + @staticmethod + def _make_std_err_msg(first, second, tol, rel, idx): + # Create the standard error message for approx_equal failures. + assert first != second + template = ( + ' %r != %r\n' + ' values differ by more than tol=%r and rel=%r\n' + ' -> absolute error = %r\n' + ' -> relative error = %r' + ) + if idx is not None: + header = 'numeric sequences first differ at index %d.\n' % idx + template = header + template + # Calculate actual errors: + abs_err, rel_err = _calc_errors(first, second) + return template % (first, second, tol, rel, abs_err, rel_err) + + +# ======================== +# === Test the helpers === +# ======================== + + +# --- Tests for approx_equal --- + +class ApproxEqualSymmetryTest(unittest.TestCase): + # Test symmetry of approx_equal. + + def test_relative_symmetry(self): + # Check that approx_equal treats relative error symmetrically. + # (a-b)/a is usually not equal to (a-b)/b. Ensure that this + # doesn't matter. + # + # Note: the reason for this test is that an early version + # of approx_equal was not symmetric. A relative error test + # would pass, or fail, depending on which value was passed + # as the first argument. + # + args1 = [2456, 37.8, -12.45, Decimal('2.54'), Fraction(17, 54)] + args2 = [2459, 37.2, -12.41, Decimal('2.59'), Fraction(15, 54)] + assert len(args1) == len(args2) + for a, b in zip(args1, args2): + self.do_relative_symmetry(a, b) + + def do_relative_symmetry(self, a, b): + a, b = min(a, b), max(a, b) + assert a < b + delta = b - a # The absolute difference between the values. + rel_err1, rel_err2 = abs(delta/a), abs(delta/b) + # Choose an error margin halfway between the two. + rel = (rel_err1 + rel_err2)/2 + # Now see that values a and b compare approx equal regardless of + # which is given first. + self.assertTrue(approx_equal(a, b, tol=0, rel=rel)) + self.assertTrue(approx_equal(b, a, tol=0, rel=rel)) + + def test_symmetry(self): + # Test that approx_equal(a, b) == approx_equal(b, a) + args = [-23, -2, 5, 107, 93568] + delta = 2 + for x in args: + for type_ in (int, float, Decimal, Fraction): + x = type_(x)*100 + y = x + delta + r = abs(delta/max(x, y)) + # There are five cases to check: + # 1) actual error <= tol, <= rel + self.do_symmetry_test(x, y, tol=delta, rel=r) + self.do_symmetry_test(x, y, tol=delta+1, rel=2*r) + # 2) actual error > tol, > rel + self.do_symmetry_test(x, y, tol=delta-1, rel=r/2) + # 3) actual error <= tol, > rel + self.do_symmetry_test(x, y, tol=delta, rel=r/2) + # 4) actual error > tol, <= rel + self.do_symmetry_test(x, y, tol=delta-1, rel=r) + self.do_symmetry_test(x, y, tol=delta-1, rel=2*r) + # 5) exact equality test + self.do_symmetry_test(x, x, tol=0, rel=0) + self.do_symmetry_test(x, y, tol=0, rel=0) + + def do_symmetry_test(self, a, b, tol, rel): + template = "approx_equal comparisons don't match for %r" + flag1 = approx_equal(a, b, tol, rel) + flag2 = approx_equal(b, a, tol, rel) + self.assertEqual(flag1, flag2, template.format((a, b, tol, rel))) + + +class ApproxEqualExactTest(unittest.TestCase): + # Test the approx_equal function with exactly equal values. + # Equal values should compare as approximately equal. + # Test cases for exactly equal values, which should compare approx + # equal regardless of the error tolerances given. + + def do_exactly_equal_test(self, x, tol, rel): + result = approx_equal(x, x, tol=tol, rel=rel) + self.assertTrue(result, 'equality failure for x=%r' % x) + result = approx_equal(-x, -x, tol=tol, rel=rel) + self.assertTrue(result, 'equality failure for x=%r' % -x) + + def test_exactly_equal_ints(self): + # Test that equal int values are exactly equal. + for n in [42, 19740, 14974, 230, 1795, 700245, 36587]: + self.do_exactly_equal_test(n, 0, 0) + + def test_exactly_equal_floats(self): + # Test that equal float values are exactly equal. + for x in [0.42, 1.9740, 1497.4, 23.0, 179.5, 70.0245, 36.587]: + self.do_exactly_equal_test(x, 0, 0) + + def test_exactly_equal_fractions(self): + # Test that equal Fraction values are exactly equal. + F = Fraction + for f in [F(1, 2), F(0), F(5, 3), F(9, 7), F(35, 36), F(3, 7)]: + self.do_exactly_equal_test(f, 0, 0) + + def test_exactly_equal_decimals(self): + # Test that equal Decimal values are exactly equal. + D = Decimal + for d in map(D, "8.2 31.274 912.04 16.745 1.2047".split()): + self.do_exactly_equal_test(d, 0, 0) + + def test_exactly_equal_absolute(self): + # Test that equal values are exactly equal with an absolute error. + for n in [16, 1013, 1372, 1198, 971, 4]: + # Test as ints. + self.do_exactly_equal_test(n, 0.01, 0) + # Test as floats. + self.do_exactly_equal_test(n/10, 0.01, 0) + # Test as Fractions. + f = Fraction(n, 1234) + self.do_exactly_equal_test(f, 0.01, 0) + + def test_exactly_equal_absolute_decimals(self): + # Test equal Decimal values are exactly equal with an absolute error. + self.do_exactly_equal_test(Decimal("3.571"), Decimal("0.01"), 0) + self.do_exactly_equal_test(-Decimal("81.3971"), Decimal("0.01"), 0) + + def test_exactly_equal_relative(self): + # Test that equal values are exactly equal with a relative error. + for x in [8347, 101.3, -7910.28, Fraction(5, 21)]: + self.do_exactly_equal_test(x, 0, 0.01) + self.do_exactly_equal_test(Decimal("11.68"), 0, Decimal("0.01")) + + def test_exactly_equal_both(self): + # Test that equal values are equal when both tol and rel are given. + for x in [41017, 16.742, -813.02, Fraction(3, 8)]: + self.do_exactly_equal_test(x, 0.1, 0.01) + D = Decimal + self.do_exactly_equal_test(D("7.2"), D("0.1"), D("0.01")) + + +class ApproxEqualUnequalTest(unittest.TestCase): + # Unequal values should compare unequal with zero error tolerances. + # Test cases for unequal values, with exact equality test. + + def do_exactly_unequal_test(self, x): + for a in (x, -x): + result = approx_equal(a, a+1, tol=0, rel=0) + self.assertFalse(result, 'inequality failure for x=%r' % a) + + def test_exactly_unequal_ints(self): + # Test unequal int values are unequal with zero error tolerance. + for n in [951, 572305, 478, 917, 17240]: + self.do_exactly_unequal_test(n) + + def test_exactly_unequal_floats(self): + # Test unequal float values are unequal with zero error tolerance. + for x in [9.51, 5723.05, 47.8, 9.17, 17.24]: + self.do_exactly_unequal_test(x) + + def test_exactly_unequal_fractions(self): + # Test that unequal Fractions are unequal with zero error tolerance. + F = Fraction + for f in [F(1, 5), F(7, 9), F(12, 11), F(101, 99023)]: + self.do_exactly_unequal_test(f) + + def test_exactly_unequal_decimals(self): + # Test that unequal Decimals are unequal with zero error tolerance. + for d in map(Decimal, "3.1415 298.12 3.47 18.996 0.00245".split()): + self.do_exactly_unequal_test(d) + + +class ApproxEqualInexactTest(unittest.TestCase): + # Inexact test cases for approx_error. + # Test cases when comparing two values that are not exactly equal. + + # === Absolute error tests === + + def do_approx_equal_abs_test(self, x, delta): + template = "Test failure for x={!r}, y={!r}" + for y in (x + delta, x - delta): + msg = template.format(x, y) + self.assertTrue(approx_equal(x, y, tol=2*delta, rel=0), msg) + self.assertFalse(approx_equal(x, y, tol=delta/2, rel=0), msg) + + def test_approx_equal_absolute_ints(self): + # Test approximate equality of ints with an absolute error. + for n in [-10737, -1975, -7, -2, 0, 1, 9, 37, 423, 9874, 23789110]: + self.do_approx_equal_abs_test(n, 10) + self.do_approx_equal_abs_test(n, 2) + + def test_approx_equal_absolute_floats(self): + # Test approximate equality of floats with an absolute error. + for x in [-284.126, -97.1, -3.4, -2.15, 0.5, 1.0, 7.8, 4.23, 3817.4]: + self.do_approx_equal_abs_test(x, 1.5) + self.do_approx_equal_abs_test(x, 0.01) + self.do_approx_equal_abs_test(x, 0.0001) + + def test_approx_equal_absolute_fractions(self): + # Test approximate equality of Fractions with an absolute error. + delta = Fraction(1, 29) + numerators = [-84, -15, -2, -1, 0, 1, 5, 17, 23, 34, 71] + for f in (Fraction(n, 29) for n in numerators): + self.do_approx_equal_abs_test(f, delta) + self.do_approx_equal_abs_test(f, float(delta)) + + def test_approx_equal_absolute_decimals(self): + # Test approximate equality of Decimals with an absolute error. + delta = Decimal("0.01") + for d in map(Decimal, "1.0 3.5 36.08 61.79 7912.3648".split()): + self.do_approx_equal_abs_test(d, delta) + self.do_approx_equal_abs_test(-d, delta) + + def test_cross_zero(self): + # Test for the case of the two values having opposite signs. + self.assertTrue(approx_equal(1e-5, -1e-5, tol=1e-4, rel=0)) + + # === Relative error tests === + + def do_approx_equal_rel_test(self, x, delta): + template = "Test failure for x={!r}, y={!r}" + for y in (x*(1+delta), x*(1-delta)): + msg = template.format(x, y) + self.assertTrue(approx_equal(x, y, tol=0, rel=2*delta), msg) + self.assertFalse(approx_equal(x, y, tol=0, rel=delta/2), msg) + + def test_approx_equal_relative_ints(self): + # Test approximate equality of ints with a relative error. + self.assertTrue(approx_equal(64, 47, tol=0, rel=0.36)) + self.assertTrue(approx_equal(64, 47, tol=0, rel=0.37)) + # --- + self.assertTrue(approx_equal(449, 512, tol=0, rel=0.125)) + self.assertTrue(approx_equal(448, 512, tol=0, rel=0.125)) + self.assertFalse(approx_equal(447, 512, tol=0, rel=0.125)) + + def test_approx_equal_relative_floats(self): + # Test approximate equality of floats with a relative error. + for x in [-178.34, -0.1, 0.1, 1.0, 36.97, 2847.136, 9145.074]: + self.do_approx_equal_rel_test(x, 0.02) + self.do_approx_equal_rel_test(x, 0.0001) + + def test_approx_equal_relative_fractions(self): + # Test approximate equality of Fractions with a relative error. + F = Fraction + delta = Fraction(3, 8) + for f in [F(3, 84), F(17, 30), F(49, 50), F(92, 85)]: + for d in (delta, float(delta)): + self.do_approx_equal_rel_test(f, d) + self.do_approx_equal_rel_test(-f, d) + + def test_approx_equal_relative_decimals(self): + # Test approximate equality of Decimals with a relative error. + for d in map(Decimal, "0.02 1.0 5.7 13.67 94.138 91027.9321".split()): + self.do_approx_equal_rel_test(d, Decimal("0.001")) + self.do_approx_equal_rel_test(-d, Decimal("0.05")) + + # === Both absolute and relative error tests === + + # There are four cases to consider: + # 1) actual error <= both absolute and relative error + # 2) actual error <= absolute error but > relative error + # 3) actual error <= relative error but > absolute error + # 4) actual error > both absolute and relative error + + def do_check_both(self, a, b, tol, rel, tol_flag, rel_flag): + check = self.assertTrue if tol_flag else self.assertFalse + check(approx_equal(a, b, tol=tol, rel=0)) + check = self.assertTrue if rel_flag else self.assertFalse + check(approx_equal(a, b, tol=0, rel=rel)) + check = self.assertTrue if (tol_flag or rel_flag) else self.assertFalse + check(approx_equal(a, b, tol=tol, rel=rel)) + + def test_approx_equal_both1(self): + # Test actual error <= both absolute and relative error. + self.do_check_both(7.955, 7.952, 0.004, 3.8e-4, True, True) + self.do_check_both(-7.387, -7.386, 0.002, 0.0002, True, True) + + def test_approx_equal_both2(self): + # Test actual error <= absolute error but > relative error. + self.do_check_both(7.955, 7.952, 0.004, 3.7e-4, True, False) + + def test_approx_equal_both3(self): + # Test actual error <= relative error but > absolute error. + self.do_check_both(7.955, 7.952, 0.001, 3.8e-4, False, True) + + def test_approx_equal_both4(self): + # Test actual error > both absolute and relative error. + self.do_check_both(2.78, 2.75, 0.01, 0.001, False, False) + self.do_check_both(971.44, 971.47, 0.02, 3e-5, False, False) + + +class ApproxEqualSpecialsTest(unittest.TestCase): + # Test approx_equal with NANs and INFs and zeroes. + + def test_inf(self): + for type_ in (float, Decimal): + inf = type_('inf') + self.assertTrue(approx_equal(inf, inf)) + self.assertTrue(approx_equal(inf, inf, 0, 0)) + self.assertTrue(approx_equal(inf, inf, 1, 0.01)) + self.assertTrue(approx_equal(-inf, -inf)) + self.assertFalse(approx_equal(inf, -inf)) + self.assertFalse(approx_equal(inf, 1000)) + + def test_nan(self): + for type_ in (float, Decimal): + nan = type_('nan') + for other in (nan, type_('inf'), 1000): + self.assertFalse(approx_equal(nan, other)) + + def test_float_zeroes(self): + nzero = math.copysign(0.0, -1) + self.assertTrue(approx_equal(nzero, 0.0, tol=0.1, rel=0.1)) + + def test_decimal_zeroes(self): + nzero = Decimal("-0.0") + self.assertTrue(approx_equal(nzero, Decimal(0), tol=0.1, rel=0.1)) + + +class TestApproxEqualErrors(unittest.TestCase): + # Test error conditions of approx_equal. + + def test_bad_tol(self): + # Test negative tol raises. + self.assertRaises(ValueError, approx_equal, 100, 100, -1, 0.1) + + def test_bad_rel(self): + # Test negative rel raises. + self.assertRaises(ValueError, approx_equal, 100, 100, 1, -0.1) + + +# --- Tests for NumericTestCase --- + +# The formatting routine that generates the error messages is complex enough +# that it too needs testing. + +class TestNumericTestCase(unittest.TestCase): + # The exact wording of NumericTestCase error messages is *not* guaranteed, + # but we need to give them some sort of test to ensure that they are + # generated correctly. As a compromise, we look for specific substrings + # that are expected to be found even if the overall error message changes. + + def do_test(self, args): + actual_msg = NumericTestCase._make_std_err_msg(*args) + expected = self.generate_substrings(*args) + for substring in expected: + self.assertIn(substring, actual_msg) + + def test_numerictestcase_is_testcase(self): + # Ensure that NumericTestCase actually is a TestCase. + self.assertTrue(issubclass(NumericTestCase, unittest.TestCase)) + + def test_error_msg_numeric(self): + # Test the error message generated for numeric comparisons. + args = (2.5, 4.0, 0.5, 0.25, None) + self.do_test(args) + + def test_error_msg_sequence(self): + # Test the error message generated for sequence comparisons. + args = (3.75, 8.25, 1.25, 0.5, 7) + self.do_test(args) + + def generate_substrings(self, first, second, tol, rel, idx): + """Return substrings we expect to see in error messages.""" + abs_err, rel_err = _calc_errors(first, second) + substrings = [ + 'tol=%r' % tol, + 'rel=%r' % rel, + 'absolute error = %r' % abs_err, + 'relative error = %r' % rel_err, + ] + if idx is not None: + substrings.append('differ at index %d' % idx) + return substrings + + +# ======================================= +# === Tests for the statistics module === +# ======================================= + + +class GlobalsTest(unittest.TestCase): + module = statistics + expected_metadata = ["__doc__", "__all__"] + + def test_meta(self): + # Test for the existence of metadata. + for meta in self.expected_metadata: + self.assertTrue(hasattr(self.module, meta), + "%s not present" % meta) + + def test_check_all(self): + # Check everything in __all__ exists and is public. + module = self.module + for name in module.__all__: + # No private names in __all__: + self.assertFalse(name.startswith("_"), + 'private name "%s" in __all__' % name) + # And anything in __all__ must exist: + self.assertTrue(hasattr(module, name), + 'missing name "%s" in __all__' % name) + + +class DocTests(unittest.TestCase): + def test_doc_tests(self): + failed, tried = doctest.testmod(statistics) + self.assertGreater(tried, 0) + self.assertEqual(failed, 0) + +class StatisticsErrorTest(unittest.TestCase): + def test_has_exception(self): + errmsg = ( + "Expected StatisticsError to be a ValueError, but got a" + " subclass of %r instead." + ) + self.assertTrue(hasattr(statistics, 'StatisticsError')) + self.assertTrue( + issubclass(statistics.StatisticsError, ValueError), + errmsg % statistics.StatisticsError.__base__ + ) + + +# === Tests for private utility functions === + +class ExactRatioTest(unittest.TestCase): + # Test _exact_ratio utility. + + def test_int(self): + for i in (-20, -3, 0, 5, 99, 10**20): + self.assertEqual(statistics._exact_ratio(i), (i, 1)) + + def test_fraction(self): + numerators = (-5, 1, 12, 38) + for n in numerators: + f = Fraction(n, 37) + self.assertEqual(statistics._exact_ratio(f), (n, 37)) + + def test_float(self): + self.assertEqual(statistics._exact_ratio(0.125), (1, 8)) + self.assertEqual(statistics._exact_ratio(1.125), (9, 8)) + data = [random.uniform(-100, 100) for _ in range(100)] + for x in data: + num, den = statistics._exact_ratio(x) + self.assertEqual(x, num/den) + + def test_decimal(self): + D = Decimal + _exact_ratio = statistics._exact_ratio + self.assertEqual(_exact_ratio(D("0.125")), (125, 1000)) + self.assertEqual(_exact_ratio(D("12.345")), (12345, 1000)) + self.assertEqual(_exact_ratio(D("-1.98")), (-198, 100)) + + +class DecimalToRatioTest(unittest.TestCase): + # Test _decimal_to_ratio private function. + + def testSpecialsRaise(self): + # Test that NANs and INFs raise ValueError. + # Non-special values are covered by _exact_ratio above. + for d in (Decimal('NAN'), Decimal('sNAN'), Decimal('INF')): + self.assertRaises(ValueError, statistics._decimal_to_ratio, d) + + + +# === Tests for public functions === + +class UnivariateCommonMixin: + # Common tests for most univariate functions that take a data argument. + + def test_no_args(self): + # Fail if given no arguments. + self.assertRaises(TypeError, self.func) + + def test_empty_data(self): + # Fail when the data argument (first argument) is empty. + for empty in ([], (), iter([])): + self.assertRaises(statistics.StatisticsError, self.func, empty) + + def prepare_data(self): + """Return int data for various tests.""" + data = list(range(10)) + while data == sorted(data): + random.shuffle(data) + return data + + def test_no_inplace_modifications(self): + # Test that the function does not modify its input data. + data = self.prepare_data() + assert len(data) != 1 # Necessary to avoid infinite loop. + assert data != sorted(data) + saved = data[:] + assert data is not saved + _ = self.func(data) + self.assertListEqual(data, saved, "data has been modified") + + def test_order_doesnt_matter(self): + # Test that the order of data points doesn't change the result. + + # CAUTION: due to floating point rounding errors, the result actually + # may depend on the order. Consider this test representing an ideal. + # To avoid this test failing, only test with exact values such as ints + # or Fractions. + data = [1, 2, 3, 3, 3, 4, 5, 6]*100 + expected = self.func(data) + random.shuffle(data) + actual = self.func(data) + self.assertEqual(expected, actual) + + def test_type_of_data_collection(self): + # Test that the type of iterable data doesn't effect the result. + class MyList(list): + pass + class MyTuple(tuple): + pass + def generator(data): + return (obj for obj in data) + data = self.prepare_data() + expected = self.func(data) + for kind in (list, tuple, iter, MyList, MyTuple, generator): + result = self.func(kind(data)) + self.assertEqual(result, expected) + + def test_range_data(self): + # Test that functions work with range objects. + data = range(20, 50, 3) + expected = self.func(list(data)) + self.assertEqual(self.func(data), expected) + + def test_bad_arg_types(self): + # Test that function raises when given data of the wrong type. + + # Don't roll the following into a loop like this: + # for bad in list_of_bad: + # self.check_for_type_error(bad) + # + # Since assertRaises doesn't show the arguments that caused the test + # failure, it is very difficult to debug these test failures when the + # following are in a loop. + self.check_for_type_error(None) + self.check_for_type_error(23) + self.check_for_type_error(42.0) + self.check_for_type_error(object()) + + def check_for_type_error(self, *args): + self.assertRaises(TypeError, self.func, *args) + + def test_type_of_data_element(self): + # Check the type of data elements doesn't affect the numeric result. + # This is a weaker test than UnivariateTypeMixin.testTypesConserved, + # because it checks the numeric result by equality, but not by type. + class MyFloat(float): + def __truediv__(self, other): + return type(self)(super().__truediv__(other)) + def __add__(self, other): + return type(self)(super().__add__(other)) + __radd__ = __add__ + + raw = self.prepare_data() + expected = self.func(raw) + for kind in (float, MyFloat, Decimal, Fraction): + data = [kind(x) for x in raw] + result = type(expected)(self.func(data)) + self.assertEqual(result, expected) + + +class UnivariateTypeMixin: + """Mixin class for type-conserving functions. + + This mixin class holds test(s) for functions which conserve the type of + individual data points. E.g. the mean of a list of Fractions should itself + be a Fraction. + + Not all tests to do with types need go in this class. Only those that + rely on the function returning the same type as its input data. + """ + def test_types_conserved(self): + # Test that functions keeps the same type as their data points. + # (Excludes mixed data types.) This only tests the type of the return + # result, not the value. + class MyFloat(float): + def __truediv__(self, other): + return type(self)(super().__truediv__(other)) + def __sub__(self, other): + return type(self)(super().__sub__(other)) + def __rsub__(self, other): + return type(self)(super().__rsub__(other)) + def __pow__(self, other): + return type(self)(super().__pow__(other)) + def __add__(self, other): + return type(self)(super().__add__(other)) + __radd__ = __add__ + + data = self.prepare_data() + for kind in (float, Decimal, Fraction, MyFloat): + d = [kind(x) for x in data] + result = self.func(d) + self.assertIs(type(result), kind) + + +class TestSum(NumericTestCase, UnivariateCommonMixin, UnivariateTypeMixin): + # Test cases for statistics._sum() function. + + def setUp(self): + self.func = statistics._sum + + def test_empty_data(self): + # Override test for empty data. + for data in ([], (), iter([])): + self.assertEqual(self.func(data), 0) + self.assertEqual(self.func(data, 23), 23) + self.assertEqual(self.func(data, 2.3), 2.3) + + def test_ints(self): + self.assertEqual(self.func([1, 5, 3, -4, -8, 20, 42, 1]), 60) + self.assertEqual(self.func([4, 2, 3, -8, 7], 1000), 1008) + + def test_floats(self): + self.assertEqual(self.func([0.25]*20), 5.0) + self.assertEqual(self.func([0.125, 0.25, 0.5, 0.75], 1.5), 3.125) + + def test_fractions(self): + F = Fraction + self.assertEqual(self.func([Fraction(1, 1000)]*500), Fraction(1, 2)) + + def test_decimals(self): + D = Decimal + data = [D("0.001"), D("5.246"), D("1.702"), D("-0.025"), + D("3.974"), D("2.328"), D("4.617"), D("2.843"), + ] + self.assertEqual(self.func(data), Decimal("20.686")) + + def test_compare_with_math_fsum(self): + # Compare with the math.fsum function. + # Ideally we ought to get the exact same result, but sometimes + # we differ by a very slight amount :-( + data = [random.uniform(-100, 1000) for _ in range(1000)] + self.assertApproxEqual(self.func(data), math.fsum(data), rel=2e-16) + + def test_start_argument(self): + # Test that the optional start argument works correctly. + data = [random.uniform(1, 1000) for _ in range(100)] + t = self.func(data) + self.assertEqual(t+42, self.func(data, 42)) + self.assertEqual(t-23, self.func(data, -23)) + self.assertEqual(t+1e20, self.func(data, 1e20)) + + def test_strings_fail(self): + # Sum of strings should fail. + self.assertRaises(TypeError, self.func, [1, 2, 3], '999') + self.assertRaises(TypeError, self.func, [1, 2, 3, '999']) + + def test_bytes_fail(self): + # Sum of bytes should fail. + self.assertRaises(TypeError, self.func, [1, 2, 3], b'999') + self.assertRaises(TypeError, self.func, [1, 2, 3, b'999']) + + def test_mixed_sum(self): + # Mixed sums are allowed. + + # Careful here: order matters. Can't mix Fraction and Decimal directly, + # only after they're converted to float. + data = [1, 2, Fraction(1, 2), 3.0, Decimal("0.25")] + self.assertEqual(self.func(data), 6.75) + + +class SumInternalsTest(NumericTestCase): + # Test internals of the sum function. + + def test_ignore_instance_float_method(self): + # Test that __float__ methods on data instances are ignored. + + # Python typically calls __dunder__ methods on the class, not the + # instance. The ``sum`` implementation calls __float__ directly. To + # better match the behaviour of Python, we call it only on the class, + # not the instance. This test will fail if somebody "fixes" that code. + + # Create a fake __float__ method. + def __float__(self): + raise AssertionError('test fails') + + # Inject it into an instance. + class MyNumber(Fraction): + pass + x = MyNumber(3) + x.__float__ = types.MethodType(__float__, x) + + # Check it works as expected. + self.assertRaises(AssertionError, x.__float__) + self.assertEqual(float(x), 3.0) + # And now test the function. + self.assertEqual(statistics._sum([1.0, 2.0, x, 4.0]), 10.0) + + +class SumTortureTest(NumericTestCase): + def test_torture(self): + # Tim Peters' torture test for sum, and variants of same. + self.assertEqual(statistics._sum([1, 1e100, 1, -1e100]*10000), 20000.0) + self.assertEqual(statistics._sum([1e100, 1, 1, -1e100]*10000), 20000.0) + self.assertApproxEqual( + statistics._sum([1e-100, 1, 1e-100, -1]*10000), 2.0e-96, rel=5e-16 + ) + + +class SumSpecialValues(NumericTestCase): + # Test that sum works correctly with IEEE-754 special values. + + def test_nan(self): + for type_ in (float, Decimal): + nan = type_('nan') + result = statistics._sum([1, nan, 2]) + self.assertIs(type(result), type_) + self.assertTrue(math.isnan(result)) + + def check_infinity(self, x, inf): + """Check x is an infinity of the same type and sign as inf.""" + self.assertTrue(math.isinf(x)) + self.assertIs(type(x), type(inf)) + self.assertEqual(x > 0, inf > 0) + assert x == inf + + def do_test_inf(self, inf): + # Adding a single infinity gives infinity. + result = statistics._sum([1, 2, inf, 3]) + self.check_infinity(result, inf) + # Adding two infinities of the same sign also gives infinity. + result = statistics._sum([1, 2, inf, 3, inf, 4]) + self.check_infinity(result, inf) + + def test_float_inf(self): + inf = float('inf') + for sign in (+1, -1): + self.do_test_inf(sign*inf) + + def test_decimal_inf(self): + inf = Decimal('inf') + for sign in (+1, -1): + self.do_test_inf(sign*inf) + + def test_float_mismatched_infs(self): + # Test that adding two infinities of opposite sign gives a NAN. + inf = float('inf') + result = statistics._sum([1, 2, inf, 3, -inf, 4]) + self.assertTrue(math.isnan(result)) + + def test_decimal_mismatched_infs_to_nan(self): + # Test adding Decimal INFs with opposite sign returns NAN. + inf = Decimal('inf') + data = [1, 2, inf, 3, -inf, 4] + with decimal.localcontext(decimal.ExtendedContext): + self.assertTrue(math.isnan(statistics._sum(data))) + + def test_decimal_mismatched_infs_to_nan(self): + # Test adding Decimal INFs with opposite sign raises InvalidOperation. + inf = Decimal('inf') + data = [1, 2, inf, 3, -inf, 4] + with decimal.localcontext(decimal.BasicContext): + self.assertRaises(decimal.InvalidOperation, statistics._sum, data) + + def test_decimal_snan_raises(self): + # Adding sNAN should raise InvalidOperation. + sNAN = Decimal('sNAN') + data = [1, sNAN, 2] + self.assertRaises(decimal.InvalidOperation, statistics._sum, data) + + +# === Tests for averages === + +class AverageMixin(UnivariateCommonMixin): + # Mixin class holding common tests for averages. + + def test_single_value(self): + # Average of a single value is the value itself. + for x in (23, 42.5, 1.3e15, Fraction(15, 19), Decimal('0.28')): + self.assertEqual(self.func([x]), x) + + def test_repeated_single_value(self): + # The average of a single repeated value is the value itself. + for x in (3.5, 17, 2.5e15, Fraction(61, 67), Decimal('4.9712')): + for count in (2, 5, 10, 20): + data = [x]*count + self.assertEqual(self.func(data), x) + + +class TestMean(NumericTestCase, AverageMixin, UnivariateTypeMixin): + def setUp(self): + self.func = statistics.mean + + def test_torture_pep(self): + # "Torture Test" from PEP-450. + self.assertEqual(self.func([1e100, 1, 3, -1e100]), 1) + + def test_ints(self): + # Test mean with ints. + data = [0, 1, 2, 3, 3, 3, 4, 5, 5, 6, 7, 7, 7, 7, 8, 9] + random.shuffle(data) + self.assertEqual(self.func(data), 4.8125) + + def test_floats(self): + # Test mean with floats. + data = [17.25, 19.75, 20.0, 21.5, 21.75, 23.25, 25.125, 27.5] + random.shuffle(data) + self.assertEqual(self.func(data), 22.015625) + + def test_decimals(self): + # Test mean with ints. + D = Decimal + data = [D("1.634"), D("2.517"), D("3.912"), D("4.072"), D("5.813")] + random.shuffle(data) + self.assertEqual(self.func(data), D("3.5896")) + + def test_fractions(self): + # Test mean with Fractions. + F = Fraction + data = [F(1, 2), F(2, 3), F(3, 4), F(4, 5), F(5, 6), F(6, 7), F(7, 8)] + random.shuffle(data) + self.assertEqual(self.func(data), F(1479, 1960)) + + def test_inf(self): + # Test mean with infinities. + raw = [1, 3, 5, 7, 9] # Use only ints, to avoid TypeError later. + for kind in (float, Decimal): + for sign in (1, -1): + inf = kind("inf")*sign + data = raw + [inf] + result = self.func(data) + self.assertTrue(math.isinf(result)) + self.assertEqual(result, inf) + + def test_mismatched_infs(self): + # Test mean with infinities of opposite sign. + data = [2, 4, 6, float('inf'), 1, 3, 5, float('-inf')] + result = self.func(data) + self.assertTrue(math.isnan(result)) + + def test_nan(self): + # Test mean with NANs. + raw = [1, 3, 5, 7, 9] # Use only ints, to avoid TypeError later. + for kind in (float, Decimal): + inf = kind("nan") + data = raw + [inf] + result = self.func(data) + self.assertTrue(math.isnan(result)) + + def test_big_data(self): + # Test adding a large constant to every data point. + c = 1e9 + data = [3.4, 4.5, 4.9, 6.7, 6.8, 7.2, 8.0, 8.1, 9.4] + expected = self.func(data) + c + assert expected != c + result = self.func([x+c for x in data]) + self.assertEqual(result, expected) + + def test_doubled_data(self): + # Mean of [a,b,c...z] should be same as for [a,a,b,b,c,c...z,z]. + data = [random.uniform(-3, 5) for _ in range(1000)] + expected = self.func(data) + actual = self.func(data*2) + self.assertApproxEqual(actual, expected) + + +class TestMedian(NumericTestCase, AverageMixin): + # Common tests for median and all median.* functions. + def setUp(self): + self.func = statistics.median + + def prepare_data(self): + """Overload method from UnivariateCommonMixin.""" + data = super().prepare_data() + if len(data)%2 != 1: + data.append(2) + return data + + def test_even_ints(self): + # Test median with an even number of int data points. + data = [1, 2, 3, 4, 5, 6] + assert len(data)%2 == 0 + self.assertEqual(self.func(data), 3.5) + + def test_odd_ints(self): + # Test median with an odd number of int data points. + data = [1, 2, 3, 4, 5, 6, 9] + assert len(data)%2 == 1 + self.assertEqual(self.func(data), 4) + + def test_odd_fractions(self): + # Test median works with an odd number of Fractions. + F = Fraction + data = [F(1, 7), F(2, 7), F(3, 7), F(4, 7), F(5, 7)] + assert len(data)%2 == 1 + random.shuffle(data) + self.assertEqual(self.func(data), F(3, 7)) + + def test_even_fractions(self): + # Test median works with an even number of Fractions. + F = Fraction + data = [F(1, 7), F(2, 7), F(3, 7), F(4, 7), F(5, 7), F(6, 7)] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), F(1, 2)) + + def test_odd_decimals(self): + # Test median works with an odd number of Decimals. + D = Decimal + data = [D('2.5'), D('3.1'), D('4.2'), D('5.7'), D('5.8')] + assert len(data)%2 == 1 + random.shuffle(data) + self.assertEqual(self.func(data), D('4.2')) + + def test_even_decimals(self): + # Test median works with an even number of Decimals. + D = Decimal + data = [D('1.2'), D('2.5'), D('3.1'), D('4.2'), D('5.7'), D('5.8')] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), D('3.65')) + + +class TestMedianDataType(NumericTestCase, UnivariateTypeMixin): + # Test conservation of data element type for median. + def setUp(self): + self.func = statistics.median + + def prepare_data(self): + data = list(range(15)) + assert len(data)%2 == 1 + while data == sorted(data): + random.shuffle(data) + return data + + +class TestMedianLow(TestMedian, UnivariateTypeMixin): + def setUp(self): + self.func = statistics.median_low + + def test_even_ints(self): + # Test median_low with an even number of ints. + data = [1, 2, 3, 4, 5, 6] + assert len(data)%2 == 0 + self.assertEqual(self.func(data), 3) + + def test_even_fractions(self): + # Test median_low works with an even number of Fractions. + F = Fraction + data = [F(1, 7), F(2, 7), F(3, 7), F(4, 7), F(5, 7), F(6, 7)] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), F(3, 7)) + + def test_even_decimals(self): + # Test median_low works with an even number of Decimals. + D = Decimal + data = [D('1.1'), D('2.2'), D('3.3'), D('4.4'), D('5.5'), D('6.6')] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), D('3.3')) + + +class TestMedianHigh(TestMedian, UnivariateTypeMixin): + def setUp(self): + self.func = statistics.median_high + + def test_even_ints(self): + # Test median_high with an even number of ints. + data = [1, 2, 3, 4, 5, 6] + assert len(data)%2 == 0 + self.assertEqual(self.func(data), 4) + + def test_even_fractions(self): + # Test median_high works with an even number of Fractions. + F = Fraction + data = [F(1, 7), F(2, 7), F(3, 7), F(4, 7), F(5, 7), F(6, 7)] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), F(4, 7)) + + def test_even_decimals(self): + # Test median_high works with an even number of Decimals. + D = Decimal + data = [D('1.1'), D('2.2'), D('3.3'), D('4.4'), D('5.5'), D('6.6')] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), D('4.4')) + + +class TestMedianGrouped(TestMedian): + # Test median_grouped. + # Doesn't conserve data element types, so don't use TestMedianType. + def setUp(self): + self.func = statistics.median_grouped + + def test_odd_number_repeated(self): + # Test median.grouped with repeated median values. + data = [12, 13, 14, 14, 14, 15, 15] + assert len(data)%2 == 1 + self.assertEqual(self.func(data), 14) + #--- + data = [12, 13, 14, 14, 14, 14, 15] + assert len(data)%2 == 1 + self.assertEqual(self.func(data), 13.875) + #--- + data = [5, 10, 10, 15, 20, 20, 20, 20, 25, 25, 30] + assert len(data)%2 == 1 + self.assertEqual(self.func(data, 5), 19.375) + #--- + data = [16, 18, 18, 18, 18, 20, 20, 20, 22, 22, 22, 24, 24, 26, 28] + assert len(data)%2 == 1 + self.assertApproxEqual(self.func(data, 2), 20.66666667, tol=1e-8) + + def test_even_number_repeated(self): + # Test median.grouped with repeated median values. + data = [5, 10, 10, 15, 20, 20, 20, 25, 25, 30] + assert len(data)%2 == 0 + self.assertApproxEqual(self.func(data, 5), 19.16666667, tol=1e-8) + #--- + data = [2, 3, 4, 4, 4, 5] + assert len(data)%2 == 0 + self.assertApproxEqual(self.func(data), 3.83333333, tol=1e-8) + #--- + data = [2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6] + assert len(data)%2 == 0 + self.assertEqual(self.func(data), 4.5) + #--- + data = [3, 4, 4, 4, 5, 5, 5, 5, 6, 6] + assert len(data)%2 == 0 + self.assertEqual(self.func(data), 4.75) + + def test_repeated_single_value(self): + # Override method from AverageMixin. + # Yet again, failure of median_grouped to conserve the data type + # causes me headaches :-( + for x in (5.3, 68, 4.3e17, Fraction(29, 101), Decimal('32.9714')): + for count in (2, 5, 10, 20): + data = [x]*count + self.assertEqual(self.func(data), float(x)) + + def test_odd_fractions(self): + # Test median_grouped works with an odd number of Fractions. + F = Fraction + data = [F(5, 4), F(9, 4), F(13, 4), F(13, 4), F(17, 4)] + assert len(data)%2 == 1 + random.shuffle(data) + self.assertEqual(self.func(data), 3.0) + + def test_even_fractions(self): + # Test median_grouped works with an even number of Fractions. + F = Fraction + data = [F(5, 4), F(9, 4), F(13, 4), F(13, 4), F(17, 4), F(17, 4)] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), 3.25) + + def test_odd_decimals(self): + # Test median_grouped works with an odd number of Decimals. + D = Decimal + data = [D('5.5'), D('6.5'), D('6.5'), D('7.5'), D('8.5')] + assert len(data)%2 == 1 + random.shuffle(data) + self.assertEqual(self.func(data), 6.75) + + def test_even_decimals(self): + # Test median_grouped works with an even number of Decimals. + D = Decimal + data = [D('5.5'), D('5.5'), D('6.5'), D('6.5'), D('7.5'), D('8.5')] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), 6.5) + #--- + data = [D('5.5'), D('5.5'), D('6.5'), D('7.5'), D('7.5'), D('8.5')] + assert len(data)%2 == 0 + random.shuffle(data) + self.assertEqual(self.func(data), 7.0) + + def test_interval(self): + # Test median_grouped with interval argument. + data = [2.25, 2.5, 2.5, 2.75, 2.75, 3.0, 3.0, 3.25, 3.5, 3.75] + self.assertEqual(self.func(data, 0.25), 2.875) + data = [2.25, 2.5, 2.5, 2.75, 2.75, 2.75, 3.0, 3.0, 3.25, 3.5, 3.75] + self.assertApproxEqual(self.func(data, 0.25), 2.83333333, tol=1e-8) + data = [220, 220, 240, 260, 260, 260, 260, 280, 280, 300, 320, 340] + self.assertEqual(self.func(data, 20), 265.0) + + +class TestMode(NumericTestCase, AverageMixin, UnivariateTypeMixin): + # Test cases for the discrete version of mode. + def setUp(self): + self.func = statistics.mode + + def prepare_data(self): + """Overload method from UnivariateCommonMixin.""" + # Make sure test data has exactly one mode. + return [1, 1, 1, 1, 3, 4, 7, 9, 0, 8, 2] + + def test_range_data(self): + # Override test from UnivariateCommonMixin. + data = range(20, 50, 3) + self.assertRaises(statistics.StatisticsError, self.func, data) + + def test_nominal_data(self): + # Test mode with nominal data. + data = 'abcbdb' + self.assertEqual(self.func(data), 'b') + data = 'fe fi fo fum fi fi'.split() + self.assertEqual(self.func(data), 'fi') + + def test_discrete_data(self): + # Test mode with discrete numeric data. + data = list(range(10)) + for i in range(10): + d = data + [i] + random.shuffle(d) + self.assertEqual(self.func(d), i) + + def test_bimodal_data(self): + # Test mode with bimodal data. + data = [1, 1, 2, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8, 9, 9] + assert data.count(2) == data.count(6) == 4 + # Check for an exception. + self.assertRaises(statistics.StatisticsError, self.func, data) + + def test_unique_data_failure(self): + # Test mode exception when data points are all unique. + data = list(range(10)) + self.assertRaises(statistics.StatisticsError, self.func, data) + + def test_none_data(self): + # Test that mode raises TypeError if given None as data. + + # This test is necessary because the implementation of mode uses + # collections.Counter, which accepts None and returns an empty dict. + self.assertRaises(TypeError, self.func, None) + + +# === Tests for variances and standard deviations === + +class VarianceStdevMixin(UnivariateCommonMixin): + # Mixin class holding common tests for variance and std dev. + + # Subclasses should inherit from this before NumericTestClass, in order + # to see the rel attribute below. See testShiftData for an explanation. + + rel = 1e-12 + + def test_single_value(self): + # Deviation of a single value is zero. + for x in (11, 19.8, 4.6e14, Fraction(21, 34), Decimal('8.392')): + self.assertEqual(self.func([x]), 0) + + def test_repeated_single_value(self): + # The deviation of a single repeated value is zero. + for x in (7.2, 49, 8.1e15, Fraction(3, 7), Decimal('62.4802')): + for count in (2, 3, 5, 15): + data = [x]*count + self.assertEqual(self.func(data), 0) + + def test_domain_error_regression(self): + # Regression test for a domain error exception. + # (Thanks to Geremy Condra.) + data = [0.123456789012345]*10000 + # All the items are identical, so variance should be exactly zero. + # We allow some small round-off error, but not much. + result = self.func(data) + self.assertApproxEqual(result, 0.0, tol=5e-17) + self.assertGreaterEqual(result, 0) # A negative result must fail. + + def test_shift_data(self): + # Test that shifting the data by a constant amount does not affect + # the variance or stdev. Or at least not much. + + # Due to rounding, this test should be considered an ideal. We allow + # some tolerance away from "no change at all" by setting tol and/or rel + # attributes. Subclasses may set tighter or looser error tolerances. + raw = [1.03, 1.27, 1.94, 2.04, 2.58, 3.14, 4.75, 4.98, 5.42, 6.78] + expected = self.func(raw) + # Don't set shift too high, the bigger it is, the more rounding error. + shift = 1e5 + data = [x + shift for x in raw] + self.assertApproxEqual(self.func(data), expected) + + def test_shift_data_exact(self): + # Like test_shift_data, but result is always exact. + raw = [1, 3, 3, 4, 5, 7, 9, 10, 11, 16] + assert all(x==int(x) for x in raw) + expected = self.func(raw) + shift = 10**9 + data = [x + shift for x in raw] + self.assertEqual(self.func(data), expected) + + def test_iter_list_same(self): + # Test that iter data and list data give the same result. + + # This is an explicit test that iterators and lists are treated the + # same; justification for this test over and above the similar test + # in UnivariateCommonMixin is that an earlier design had variance and + # friends swap between one- and two-pass algorithms, which would + # sometimes give different results. + data = [random.uniform(-3, 8) for _ in range(1000)] + expected = self.func(data) + self.assertEqual(self.func(iter(data)), expected) + + +class TestPVariance(VarianceStdevMixin, NumericTestCase, UnivariateTypeMixin): + # Tests for population variance. + def setUp(self): + self.func = statistics.pvariance + + def test_exact_uniform(self): + # Test the variance against an exact result for uniform data. + data = list(range(10000)) + random.shuffle(data) + expected = (10000**2 - 1)/12 # Exact value. + self.assertEqual(self.func(data), expected) + + def test_ints(self): + # Test population variance with int data. + data = [4, 7, 13, 16] + exact = 22.5 + self.assertEqual(self.func(data), exact) + + def test_fractions(self): + # Test population variance with Fraction data. + F = Fraction + data = [F(1, 4), F(1, 4), F(3, 4), F(7, 4)] + exact = F(3, 8) + result = self.func(data) + self.assertEqual(result, exact) + self.assertIsInstance(result, Fraction) + + def test_decimals(self): + # Test population variance with Decimal data. + D = Decimal + data = [D("12.1"), D("12.2"), D("12.5"), D("12.9")] + exact = D('0.096875') + result = self.func(data) + self.assertEqual(result, exact) + self.assertIsInstance(result, Decimal) + + +class TestVariance(VarianceStdevMixin, NumericTestCase, UnivariateTypeMixin): + # Tests for sample variance. + def setUp(self): + self.func = statistics.variance + + def test_single_value(self): + # Override method from VarianceStdevMixin. + for x in (35, 24.7, 8.2e15, Fraction(19, 30), Decimal('4.2084')): + self.assertRaises(statistics.StatisticsError, self.func, [x]) + + def test_ints(self): + # Test sample variance with int data. + data = [4, 7, 13, 16] + exact = 30 + self.assertEqual(self.func(data), exact) + + def test_fractions(self): + # Test sample variance with Fraction data. + F = Fraction + data = [F(1, 4), F(1, 4), F(3, 4), F(7, 4)] + exact = F(1, 2) + result = self.func(data) + self.assertEqual(result, exact) + self.assertIsInstance(result, Fraction) + + def test_decimals(self): + # Test sample variance with Decimal data. + D = Decimal + data = [D(2), D(2), D(7), D(9)] + exact = 4*D('9.5')/D(3) + result = self.func(data) + self.assertEqual(result, exact) + self.assertIsInstance(result, Decimal) + + +class TestPStdev(VarianceStdevMixin, NumericTestCase): + # Tests for population standard deviation. + def setUp(self): + self.func = statistics.pstdev + + def test_compare_to_variance(self): + # Test that stdev is, in fact, the square root of variance. + data = [random.uniform(-17, 24) for _ in range(1000)] + expected = math.sqrt(statistics.pvariance(data)) + self.assertEqual(self.func(data), expected) + + +class TestStdev(VarianceStdevMixin, NumericTestCase): + # Tests for sample standard deviation. + def setUp(self): + self.func = statistics.stdev + + def test_single_value(self): + # Override method from VarianceStdevMixin. + for x in (81, 203.74, 3.9e14, Fraction(5, 21), Decimal('35.719')): + self.assertRaises(statistics.StatisticsError, self.func, [x]) + + def test_compare_to_variance(self): + # Test that stdev is, in fact, the square root of variance. + data = [random.uniform(-2, 9) for _ in range(1000)] + expected = math.sqrt(statistics.variance(data)) + self.assertEqual(self.func(data), expected) + + +# === Run tests === + +def load_tests(loader, tests, ignore): + """Used for doctest/unittest integration.""" + tests.addTests(doctest.DocTestSuite()) + return tests + + +if __name__ == "__main__": + unittest.main() @@ -59,6 +59,9 @@ Core and Builtins Library ------- +- Issue #18606: Add the new "statistics" module (PEP 450). Contributed + by Steven D'Aprano. + - Issue #12866: The audioop module now supports 24-bit samples. - Issue #19254: Provide an optimized Python implementation of pbkdf2_hmac. |