diff options
author | Raymond Hettinger <rhettinger@users.noreply.github.com> | 2019-02-23 22:44:07 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2019-02-23 22:44:07 (GMT) |
commit | 11c79531655a4aa3f82c20ff562ac571f40040cc (patch) | |
tree | 6af6cf3108204156c7b66022044514d75fca134e /Doc | |
parent | 64d6cc826dacebc2493b1bb5e8cb97828eb76f81 (diff) | |
download | cpython-11c79531655a4aa3f82c20ff562ac571f40040cc.zip cpython-11c79531655a4aa3f82c20ff562ac571f40040cc.tar.gz cpython-11c79531655a4aa3f82c20ff562ac571f40040cc.tar.bz2 |
bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/library/statistics.rst | 195 | ||||
-rw-r--r-- | Doc/whatsnew/3.8.rst | 26 |
2 files changed, 221 insertions, 0 deletions
diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst index 20a2c1c..c1be295 100644 --- a/Doc/library/statistics.rst +++ b/Doc/library/statistics.rst @@ -467,6 +467,201 @@ A single exception is defined: Subclass of :exc:`ValueError` for statistics-related exceptions. + +:class:`NormalDist` objects +=========================== + +A :class:`NormalDist` is a a composite class that treats the mean and standard +deviation of data measurements as a single entity. It is a tool for creating +and manipulating normal distributions of a random variable. + +Normal distributions arise from the `Central Limit Theorem +<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range +of applications in statistics, including simulations and hypothesis testing. + +.. class:: NormalDist(mu=0.0, sigma=1.0) + + Returns a new *NormalDist* object where *mu* represents the `arithmetic + mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma* + represents the `standard deviation + <https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data. + + If *sigma* is negative, raises :exc:`StatisticsError`. + + .. attribute:: mu + + The mean of a normal distribution. + + .. attribute:: sigma + + The standard deviation of a normal distribution. + + .. attribute:: variance + + A read-only property representing the `variance + <https://en.wikipedia.org/wiki/Variance>`_ of a normal + distribution. Equal to the square of the standard deviation. + + .. classmethod:: NormalDist.from_samples(data) + + Class method that makes a normal distribution instance + from sample data. The *data* can be any :term:`iterable` + and should consist of values that can be converted to type + :class:`float`. + + If *data* does not contain at least two elements, raises + :exc:`StatisticsError` because it takes at least one point to estimate + a central value and at least two points to estimate dispersion. + + .. method:: NormalDist.samples(n, seed=None) + + Generates *n* random samples for a given mean and standard deviation. + Returns a :class:`list` of :class:`float` values. + + If *seed* is given, creates a new instance of the underlying random + number generator. This is useful for creating reproducible results, + even in a multi-threading context. + + .. method:: NormalDist.pdf(x) + + Using a `probability density function (pdf) + <https://en.wikipedia.org/wiki/Probability_density_function>`_, + compute the relative likelihood that a random sample *X* will be near + the given value *x*. Mathematically, it is the ratio ``P(x <= X < + x+dx) / dx``. + + Note the relative likelihood of *x* can be greater than `1.0`. The + probability for a specific point on a continuous distribution is `0.0`, + so the :func:`pdf` is used instead. It gives the probability of a + sample occurring in a narrow range around *x* and then dividing that + probability by the width of the range (hence the word "density"). + + .. method:: NormalDist.cdf(x) + + Using a `cumulative distribution function (cdf) + <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, + compute the probability that a random sample *X* will be less than or + equal to *x*. Mathematically, it is written ``P(X <= x)``. + + Instances of :class:`NormalDist` support addition, subtraction, + multiplication and division by a constant. These operations + are used for translation and scaling. For example: + + .. doctest:: + + >>> temperature_february = NormalDist(5, 2.5) # Celsius + >>> temperature_february * (9/5) + 32 # Fahrenheit + NormalDist(mu=41.0, sigma=4.5) + + Dividing a constant by an instance of :class:`NormalDist` is not supported. + + Since normal distributions arise from additive effects of independent + variables, it is possible to `add and subtract two normally distributed + random variables + <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ + represented as instances of :class:`NormalDist`. For example: + + .. doctest:: + + >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) + >>> drug_effects = NormalDist(0.4, 0.15) + >>> combined = birth_weights + drug_effects + >>> f'mu={combined.mu :.1f} sigma={combined.sigma :.1f}' + 'mu=3.1 sigma=0.5' + + .. versionadded:: 3.8 + + +:class:`NormalDist` Examples and Recipes +---------------------------------------- + +A :class:`NormalDist` readily solves classic probability problems. + +For example, given `historical data for SAT exams +<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores +are normally distributed with a mean of 1060 and standard deviation of 192, +determine the percentage of students with scores between 1100 and 1200: + +.. doctest:: + + >>> sat = NormalDist(1060, 195) + >>> fraction = sat.cdf(1200) - sat.cdf(1100) + >>> f'{fraction * 100 :.1f}% score between 1100 and 1200' + '18.2% score between 1100 and 1200' + +To estimate the distribution for a model than isn't easy to solve +analytically, :class:`NormalDist` can generate input samples for a `Monte +Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the +model: + +.. doctest:: + + >>> n = 100_000 + >>> X = NormalDist(350, 15).samples(n) + >>> Y = NormalDist(47, 17).samples(n) + >>> Z = NormalDist(62, 6).samples(n) + >>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)] + >>> NormalDist.from_samples(model_simulation) # doctest: +SKIP + NormalDist(mu=267.6516398754636, sigma=101.357284306067) + +Normal distributions commonly arise in machine learning problems. + +Wikipedia has a `nice example with a Naive Bayesian Classifier +<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge +is to guess a person's gender from measurements of normally distributed +features including height, weight, and foot size. + +The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of +being male or female is 50%: + +.. doctest:: + + >>> prior_male = 0.5 + >>> prior_female = 0.5 + +We also have a training dataset with measurements for eight people. These +measurements are assumed to be normally distributed, so we summarize the data +with :class:`NormalDist`: + +.. doctest:: + + >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) + >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) + >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) + >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) + >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) + >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) + +We observe a new person whose feature measurements are known but whose gender +is unknown: + +.. doctest:: + + >>> ht = 6.0 # height + >>> wt = 130 # weight + >>> fs = 8 # foot size + +The posterior is the product of the prior times each likelihood of a +feature measurement given the gender: + +.. doctest:: + + >>> posterior_male = (prior_male * height_male.pdf(ht) * + ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) + + >>> posterior_female = (prior_female * height_female.pdf(ht) * + ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) + +The final prediction is awarded to the largest posterior -- this is known as +the `maximum a posteriori +<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: + +.. doctest:: + + >>> 'male' if posterior_male > posterior_female else 'female' + 'female' + + .. # This modelines must appear within the last ten lines of the file. kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; diff --git a/Doc/whatsnew/3.8.rst b/Doc/whatsnew/3.8.rst index f21175a..68a4457 100644 --- a/Doc/whatsnew/3.8.rst +++ b/Doc/whatsnew/3.8.rst @@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of :func:`statistics.mean()`. (Contributed by Raymond Hettinger and Steven D'Aprano in :issue:`35904`.) +Added :class:`statistics.NormalDist`, a tool for creating +and manipulating normal distributions of a random variable. +(Contributed by Raymond Hettinger in :issue:`36018`.) + +:: + + >>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14]) + >>> temperature_feb + NormalDist(mu=6.0, sigma=6.356099432828281) + + >>> temperature_feb.cdf(3) # Chance of being under 3 degrees + 0.3184678262814532 + >>> # Relative chance of being 7 degrees versus 10 degrees + >>> temperature_feb.pdf(7) / temperature_feb.pdf(10) + 1.2039930378537762 + + >>> el_nino = NormalDist(4, 2.5) + >>> temperature_feb += el_nino # Add in a climate effect + >>> temperature_feb + NormalDist(mu=10.0, sigma=6.830080526611674) + + >>> temperature_feb * (9/5) + 32 # Convert to Fahrenheit + NormalDist(mu=50.0, sigma=12.294144947901014) + >>> temperature_feb.samples(3) # Generate random samples + [7.672102882379219, 12.000027119750287, 4.647488369766392] + tokenize -------- |