summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMiss Islington (bot) <31488909+miss-islington@users.noreply.github.com>2022-07-10 17:36:01 (GMT)
committerGitHub <noreply@github.com>2022-07-10 17:36:01 (GMT)
commit30015de7235e5b4033298b85a164fcd8d96046b3 (patch)
tree6592419fb4f0c5570f341ab97a85303d96302f85
parentd4796c2231eef9355814572650f00c3c4662f197 (diff)
downloadcpython-30015de7235e5b4033298b85a164fcd8d96046b3.zip
cpython-30015de7235e5b4033298b85a164fcd8d96046b3.tar.gz
cpython-30015de7235e5b4033298b85a164fcd8d96046b3.tar.bz2
GH-77265: Document NaN handling in statistics functions that sort or count (GH-94676) (#94725)
-rw-r--r--Doc/library/statistics.rst29
1 files changed, 29 insertions, 0 deletions
diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst
index 1f55ae8..6484e74 100644
--- a/Doc/library/statistics.rst
+++ b/Doc/library/statistics.rst
@@ -35,6 +35,35 @@ and implementation-dependent. If your input data consists of mixed types,
you may be able to use :func:`map` to ensure a consistent result, for
example: ``map(float, input_data)``.
+Some datasets use ``NaN`` (not a number) values to represent missing data.
+Since NaNs have unusual comparison semantics, they cause surprising or
+undefined behaviors in the statistics functions that sort data or that count
+occurrences. The functions affected are ``median()``, ``median_low()``,
+``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and
+``quantiles()``. The ``NaN`` values should be stripped before calling these
+functions::
+
+ >>> from statistics import median
+ >>> from math import isnan
+ >>> from itertools import filterfalse
+
+ >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
+ >>> sorted(data) # This has surprising behavior
+ [20.7, nan, 14.4, 18.3, 19.2, nan]
+ >>> median(data) # This result is unexpected
+ 16.35
+
+ >>> sum(map(isnan, data)) # Number of missing values
+ 2
+ >>> clean = list(filterfalse(isnan, data)) # Strip NaN values
+ >>> clean
+ [20.7, 19.2, 18.3, 14.4]
+ >>> sorted(clean) # Sorting now works as expected
+ [14.4, 18.3, 19.2, 20.7]
+ >>> median(clean) # This result is now well defined
+ 18.75
+
+
Averages and measures of central location
-----------------------------------------