bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620)

* bpo-39040: Fix parsing of email headers with encoded-words inside a quoted string. It is fairly common to find malformed mime headers (especially content-disposition headers) where the parameter values, instead of being encoded to RFC standards, are "encoded" by doing RFC 2047 "encoded word" encoding, and then enclosing the whole thing in quotes. The processing of these malformed headers was incorrectly leaving the spaces between encoded words in the decoded text (whitespace between adjacent encoded words is supposed to be stripped on decoding). This changeset fixes the encoded word processing inside quoted strings (bare-quoted-string) to do correct RFC 2047 decoding by stripping that whitespace.
author: Abhilash Raj <maxking@users.noreply.github.com> 2020-05-29 00:04:59 (GMT)
committer: GitHub <noreply@github.com> 2020-05-29 00:04:59 (GMT)
commit: 21017ed904f734be9f195ae1274eb81426a9e776 (patch)
tree: 407b8d0170e82598d769d6b8dcb439323cb9b930 /Lib/email
parent: c610d970f5373b143bf5f5900d4645e6a90fb460 (diff)
download: cpython-21017ed904f734be9f195ae1274eb81426a9e776.zip
cpython-21017ed904f734be9f195ae1274eb81426a9e776.tar.gz
cpython-21017ed904f734be9f195ae1274eb81426a9e776.tar.bz2
1 files changed, 9 insertions, 0 deletions
diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py
index 9c55ef7..51d355f 100644
--- a/Lib/email/_header_value_parser.py
+++ b/Lib/email/_header_value_parser.py
@@ -1218,12 +1218,21 @@ def get_bare_quoted_string(value):
         if value[0] in WSP:
             token, value = get_fws(value)
         elif value[:2] == '=?':
+            valid_ew = False
             try:
                 token, value = get_encoded_word(value)
                 bare_quoted_string.defects.append(errors.InvalidHeaderDefect(
                     "encoded word inside quoted string"))
+                valid_ew = True
             except errors.HeaderParseError:
                 token, value = get_qcontent(value)
+            # Collapse the whitespace between two encoded words that occur in a
+            # bare-quoted-string.
+            if valid_ew and len(bare_quoted_string) > 1:
+                if (bare_quoted_string[-1].token_type == 'fws' and
+                        bare_quoted_string[-2].token_type == 'encoded-word'):
+                    bare_quoted_string[-1] = EWWhiteSpaceTerminal(
+                        bare_quoted_string[-1], 'fws')
         else:
             token, value = get_qcontent(value)
         bare_quoted_string.append(token)
author	Abhilash Raj <maxking@users.noreply.github.com>	2020-05-29 00:04:59 (GMT)
committer	GitHub <noreply@github.com>	2020-05-29 00:04:59 (GMT)
commit	21017ed904f734be9f195ae1274eb81426a9e776 (patch)
tree	407b8d0170e82598d769d6b8dcb439323cb9b930 /Lib/email
parent	c610d970f5373b143bf5f5900d4645e6a90fb460 (diff)
download	cpython-21017ed904f734be9f195ae1274eb81426a9e776.zip cpython-21017ed904f734be9f195ae1274eb81426a9e776.tar.gz cpython-21017ed904f734be9f195ae1274eb81426a9e776.tar.bz2