diff options
author | Greg Ward <greg@gerg.ca> | 2015-04-21 00:21:21 (GMT) |
---|---|---|
committer | Greg Ward <greg@gerg.ca> | 2015-04-21 00:21:21 (GMT) |
commit | 4d9d2563f51edad448a960d9490a6f56ac733735 (patch) | |
tree | 732ccd1c3bded6c2b25b942c7f55c5cc685dab1e /Doc/library/difflib.rst | |
parent | d19458ac51633cac979b7c7b9439ea89f179c8c8 (diff) | |
download | cpython-4d9d2563f51edad448a960d9490a6f56ac733735.zip cpython-4d9d2563f51edad448a960d9490a6f56ac733735.tar.gz cpython-4d9d2563f51edad448a960d9490a6f56ac733735.tar.bz2 |
#17445: difflib: add diff_bytes(), to compare bytes rather than str
Some applications (e.g. traditional Unix diff, version control
systems) neither know nor care about the encodings of the files they
are comparing. They are textual, but to the diff utility they are just
bytes. This worked fine under Python 2, because all of the hardcoded
strings in difflib.py are ASCII, so could safely be combined with
old-style u'' strings. But it stopped working in 3.x.
The solution is to use surrogate escapes for a lossless
bytes->str->bytes roundtrip. That means {unified,context}_diff() can
continue to just handle strings without worrying about bytes. Callers
who have to deal with bytes will need to change to using diff_bytes().
Use case: Mercurial's test runner uses difflib to compare current hg
output with known good output. But Mercurial's output is just bytes,
since it can contain:
* file contents (arbitrary unknown encoding)
* filenames (arbitrary unknown encoding)
* usernames and commit messages (usually UTF-8, but not guaranteed
because old versions of Mercurial did not enforce it)
* user messages (locale encoding)
Since the output of any given hg command can include text in multiple
encodings, it is hopeless to try to treat it as decodable Unicode
text. It's just bytes, all the way down.
This is an elaboration of a patch by Terry Reedy.
Diffstat (limited to 'Doc/library/difflib.rst')
-rw-r--r-- | Doc/library/difflib.rst | 15 |
1 files changed, 15 insertions, 0 deletions
diff --git a/Doc/library/difflib.rst b/Doc/library/difflib.rst index 4427065..efaac7a 100644 --- a/Doc/library/difflib.rst +++ b/Doc/library/difflib.rst @@ -315,6 +315,21 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. See :ref:`difflib-interface` for a more detailed example. +.. function:: diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\\n') + + Compare *a* and *b* (lists of bytes objects) using *dfunc*; yield a + sequence of delta lines (also bytes) in the format returned by *dfunc*. + *dfunc* must be a callable, typically either :func:`unified_diff` or + :func:`context_diff`. + + Allows you to compare data with unknown or inconsistent encoding. All + inputs except *n* must be bytes objects, not str. Works by losslessly + converting all inputs (except *n*) to str, and calling ``dfunc(a, b, + fromfile, tofile, fromfiledate, tofiledate, n, lineterm)``. The output of + *dfunc* is then converted back to bytes, so the delta lines that you + receive have the same unknown/inconsistent encodings as *a* and *b*. + + .. versionadded:: 3.5 .. function:: IS_LINE_JUNK(line) |