summaryrefslogtreecommitdiffstats
path: root/generic/tclUtf.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix compiled "string is <class>" for TCL_UTF_MAX=4 build, for characters > ↵jan.nijtmans2020-05-251-0/+16
| | | | U+FFFF.
* Tiny fix for TCL_UTF_MAX=4 build only: Since Tcl_UtfNext() verifies 4 bytes ↵jan.nijtmans2020-05-181-1/+1
| | | | for lead bytes F0-F5, Tcl_UtfCharComplete() should guarantee that those 4 bytes are available, not 3.
* Fix [ed29806baf]: Tcl_UtfToUniChar reads more than TCL_UTF_MAX bytesjan.nijtmans2020-05-131-25/+11
|\
| * Merge testcase cleanup. Make Tcl_UtfPrev() behave the same for any ↵jan.nijtmans2020-05-121-4/+4
| |\ | | | | | | | | | TCL_UTF_MAX value, since we didn't figure out yet how it should behave for TCL_UTF_MAX>3.
| * | Fix "knownBug" utf-4.11. Turns out a few other testcases where still not ↵jan.nijtmans2020-05-121-23/+5
| | | | | | | | | | | | correct, now they are. Make next/prev behavior the same for all TCL_UTF_MAX values, since the exact behavior for TCL_UTF_MAX>3 should be worked out further for Tcl 8.7 first, then everything agreed upon can be backported.
| * | Merge 8.6. Mark testcase utf-4.11 as "knownBug": this one still doesn't give ↵jan.nijtmans2020-05-111-36/+48
| |\ \ | | |/ | | | | | | the right answer. Add testcase 4.14 with similar corner-case, this one is OK.
* | | Revert implementation of Tcl_UniCharAtIndex() change done in this commit: ↵jan.nijtmans2020-05-121-1/+3
| |/ |/| | | | | | | [6596c4af31e29b5d]. Just look at the Tcl_UtfAtIndex() implementation for TCL_UTF_MAX=4: It's not the same. There are no test-cases for Tcl_UniCharAtIndex(), see [f45d0dc1a7], not really worth to write one, since the implementation of this function didn't change in 20 years.
* | Tweak the Tcl_UtfPrev() implementation for TCL_UTF_MAX=4. This fixes 10 ↵jan.nijtmans2020-05-111-1/+1
| | | | | | | | testcases in 4 groups (utf-7.10, utf-7.15, utf-7.40 and utf-7.48) , where Tcl_UtfPrev() didn't jump to the beginning of the UTF-8 character, even though there was no limitation which prevented that. So, this is actually a bug-fix for the TIP #389 implementation.
* | occurance -> occurrence.jan.nijtmans2020-05-111-3/+3
| |
* | Tweak Invalid() function: No need for "return 0" twice in the function. jan.nijtmans2020-05-101-8/+31
| | | | | | For start bytes F0-F4, case TCL_UTF_MAX=4, Tcl_UtfToUniChar() reads 3 bytes but only advances 1 byte. So Tcl_UtfCharComplete() must make sure 3 bytes are available, not 1. Adapt Tcl_UtfCharComplete() accordingly. No change for TCL_UTF_MAX=[3|6]
* | Rebase to latest core-8-6-branch.jan.nijtmans2020-05-081-26/+12
|\ \
| * \ Merge changes from parent branchdgp2020-05-071-14/+16
| |\ \ | | |/
| | * Merge 8.6jan.nijtmans2020-05-071-19/+21
| | |\ | |_|/ |/| |
| * | New approach to fixing the regression reported in [31aa44375d] builds ondgp2020-05-071-17/+3
| |/ | | | | recent reforms. Older efforts aborted.
| * Merge 8.6. Some more tweaks to Tcl_UtfPrev(), so it cannot jump back 4 bytes ↵jan.nijtmans2020-05-071-63/+92
| |\ | | | | | | | | | in "utf16" build any more.
| * | Add 4 test-cases that could fool Tcl_UtfPrev (but ... actually they don't).jan.nijtmans2020-05-051-1/+1
| | | | | | | | | Make sure that Tcl_UtfPrev() never reads more than 3 trail bytes (or 4 when TCL_UTF_MAX > 4). Those are the same limits as for Tcl_UtfNext() and Tcl_UtfToUniChar()
| * | Merge 8.6jan.nijtmans2020-05-051-1/+1
| |\ \
| * | | More progress/simplificationjan.nijtmans2020-05-041-21/+2
| | | |
| * | | Merge 8.6jan.nijtmans2020-05-041-6/+7
| |\ \ \
| * \ \ \ Merge 8.6jan.nijtmans2020-05-041-6/+8
| |\ \ \ \
| * \ \ \ \ Merge 8.6jan.nijtmans2020-05-031-13/+55
| |\ \ \ \ \
| * | | | | | Seems almost correct. Still problem with "string index" for TCL_UTF_MAX>3jan.nijtmans2020-05-021-15/+10
| | | | | | |
| * | | | | | More fixes for [ed29806baf]. Not working yet. WIPjan.nijtmans2020-05-021-17/+41
| | | | | | |
* | | | | | | merge 8.5dgp2020-05-071-5/+6
|\ \ \ \ \ \ \
| * | | | | | | Same trouble with Tcl_UtfToUniCharDstring. Test and fix.dgp2020-05-071-4/+5
| | | | | | | |
* | | | | | | | merge 8.5dgp2020-05-071-14/+15
|\ \ \ \ \ \ \ \ | |/ / / / / / /
| * | | | | | | Fix. Note that just because we get one positive detection of an incompletedgp2020-05-071-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | character, we cannot conclude that the next byte also will be, or can by taken as a single byte. At least we cannot when TCL_UTF_MAX > 3 so that we have room for valid two-byte sequences after incomplete sequence detection. No need for conditional code, just use an algorithm that always works.
* | | | | | | | For TCL_UTF_MAX==4: Make sure that Tcl_UtfNext()/Tcl_UtfPrev() never move ↵jan.nijtmans2020-05-071-6/+6
| |_|_|_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | more than 3 bytes. This is more consistant with what Tcl 8.7 does too. For TCL_UTF_MAX==6: Make sure that Tcl_UtfNext()/Tcl_UtfPrev() never move more than 4 bytes. For TCL_UTF_MAX==3: No change. Introduce ucs2_utf16 test constraint, since many test results now become the same for ucs2 and utf16.
* | | | | | | Optimize Tcl_UtfToUniCharDString()jan.nijtmans2020-05-071-23/+26
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Tighten optimization in Tcl_UtfToUniCharDString(), just as in ↵jan.nijtmans2020-05-071-81/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tcl_NumUtfChars(). Don't use "-1" in the Tcl_NumUtfChars() calculation, since that raises more questions than it solves, but that's easy to be remedied as well: Juse use >= in stead of > in the comparation. Great idea, Don! Backport more code formatting from Tcl 8.6 (e.g. use of CONST, which makes no sense any more in c-files)
* | | | | | | merge 8.5dgp2020-05-061-14/+24
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Tighten optimization in Tcl_NumUtfChars. Explain in comments.dgp2020-05-061-14/+25
| | | | | | |
* | | | | | | merge 8.5dgp2020-05-061-11/+26
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Restore safe calls of Invalid().dgp2020-05-061-3/+10
| | | | | | |
| * | | | | | The routine Invalid() has been revised to do something different.dgp2020-05-061-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Update the comments to describe what it does now, and cautions that callers take into account.
* | | | | | | Merge 8.5. More usage of UCHAR() macro.jan.nijtmans2020-05-061-11/+13
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Change Invalid() parameter type to "const char *". Also call Invalid() ↵jan.nijtmans2020-05-061-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | first in Tcl_UtfNext(), so if src[1] is invalid src[2] doesn't need to be checked any more. Note: This order change, calling Invalid() first was wrong, and is corrected in later commits. Thanks, Don, for noticing this!
* | | | | | | More usage of TclUtfToUCS4(), so we can use the whole Unicode range better ↵jan.nijtmans2020-05-051-9/+8
| |_|_|_|_|/ |/| | | | | | | | | | | | | | | | | in TCL_UTF_MAX>3 builds.
* | | | | | Merge 8.5jan.nijtmans2020-05-051-2/+2
|\ \ \ \ \ \ | |/ / / / / | | | | | / | |_|_|_|/ |/| | | |
| * | | | Properly protect "Invalid" function against lead bytes 0x80-0xBF. This fixes ↵jan.nijtmans2020-05-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | "knownBug" testcase utf-6.93.1. Rename tip389 selector to utf16, since that's what it actually is, in contrast to ucs2 and ucs4.
* | | | | New internal function TclGetUCS4() only available when TCL_UTF_MAX=4. This ↵jan.nijtmans2020-05-041-1/+5
| |_|_|/ |/| | | | | | | | | | | fixes all "knownBug" testcases related to tip389.
* | | | (partial) fix for [9d0cb35bb2]: Various issues with core-8-6-branch, ↵jan.nijtmans2020-05-041-16/+18
| |_|/ |/| | | | | | | | | | | TCL_UTF_MAX=4. (even though TCL_UTF_MAX=4 is unsupported, it would be nice to make it work) Marked various test-cases as "knownBug", those work correctly in core-8-branch (8.7). The fix there could be backported. Low prio.
* | | Re-join utf-6.93.0 and utf-6.93.1 (please disregard comment in previous ↵jan.nijtmans2020-05-031-12/+54
| | | | | | | | | | | | | | | commit, it was not correct). Perfectionalize TclUtfToUCS4()/TclUCS4Complete() and new (internal) function TclUCS4ToUtf(). They can help preventing bugs regarding splitting/joining surrogates. Used them in a few more places.
* | | Join test-cases utf-6.93.0 and utf-6.93.1, which MUST give the same answer ↵jan.nijtmans2020-05-021-1/+1
| |/ |/| | | | | | | | | always for whatever testConstraints. Fix one invalid use of TclUCS4Complete(), and let TclUtfToUCS4() handle (invalid) 4-byte sequences. Test-case cleanup (removal of unnecessary quoting)
* | Fix first part of [ed29806baf]: Tcl_UtfToUniChar reads more than TCL_UTF_MAX ↵jan.nijtmans2020-05-011-10/+10
|\ \ | | | | | | | | | | | | | | | bytes. Tcl_UtfToUniChar() now never reads more than TCL_UTF_MAX bytes any more. The UtfToUtf encoder/decoder is adapted to do attitional checks (more tricky than in Tcl 8.7, since we want compatibility with earlier 8.6 releases). Other callers of Tcl_UtfToUniChar() needs to be revised for the same problem. Most callers will need to change Tcl_UtfToUniChar() -> TclUtfToUCS4() and Tcl_UtfCharComplete() -> TclUCS4Complete(), but that's not done yet.
| * | First, prove that bug [ed29806baf] is present in 8.7 too. Let's see what ↵jan.nijtmans2020-04-301-3/+3
| | | | | | | | | | | | test-cases fail when we no longer check the validity of the 3th trail byte.
| * | Let's not get out the src[3] check yet.jan.nijtmans2020-04-301-1/+1
| | |
| * | Merge 8.6jan.nijtmans2020-04-301-9/+9
| |\ \
| * \ \ Merge-mark 8.6 (Use of UNICODE_OUT_OF_RANGE() macro already was in 8.7).jan.nijtmans2020-04-291-11/+6
| |\ \ \ | | | | | | | | | | Quick exit from Tcl_UtfToChar16()/Tcl_UtfToUniChar() when lead-byte is 0xF5 - 0xF7.
| * \ \ \ merge 8.6dgp2020-04-271-45/+16
| |\ \ \ \