Jim Tcl
Check-in [18ba56fba7]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:docs: Bring README.utf-8 up-to-date

Signed-off-by: Steve Bennett <steveb@workware.net.au>

Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 18ba56fba7c8d571851f64909c522d67056326df
User & Date: steveb@workware.net.au 2017-11-07 21:48:02
Context
2017-11-24
22:59
build: During install, make sure pkgconfig dir exists

Signed-off-by: Steve Bennett <steveb@workware.net.au> check-in: b87dad7d92 user: steveb@workware.net.au tags: trunk

2017-11-07
21:48
docs: Bring README.utf-8 up-to-date

Signed-off-by: Steve Bennett <steveb@workware.net.au> check-in: 18ba56fba7 user: steveb@workware.net.au tags: trunk

21:47
tclcompat.tcl: minor comment updates

Signed-off-by: Steve Bennett <steveb@workware.net.au> check-in: 92e4698aad user: steveb@workware.net.au tags: trunk

Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to README.utf-8.

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
..
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43



44
45
46
47
48
49
50
..
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

=========================

Author: Steve Bennett <steveb@workware.net.au>
Date: 2 Nov 2010 10:55:52 EST

OVERVIEW
--------
Traditionally Jim Tcl has support strings, including binary strings containing
nulls, however it has had no support for multi-byte character encodings.

In some fields, such as when dealing with the web, or other user-generated content,
support for multi-byte character encodings is necessary.
In these cases it would be very useful for Jim Tcl to be able to process strings
as multi-byte character strings rather than simply binary bytes.

Supporting multiple character encodings and translation between those encodings
is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
for UTF-8, as probably the most popular general purpose multi-byte encoding.

UTF-8 support is optional. It can be enabled at compile time with:

  ./configure --enable-utf8

The Jim Tcl documentation fully documents the UTF-8 support. This README includes
additional background information.
................................................................................

Unicode vs UTF-8
----------------
It is important to understand that Unicode is an abstract representation
of the concept of a "character", while UTF-8 is an encoding of
Unicode into bytes.  Thus the Unicode codepoint U+00B5 is encoded
in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
ASCII which the same name is used interchangeably between a character
set and an encoding.

Unicode Escapes
---------------
Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
in strings. This can be done with the \uNNNN Unicode escape. This syntax
is compatible with Tcl and is enabled even if UTF-8 is disabled.

Like Tcl, currently only 16-bit Unicode characters can be encoded.




UTF-8 Properties
----------------
Due to the design of the UTF-8 encoding, many (most) commands continue
to work with UTF-8 strings. This is due to the following properties of UTF-8:

* ASCII characters in strings have the same representation in UTF-8
................................................................................
Case folding tables are automatically generated from the official
unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt

Working with Binary Data and non-UTF-8 encodings
------------------------------------------------
Almost all Jim commands will work identically with binary data and
UTF-8 encoded data, including read, gets, puts and 'string eq'.  It
is only certain string manipulation commands which will operated
differently.  For example, 'string index' will return UTF-8 characters,
not bytes.

If it is necessary to manipulate strings containing binary, non-ASCII
data (bytes >= 0x80), there are two options.

1. Build Jim without UTF-8 support
2. Arrange to encode and decode binary data or data in other encodings
   to UTF-8 before manipulation.

Internal Details
----------------
Jim_Utf8Length() will calculate the character length of the string and cache
it for later access. It uses utf8_strlen() which relies on the string to be null
terminated (which it always will be).

It is possible to tell if a string is ascii-only because length == bytelength

It is possible to provide optimised versions of various routines for
the ascii-only case. Currently this is done only for 'string index' and 'string range'.








|
|








|







 







|
|







|
>
>
>







 







|
|
<





|
|










|
>
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
..
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
..
99
100
101
102
103
104
105
106
107

108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
=========================

Author: Steve Bennett <steveb@workware.net.au>
Date: 2 Nov 2010 10:55:52 EST

OVERVIEW
--------
Early versions of Jim Tcl supported strings, including binary strings containing
nulls, however it had no support for multi-byte character encodings.

In some fields, such as when dealing with the web, or other user-generated content,
support for multi-byte character encodings is necessary.
In these cases it would be very useful for Jim Tcl to be able to process strings
as multi-byte character strings rather than simply binary bytes.

Supporting multiple character encodings and translation between those encodings
is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
for UTF-8, as the most popular general purpose multi-byte encoding.

UTF-8 support is optional. It can be enabled at compile time with:

  ./configure --enable-utf8

The Jim Tcl documentation fully documents the UTF-8 support. This README includes
additional background information.
................................................................................

Unicode vs UTF-8
----------------
It is important to understand that Unicode is an abstract representation
of the concept of a "character", while UTF-8 is an encoding of
Unicode into bytes.  Thus the Unicode codepoint U+00B5 is encoded
in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
ASCII where the same name is used interchangeably between a character value
and and its encoding.

Unicode Escapes
---------------
Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
in strings. This can be done with the \uNNNN Unicode escape. This syntax
is compatible with Tcl and is enabled even if UTF-8 is disabled.

Unlike Tcl, Jim Tcl supports  Unicode characters up to 21 bits.
In addition to \uNNNN, Jim Tcl also supports variable length Unicode
character specifications with \u{NNNNNN} where there may be anywhere between
1 and 6 hex within the braces. e.g. \u{24B62}

UTF-8 Properties
----------------
Due to the design of the UTF-8 encoding, many (most) commands continue
to work with UTF-8 strings. This is due to the following properties of UTF-8:

* ASCII characters in strings have the same representation in UTF-8
................................................................................
Case folding tables are automatically generated from the official
unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt

Working with Binary Data and non-UTF-8 encodings
------------------------------------------------
Almost all Jim commands will work identically with binary data and
UTF-8 encoded data, including read, gets, puts and 'string eq'.  It
is only certain string manipulation commands that behave differently.
For example, 'string index' will return UTF-8 characters, not bytes.


If it is necessary to manipulate strings containing binary, non-ASCII
data (bytes >= 0x80), there are two options.

1. Build Jim without UTF-8 support
2. Use 'string byterange', 'string bytelength' and 'pack', 'unpack' and
   'binary' to operate on strings as bytes rather than characters.

Internal Details
----------------
Jim_Utf8Length() will calculate the character length of the string and cache
it for later access. It uses utf8_strlen() which relies on the string to be null
terminated (which it always will be).

It is possible to tell if a string is ascii-only because length == bytelength

It is possible to provide optimised versions of various routines for
the ascii-only case. Both 'string index' and 'string range' currently
perform such optimisation.