Rocksolid Light - comp.lang.c - Re: Unicode test suite

Re: Unicode test suite

<7ruxM.289871$AsA.241687@fx18.iad>

https://news.novabbs.org/devel/article-flat.php?id=47871&group=comp.lang.c#47871

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Unicode test suite
Newsgroups: comp.lang.c
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me>
From: Richard@Damon-Family.org (Richard Damon)
Content-Language: en-US
In-Reply-To: <ua5gf0$2u9ha$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 62
Message-ID: <7ruxM.289871$AsA.241687@fx18.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 30 Jul 2023 10:27:15 -0400
X-Received-Bytes: 4228

by: Richard Damon - Sun, 30 Jul 2023 14:27 UTC

On 7/30/23 7:10 AM, Bart wrote:
> On 30/07/2023 04:52, Richard Damon wrote:
> > On 7/29/23 9:00 PM, Keith Thompson wrote:
>
> >> I haven't actually used UTF-8 string literals. As of C11/N1570, the
> >> standard is a bit vague about the difference.
> >>
> >> N1570 6.4.5p3:
> >>
> >>      A *character string literal* is a sequence of zero or more
> multibyte
> >>      characters enclosed in double-quotes, as in "xyz". A *UTF−8 string
> >>      literal* is the same, except prefixed by u8.
> >>
> >> p6:
> >>
> >>      For character string literals, the array elements have type char,
> >>      and are initialized with the individual bytes of the multibyte
> >>      character sequence. For UTF−8 string literals, the array elements
> >>      have type char, and are initialized with the characters of the
> >>      multibyte character sequence, as encoded in UTF−8.
> >>
> >> I would guess that something like u8"\xff", which specifies an invalid
> >> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
> >> constraint or syntax rule, and neither gcc nor clang complains about
> it.
> >>
> >> I guess that for a compiler that uses EBCDIC for source code, "x" would
> >> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
> >> equivalent to "\x78".
> >>
> >> Either the standard is insufficiently clear, or I'm missing something.
> >> I definitely wouldn't bet against the latter.
> >>
> >
> > I think the key is that numerical constant specified characters are
> > specified in the "execution character encoding", which is implementation
> > defined, and doesn't need to be unicode.
> >
> > Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
> > represent the Unicode character U+00FF
>
> (Do you mean \0xFF or \xFF ?)
>
> I thought these were limited to 2 hex digits, but apparently you can
> have as many as you like. However that looks to be ambiguous:
>
>     "\x20AC"
>
> Is that a 4-digit hex code, or is it a 2-digit one following by normal
> characters A and C?

Yes, forgot that the 0 wasn't needed there.

As for the ambiguity, the grammar is defined by "maximum munching" so
"\x20AC" has a single "character" which takes at least 14 bits to
represent (so might not be a valid character). If you want the latter,
you need to write:

"\x20""AC" so that the constant is terminated and then in a later
phase the strings are concatenated. (a trap for machine generated code
that isn't smart enough to handle the case.)

Re: Unicode test suite

<ua5u7h$2vbh4$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47872&group=comp.lang.c#47872

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 16:05:23 +0100
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <ua5u7h$2vbh4$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me>
<7ruxM.289871$AsA.241687@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Jul 2023 15:05:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ceb74387feb450401c92448496a7cc87";
logging-data="3124772"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18hY01b/ccjmxHPDnuZgbhUK3MrowiTCUU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:d+xuLGOnfLq7BBB25meLdeuVOWc=
In-Reply-To: <7ruxM.289871$AsA.241687@fx18.iad>

by: Bart - Sun, 30 Jul 2023 15:05 UTC

On 30/07/2023 15:27, Richard Damon wrote:
> On 7/30/23 7:10 AM, Bart wrote:
>> On 30/07/2023 04:52, Richard Damon wrote:
>> > On 7/29/23 9:00 PM, Keith Thompson wrote:
>>
>> >> I haven't actually used UTF-8 string literals. As of C11/N1570, the
>> >> standard is a bit vague about the difference.
>> >>
>> >> N1570 6.4.5p3:
>> >>
>> >> A *character string literal* is a sequence of zero or more
>> multibyte
>> >> characters enclosed in double-quotes, as in "xyz". A *UTF−8
>> string
>> >> literal* is the same, except prefixed by u8.
>> >>
>> >> p6:
>> >>
>> >> For character string literals, the array elements have type
>> char,
>> >> and are initialized with the individual bytes of the multibyte
>> >> character sequence. For UTF−8 string literals, the array
>> elements
>> >> have type char, and are initialized with the characters of the
>> >> multibyte character sequence, as encoded in UTF−8.
>> >>
>> >> I would guess that something like u8"\xff", which specifies an
>> invalid
>> >> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
>> >> constraint or syntax rule, and neither gcc nor clang complains
>> about it.
>> >>
>> >> I guess that for a compiler that uses EBCDIC for source code, "x"
>> would
>> >> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
>> >> equivalent to "\x78".
>> >>
>> >> Either the standard is insufficiently clear, or I'm missing
>> something.
>> >> I definitely wouldn't bet against the latter.
>> >>
>> >
>> > I think the key is that numerical constant specified characters are
>> > specified in the "execution character encoding", which is
>> implementation
>> > defined, and doesn't need to be unicode.
>> >
>> > Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
>> > represent the Unicode character U+00FF
>>
>> (Do you mean \0xFF or \xFF ?)
>>
>> I thought these were limited to 2 hex digits, but apparently you can
>> have as many as you like. However that looks to be ambiguous:
>>
>> "\x20AC"
>>
>> Is that a 4-digit hex code, or is it a 2-digit one following by normal
>> characters A and C?
>
> Yes, forgot that the 0 wasn't needed there.
>
> As for the ambiguity, the grammar is defined by "maximum munching" so
> "\x20AC" has a single "character" which takes at least 14 bits to
> represent (so might not be a valid character). If you want the latter,
> you need to write:

20AC should be Unicode for €. I assumed your example took Unicode
character codes, which need up to 21 bits, and turned them into UTF8
sequences.

However when I tried \x20AC with gcc 10.x, it said the hex escape
sequence was out of range, whether I used 'u8' or not.

> "\x20""AC" so that the constant is terminated and then in a later
> phase the strings are concatenated. (a trap for machine generated code
> that isn't smart enough to handle the case.)

Then it looks like "" can always be used to terminate a `\dd` sequence,
unless another \ follows anyway.

Re: Unicode test suite

<1c22f72a-98cf-4a26-be5f-b8d633edb855n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47873&group=comp.lang.c#47873

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:1ba7:b0:403:e8a7:bd9b with SMTP id bp39-20020a05622a1ba700b00403e8a7bd9bmr30055qtb.11.1690729771976;
Sun, 30 Jul 2023 08:09:31 -0700 (PDT)
X-Received: by 2002:a05:622a:1105:b0:400:a783:f746 with SMTP id
e5-20020a05622a110500b00400a783f746mr26167qty.0.1690729771744; Sun, 30 Jul
2023 08:09:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 08:09:31 -0700 (PDT)
In-Reply-To: <7ruxM.289871$AsA.241687@fx18.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=81.143.231.9; posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 81.143.231.9
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <ua42v3$2nihd$1@dont-email.me>
<87wmyimcwr.fsf@nosuchdomain.example.com> <W7lxM.161741$qnnb.72702@fx11.iad>
<ua5gf0$2u9ha$1@dont-email.me> <7ruxM.289871$AsA.241687@fx18.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1c22f72a-98cf-4a26-be5f-b8d633edb855n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Sun, 30 Jul 2023 15:09:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5141

by: Malcolm McLean - Sun, 30 Jul 2023 15:09 UTC

On Sunday, 30 July 2023 at 15:27:32 UTC+1, Richard Damon wrote:
> On 7/30/23 7:10 AM, Bart wrote:
> > On 30/07/2023 04:52, Richard Damon wrote:
> > > On 7/29/23 9:00 PM, Keith Thompson wrote:
> >
> > >> I haven't actually used UTF-8 string literals. As of C11/N1570, the
> > >> standard is a bit vague about the difference.
> > >>
> > >> N1570 6.4.5p3:
> > >>
> > >> A *character string literal* is a sequence of zero or more
> > multibyte
> > >> characters enclosed in double-quotes, as in "xyz". A *UTF−8 string
> > >> literal* is the same, except prefixed by u8.
> > >>
> > >> p6:
> > >>
> > >> For character string literals, the array elements have type char,
> > >> and are initialized with the individual bytes of the multibyte
> > >> character sequence. For UTF−8 string literals, the array elements
> > >> have type char, and are initialized with the characters of the
> > >> multibyte character sequence, as encoded in UTF−8.
> > >>
> > >> I would guess that something like u8"\xff", which specifies an invalid
> > >> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
> > >> constraint or syntax rule, and neither gcc nor clang complains about
> > it.
> > >>
> > >> I guess that for a compiler that uses EBCDIC for source code, "x" would
> > >> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
> > >> equivalent to "\x78".
> > >>
> > >> Either the standard is insufficiently clear, or I'm missing something.
> > >> I definitely wouldn't bet against the latter.
> > >>
> > >
> > > I think the key is that numerical constant specified characters are
> > > specified in the "execution character encoding", which is implementation
> > > defined, and doesn't need to be unicode.
> > >
> > > Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
> > > represent the Unicode character U+00FF
> >
> > (Do you mean \0xFF or \xFF ?)
> >
> > I thought these were limited to 2 hex digits, but apparently you can
> > have as many as you like. However that looks to be ambiguous:
> >
> > "\x20AC"
> >
> > Is that a 4-digit hex code, or is it a 2-digit one following by normal
> > characters A and C?
> Yes, forgot that the 0 wasn't needed there.
>
> As for the ambiguity, the grammar is defined by "maximum munching" so
> "\x20AC" has a single "character" which takes at least 14 bits to
> represent (so might not be a valid character). If you want the latter,
> you need to write:
>
> "\x20""AC" so that the constant is terminated and then in a later
> phase the strings are concatenated. (a trap for machine generated code
> that isn't smart enough to handle the case.)
>
That a weakness in the Baby X resource compiler. It allows the user to enter
a C-style escaped string like "hello world\n". But it won't handle hex escapes.

Re: Unicode test suite

<60wxM.144679$xMqa.81254@fx12.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47875&group=comp.lang.c#47875

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Unicode test suite
Content-Language: en-US
Newsgroups: comp.lang.c
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me>
<7ruxM.289871$AsA.241687@fx18.iad> <ua5u7h$2vbh4$1@dont-email.me>
From: Richard@Damon-Family.org (Richard Damon)
In-Reply-To: <ua5u7h$2vbh4$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 112
Message-ID: <60wxM.144679$xMqa.81254@fx12.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 30 Jul 2023 12:14:58 -0400
X-Received-Bytes: 6440

by: Richard Damon - Sun, 30 Jul 2023 16:14 UTC

On 7/30/23 11:05 AM, Bart wrote:
> On 30/07/2023 15:27, Richard Damon wrote:
> > On 7/30/23 7:10 AM, Bart wrote:
> >> On 30/07/2023 04:52, Richard Damon wrote:
> >> > On 7/29/23 9:00 PM, Keith Thompson wrote:
> >>
> >> >> I haven't actually used UTF-8 string literals. As of C11/N1570,
> the
> >> >> standard is a bit vague about the difference.
> >> >>
> >> >> N1570 6.4.5p3:
> >> >>
> >> >>      A *character string literal* is a sequence of zero or more
> >> multibyte
> >> >>      characters enclosed in double-quotes, as in "xyz". A *UTF−8
> >> string
> >> >>      literal* is the same, except prefixed by u8.
> >> >>
> >> >> p6:
> >> >>
> >> >>      For character string literals, the array elements have type
> >> char,
> >> >>      and are initialized with the individual bytes of the multibyte
> >> >>      character sequence. For UTF−8 string literals, the array
> >> elements
> >> >>      have type char, and are initialized with the characters of the
> >> >>      multibyte character sequence, as encoded in UTF−8.
> >> >>
> >> >> I would guess that something like u8"\xff", which specifies an
> >> invalid
> >> >> UTF-8 sequence, would be invalid, but it doesn't seem to violate
> any
> >> >> constraint or syntax rule, and neither gcc nor clang complains
> >> about it.
> >> >>
> >> >> I guess that for a compiler that uses EBCDIC for source code, "x"
> >> would
> >> >> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x"
> would be
> >> >> equivalent to "\x78".
> >> >>
> >> >> Either the standard is insufficiently clear, or I'm missing
> >> something.
> >> >> I definitely wouldn't bet against the latter.
> >> >>
> >> >
> >> > I think the key is that numerical constant specified characters are
> >> > specified in the "execution character encoding", which is
> >> implementation
> >> > defined, and doesn't need to be unicode.
> >> >
> >> > Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
> >> > represent the Unicode character U+00FF
> >>
> >> (Do you mean \0xFF or \xFF ?)
> >>
> >> I thought these were limited to 2 hex digits, but apparently you can
> >> have as many as you like. However that looks to be ambiguous:
> >>
> >>      "\x20AC"
> >>
> >> Is that a 4-digit hex code, or is it a 2-digit one following by normal
> >> characters A and C?
> >
> > Yes, forgot that the 0 wasn't needed there.
> >
> > As for the ambiguity, the grammar is defined by "maximum munching" so
> > "\x20AC" has a single "character" which takes at least 14 bits to
> > represent (so might not be a valid character). If you want the latter,
> > you need to write:
>
> 20AC should be Unicode for €. I assumed your example took Unicode
> character codes, which need up to 21 bits, and turned them into UTF8
> sequences.
>

Someone earlier (not me) was asking about €, note that example made the
assumption that € was a character defined in both the source and
execution character sets (which don't need to be Unicode)

> However when I tried \x20AC with gcc 10.x, it said the hex escape
> sequence was out of range, whether I used 'u8' or not.

20AC is € only if the "Execution Character Set" is defined to be UNICODE.

I don't think GCC defines its "Execution Character Set" as UNICODE, at
least not for "narrow" characters, so that value isn't valid.

If you want to specify a Unicode character, you would use "\u20AC".
Note, the the execution character set can't reprent that character, you
might want u8"\u20AC" to get a UTF-8 string that represents that character.

This goes back to the fact that C vastly predates Unicode being a common
format, and C wants to allow implementations to use what ever is
actually natural for that system. It ALLOWS, but doesn't REQUIRE plain
strings to be a representation of it. Later version have made
understanding Unicode (as a possible additional representation) pretty
much a requirement for full conformance, but don't make plain char need
to be part of it.
>
>
> > "\x20""AC" so that the constant is terminated and then in a later
> > phase the strings are concatenated. (a trap for machine generated code
> > that isn't smart enough to handle the case.)
>
> Then it looks like "" can always be used to terminate a `\dd` sequence,
> unless another \ follows anyway.

yes, because the first one ends the string that contained the \x escape
sequence, and then the second one starts another, and consecutive string
literals are automatically concatinated in a later phase.

Re: Unicode test suite

<ua6h03$30rf9$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47883&group=comp.lang.c#47883

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jameskuyper@alumni.caltech.edu (James Kuyper)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 16:25:39 -0400
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <ua6h03$30rf9$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 30 Jul 2023 20:25:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="be56355ec55916036e9c6f9e7720a028";
logging-data="3173865"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jcKI527lPaJiEqmr7trza0nz8BjwAm1s="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:27VOcBncCKuDcz2BHzcPfbb6jV4=
Content-Language: en-US
In-Reply-To: <W7lxM.161741$qnnb.72702@fx11.iad>

by: James Kuyper - Sun, 30 Jul 2023 20:25 UTC

On 7/29/23 23:52, Richard Damon wrote:
> Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
> represent the Unicode character U+00FF

The "\0" is an octal escape sequence representing the null character.
That makes the 'x' and the two 'F's into ordinary characters, not part
of an escape sequence. I'm sure that's not what you intended.

Re: Unicode test suite

<87sf95m6ws.fsf@nosuchdomain.example.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47884&group=comp.lang.c#47884

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S.Thompson+u@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 14:22:11 -0700
Organization: None to speak of
Lines: 25
Message-ID: <87sf95m6ws.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me>
<87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="0f6fc3f0df3802232e8b4af3a2646543";
logging-data="3179515"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1812k6oDzGaWDFxSfscU+8i"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:6yVvPHH1Pdt9kgrgsoV28UGkbDk=
sha1:vA68Bmygg9I44OGDLkno1WqhezQ=

by: Keith Thompson - Sun, 30 Jul 2023 21:22 UTC

Bart <bc@freeuk.com> writes:
> On 30/07/2023 04:52, Richard Damon wrote:
[...]
>> Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
>> represent the Unicode character U+00FF
>
> (Do you mean \0xFF or \xFF ?)

\xFF is correct.

> I thought these were limited to 2 hex digits, but apparently you can
> have as many as you like. However that looks to be ambiguous:
>
> "\x20AC"
>
> Is that a 4-digit hex code, or is it a 2-digit one following by normal
> characters A and C?

N1570 6.4.4.4 (or the corresponding section on character constants in
whatever standard or draft you have).

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<87o7jtm6ox.fsf@nosuchdomain.example.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=47885&group=comp.lang.c#47885

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S.Thompson+u@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 14:26:54 -0700
Organization: None to speak of
Lines: 16
Message-ID: <87o7jtm6ox.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me>
<87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me>
<7ruxM.289871$AsA.241687@fx18.iad> <ua5u7h$2vbh4$1@dont-email.me>
<60wxM.144679$xMqa.81254@fx12.iad>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="0f6fc3f0df3802232e8b4af3a2646543";
logging-data="3179515"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+epW6voYNd6O1BS2LI1vWx"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:0aOyXji0yoJG3sL46Eo1Nk9wrHo=
sha1:TbJH0n0jyE3JseFmzXXAY5y5P5o=

by: Keith Thompson - Sun, 30 Jul 2023 21:26 UTC

Richard Damon <Richard@Damon-Family.org> writes:
[...]
> I don't think GCC defines its "Execution Character Set" as UNICODE, at
> least not for "narrow" characters, so that value isn't valid.

It does. From the gcc manual (version 11.3.0):

'-fexec-charset=CHARSET'
Set the execution character set, used for string and character
constants. The default is UTF-8. CHARSET can be any encoding
supported by the system's 'iconv' library routine.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<86il9vgcum.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=48029&group=comp.lang.c#48029

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 04 Aug 2023 06:24:01 -0700
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <86il9vgcum.fsf@linuxsc.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com> <87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com> <u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com> <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me> <432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com> <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com> <ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com> <W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me> <7ruxM.289871$AsA.241687@fx18.iad> <ua5u7h$2vbh4$1@dont-email.me> <60wxM.144679$xMqa.81254@fx12.iad> <87o7jtm6ox.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="7a95792a6e880c00df8e418e5a126a38";
logging-data="1368544"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19P2LvCvaqNNCXrEeEg4TSH1mV0mtnUnPg="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:birgkj9msVhbLnXa9s3go0ms/aE=
sha1:FYH2dzLu/RfvhDYvk54ELbijIFk=

by: Tim Rentsch - Fri, 4 Aug 2023 13:24 UTC

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Richard Damon <Richard@Damon-Family.org> writes:
> [...]
>
>> I don't think GCC defines its "Execution Character Set" as UNICODE, at
>> least not for "narrow" characters, so that value isn't valid.
>
> It does. From the gcc manual (version 11.3.0):
>
> '-fexec-charset=CHARSET'
> Set the execution character set, used for string and character
> constants. The default is UTF-8. CHARSET can be any encoding
> supported by the system's 'iconv' library routine.

The name of the option is something of a misnomer. Unicode is a
character set. UTF-8 is a particular encoding of Unicode.[*] The
description even says as much: "CHARSET can be any encoding ...".
A C implementation could choose Unicode for its execution character
set, but choose a different encoding to represent it.

In any case, thank you, it is nice to know about the option.

[*] I suppose UTF-8 could be viewed as an encoding of any set of
integer values taken from the domain of Unicode code points, but
undoubtedly the most common is to represent Unicode characters.

Re: Unicode test suite

<86edkihiyp.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=48050&group=comp.lang.c#48050

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 04 Aug 2023 09:26:38 -0700
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <86edkihiyp.fsf@linuxsc.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com> <87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com> <u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com> <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me> <432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com> <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com> <ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com> <W7lxM.161741$qnnb.72702@fx11.iad> <ua5gf0$2u9ha$1@dont-email.me> <7ruxM.289871$AsA.241687@fx18.iad> <ua5u7h$2vbh4$1@dont-email.me> <60wxM.144679$xMqa.81254@fx12.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="7a95792a6e880c00df8e418e5a126a38";
logging-data="1421845"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kcajhjcJHVdfo1iOdCe2R97ucxHkoroo="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:igKYtUQBykAYuR+yEOr4NmlIa9o=
sha1:GSW5knFPBRoxquGcXLBrSPYEetU=

by: Tim Rentsch - Fri, 4 Aug 2023 16:26 UTC

Richard Damon <Richard@Damon-Family.org> writes:

> On 7/30/23 11:05 AM, Bart wrote:

[concering a UTF-8 string literal u8"\x20AC"]

>> However when I tried \x20AC with gcc 10.x, it said the hex escape
>> sequence was out of range, whether I used 'u8' or not.
>
> [u8"\x20AC" is (something)] only if the "Execution Character Set"
> is defined to be UNICODE.

I believe this case is simply an oversight in the C standard.
Starting near the end of 2019, the C2x draft clearly establishes
that octal and hexadecimal escape sequences for u8 string literals
(and also u8 character constants) must have a value in the range of
unsigned char. The string literal u8"\x20AC" is meant to be a
constraint violation (and is one in C2x drafts)[*]. That result
doesn't depend on what the execution character set is.

[*] Assuming CHAR_BIT == 8, of course.

Re: Unicode test suite

<86a5v6h8k2.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=48071&group=comp.lang.c#48071

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 04 Aug 2023 13:11:25 -0700
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <86a5v6h8k2.fsf@linuxsc.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com> <u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com> <87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com> <u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com> <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me> <432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com> <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com> <ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="7a95792a6e880c00df8e418e5a126a38";
logging-data="1487723"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+21nM0KCkK4/FE2ik0QVhOo29k8QIlh+k="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:4vWKXwSq9DCVPwOdb5t0RG7IH3Y=
sha1:OKkwcEXhEabkexyTRMn7UiHXH8s=

by: Tim Rentsch - Fri, 4 Aug 2023 20:11 UTC

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

[.. utf-8 string literals ..]

> I haven't actually used UTF-8 string literals. As of C11/N1570,
> the standard is a bit vague about the difference.
>
> N1570 6.4.5p3:
>
> A *character string literal* is a sequence of zero or more
> multibyte characters enclosed in double-quotes, as in "xyz".
> A *UTF-8 string literal* is the same, except prefixed by u8.
>
> p6:
>
> For character string literals, the array elements have type
> char, and are initialized with the individual bytes of the
> multibyte character sequence. For UTF-8 string literals, the
> array elements have type char, and are initialized with the
> characters of the multibyte character sequence, as encoded in
> UTF?8.
>
> I would guess that something like u8"\xff", which specifies an invalid
> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
> constraint or syntax rule, and neither gcc nor clang complains about it.

In C11, the source string is converted to the execution character
set, which is then represented using a UTF-8 encoding. 5.1.1.2 p5
in N1570 says:

Each source character set member and escape sequence in
character constants and string literals is converted to the
corresponding member of the execution character set; if
there is no corresponding member, it is converted to an
implementation-defined member other than the null (wide)
character.

Note: "Each source character set member and escape sequence" is
converted to the corresponding member of the execution character
set. I don't see anything that changes that rule for UTF-8
string literals.

I think this behavior is changing in C23, but I don't know
what the new wording means.

Re: Unicode test suite

<865y5uh1eg.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=48079&group=comp.lang.c#48079

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 04 Aug 2023 15:45:59 -0700
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <865y5uh1eg.fsf@linuxsc.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com> <u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com> <87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com> <u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com> <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me> <432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com> <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com> <AZgxM.198095$U3w1.28759@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="943f511331d2b471aa1b6ebff38979e9";
logging-data="1528864"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1947bfexqnCFoHk7rzLGr7+pte+YlsvtN0="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:He5lTbnKvtbZrGPCsVVLln3EOwA=
sha1:bYQ4b/yFWxwgoIvkx2RWspL3rRE=

by: Tim Rentsch - Fri, 4 Aug 2023 22:45 UTC

Richard Damon <Richard@Damon-Family.org> writes:

> As I understand it, a u8"string" will convert the string from its
> source character set (which might be set to Latin-1) into UTF-8,
> while a plain "string" will convert it to what is defined as the
> "execution" character set, which might be defined as the same as
> the input set of Latin-1.

For both u8-prefixed string literals and unprefixed string
literals, the source characters are first converted to the
corresponding members of the execution character set (or to
non-null implementation-defined members of the ECS in cases
where there is no corresponding member).

For u8-prefixed string literals, the sequence of execution
character set characters is then turned into a sequence of
bytes using a UTF-8 representation.

The execution character set is just a set. How characters
are represented as a sequence of multibyte characters is
independent of what characters are in the set.

Notice that source character set characters are converted into
their corresponding execution set members in translation phase 5,
per 5.1.1.2 paragraph 5, and the array value for the string
literal with a UTF-8 encoding is produced in translation phase 7,
per 6.4.5 paragraph 6 (references from N1570). Conversion to the
execution character set always comes first, for both u8-prefixed
string literals and unprefixed string literals.

This is an unauthorized cybernetic announcement.

devel / comp.lang.c / Re: Unicode test suite

Subject	Author
Unicode test suite	Malcolm McLean
Unicode test suite	Spiros Bousbouras
Unicode test suite	Malcolm McLean
Unicode test suite	Bart
Unicode test suite	Scott Lurndal
Unicode test suite	Malcolm McLean
Unicode test suite	Keith Thompson
Unicode test suite	Malcolm McLean
Unicode test suite	Kaz Kylheku
Unicode test suite	Keith Thompson
Unicode test suite	jak
Unicode test suite	Keith Thompson
Unicode test suite	Malcolm McLean
Unicode test suite	Bart
Unicode test suite	Malcolm McLean
Unicode test suite	Kaz Kylheku
Unicode test suite	Keith Thompson
Unicode test suite	Malcolm McLean
Unicode test suite	Bart
Unicode test suite	Ben Bacarisse
Unicode test suite	Keith Thompson
Unicode test suite	Bart
Unicode test suite	Kaz Kylheku
Unicode test suite	Keith Thompson
Unicode test suite	Kaz Kylheku
Unicode test suite	Richard Damon
Unicode test suite	Bart
Unicode test suite	Richard Damon
Unicode test suite	Bart
Unicode test suite	Richard Damon
Unicode test suite	Keith Thompson
Unicode test suite	Tim Rentsch
Unicode test suite	Tim Rentsch
Unicode test suite	Malcolm McLean
Unicode test suite	Keith Thompson
Unicode test suite	James Kuyper
Unicode test suite	Tim Rentsch
Unicode test suite	Malcolm McLean
Unicode test suite	Kaz Kylheku
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	fir
Unicode test suite	Richard Damon
Unicode test suite	Tim Rentsch
Unicode test suite	Kaz Kylheku
Unicode test suite	Malcolm McLean
Unicode test suite	Kaz Kylheku
Unicode test suite	Malcolm McLean
Unicode test suite	Kaz Kylheku
Unicode test suite	fir
Unicode test suite	Malcolm McLean
Unicode test suite	fir
Unicode test suite	Malcolm McLean
Unicode test suite	fir
Unicode test suite	Ben Bacarisse
Unicode test suite	Malcolm McLean