Message-ID:

"It runs like _x, where _x is something unsavory" -- Prof. Romas Aleliunas, CS 435

devel / comp.lang.c / Re: C vs Haskell for XML parsing

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>On Sat, 19 Aug 2023 14:48:08 +0000, Scott Lurndal wrote:
>
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>>>>
>>>> It turns out that if you want to be 100% conforming you need to be able
>>>> to detect both UCS-4 and (eye roll) EBCDIC.
>>>>
>>>I had a go at ECBDIC.
>>>
>>>If anyone has an EBCDIC XML file they'd like to test, please post a link.
>>
>> Here's one:
>[snip]
>
>And that's an excellent illustration of my point about some EBCDIC
>charactersets lacking the necessary characters to properly express XML.
>
>Here are the first four lines of the ASCII equivalent of that message,
>as generated by
> dd if=ebcdic.msg of=ascii.msg conv=ascii
>where
> conv=ascii
>will convert "from EBCDIC to ASCII" (dd(1) manpage)
>
>Note the (translated) format of the DOCTYPE entities
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet href="one_register.xsl" type="text/xsl" ?>
> <|DOCTYPE registers SYSTEM "registers.dtd">
> <|-- Copyright (c) 2010-2014 ARM Limited. All rights reserved. -->
>
>Apparently, you used a variant of EBCDIC that includes an exclamation mark
>at codepoint 0x4f; dd uses EBCDIC-US which, at codepoint 0x4f encodes
>a "VERTICAL LINE"
>

Actually, I used 'dd' on an old Fedora Core install.

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>scott@slp53.sl.home (Scott Lurndal) writes:
>
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>>>>
>>>> It turns out that if you want to be 100% conforming you need to be able
>>>> to detect both UCS-4 and (eye roll) EBCDIC.
>>>>
>>>I had a go at ECBDIC.
>>>
>>>If anyone has an EBCDIC XML file they'd like to test, please post a link.
>>
>> Here's one:
>>
>> Lo...
>
><EBCDIC-encoded XML deleted>
>
>Is that legal? I thought an EBCDIC XML file must give the correct
>encoding in the XML declaration. xmllint rejects it unless I edit the
>declaration.

As Lew pointed out, it was not properly specified, I had cheated and
encoded (using dd) an existing xml file (from the public ARM Aarch64
SysReg XML).

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>>
>> It turns out that if you want to be 100% conforming you need to be able
>> to detect both UCS-4 and (eye roll) EBCDIC.
>>
> I had a go at ECBDIC.
>
> If anyone has an EBCDIC XML file they'd like to test, please post a
> link.

You can make your own by (a) setting the encoding="..." attribute in the
declaration (EBCDIC-INT is a good one) and then running iconv.

> Of course the next challenge is to support ECBDIC as the execution
> character set. This means all the if (ch == '<') statements have to
> come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
> have to be replaced with hex codes.

Do you have a user who wants to compile your program on a system that
does not support ASCII C source?

> Here's where the Baby X resource compiler shows its power. Simply set
> up the input
> <BabyXRC>
> <utf8 name="cdata"><CDATA</utf8>
> </BabyXRC>

You've lost me. That does not parse.

> And so on, and you get all the strings in hex-encoded UTF-8, ready to
> cut and paste.

What strings? And why hex -- nothing in the XML suggests hex? I
usually want UTF-8 strings as UTF-8 strings in the source, but I
understand your user base does not include me.

--
Ben.

On Saturday, 19 August 2023 at 22:31:28 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
> >>
> >> It turns out that if you want to be 100% conforming you need to be able
> >> to detect both UCS-4 and (eye roll) EBCDIC.
> >>
> > I had a go at ECBDIC.
> >
> > If anyone has an EBCDIC XML file they'd like to test, please post a
> > link.
> You can make your own by (a) setting the encoding="..." attribute in the
> declaration (EBCDIC-INT is a good one) and then running iconv.
> > Of course the next challenge is to support ECBDIC as the execution
> > character set. This means all the if (ch == '<') statements have to
> > come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
> > have to be replaced with hex codes.
> Do you have a user who wants to compile your program on a system that
> does not support ASCII C source?
>
Who knows. The code is publicly available to whoever wants it.
The problem with this model is that, unless the user chooses to get back to
you, you've no idea who he is, or how he is using the code, or if he has
any problems with it. Unlike paying customers who usually leave their
details, and are likely to complain if they don't get what they wanted.

But if the XML parser is to support EBCDIC input, then I'd expect that
an EBCDIC-interested user might also want to compile under a system
where the execution character set is EBCDIC. However he'll get UTF-8
output, which is probably not what he wants.

I'd need a EBCDIC C compiler to test it.
>
> > Here's where the Baby X resource compiler shows its power. Simply set
> > up the input
> > <BabyXRC>
> > <utf8 name="cdata"><CDATA</utf8>
> > </BabyXRC>
> You've lost me. That does not parse.
> > And so on, and you get all the strings in hex-encoded UTF-8, ready to
> > cut and paste.
> What strings? And why hex -- nothing in the XML suggests hex? I
> usually want UTF-8 strings as UTF-8 strings in the source, but I
> understand your user base does not include me.
>
XML documents contain a tag called "CDATA". So the natural thing is
to write
if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */

This will work on a program which accepts data in the execution character
set and only in the execution character set. However the XML parser
accepts data in ASCII, UTF-8, UTF-16 (two flavours) and, now, EBCDIC.
It does this by converting to a common format via a conversion function
passed to the lexer, and the common format is UTF-8.

So "tag" will be in UTF-8. If the execution character set is ASCII, then
the comparison will still work, and that is what I have done. But if it is
EBCDIC, it will fail.

Instead we need to write

/* CDATA in UTF-8 */
char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
if (!strcmp(tag, cdata)) /* check for CDATA and process it */

This is where the Baby X resource compiler comes to our rescue. It will
convert ASCII to that form, with the utf-8 tag.

On 8/20/23 1:04 AM, Malcolm McLean wrote:
> On Saturday, 19 August 2023 at 22:31:28 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>>> On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>>>>
>>>> It turns out that if you want to be 100% conforming you need to be able
>>>> to detect both UCS-4 and (eye roll) EBCDIC.
>>>>
>>> I had a go at ECBDIC.
>>>
>>> If anyone has an EBCDIC XML file they'd like to test, please post a
>>> link.
>> You can make your own by (a) setting the encoding="..." attribute in the
>> declaration (EBCDIC-INT is a good one) and then running iconv.
>>> Of course the next challenge is to support ECBDIC as the execution
>>> character set. This means all the if (ch == '<') statements have to
>>> come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
>>> have to be replaced with hex codes.
>> Do you have a user who wants to compile your program on a system that
>> does not support ASCII C source?
>>
> Who knows. The code is publicly available to whoever wants it.
> The problem with this model is that, unless the user chooses to get back to
> you, you've no idea who he is, or how he is using the code, or if he has
> any problems with it. Unlike paying customers who usually leave their
> details, and are likely to complain if they don't get what they wanted.
>
> But if the XML parser is to support EBCDIC input, then I'd expect that
> an EBCDIC-interested user might also want to compile under a system
> where the execution character set is EBCDIC. However he'll get UTF-8
> output, which is probably not what he wants.
>
> I'd need a EBCDIC C compiler to test it.
>>
>>> Here's where the Baby X resource compiler shows its power. Simply set
>>> up the input
>>> <BabyXRC>
>>> <utf8 name="cdata"><CDATA</utf8>
>>> </BabyXRC>
>> You've lost me. That does not parse.
>>> And so on, and you get all the strings in hex-encoded UTF-8, ready to
>>> cut and paste.
>> What strings? And why hex -- nothing in the XML suggests hex? I
>> usually want UTF-8 strings as UTF-8 strings in the source, but I
>> understand your user base does not include me.
>>
> XML documents contain a tag called "CDATA". So the natural thing is
> to write
> if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */
>
> This will work on a program which accepts data in the execution character
> set and only in the execution character set. However the XML parser
> accepts data in ASCII, UTF-8, UTF-16 (two flavours) and, now, EBCDIC.
> It does this by converting to a common format via a conversion function
> passed to the lexer, and the common format is UTF-8.
>
> So "tag" will be in UTF-8. If the execution character set is ASCII, then
> the comparison will still work, and that is what I have done. But if it is
> EBCDIC, it will fail.
>
> Instead we need to write
>
> /* CDATA in UTF-8 */
> char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
>
> if (!strcmp(tag, cdata)) /* check for CDATA and process it */
>
> This is where the Baby X resource compiler comes to our rescue. It will
> convert ASCII to that form, with the utf-8 tag.

Why not just write u8"CDATA" instead.

u8 strings are always UTF-8 encoded, no matter what the execution
character set is.

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Saturday, 19 August 2023 at 22:31:28 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> > On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>> >>
>> >> It turns out that if you want to be 100% conforming you need to be able
>> >> to detect both UCS-4 and (eye roll) EBCDIC.
>> >>
>> > I had a go at ECBDIC.
>> >
>> > If anyone has an EBCDIC XML file they'd like to test, please post a
>> > link.
>> You can make your own by (a) setting the encoding="..." attribute in the
>> declaration (EBCDIC-INT is a good one) and then running iconv.
>> > Of course the next challenge is to support ECBDIC as the execution
>> > character set. This means all the if (ch == '<') statements have to
>> > come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
>> > have to be replaced with hex codes.
>> Do you have a user who wants to compile your program on a system that
>> does not support ASCII C source?
>>
> Who knows. The code is publicly available to whoever wants it.

It was a somewhat rhetorical question. EBCDIC data is, I would venture,
far more common that non-ASCII C compilers.

>> > Here's where the Baby X resource compiler shows its power. Simply set
>> > up the input
>> > <BabyXRC>
>> > <utf8 name="cdata"><CDATA</utf8>
>> > </BabyXRC>
>> You've lost me. That does not parse.

Without a parse for that supposed document, I can't work out what you
are saying. You refer to XML CDATA sections below, but <CDATA is not
such a section.

>> > And so on, and you get all the strings in hex-encoded UTF-8, ready to
>> > cut and paste.
>> What strings? And why hex -- nothing in the XML suggests hex? I
>> usually want UTF-8 strings as UTF-8 strings in the source, but I
>> understand your user base does not include me.
>>
> XML documents contain a tag called "CDATA".

No, CDATA sections are not tags, not is the syntax, <[CDATA[, that or a
tag.

> So the natural thing is
> to write
> if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */

I can't stop you parsing <[CDATA[ and putting "CDATA" in a variable tag,
but it's not a good name.

> This will work on a program which accepts data in the execution character
> set and only in the execution character set. However the XML parser
> accepts data in ASCII, UTF-8, UTF-16 (two flavours) and, now, EBCDIC.
> It does this by converting to a common format via a conversion function
> passed to the lexer, and the common format is UTF-8.

Er... yes. I don't see how this us going to explain the bit that had me
perplexed but I'll keep reading.

> So "tag" will be in UTF-8. If the execution character set is ASCII, then
> the comparison will still work, and that is what I have done. But if it is
> EBCDIC, it will fail.

Use u8"CDATA"?

> Instead we need to write
>
> /* CDATA in UTF-8 */
> char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:

I think you mean {0x43, 0x44, 0x41, 0x54, 0x41, 0x00};

> if (!strcmp(tag, cdata)) /* check for CDATA and process it */

You don't /need/ to, but it's one way.

> This is where the Baby X resource compiler comes to our rescue. It will
> convert ASCII to that form, with the utf-8 tag.

It comes into it's own for people using EBCDIC for C source code?
That's a tiny user base. I am now completely lost. On other machines,
converting ASCII to UTF-8 is a no-op.

--
Ben.

Re: C vs Haskell for XML parsing

<610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=30098&group=comp.lang.c#30098

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:568:b0:76d:845c:1b39 with SMTP id p8-20020a05620a056800b0076d845c1b39mr24404qkp.4.1692555620117;
Sun, 20 Aug 2023 11:20:20 -0700 (PDT)
X-Received: by 2002:a17:902:da8b:b0:1b8:8fe2:6627 with SMTP id
j11-20020a170902da8b00b001b88fe26627mr2276822plx.8.1692555619633; Sun, 20 Aug
2023 11:20:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 20 Aug 2023 11:20:18 -0700 (PDT)
In-Reply-To: <87jztpu2iu.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:c4f4:8fc2:f241:72e0;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:c4f4:8fc2:f241:72e0
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Sun, 20 Aug 2023 18:20:20 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6206

by: Malcolm McLean - Sun, 20 Aug 2023 18:20 UTC

On Sunday, 20 August 2023 at 17:01:14 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > On Saturday, 19 August 2023 at 22:31:28 UTC+1, Ben Bacarisse wrote:
> >> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> >>
> >> > On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
> >> >>
> >> >> It turns out that if you want to be 100% conforming you need to be able
> >> >> to detect both UCS-4 and (eye roll) EBCDIC.
> >> >>
> >> > I had a go at ECBDIC.
> >> >
> >> > If anyone has an EBCDIC XML file they'd like to test, please post a
> >> > link.
> >> You can make your own by (a) setting the encoding="..." attribute in the
> >> declaration (EBCDIC-INT is a good one) and then running iconv.
> >> > Of course the next challenge is to support ECBDIC as the execution
> >> > character set. This means all the if (ch == '<') statements have to
> >> > come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
> >> > have to be replaced with hex codes.
> >> Do you have a user who wants to compile your program on a system that
> >> does not support ASCII C source?
> >>
> > Who knows. The code is publicly available to whoever wants it.
> It was a somewhat rhetorical question. EBCDIC data is, I would venture,
> far more common that non-ASCII C compilers.
> >> > Here's where the Baby X resource compiler shows its power. Simply set
> >> > up the input
> >> > <BabyXRC>
> >> > <utf8 name="cdata"><CDATA</utf8>
> >> > </BabyXRC>
> >> You've lost me. That does not parse.
> Without a parse for that supposed document, I can't work out what you
> are saying. You refer to XML CDATA sections below, but <CDATA is not
> such a section.
> >> > And so on, and you get all the strings in hex-encoded UTF-8, ready to
> >> > cut and paste.
> >> What strings? And why hex -- nothing in the XML suggests hex? I
> >> usually want UTF-8 strings as UTF-8 strings in the source, but I
> >> understand your user base does not include me.
> >>
> > XML documents contain a tag called "CDATA".
> No, CDATA sections are not tags, not is the syntax, <[CDATA[, that or a
> tag.
> > So the natural thing is
> > to write
> > if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */
> I can't stop you parsing <[CDATA[ and putting "CDATA" in a variable tag,
> but it's not a good name.
>
Oh, there's a typo. A stray '<'. So of course the load would fail. That's why I'm writing
another XML parser. The main motive is to get better error reports.
>
> > This will work on a program which accepts data in the execution character
> > set and only in the execution character set. However the XML parser
> > accepts data in ASCII, UTF-8, UTF-16 (two flavours) and, now, EBCDIC.
> > It does this by converting to a common format via a conversion function
> > passed to the lexer, and the common format is UTF-8.
> Er... yes. I don't see how this us going to explain the bit that had me
> perplexed but I'll keep reading.
> > So "tag" will be in UTF-8. If the execution character set is ASCII, then
> > the comparison will still work, and that is what I have done. But if it is
> > EBCDIC, it will fail.
> Use u8"CDATA"?
>
Apparently it opens a can of worms because it makes the string a char8_t * instead
of a char *.
> > Instead we need to write
> >
> > /* CDATA in UTF-8 */
> > char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
> I think you mean {0x43, 0x44, 0x41, 0x54, 0x41, 0x00};
> > if (!strcmp(tag, cdata)) /* check for CDATA and process it */
> You don't /need/ to, but it's one way.
> > This is where the Baby X resource compiler comes to our rescue. It will
> > convert ASCII to that form, with the utf-8 tag.
> It comes into it's own for people using EBCDIC for C source code?
> That's a tiny user base. I am now completely lost. On other machines,
> converting ASCII to UTF-8 is a no-op.
>
Yes. Instead of converting strings to UTF-8 by hand, which is error prone, the
Baby X resource compiler will do it for you automatically. The "utf8" tag says
"output this string as a hex dump in UTF-8 format".
As you say, on ASCII machines it tends not to be much of an issue if the UTF-8
string is in the common subset of ASCII and UTF-8, because the encoding is
also the same. It's only important if you need the extended UTF-8 characters
but your source character set is strictly ASCII only.
The number of people with EBCDIC C compilers is very small. But they tend to be
dealing with machines worth many millions of pounds and data of incalculably
high value.

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
[...]
> The number of people with EBCDIC C compilers is very small. But they tend to be
> dealing with machines worth many millions of pounds and data of incalculably
> high value.

And, I suspect, they have a lot of experience converting data to and
from EBCDIC -- or they do all their work on EBCDIC-based systems and
don't need to convert anything.

I'm only guessing, but I suspect the intersection of people who could
use BabyX and people who use EBCDIC is small, possibly empty.

The XML specification <https://www.w3.org/TR/xml/> does discuss EBCDIC,
but if BabyX's XML processor didn't handle EBCDIC, I'd be surprised if
anyone were inconvenienced.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Sunday, 20 August 2023 at 17:01:14 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> > On Saturday, 19 August 2023 at 22:31:28 UTC+1, Ben Bacarisse wrote:
>> >> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>> >>
>> >> > On Saturday, 19 August 2023 at 00:15:25 UTC+1, Ben Bacarisse wrote:
>> >> >>
>> >> >> It turns out that if you want to be 100% conforming you need to be able
>> >> >> to detect both UCS-4 and (eye roll) EBCDIC.
>> >> >>
>> >> > I had a go at ECBDIC.
>> >> >
>> >> > If anyone has an EBCDIC XML file they'd like to test, please post a
>> >> > link.
>> >> You can make your own by (a) setting the encoding="..." attribute in the
>> >> declaration (EBCDIC-INT is a good one) and then running iconv.
>> >> > Of course the next challenge is to support ECBDIC as the execution
>> >> > character set. This means all the if (ch == '<') statements have to
>> >> > come out and be replaced by if (ch == ASCII_LESSTHEN). And the strings
>> >> > have to be replaced with hex codes.
>> >> Do you have a user who wants to compile your program on a system that
>> >> does not support ASCII C source?
>> >>
>> > Who knows. The code is publicly available to whoever wants it.
>> It was a somewhat rhetorical question. EBCDIC data is, I would venture,
>> far more common that non-ASCII C compilers.
>> >> > Here's where the Baby X resource compiler shows its power. Simply set
>> >> > up the input
>> >> > <BabyXRC>
>> >> > <utf8 name="cdata"><CDATA</utf8>
>> >> > </BabyXRC>
>> >> You've lost me. That does not parse.
>> Without a parse for that supposed document, I can't work out what you
>> are saying. You refer to XML CDATA sections below, but <CDATA is not
>> such a section.
>> >> > And so on, and you get all the strings in hex-encoded UTF-8, ready to
>> >> > cut and paste.
>> >> What strings? And why hex -- nothing in the XML suggests hex? I
>> >> usually want UTF-8 strings as UTF-8 strings in the source, but I
>> >> understand your user base does not include me.
>> >>
>> > XML documents contain a tag called "CDATA".
>> No, CDATA sections are not tags, not is the syntax, <[CDATA[, that or a
>> tag.
>> > So the natural thing is
>> > to write
>> > if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */
>> I can't stop you parsing <[CDATA[ and putting "CDATA" in a variable tag,
>> but it's not a good name.
>>
> Oh, there's a typo. A stray '<'. So of course the load would
> fail. That's why I'm writing another XML parser. The main motive is to
> get better error reports.

So you intended to write

<utf8 name="cdata">CDATA</utf8>

and CDATA had nothing to do with XML CDATA sections. What's the
reference to "a tag called 'CDATA'" then?

>> > This will work on a program which accepts data in the execution character
>> > set and only in the execution character set. However the XML parser
>> > accepts data in ASCII, UTF-8, UTF-16 (two flavours) and, now, EBCDIC.
>> > It does this by converting to a common format via a conversion function
>> > passed to the lexer, and the common format is UTF-8.
>> Er... yes. I don't see how this us going to explain the bit that had me
>> perplexed but I'll keep reading.
>> > So "tag" will be in UTF-8. If the execution character set is ASCII, then
>> > the comparison will still work, and that is what I have done. But if it is
>> > EBCDIC, it will fail.
>> Use u8"CDATA"?
>>
> Apparently it opens a can of worms because it makes the string a
> char8_t * instead of a char *.

Is the can of worms really there? The "apparently" makes me worry it's
hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
or unsigned char *. Where are the worms?

But if I were writing such a program, I'd put effort into allowing
different C standard outputs rather than dealing with EBCDIC source
code. Someone who uses C23 will prefer

char8_t *cdata = u8"CDATA";

>> > Instead we need to write
>> >
>> > /* CDATA in UTF-8 */
>> > char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
>> I think you mean {0x43, 0x44, 0x41, 0x54, 0x41, 0x00};
>> > if (!strcmp(tag, cdata)) /* check for CDATA and process it */
>> You don't /need/ to, but it's one way.
>> > This is where the Baby X resource compiler comes to our rescue. It will
>> > convert ASCII to that form, with the utf-8 tag.
>> It comes into it's own for people using EBCDIC for C source code?
>> That's a tiny user base. I am now completely lost. On other machines,
>> converting ASCII to UTF-8 is a no-op.
>>
> Yes. Instead of converting strings to UTF-8 by hand, which is error prone, the
> Baby X resource compiler will do it for you automatically.

It performs the no-op automatically and comes into its own only on
EBCDIC compilers, or was the "yes" a typo?

> The "utf8" tag says
> "output this string as a hex dump in UTF-8 format".
> As you say, on ASCII machines it tends not to be much of an issue if the UTF-8
> string is in the common subset of ASCII and UTF-8, because the encoding is
> also the same. It's only important if you need the extended UTF-8 characters
> but your source character set is strictly ASCII only.

So it does not convert ASCII to UTF-8. In fact, it usually does the
opposite: it converts UTF-8 to ASCII -- specifically the ASCII C source
to represent the string using hex integer constants. That makes much
more sense.

> The number of people with EBCDIC C compilers is very small. But they
> tend to be dealing with machines worth many millions of pounds and
> data of incalculably high value.

The value of the machines and data don't have much to do with whether
it's worth your while supporting EBCDIC. Will there be even one such
user of the system? Will that user really not know how to pipe their
data through, say, xmllint first?

--
Ben.

Re: C vs Haskell for XML parsing

<ec66ec3c-67b6-49fc-a4bd-5acc85ff3335n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=31714&group=comp.lang.c#31714

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:9c1:b0:76c:e5b9:f0ff with SMTP id y1-20020a05620a09c100b0076ce5b9f0ffmr41720qky.1.1692585921061;
Sun, 20 Aug 2023 19:45:21 -0700 (PDT)
X-Received: by 2002:a17:903:41c1:b0:1b2:436b:931d with SMTP id
u1-20020a17090341c100b001b2436b931dmr3165220ple.2.1692585920577; Sun, 20 Aug
2023 19:45:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 20 Aug 2023 19:45:19 -0700 (PDT)
In-Reply-To: <87350dtive.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:c4f4:8fc2:f241:72e0;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:c4f4:8fc2:f241:72e0
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ec66ec3c-67b6-49fc-a4bd-5acc85ff3335n@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Mon, 21 Aug 2023 02:45:21 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 121

by: Malcolm McLean - Mon, 21 Aug 2023 02:45 UTC

On Monday, 21 August 2023 at 00:05:42 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> >> > if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */
> >> I can't stop you parsing <[CDATA[ and putting "CDATA" in a variable tag,
> >> but it's not a good name.
> >>
> > Oh, there's a typo. A stray '<'. So of course the load would
> > fail. That's why I'm writing another XML parser. The main motive is to
> > get better error reports.
> So you intended to write
> <utf8 name="cdata">CDATA</utf8>
> and CDATA had nothing to do with XML CDATA sections. What's the
> reference to "a tag called 'CDATA'" then?
>
The Baby X resource compiler accepts a script file written in XML as its
main input. (The script file usually contains paths to other input files). At
the moment, it contains a simple XML parser which will adequately parse
the subset of XML used for the script files, but isn't goo enough to be a
general purpose XML parser. Also, it doesn't have good error reporting. Which
is a practical problem with XML scripts written by hand.
So I need a new XML parser, which I'm writing at the moment. However the
Baby X resource compiler, in its current state, can be used to assist the
writing of the next generation of XML parser.
>
> > Apparently it opens a can of worms because it makes the string a
> > char8_t * instead of a char *.
> Is the can of worms really there? The "apparently" makes me worry it's
> hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
> or unsigned char *. Where are the worms?
>
A major advantage of UTF-8 is that it is transparent to UTF-8 naive programs,
as long as they accept strings in ASCII and don't try to play with the top
bit. If you use a different type for char and UTF-8 characters, you lose that
interoperability. In C++
std::cout << " <Greek UTF-8> "
and
std::cout << u8" <Greek UTF-8> "

can do different things, replacing <Greek UTF-8> with extended UTF-8 characters.

In C, it's more subtle.
>
> But if I were writing such a program, I'd put effort into allowing
> different C standard outputs rather than dealing with EBCDIC source
> code. Someone who uses C23 will prefer
>
I'm writing the XML parser component of the program at the moment. The XML
parser always produces output in UTF-8. The current version only accepts XML
input in UTF-8. But the instructions say to accept UTF-16, and with the current
design, that's not too hard to do.
However there could be attribute on the utf8 tag in the script files to say "output
UTF-8 in human-readable form using the u8 prefix". That would be useful.
>
> char8_t *cdata = u8"CDATA";
> >> > Instead we need to write
> >> >
> >> > /* CDATA in UTF-8 */
> >> > char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
> >> I think you mean {0x43, 0x44, 0x41, 0x54, 0x41, 0x00};
> >> > if (!strcmp(tag, cdata)) /* check for CDATA and process it */
> >> You don't /need/ to, but it's one way.
> >> > This is where the Baby X resource compiler comes to our rescue. It will
> >> > convert ASCII to that form, with the utf-8 tag.
> >> It comes into it's own for people using EBCDIC for C source code?
> >> That's a tiny user base. I am now completely lost. On other machines,
> >> converting ASCII to UTF-8 is a no-op.
> >>
> > Yes. Instead of converting strings to UTF-8 by hand, which is error prone, the
> > Baby X resource compiler will do it for you automatically.
> It performs the no-op automatically and comes into its own only on
> EBCDIC compilers, or was the "yes" a typo?
>
You can freely mix ASCII and UTF-8 (as long as you don't use the u8 modifier),
in C. So ASCII string are also UTF-8 strings and there's no point in representing
them in hex. So on an ASCII system, if you have a string which is constrained
to be ASCII< there's not much point using the utf8 tag.
On an EBCDIC system it is different. The strings are no long ASCII. So if you use
the Baby X resource compiler's "string" tag to add a string to the source code,
and compile it with a compiler whose execution character set is EBCDIC, you'll
get a string in EBCDIC. Which is usually what you want, but not always.
>
> > The "utf8" tag says
> > "output this string as a hex dump in UTF-8 format".
> > As you say, on ASCII machines it tends not to be much of an issue if the UTF-8
> > string is in the common subset of ASCII and UTF-8, because the encoding is
> > also the same. It's only important if you need the extended UTF-8 characters
> > but your source character set is strictly ASCII only.
> So it does not convert ASCII to UTF-8. In fact, it usually does the
> opposite: it converts UTF-8 to ASCII -- specifically the ASCII C source
> to represent the string using hex integer constants. That makes much
> more sense.
>
That's right.
On an ASCII system, the utf8 tag in the Baby X resource compiler is useful if
either your compiler or your editor won't accept non-ASCII characters, or
interprets them as ANSI 8 bit codes, because it converts UTF-8 to ASCIII-
encoded hex.
>
> > The number of people with EBCDIC C compilers is very small. But they
> > tend to be dealing with machines worth many millions of pounds and
> > data of incalculably high value.
> The value of the machines and data don't have much to do with whether
> it's worth your while supporting EBCDIC. Will there be even one such
> user of the system? Will that user really not know how to pipe their
> data through, say, xmllint first?
>
Well it does. If you've got only one user who is a hobbyist and does a bit of
bedroom programming for casual games with a tiny circulation, then I think
you'd say that was a disappointment.If you've got only one user who is a mainframe
programmer and says that the program has been invaluable in helping him
process data worth millions of pounds, then I think you'd say that the effort
has been worthwhile.

It's likely that a user interested in EBCDIC would browse available parsers
and choose one that supported EBCDIC. With speculative development, you write
the program first and then hope the customers see it, decide it would be useful, and
come to you. You don't get a user first and then ask him what he needs.

xmllint might well be a better solution than reading the data in EBCDIC. But the
instructions say that a parser should accept EBCDIC.

On 21/08/2023 01:05, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>

>> Apparently it opens a can of worms because it makes the string a
>> char8_t * instead of a char *.
>
> Is the can of worms really there? The "apparently" makes me worry it's
> hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
> or unsigned char *. Where are the worms?
>
> But if I were writing such a program, I'd put effort into allowing
> different C standard outputs rather than dealing with EBCDIC source
> code. Someone who uses C23 will prefer
>
> char8_t *cdata = u8"CDATA";
>

Is there any reason not to write :

const char8_t * cdata = u8"CDATA";

If you are dealing with old code that takes non-const pointers even
there is no write access through the pointers, then it might be more
convenient to have non-const pointers to pass to these functions. But
for new code, I prefer const pointers as much as possible.

Re: C vs Haskell for XML parsing

<3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32421&group=comp.lang.c#32421

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:18e8:b0:64a:2de0:786d with SMTP id ep8-20020a05621418e800b0064a2de0786dmr34642qvb.7.1692611981398;
Mon, 21 Aug 2023 02:59:41 -0700 (PDT)
X-Received: by 2002:a17:902:d504:b0:1c0:77b8:bb1e with SMTP id
b4-20020a170902d50400b001c077b8bb1emr726024plg.7.1692611981116; Mon, 21 Aug
2023 02:59:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 21 Aug 2023 02:59:40 -0700 (PDT)
In-Reply-To: <ubvan6$1rb3s$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:499:8a4d:8a30:789b;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:499:8a4d:8a30:789b
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Mon, 21 Aug 2023 09:59:41 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3711

by: Malcolm McLean - Mon, 21 Aug 2023 09:59 UTC

On Monday, 21 August 2023 at 10:28:28 UTC+1, David Brown wrote:
> On 21/08/2023 01:05, Ben Bacarisse wrote:
> > Malcolm McLean <malcolm.ar...@gmail.com> writes:
> >
>
> >> Apparently it opens a can of worms because it makes the string a
> >> char8_t * instead of a char *.
> >
> > Is the can of worms really there? The "apparently" makes me worry it's
> > hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
> > or unsigned char *. Where are the worms?
> >
> > But if I were writing such a program, I'd put effort into allowing
> > different C standard outputs rather than dealing with EBCDIC source
> > code. Someone who uses C23 will prefer
> >
> > char8_t *cdata = u8"CDATA";
> >
> Is there any reason not to write :
>
> const char8_t * cdata = u8"CDATA";
>
> ?
>
> If you are dealing with old code that takes non-const pointers even
> there is no write access through the pointers, then it might be more
> convenient to have non-const pointers to pass to these functions. But
> for new code, I prefer const pointers as much as possible.
>
Baby X has a const-free policy in its API.
That's partly because a lot of functions take opaque pointers, but do not in fact
change state. However that's because of the details of the Baby X implementation
and might change if Baby X is ported to a different platform. So "const" is
misleading.
Partly it's to avoid visual clutter.
Strings are a partial exception. The bbx_utf8* functions take const char * so that
they can be used with code which uses const.
const makes sense in embedded systems where you need to mark data which
is stored in physically read-only memory. But Baby X isn't designed for those
systems, and all memory is expected to be in RAM chips.

But C string literals are non-const by default, even though they are not writeable.
Someone might pass data intended as input to a parameter intended for output,
but such a mistake would almost always be caught very early in testing.

David Brown <david.brown@hesbynett.no> writes:

> On 21/08/2023 01:05, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>
>
>>> Apparently it opens a can of worms because it makes the string a
>>> char8_t * instead of a char *.
>> Is the can of worms really there? The "apparently" makes me worry it's
>> hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
>> or unsigned char *. Where are the worms?
>> But if I were writing such a program, I'd put effort into allowing
>> different C standard outputs rather than dealing with EBCDIC source
>> code. Someone who uses C23 will prefer
>> char8_t *cdata = u8"CDATA";
>>
>
> Is there any reason not to write :
>
> const char8_t * cdata = u8"CDATA";
>
> ?

Not for me, but Malcolm does not like const. I'd use it, or I'd write

char8_t cdata[] = u8"CDATA";

--
Ben.

On 21/08/2023 11:59, Malcolm McLean wrote:
> On Monday, 21 August 2023 at 10:28:28 UTC+1, David Brown wrote:
>> On 21/08/2023 01:05, Ben Bacarisse wrote:
>>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>>
>>
>>>> Apparently it opens a can of worms because it makes the string a
>>>> char8_t * instead of a char *.
>>>
>>> Is the can of worms really there? The "apparently" makes me worry it's
>>> hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
>>> or unsigned char *. Where are the worms?
>>>
>>> But if I were writing such a program, I'd put effort into allowing
>>> different C standard outputs rather than dealing with EBCDIC source
>>> code. Someone who uses C23 will prefer
>>>
>>> char8_t *cdata = u8"CDATA";
>>>
>> Is there any reason not to write :
>>
>> const char8_t * cdata = u8"CDATA";
>>
>> ?
>>
>> If you are dealing with old code that takes non-const pointers even
>> there is no write access through the pointers, then it might be more
>> convenient to have non-const pointers to pass to these functions. But
>> for new code, I prefer const pointers as much as possible.
>>
> Baby X has a const-free policy in its API.

That would be a reason not to use "const" in the definition, but it is
just kicking the can down the road.

> That's partly because a lot of functions take opaque pointers, but do not in fact
> change state.

Functions that take const pointers are guaranteeing (baring bugs) that
they do not change the data pointed at, at least not via that pointer.
This is an extremely useful feature and make using an API and reasoning
about it significantly easier. Throwing out that feature is a huge step
backwards for the users of your toolkit.

Proper use of "const" also allows several extra compiler checks, both in
the users' code, and in the implementation of the library. Only a fool
thinks they write such perfect code that they don't take advantage of
such cheap and simple checks.

If I am evaluating a piece of code, and I see it does not have "const"
for pointers that do not change data (at the very least, for function
parameters), I will assume the code is either ancient or written by
someone who does not really understand how to use the language. It's
not an absolute, of course, but it is a big red flag - much like lack of
"static" on file-local functions.

I would have no use of a "resource compiler" that did not declare the
resources as "const".

> However that's because of the details of the Baby X implementation
> and might change if Baby X is ported to a different platform. So "const" is
> misleading.

No, it is not.

If the function in question might change the data, it should be
non-const. If it will not change the data, it should be a const
pointer. That gives the user vital information.

If I see an API function "void show_string(char * p);", I have to assume
it may change the data pointed to. I have to assume I can't write
"show_string("Hello, world!");", but instead must make a copy to a
writeable array and send a pointer to that.

> Partly it's to avoid visual clutter.

Do you also use K&R-style implicit int to avoid clutter?

IMHO - and this is very much just my opinion - you'd be better off
changing your function naming conventions to avoid clutter, and keeping
things that provide important and useful information.

> Strings are a partial exception. The bbx_utf8* functions take const char * so that
> they can be used with code which uses const.
> const makes sense in embedded systems where you need to mark data which
> is stored in physically read-only memory. But Baby X isn't designed for those
> systems, and all memory is expected to be in RAM chips.

"const" makes sense everywhere that you have something you don't want to
change.

It has extra relevance for small embedded systems, but it is certainly
not limited to such systems.

>
> But C string literals are non-const by default, even though they are not writeable.
>

You do know that you can safely assign a pointer-to-non-const expression
to a pointer-to-const? The historical fact that C string literals
predate the "const" keyword in C does not mean you can't use const
pointers to point to C string literals.

> Someone might pass data intended as input to a parameter intended for output,
> but such a mistake would almost always be caught very early in testing.

And we all know that "almost always caught in testing" is /so/ much more
helpful than "always caught at compile time".

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Monday, 21 August 2023 at 00:05:42 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> >> > if (!strcmp(tag, "CDATA") /* check for CDATA and process it. */
>> >> I can't stop you parsing <[CDATA[ and putting "CDATA" in a variable tag,
>> >> but it's not a good name.
>> >>
>> > Oh, there's a typo. A stray '<'. So of course the load would
>> > fail. That's why I'm writing another XML parser. The main motive is to
>> > get better error reports.
>> So you intended to write
>> <utf8 name="cdata">CDATA</utf8>
>> and CDATA had nothing to do with XML CDATA sections. What's the
>> reference to "a tag called 'CDATA'" then?
>>
> The Baby X resource compiler accepts a script file written in XML as its
> main input. (The script file usually contains paths to other input files). At
> the moment, it contains a simple XML parser which will adequately parse
> the subset of XML used for the script files, but isn't goo enough to be a
> general purpose XML parser. Also, it doesn't have good error reporting. Which
> is a practical problem with XML scripts written by hand.
> So I need a new XML parser, which I'm writing at the moment. However the
> Baby X resource compiler, in its current state, can be used to assist the
> writing of the next generation of XML parser.

I can't see an answer to my question, but it was not really important.
I just wanted to know what you were saying.

>> > Apparently it opens a can of worms because it makes the string a
>> > char8_t * instead of a char *.
>> Is the can of worms really there? The "apparently" makes me worry it's
>> hearsay and FUD. In C11 it's a char[], but in C23 just cast to char *
>> or unsigned char *. Where are the worms?
>>
> A major advantage of UTF-8 is that it is transparent to UTF-8 naive programs,
> as long as they accept strings in ASCII and don't try to play with the top
> bit. If you use a different type for char and UTF-8 characters, you lose that
> interoperability.

So you ignored my suggestion and continued to insist that there is some
mysterious problem. Nothing in the type (other than, ironically, const
which you refuse to use) can prevent problems in a program the fiddles
with the top bit so that's a red-herring.

I don't use this stuff at the moment so I'd really like to know the
problems, not at the level of description you are using.

> In C++
> std::cout << " <Greek UTF-8> "
> and
> std::cout << u8" <Greek UTF-8> "
>
> can do different things, replacing <Greek UTF-8> with extended UTF-8
> characters.

I thought you were generating C. A flag to generate C++ strings would
be a better option for C++ output would it not? Especially as C++ is
moving away from supporting such C-style strings in ostream.

Anyway, I see no permission in C++ to mess with the characters in the
string. The C++ wording seems to be almost the same as the C wording.
What compiler does this?

> In C, it's more subtle.

That's even less precise. Where are the worms?

>> But if I were writing such a program, I'd put effort into allowing
>> different C standard outputs rather than dealing with EBCDIC source
>> code. Someone who uses C23 will prefer
>>
> I'm writing the XML parser component of the program at the moment. The
> XML parser always produces output in UTF-8. The current version only
> accepts XML input in UTF-8. But the instructions say to a ccept
> UTF-16, and with the current design, that's not too hard to do.
> However there could be attribute on the utf8 tag in the script files
> to say "output UTF-8 in human-readable form using the u8 prefix". That
> would be useful.

I think it would. And const could be an option too, especially if these
can be defaulted (otherwise it might well be easier just to add const to
the output oneself).

>> char8_t *cdata = u8"CDATA";
>> >> > Instead we need to write
>> >> >
>> >> > /* CDATA in UTF-8 */
>> >> > char *cdata = {0x43, 0x44, 0x54, 0x41, 0x00}:
>> >> I think you mean {0x43, 0x44, 0x41, 0x54, 0x41, 0x00};
>> >> > if (!strcmp(tag, cdata)) /* check for CDATA and process it */
>> >> You don't /need/ to, but it's one way.
>> >> > This is where the Baby X resource compiler comes to our rescue. It will
>> >> > convert ASCII to that form, with the utf-8 tag.
>> >> It comes into it's own for people using EBCDIC for C source code?
>> >> That's a tiny user base. I am now completely lost. On other machines,
>> >> converting ASCII to UTF-8 is a no-op.
>> >>
>> > Yes. Instead of converting strings to UTF-8 by hand, which is error prone, the
>> > Baby X resource compiler will do it for you automatically.
>> It performs the no-op automatically and comes into its own only on
>> EBCDIC compilers, or was the "yes" a typo?
>>
> You can freely mix ASCII and UTF-8 (as long as you don't use the u8
> modifier), in C. So ASCII string are also UTF-8 strings and there's
> no point in representing them in hex.
> So on an ASCII system, if you have a string which is constrained to be
> ASCII there's not much point using the utf8 tag. On an EBCDIC system
> it is different. The strings are no long ASCII. So if you use the Baby
> X resource compiler's "string" tag to add a string to the source code,
> and compile it with a compiler whose execution character set is
> EBCDIC, you'll get a string in EBCDIC. Which is usually what you want,
> but not always.

I'm going to back out of this exchange. I can't seem to get a straight
answer.

>> > The number of people with EBCDIC C compilers is very small. But they
>> > tend to be dealing with machines worth many millions of pounds and
>> > data of incalculably high value.
>> The value of the machines and data don't have much to do with whether
>> it's worth your while supporting EBCDIC. Will there be even one such
>> user of the system? Will that user really not know how to pipe their
>> data through, say, xmllint first?
>>
> Well it does. If you've got only one user who is a hobbyist and does a
> bit of bedroom programming for casual games with a tiny circulation,
> then I think you'd say that was a disappointment.If you've got only
> one user who is a mainframe programmer and says that the program has
> been invaluable in helping him process data worth millions of pounds,
> then I think you'd say that the effort has been worthwhile.

I see we have different value systems.

--
Ben.

Re: C vs Haskell for XML parsing

<e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35451&group=comp.lang.c#35451

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:19aa:b0:40f:dc70:fdc9 with SMTP id u42-20020a05622a19aa00b0040fdc70fdc9mr63430qtc.13.1692684182119;
Mon, 21 Aug 2023 23:03:02 -0700 (PDT)
X-Received: by 2002:a17:90a:ad05:b0:263:3b44:43ae with SMTP id
r5-20020a17090aad0500b002633b4443aemr1883592pjq.8.1692684181674; Mon, 21 Aug
2023 23:03:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 21 Aug 2023 23:03:00 -0700 (PDT)
In-Reply-To: <ubvo4d$1tm0p$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:1c5f:c8af:ef99:3315;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:1c5f:c8af:ef99:3315
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me> <3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Tue, 22 Aug 2023 06:03:02 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 72

by: Malcolm McLean - Tue, 22 Aug 2023 06:03 UTC

On Monday, 21 August 2023 at 14:17:16 UTC+1, David Brown wrote:
>
> If the function in question might change the data, it should be
> non-const. If it will not change the data, it should be a const
> pointer. That gives the user vital information.
>
We have to document what the function does and what the parameters
mean anyway. const can help, but it isn't the main source of information,
and it isn't "vital". If it was "vital" then K and R C wouldn't have been a
viable programming language.

> If I see an API function "void show_string(char * p);", I have to assume
> it may change the data pointed to. I have to assume I can't write
> "show_string("Hello, world!");", but instead must make a copy to a
> writeable array and send a pointer to that.
>
No, you read the documentation. A function called strtoupper() pretty obviously
is designed to change the string passed to it. But it can't make just any change.
The changes it does make have to be described. Similarly "show_string" may,
fr some reason, corrupt the string and leave it in an indeterminate state, but
that's part of the API contract.
>
> > Partly it's to avoid visual clutter.
> Do you also use K&R-style implicit int to avoid clutter?
>
I don't, but there's a strong case for making int the default integer type and
making the use of other types special purpose and rare. Implicit int was an
attempt to do that, and the idea was good. But the problem was that the
pattern "typename ... variable or functioname" was too intutitive.
>
> It has extra relevance for small embedded systems, but it is certainly
> not limited to such systems.
>
Generally data is physically writeable on large systems. So "const" is an
artificial restriction.
>
> > But C string literals are non-const by default, even though they are not writeable.
> >
> You do know that you can safely assign a pointer-to-non-const expression
> to a pointer-to-const? The historical fact that C string literals
> predate the "const" keyword in C does not mean you can't use const
> pointers to point to C string literals.
>
You can. But it's only a safe thing to do "in the small". In the large, tainting
mutable data with const can mean that efficiency improvements become
impossible. A common situation is that the operation doens't notionally
change the state (it returns a node in the tree, leaving the tree unaltered),
however in reality it's seldom used for random access, but to iterate through
all the nodes sequentially. So you can convert an O(log N) operation to an
O(1) operation by cacheing the last access. But not if you've const-poisoned the
tree.
>
> > Someone might pass data intended as input to a parameter intended for output,
> > but such a mistake would almost always be caught very early in testing.
> And we all know that "almost always caught in testing" is /so/ much more
> helpful than "always caught at compile time".
>
It's not an absolute. Sometimes using const will help to catch a bug. But it's a
fairly slight advantage. The main reason is that switiching input and output
parameters will almost always be caught on the first test. But the other
reason is that, whilst the low-level function takes const, the data in caller
will usually be mutable. Occasionally you see
strcpy(name, "Fred");
but far more often it's
strcpy(name, employee->name);

const will catch strcpy("Fred", name). (Except it won't, but never mind).
but not
strcpy(employee->name, name);
which is a far more likely bug.
So yes, there are few situations in which const might help you catch bugs, but
it's not a big help, and they aren't very common.

On 22/08/2023 08:03, Malcolm McLean wrote:
> On Monday, 21 August 2023 at 14:17:16 UTC+1, David Brown wrote:
>>
>> If the function in question might change the data, it should be
>> non-const. If it will not change the data, it should be a const
>> pointer. That gives the user vital information.
>>
> We have to document what the function does and what the parameters
> mean anyway. const can help, but it isn't the main source of information,
> and it isn't "vital". If it was "vital" then K and R C wouldn't have been a
> viable programming language.

C90 was a vast improvement over K&R C (and C99 another huge improvement
- changes after that have been relatively minor).

You can argue that every programming feature that is not in a Turing
Machine is not "vital". The reality, however, is that some features are
very useful for helping writing clear and correct programs. "const" is
one of these. It is no surprise that many modern languages make
everything "const" by default and require explicit keywords to support
modifiable data.

Arguing that you could get by without "const" when programming 35 years
ago is just nonsense.

This is, of course, your program and your decision - all I can do is
give you advice, and it is up to you to take it or leave it.

>
>> If I see an API function "void show_string(char * p);", I have to assume
>> it may change the data pointed to. I have to assume I can't write
>> "show_string("Hello, world!");", but instead must make a copy to a
>> writeable array and send a pointer to that.
>>
> No, you read the documentation.

The declaration, the types used, and the name of the function are part
of the documentation. They are particularly important parts of the
documentation, since they are the only parts that are guaranteed to be
in sync with the code. It is good practice to never put something in
comments or extra documentation if it can be expressed clearly in code -
that way it is always correct, often enforceable or checked by the
compiler and other automated checking tools, and automatically correct
in the documentation if you use tools such as doxygen.

I fully agree that the documentation is important and should be read. I
know that in practice, many people do not read documentation
sufficiently - and many people do not write sufficient documentation
(and many others fail to update and maintain it correctly).

Even the best documentation is no excuse for poor code.

> A function called strtoupper() pretty obviously
> is designed to change the string passed to it.

I disagree - it would depend on the signature and the way your API
works. If you have a type "String" that is a managed string type, and
you had a signature :

String strtoupper(String s);

then I would expect it to return a new String and leave all the caller's
data unaffected. (It would be a "Malcolm-function".)

If it were declared :

const char * strtoupper(const char * s);

then I would again expect the function to allocate new space and leave
the original string unchanged.

But if it were declared :

void strtoupper(char * s);

/then/ I would expect it to change the original string.

Mistakenly assuming that users will make correct "obvious assumptions"
about behaviour based on function names is a recipe for disaster.

> But it can't make just any change.
> The changes it does make have to be described. Similarly "show_string" may,
> fr some reason, corrupt the string and leave it in an indeterminate state, but
> that's part of the API contract.

I would suggest that anything that leaves data "corrupted and in an
indeterminate state" is a poor choice of API - even if it is documented.
(If the function is clearly a "data sink", and perhaps frees the
string's memory, that's a different matter.)

However, nothing of this gives any justification for not using "const"
on pointers when the function does not change the data.

>>
>>> Partly it's to avoid visual clutter.
>> Do you also use K&R-style implicit int to avoid clutter?
>>
> I don't, but there's a strong case for making int the default integer type and
> making the use of other types special purpose and rare. Implicit int was an
> attempt to do that, and the idea was good. But the problem was that the
> pattern "typename ... variable or functioname" was too intutitive.

The question was rhetorical. I thought that was obvious.

> >
>> It has extra relevance for small embedded systems, but it is certainly
>> not limited to such systems.
>>
> Generally data is physically writeable on large systems. So "const" is an
> artificial restriction.

RAM is generally writeable, and thus "const" is a hugely important and
useful restriction.

>>
>>> But C string literals are non-const by default, even though they are not writeable.
>>>
>> You do know that you can safely assign a pointer-to-non-const expression
>> to a pointer-to-const? The historical fact that C string literals
>> predate the "const" keyword in C does not mean you can't use const
>> pointers to point to C string literals.
>>
> You can. But it's only a safe thing to do "in the small".

Rubbish.

> In the large, tainting
> mutable data with const can mean that efficiency improvements become
> impossible.

Rubbish.

> A common situation is that the operation doens't notionally
> change the state (it returns a node in the tree, leaving the tree unaltered),
> however in reality it's seldom used for random access, but to iterate through
> all the nodes sequentially. So you can convert an O(log N) operation to an
> O(1) operation by cacheing the last access. But not if you've const-poisoned the
> tree.

You are talking about a very rare situation. The issue does arise - it
is the reason for the "mutable" keyword in C++. But it is rare, and
certainly not a rational justification for not using "const".

We are all aware that "const" can sometimes be unclear - in particular,
for complex objects, it does not pass through to objects indirectly
accessed through the higher level object. (i.e., if a type "String" is
a struct holding a length and a pointer to data, a "const String *"
pointer can still be used to modify the data pointed to by the String
object.)

That is a reason to be clear in your API design, types and
documentation. It is not a reason to abandon "const".

>>
>>> Someone might pass data intended as input to a parameter intended for output,
>>> but such a mistake would almost always be caught very early in testing.
>> And we all know that "almost always caught in testing" is /so/ much more
>> helpful than "always caught at compile time".
>>
> It's not an absolute.

It is absolutely an absolute (and obviously sarcasm). If an error can
be caught at compile time, that is better than hoping to catch it in
testing. And compile time checks do not preclude testing as well.

> Sometimes using const will help to catch a bug. But it's a
> fairly slight advantage. The main reason is that switiching input and output
> parameters will almost always be caught on the first test. But the other
> reason is that, whilst the low-level function takes const, the data in caller
> will usually be mutable. Occasionally you see
> strcpy(name, "Fred");
> but far more often it's
> strcpy(name, employee->name);
>
> const will catch strcpy("Fred", name). (Except it won't, but never mind).
> but not
> strcpy(employee->name, name);
> which is a far more likely bug.

So what?

Are you trying to claim that because "const" will not catch all bugs, it
is useless?

>
> So yes, there are few situations in which const might help you catch bugs, but
> it's not a big help, and they aren't very common.

I'm sorry, but I think you are completely wrong here. You are wrong
about the practice, you are wrong about the user experience. You are
wrong about how to write APIs and the role of declarations and
documentation. You've given no reasonable justification for not using
"const" - at least, not any justification I consider significant or that
comes close to outweighing the advantages of "const".

I consider the lack of appropriate use of "const" as a major red flag on
the quality of code, and that would affect my choice of code to use.

That's my opinion, and you are of course entirely free to disagree.

Re: C vs Haskell for XML parsing

<b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36158&group=comp.lang.c#36158

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:1893:b0:410:401b:6039 with SMTP id v19-20020a05622a189300b00410401b6039mr70259qtc.6.1692707940937;
Tue, 22 Aug 2023 05:39:00 -0700 (PDT)
X-Received: by 2002:a17:90b:30cb:b0:268:1be1:b835 with SMTP id
hi11-20020a17090b30cb00b002681be1b835mr2101388pjb.2.1692707940454; Tue, 22
Aug 2023 05:39:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 22 Aug 2023 05:38:59 -0700 (PDT)
In-Reply-To: <uc28id$2dc7f$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:1c5f:c8af:ef99:3315;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:1c5f:c8af:ef99:3315
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me> <3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me> <e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Tue, 22 Aug 2023 12:39:00 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 74

by: Malcolm McLean - Tue, 22 Aug 2023 12:38 UTC

On Tuesday, 22 August 2023 at 13:10:11 UTC+1, David Brown wrote:
> > I would suggest that anything that leaves data "corrupted and in an
> indeterminate state" is a poor choice of API - even if it is documented.
> (If the function is clearly a "data sink", and perhaps frees the
> string's memory, that's a different matter.)
>
I just wrote a function to calculate a determinant by Gaussian elimination.
Because of the way it works, the matrix is diagonalised in place. You could
of course take a copy, but that woud be a memory allocation, and the
main benefit of Gaussian elimination is that it is extremely fast. Or you
could make the matrix a const * and pass in a workspace, but them that's
not as intuitive (though I might in fact modify the function to do that).
>
> > A common situation is that the operation doens't notionally
> > change the state (it returns a node in the tree, leaving the tree unaltered),
> > however in reality it's seldom used for random access, but to iterate through
> > all the nodes sequentially. So you can convert an O(log N) operation to an
> > O(1) operation by cacheing the last access. But not if you've const-poisoned the
> > tree.
> You are talking about a very rare situation. The issue does arise - it
> is the reason for the "mutable" keyword in C++. But it is rare, and
> certainly not a rational justification for not using "const".
>
It's rare "in the small".A function to calculate the employees' average salary can take
a const * and it's most unlikely to be rewritten to use the array as scratch memory
space. A function that takes the "world" as a parameter but doesn't change the
world's external state (maybe it renders it to a buffer) is a different matter. There might
well be a tree traversal in there.
>
> We are all aware that "const" can sometimes be unclear - in particular,
> for complex objects, it does not pass through to objects indirectly
> accessed through the higher level object. (i.e., if a type "String" is
> a struct holding a length and a pointer to data, a "const String *"
> pointer can still be used to modify the data pointed to by the String
> object.)
>
> That is a reason to be clear in your API design, types and
> documentation. It is not a reason to abandon "const".
>
It is a reason to abandon const. If const * is giving a miselading message, because
the pointer is in fact ebeing used to access mutable data via nested members, then
I'd say you shouldn't use const qualification.
>
> Are you trying to claim that because "const" will not catch all bugs, it
> is useless?
>
I said it will catch some bugs. But a limited subset of bugs, the vast majority of
which are not dangerous because they mean than input and output parameters have
been switched, and that will almost always come out on the first test run. But
you can probably come up with a real example where that isn't the case. So it's
of limited value, but not of zero value. It's not that there is no case for it a rational
person could make at all.
> >
> > So yes, there are few situations in which const might help you catch bugs, but
> > it's not a big help, and they aren't very common.
> I'm sorry, but I think you are completely wrong here. You are wrong
> about the practice, you are wrong about the user experience. You are
> wrong about how to write APIs and the role of declarations and
> documentation. You've given no reasonable justification for not using
> "const" - at least, not any justification I consider significant or that
> comes close to outweighing the advantages of "const".
>
> I consider the lack of appropriate use of "const" as a major red flag on
> the quality of code, and that would affect my choice of code to use.
>
It appeals to a certain mindset. I agree.
>
> That's my opinion, and you are of course entirely free to disagree.
>
Baby X makes heavy use of opaque pointers and callbacks. The callbacks all
take a void * for context. A lot of the time, the context data won't be modiifed,
but it can't be a const void *, because the context pointer is free for the callback
function to use as it sees fit.
So those are two cases where data might be constant, but const is inappropriate.

On 22/08/2023 13:09, David Brown wrote:
> On 22/08/2023 08:03, Malcolm McLean wrote:
>> On Monday, 21 August 2023 at 14:17:16 UTC+1, David Brown wrote:
>>>
>>> If the function in question might change the data, it should be
>>> non-const. If it will not change the data, it should be a const
>>> pointer. That gives the user vital information.
>>>
>> We have to document what the function does and what the parameters
>> mean anyway. const can help, but it isn't the main source of information,
>> and it isn't "vital". If it was "vital" then K and R C wouldn't have
>> been a
>> viable programming language.
>
> C90 was a vast improvement over K&R C (and C99 another huge improvement
> - changes after that have been relatively minor).
>
> You can argue that every programming feature that is not in a Turing
> Machine is not "vital". The reality, however, is that some features are
> very useful for helping writing clear and correct programs. "const" is
> one of these. It is no surprise that many modern languages make
> everything "const" by default and require explicit keywords to support
> modifiable data.

It is a rather unique feature in that you can take any working C
program, take out all the 'const's, and it will still compile and still
work.

But the downside of const is:

* It generates more clutter, making it harder to spot real problems

* Some people go mad with it, often pointlessly so

* It can give a false sense of security (or you can also stick
'const' in the wrong place)

* You can waste time tying yourself up in knots trying to get around
a 'const' in a data structure that seemed a good idea at first

> Arguing that you could get by without "const" when programming 35 years
> ago is just nonsense.
>
> This is, of course, your program and your decision - all I can do is
> give you advice, and it is up to you to take it or leave it.

And using it is your decision.

>> A function called strtoupper() pretty obviously
>> is designed to change the string passed to it.
>
> I disagree - it would depend on the signature and the way your API
> works. If you have a type "String" that is a managed string type, and
> you had a signature :
>
>     String strtoupper(String s);
>
> then I would expect it to return a new String and leave all the caller's
> data unaffected. (It would be a "Malcolm-function".)
>
> If it were declared :
>
>     const char * strtoupper(const char * s);
>
> then I would again expect the function to allocate new space and leave
> the original string unchanged.
>
> But if it were declared :
>
>     void strtoupper(char * s);
>
> /then/ I would expect it to change the original string.
>
> Mistakenly assuming that users will make correct "obvious assumptions"
> about behaviour based on function names is a recipe for disaster.

There'all poor names IMO. In a language where you can routinely pass and
return object by value, then

strtoupper(s)

sounds like it will return a modified copy of s (no matter what the
signature is). I would use:

istrtoupper(s)

Using a leading 'i' is a prefix I tend to add to functions performing
in-place updates.

Re: C vs Haskell for XML parsing

<uc2dbv$2e4tg$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38261&group=comp.lang.c#38261

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: C vs Haskell for XML parsing
Date: Tue, 22 Aug 2023 15:31:43 +0200
Organization: A noiseless patient Spider
Lines: 112
Message-ID: <uc2dbv$2e4tg$1@dont-email.me>
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk>
<37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk>
<250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk>
<cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk>
<7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk>
<610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me>
<3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me>
<e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me>
<b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Aug 2023 13:31:43 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="79c7796a19758559c02eca618b3888a5";
logging-data="2560944"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18sEBJvIoWhmtkBwRNt4z2r6gkkRQAIfUM="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:Yor3pqsELEzNAjeTcQttHIa8zJE=
In-Reply-To: <b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
Content-Language: en-GB

by: David Brown - Tue, 22 Aug 2023 13:31 UTC

On 22/08/2023 14:38, Malcolm McLean wrote:
> On Tuesday, 22 August 2023 at 13:10:11 UTC+1, David Brown wrote:
>>> I would suggest that anything that leaves data "corrupted and in an
>> indeterminate state" is a poor choice of API - even if it is documented.
>> (If the function is clearly a "data sink", and perhaps frees the
>> string's memory, that's a different matter.)
>>
> I just wrote a function to calculate a determinant by Gaussian elimination.
> Because of the way it works, the matrix is diagonalised in place. You could
> of course take a copy, but that woud be a memory allocation, and the
> main benefit of Gaussian elimination is that it is extremely fast. Or you
> could make the matrix a const * and pass in a workspace, but them that's
> not as intuitive (though I might in fact modify the function to do that).

That is a function that modifies its argument in an appropriate way, for
good reason. It does not leave it in a corrupted and indeterminate state.

So I do not understand your point.

>>
>>> A common situation is that the operation doens't notionally
>>> change the state (it returns a node in the tree, leaving the tree unaltered),
>>> however in reality it's seldom used for random access, but to iterate through
>>> all the nodes sequentially. So you can convert an O(log N) operation to an
>>> O(1) operation by cacheing the last access. But not if you've const-poisoned the
>>> tree.
>> You are talking about a very rare situation. The issue does arise - it
>> is the reason for the "mutable" keyword in C++. But it is rare, and
>> certainly not a rational justification for not using "const".
>>
> It's rare "in the small".A function to calculate the employees' average salary can take
> a const * and it's most unlikely to be rewritten to use the array as scratch memory
> space. A function that takes the "world" as a parameter but doesn't change the
> world's external state (maybe it renders it to a buffer) is a different matter. There might
> well be a tree traversal in there.

So don't use "const" on world-changing functions. Use "const" on
functions that don't change the thing being pointed at.

> >
>> We are all aware that "const" can sometimes be unclear - in particular,
>> for complex objects, it does not pass through to objects indirectly
>> accessed through the higher level object. (i.e., if a type "String" is
>> a struct holding a length and a pointer to data, a "const String *"
>> pointer can still be used to modify the data pointed to by the String
>> object.)
>>
>> That is a reason to be clear in your API design, types and
>> documentation. It is not a reason to abandon "const".
>>
> It is a reason to abandon const.

Don't be so defeatist.

> If const * is giving a miselading message, because
> the pointer is in fact ebeing used to access mutable data via nested members, then
> I'd say you shouldn't use const qualification.

If you want, it is a reason not to use "const" in that particular case -
it is no reason not to use "const" in other cases. (You might consider
using it even in such cases, as long as there is no logical change to
the user-visible data - that is how "const" is interpreted in the C++
world. Use whatever is clearest for the users of your API.)

>
>>
>> Are you trying to claim that because "const" will not catch all bugs, it
>> is useless?
>>
> I said it will catch some bugs. But a limited subset of bugs, the vast majority of
> which are not dangerous because they mean than input and output parameters have
> been switched, and that will almost always come out on the first test run. But
> you can probably come up with a real example where that isn't the case. So it's
> of limited value, but not of zero value. It's not that there is no case for it a rational
> person could make at all.

I think we see things differently here. We agree that "const" is not
perfect or a magic cure for all kinds of bugs. I see the aid to the
user and the implementer as being a reason to use "const" despite its
limitations - you apparently see its limitations as a reason never to
use it. I simply can't understand why you think that way.

>>>
>>> So yes, there are few situations in which const might help you catch bugs, but
>>> it's not a big help, and they aren't very common.
>> I'm sorry, but I think you are completely wrong here. You are wrong
>> about the practice, you are wrong about the user experience. You are
>> wrong about how to write APIs and the role of declarations and
>> documentation. You've given no reasonable justification for not using
>> "const" - at least, not any justification I consider significant or that
>> comes close to outweighing the advantages of "const".
>>
>> I consider the lack of appropriate use of "const" as a major red flag on
>> the quality of code, and that would affect my choice of code to use.
>>
> It appeals to a certain mindset. I agree.
>>
>> That's my opinion, and you are of course entirely free to disagree.
>>
> Baby X makes heavy use of opaque pointers and callbacks. The callbacks all
> take a void * for context. A lot of the time, the context data won't be modiifed,
> but it can't be a const void *, because the context pointer is free for the callback
> function to use as it sees fit.
> So those are two cases where data might be constant, but const is inappropriate.

If it might be changed, don't use "const". No one suggested using
"const" on all pointers - merely on those where there will be no change.

Re: C vs Haskell for XML parsing

<d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=43341&group=comp.lang.c#43341

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:14eb:b0:63c:e9df:a46b with SMTP id k11-20020a05621414eb00b0063ce9dfa46bmr77212qvw.3.1692712297710;
Tue, 22 Aug 2023 06:51:37 -0700 (PDT)
X-Received: by 2002:a63:6f86:0:b0:56c:50c0:fbad with SMTP id
k128-20020a636f86000000b0056c50c0fbadmr483284pgc.8.1692712296955; Tue, 22 Aug
2023 06:51:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 22 Aug 2023 06:51:36 -0700 (PDT)
In-Reply-To: <uc2dbv$2e4tg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:1c5f:c8af:ef99:3315;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:1c5f:c8af:ef99:3315
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me> <3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me> <e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me> <b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
<uc2dbv$2e4tg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Tue, 22 Aug 2023 13:51:37 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5776

by: Malcolm McLean - Tue, 22 Aug 2023 13:51 UTC

On Tuesday, 22 August 2023 at 14:32:01 UTC+1, David Brown wrote:
> On 22/08/2023 14:38, Malcolm McLean wrote:
>
> That is a function that modifies its argument in an appropriate way, for
> good reason. It does not leave it in a corrupted and indeterminate state.
>
If you write determinant() in the obvious way, using the recursive definition,
then you wouldn't modify the matrix. If you use Gaussian elimination you
diagonalise it. And it's not necessarily caller's business which algorithm is
used. So it's probably better to say that, on fucntion exit, the matrix is indeterminate.
(If you've got a zero on the diagonal, the elimination method will produce a divide
by zero, so you have to resort to the recursive method).
>
> > It's rare "in the small".A function to calculate the employees' average salary can take
> > a const * and it's most unlikely to be rewritten to use the array as scratch memory
> > space. A function that takes the "world" as a parameter but doesn't change the
> > world's external state (maybe it renders it to a buffer) is a different matter. There might
> > well be a tree traversal in there.
> So don't use "const" on world-changing functions. Use "const" on
> functions that don't change the thing being pointed at.
>
Image *render(WORLD * world).is not a world-changing function, at least notionally.
Nothing about the world should change as a result of a render, all the baddies and
lihgts and cameras and so on should behave exactly as they did before the function
was called. However in reality you are probably caching quite a lot of state. So is
world a const * or not?
>
> > If const * is giving a miselading message, because
> > the pointer is in fact ebeing used to access mutable data via nested members, then
> > I'd say you shouldn't use const qualification.
> If you want, it is a reason not to use "const" in that particular case -
> it is no reason not to use "const" in other cases. (You might consider
> using it even in such cases, as long as there is no logical change to
> the user-visible data - that is how "const" is interpreted in the C++
> world. Use whatever is clearest for the users of your API.)
>
C+ is a different lanugage. In particular, you rarely pass const pointers, but you
do pass const references. In fact you shouldn't pass non-const references at
all, because then it's not obvious in caller that a variable might be modiifed. You
should pass a non-const pointer.
const works in a different way in C++.
>
> I think we see things differently here. We agree that "const" is not
> perfect or a magic cure for all kinds of bugs. I see the aid to the
> user and the implementer as being a reason to use "const" despite its
> limitations - you apparently see its limitations as a reason never to
> use it. I simply can't understand why you think that way.
>
Because in C const adds visual clutter. It's harder to simply glance at a
function prototype and take in what it does, if it is decorated with all sorts
of qualifiers. So you introduce errors as a result of code being harder to read.
>
> > Baby X makes heavy use of opaque pointers and callbacks. The callbacks all
> > take a void * for context. A lot of the time, the context data won't be modiifed,
> > but it can't be a const void *, because the context pointer is free for the callback
> > function to use as it sees fit.
> > So those are two cases where data might be constant, but const is inappropriate.
> If it might be changed, don't use "const". No one suggested using
> "const" on all pointers - merely on those where there will be no change.
>
With an opaque pointer, there often is no change. However if you specify that in
the function interface, then it's no longer an opaque pointer.

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>On Tuesday, 22 August 2023 at 13:10:11 UTC+1, David Brown wrote:

>> You are talking about a very rare situation. The issue does arise - it
>> is the reason for the "mutable" keyword in C++. But it is rare, and
>> certainly not a rational justification for not using "const".
>>
>It's rare "in the small".A function to calculate the employees' average salary can take
>a const * and it's most unlikely to be rewritten to use the array as scratch memory
>space.

It's unlikely to use an array at all. In the real world, it would
likely just be a programmatic SQL query to the employee database.

Re: C vs Haskell for XML parsing

<uc2qnl$2gh96$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=43357&group=comp.lang.c#43357

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: C vs Haskell for XML parsing
Date: Tue, 22 Aug 2023 19:19:48 +0200
Organization: A noiseless patient Spider
Lines: 99
Message-ID: <uc2qnl$2gh96$1@dont-email.me>
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk>
<37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk>
<250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk>
<cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk>
<7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk>
<610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me>
<3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me>
<e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me>
<b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
<uc2dbv$2e4tg$1@dont-email.me>
<d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Aug 2023 17:19:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6a8184aae5c6a2ba6101fa600751a5eb";
logging-data="2639142"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+EmATExqOt98Gbj+prUvtX19i2az543E="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:yR135u61SNVY0CT6RXU8ViTiutg=
In-Reply-To: <d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>
Content-Language: en-GB

by: David Brown - Tue, 22 Aug 2023 17:19 UTC

On 22/08/2023 15:51, Malcolm McLean wrote:
> On Tuesday, 22 August 2023 at 14:32:01 UTC+1, David Brown wrote:
>> On 22/08/2023 14:38, Malcolm McLean wrote:
>>
>> That is a function that modifies its argument in an appropriate way, for
>> good reason. It does not leave it in a corrupted and indeterminate state.
>>
> If you write determinant() in the obvious way, using the recursive definition,
> then you wouldn't modify the matrix. If you use Gaussian elimination you
> diagonalise it. And it's not necessarily caller's business which algorithm is
> used. So it's probably better to say that, on fucntion exit, the matrix is indeterminate.

I for one would not be happy with a "determinant" function that might or
might not trash my matrix. If the matrix is big enough that Gaussian
elimination is the best algorithm, copying it locally is negligible
overhead and makes the code much easier to use.

> (If you've got a zero on the diagonal, the elimination method will produce a divide
> by zero, so you have to resort to the recursive method).

(That is incorrect, but off-topic here. I refer you to Wikipedia,
google, or any maths book on the topic.)

>>
>>> It's rare "in the small".A function to calculate the employees' average salary can take
>>> a const * and it's most unlikely to be rewritten to use the array as scratch memory
>>> space. A function that takes the "world" as a parameter but doesn't change the
>>> world's external state (maybe it renders it to a buffer) is a different matter. There might
>>> well be a tree traversal in there.
>> So don't use "const" on world-changing functions. Use "const" on
>> functions that don't change the thing being pointed at.
>>
> Image *render(WORLD * world).is not a world-changing function, at least notionally.
> Nothing about the world should change as a result of a render, all the baddies and
> lihgts and cameras and so on should behave exactly as they did before the function
> was called. However in reality you are probably caching quite a lot of state. So is
> world a const * or not?

That's up to you. But whatever you decide, it does not stop "const"
being useful other places.

>>
>>> If const * is giving a miselading message, because
>>> the pointer is in fact ebeing used to access mutable data via nested members, then
>>> I'd say you shouldn't use const qualification.
>> If you want, it is a reason not to use "const" in that particular case -
>> it is no reason not to use "const" in other cases. (You might consider
>> using it even in such cases, as long as there is no logical change to
>> the user-visible data - that is how "const" is interpreted in the C++
>> world. Use whatever is clearest for the users of your API.)
>>
> C+ is a different lanugage. In particular, you rarely pass const pointers, but you
> do pass const references. In fact you shouldn't pass non-const references at
> all, because then it's not obvious in caller that a variable might be modiifed. You
> should pass a non-const pointer.
> const works in a different way in C++.

C++ is a different language, yes - but that does not mean you should not
use "const" in C.

"bool" in C++ and C are somewhat different too - does that mean you
should not use "bool" in C ?

>>
>> I think we see things differently here. We agree that "const" is not
>> perfect or a magic cure for all kinds of bugs. I see the aid to the
>> user and the implementer as being a reason to use "const" despite its
>> limitations - you apparently see its limitations as a reason never to
>> use it. I simply can't understand why you think that way.
>>
> Because in C const adds visual clutter.

What you call "visual clutter", other people call useful information.

> It's harder to simply glance at a
> function prototype and take in what it does, if it is decorated with all sorts
> of qualifiers. So you introduce errors as a result of code being harder to read.

This is from the person who thinks "thisfunctionhasaperfectlygoodname"
is easy to read? Surely you are joking.

>>
>>> Baby X makes heavy use of opaque pointers and callbacks. The callbacks all
>>> take a void * for context. A lot of the time, the context data won't be modiifed,
>>> but it can't be a const void *, because the context pointer is free for the callback
>>> function to use as it sees fit.
>>> So those are two cases where data might be constant, but const is inappropriate.
>> If it might be changed, don't use "const". No one suggested using
>> "const" on all pointers - merely on those where there will be no change.
>>
> With an opaque pointer, there often is no change. However if you specify that in
> the function interface, then it's no longer an opaque pointer.

As long as you use "void *" pointers, they are opaque, whether "const"
or not. (I'm not keen on "void *" - it is often an excuse not to give
more informative and safer types. But it has its uses sometimes.)

Re: C vs Haskell for XML parsing

<d651e08e-033d-4a90-8477-6a5fa13d30f3n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=43363&group=comp.lang.c#43363

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:7f3:b0:76d:8827:11a6 with SMTP id k19-20020a05620a07f300b0076d882711a6mr90002qkk.4.1692766786309;
Tue, 22 Aug 2023 21:59:46 -0700 (PDT)
X-Received: by 2002:a63:3346:0:b0:569:4636:344d with SMTP id
z67-20020a633346000000b005694636344dmr2155024pgz.7.1692766785766; Tue, 22 Aug
2023 21:59:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 22 Aug 2023 21:59:45 -0700 (PDT)
In-Reply-To: <uc2qnl$2gh96$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:1c5f:c8af:ef99:3315;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:1c5f:c8af:ef99:3315
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk> <37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk> <250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk> <cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk> <7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk> <610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me> <3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me> <e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me> <b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
<uc2dbv$2e4tg$1@dont-email.me> <d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>
<uc2qnl$2gh96$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d651e08e-033d-4a90-8477-6a5fa13d30f3n@googlegroups.com>
Subject: Re: C vs Haskell for XML parsing
From: malcolm.arthur.mclean@gmail.com (Malcolm McLean)
Injection-Date: Wed, 23 Aug 2023 04:59:46 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5470

by: Malcolm McLean - Wed, 23 Aug 2023 04:59 UTC

On Tuesday, 22 August 2023 at 18:20:05 UTC+1, David Brown wrote:
> On 22/08/2023 15:51, Malcolm McLean wrote:
> > On Tuesday, 22 August 2023 at 14:32:01 UTC+1, David Brown wrote:
> >> On 22/08/2023 14:38, Malcolm McLean wrote:
> >>
> >> That is a function that modifies its argument in an appropriate way, for
> >> good reason. It does not leave it in a corrupted and indeterminate state.
> >>
> > If you write determinant() in the obvious way, using the recursive definition,
> > then you wouldn't modify the matrix. If you use Gaussian elimination you
> > diagonalise it. And it's not necessarily caller's business which algorithm is
> > used. So it's probably better to say that, on fucntion exit, the matrix is indeterminate.
> I for one would not be happy with a "determinant" function that might or
> might not trash my matrix. If the matrix is big enough that Gaussian
> elimination is the best algorithm, copying it locally is negligible
> overhead and makes the code much easier to use.
>
No, that's not the case. These sorts of calculation tend to be in the inner algorithmic
core of the application.
> > (If you've got a zero on the diagonal, the elimination method will produce a divide
> > by zero, so you have to resort to the recursive method).
> (That is incorrect, but off-topic here. I refer you to Wikipedia,
> google, or any maths book on the topic.)
>
No, it's a known problem.
>
> "bool" in C++ and C are somewhat different too - does that mean you
> should not use "bool" in C ?
>
No, you shouldn't use bool in C. In C we use zero as false, non-zero as true,
and that doesn't play well with a boolean type. There is a case for returning
"bool" from a function, but it's deeply problematic to pass a bool as a parameter.
The reason is that

drawpath(mypath, false);

is meaningless to a person reading the code.
It should be

drawpath(mypath, PATH_OPEN) ;

Now we've at least some idea what the parameter means. So it needs to be an enum
or a defined integer constnat, not a bool.

So bool is pretty useless and we're better off without it.
>
> > It's harder to simply glance at a
> > function prototype and take in what it does, if it is decorated with all sorts
> > of qualifiers. So you introduce errors as a result of code being harder to read.
> This is from the person who thinks "thisfunctionhasaperfectlygoodname"
> is easy to read? Surely you are joking.
>
You've got the highighting paradox. THIS IS BIG is easy to read. But text larded
with lots of capitals is far harder to read. Similarly one camelCase or under_score
is easy to read, when embedded in lower case text, but when you have many
such names, the text becomes quite hard to read

Your example refutes itself, and the text is quite easy to read. In fact, as you
are obviously not aware, scripto continua was the norm for ancient manuscripts..
>
> > With an opaque pointer, there often is no change. However if you specify that in
> > the function interface, then it's no longer an opaque pointer.
> As long as you use "void *" pointers, they are opaque, whether "const"
> or not. (I'm not keen on "void *" - it is often an excuse not to give
> more informative and safer types. But it has its uses sometimes.)
>
A const void * is not an opaque pointer. We can say something about how the
called function will handle the data it points to.

Re: C vs Haskell for XML parsing

<uc4e4t$2rdlt$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=43365&group=comp.lang.c#43365

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: C vs Haskell for XML parsing
Date: Wed, 23 Aug 2023 09:57:16 +0200
Organization: A noiseless patient Spider
Lines: 146
Message-ID: <uc4e4t$2rdlt$1@dont-email.me>
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com>
<ubi7hd$38q7d$1@dont-email.me> <87o7j6fu74.fsf@bsb.me.uk>
<37f1a926-972c-42c8-a276-8d3f6457ccb8n@googlegroups.com>
<877cptgbli.fsf@bsb.me.uk>
<250cc72c-f682-4986-96bd-80011967c8dbn@googlegroups.com>
<87o7j4vt6r.fsf@bsb.me.uk>
<cb35076d-f8ec-441c-a963-7077bd5f884cn@googlegroups.com>
<87jztqvhwf.fsf@bsb.me.uk>
<7f9fbbd6-7f5c-4e12-a73b-c9abe91b7f5bn@googlegroups.com>
<87jztpu2iu.fsf@bsb.me.uk>
<610a41a0-a3a3-4e01-a9a7-8b5e1fe31ec0n@googlegroups.com>
<87350dtive.fsf@bsb.me.uk> <ubvan6$1rb3s$1@dont-email.me>
<3c87ec37-8fe1-4171-9500-609fad6701b7n@googlegroups.com>
<ubvo4d$1tm0p$1@dont-email.me>
<e9853969-42ce-48db-81e1-d37c8e4da59dn@googlegroups.com>
<uc28id$2dc7f$1@dont-email.me>
<b21393a6-c4f5-436a-9975-8ffedd6bf20bn@googlegroups.com>
<uc2dbv$2e4tg$1@dont-email.me>
<d734d616-b18e-4e67-b858-f0eb0a636a87n@googlegroups.com>
<uc2qnl$2gh96$1@dont-email.me>
<d651e08e-033d-4a90-8477-6a5fa13d30f3n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Aug 2023 07:57:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="73d2f5c07fe5b58e515376363a9f9029";
logging-data="2995901"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RTrfGvoQmj6pim6yU1cYQcPLqH+JwuFE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:qj5y7bmbBWxsiI4k2LMNOOhtpHM=
Content-Language: en-GB
In-Reply-To: <d651e08e-033d-4a90-8477-6a5fa13d30f3n@googlegroups.com>

by: David Brown - Wed, 23 Aug 2023 07:57 UTC

On 23/08/2023 06:59, Malcolm McLean wrote:
> On Tuesday, 22 August 2023 at 18:20:05 UTC+1, David Brown wrote:
>> On 22/08/2023 15:51, Malcolm McLean wrote:
>>> On Tuesday, 22 August 2023 at 14:32:01 UTC+1, David Brown wrote:
>>>> On 22/08/2023 14:38, Malcolm McLean wrote:
>>>>
>>>> That is a function that modifies its argument in an appropriate way, for
>>>> good reason. It does not leave it in a corrupted and indeterminate state.
>>>>
>>> If you write determinant() in the obvious way, using the recursive definition,
>>> then you wouldn't modify the matrix. If you use Gaussian elimination you
>>> diagonalise it. And it's not necessarily caller's business which algorithm is
>>> used. So it's probably better to say that, on fucntion exit, the matrix is indeterminate.
>> I for one would not be happy with a "determinant" function that might or
>> might not trash my matrix. If the matrix is big enough that Gaussian
>> elimination is the best algorithm, copying it locally is negligible
>> overhead and makes the code much easier to use.
>>
> No, that's not the case. These sorts of calculation tend to be in the inner algorithmic
> core of the application.

Experience in this group has taught me that your ideas of how things
"tend to be" is usually different from other people's. (Equally, I do
not expect you to give much credence to vague and unreferenced
assertions from me.)

All I can tell you here is that /I/ would not expect a "determinant"
function to trash the matrix I pass to it. If /I/ were writing a
determinate function, I would do so in a way that did not trash the
caller's data, and did not take noticeably longer. And if I were
convinced that some operations, such as this one, were significantly
more efficient without copying, and destruction was fine for a
significant proportion of use-cases, I'd make an explicit
"determinant_destructive" version. The non-destructive version would,
of course, take a const pointer.

In no circumstances would I make a function that left the caller's data
"corrupt" or "indeterminate".

(Oh, and I'd write it in C++ to give a much better user API here. C's
great for some things - but it's not the best choice for everything.)

>>> (If you've got a zero on the diagonal, the elimination method will produce a divide
>>> by zero, so you have to resort to the recursive method).
>> (That is incorrect, but off-topic here. I refer you to Wikipedia,
>> google, or any maths book on the topic.)
>>
> No, it's a known problem.

I note you haven't actually looked at references for it. It is a known
/consideration/ - not a known /problem/. When you hit a row with a zero
on the diagonal, you simply swap it with a row further down that does
not have zero in that column. Swapping rows multiplies the determinant
by -1. If there is no such row, the determinant is 0. (In fact, it is
common to swap rows for Gaussian elimination anyway to improve numerical
stability. But that is definitely off-topic here.)

>>
>> "bool" in C++ and C are somewhat different too - does that mean you
>> should not use "bool" in C ?
>>
> No, you shouldn't use bool in C. In C we use zero as false, non-zero as true,
> and that doesn't play well with a boolean type. There is a case for returning
> "bool" from a function, but it's deeply problematic to pass a bool as a parameter.

Sorry, but that again is /utter/ bollocks. _Bool has been a type in C
since C99, and is a far more natural choice than "int" for true/false
indications.

> The reason is that
>
> drawpath(mypath, false);
>
> is meaningless to a person reading the code.
> It should be
>
> drawpath(mypath, PATH_OPEN) ;
>
> Now we've at least some idea what the parameter means. So it needs to be an enum
> or a defined integer constnat, not a bool.

Do you really believe you are making a sound argument here? Or do you
realise that you are conflating completely different concepts? I'm
trying to think of a single example in this thread where you have
actually addressed the question, and actually justified your decisions.
But it's just a field of straw men fishing for red herrings.

>
> So bool is pretty useless and we're better off without it.

No, bool is pretty useful and we are better off having it.

It does not replace "enum", it replaces a 0/1 int.

>>
>>> It's harder to simply glance at a
>>> function prototype and take in what it does, if it is decorated with all sorts
>>> of qualifiers. So you introduce errors as a result of code being harder to read.
>> This is from the person who thinks "thisfunctionhasaperfectlygoodname"
>> is easy to read? Surely you are joking.
>>
> You've got the highighting paradox.

I note that when you try to join multiple words together without any
kind of separation, you make spelling mistakes. Normally I would not
comment on typos in a Usenet post, but surely you see the irony? You
want to use ridiculous names for your API functions, yet regularly fail
to spot errors in longer words in your prose.

> THIS IS BIG is easy to read. But text larded
> with lots of capitals is far harder to read. Similarly one camelCase or under_score
> is easy to read, when embedded in lower case text, but when you have many
> such names, the text becomes quite hard to read
>
> Your example refutes itself, and the text is quite easy to read.

Are you /seriously/ suggesting that "readfilefromdisk" is easy to read?
Better than, say, "read_file_from_disk" or "readFileFromDisk" ? (Not
that I think the camel-case version is particularly easy to read here -
but it is a world better than your choice of jumble.)

> In fact, as you
> are obviously not aware, scripto continua was the norm for ancient manuscripts..

I am entirely aware of that. I have not studied such things
academically, but I have a far above average interest in history and
writing systems. Are you aware of /why/ ancient manuscripts (and other
old writing) was regularly written without spacing? I'll give you a
clue - it was /not/ in order to make the text easier to read.

>>
>>> With an opaque pointer, there often is no change. However if you specify that in
>>> the function interface, then it's no longer an opaque pointer.
>> As long as you use "void *" pointers, they are opaque, whether "const"
>> or not. (I'm not keen on "void *" - it is often an excuse not to give
>> more informative and safer types. But it has its uses sometimes.)
>>
> A const void * is not an opaque pointer. We can say something about how the
> called function will handle the data it points to.

Yes - that's a good thing.

Subject	Author
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Lew Pitcher
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Lew Pitcher
C vs Haskell for XML parsing	Lew Pitcher
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	Kaz Kylheku
C vs Haskell for XML parsing	Kaz Kylheku
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Lew Pitcher
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Lew Pitcher
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	James Kuyper
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Kaz Kylheku
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Tim Rentsch
C vs Haskell for XML parsing	Kaz Kylheku
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Chris M. Thomasson
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Richard Damon
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	David Brown
C vs Haskell for XML parsing	Malcolm McLean
C vs Haskell for XML parsing	Spiros Bousbouras
C vs Haskell for XML parsing	Malcolm McLean
Underscores in type names (was : C vs Haskell for XML parsing)	Spiros Bousbouras
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	Keith Thompson
C vs Haskell for XML parsing	Scott Lurndal
C vs Haskell for XML parsing	Bart
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	fir
C vs Haskell for XML parsing	Kaz Kylheku
C vs Haskell for XML parsing	Ben Bacarisse
C vs Haskell for XML parsing	fir
C vs Haskell for XML parsing	fir

"It runs like _x, where _x is something unsavory" -- Prof. Romas Aleliunas, CS 435

devel / comp.lang.c / Re: C vs Haskell for XML parsing

devel / comp.lang.c / Re: C vs Haskell for XML parsing

"It runs like _x, where _x is something unsavory" -- Prof. Romas Aleliunas, CS 435