Message-ID:

I THINK MAN INVENTED THE CAR by instinct. -- Jack Handey, The New Mexican, 1988.

Scanning

<Scanning-20230119123241@ram.dialup.fu-berlin.de>

https://news.novabbs.org/devel/article-flat.php?id=4005&group=comp.programming#4005

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Scanning
Date: 19 Jan 2023 12:10:36 GMT
Organization: Stefan Ram
Lines: 80
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de D7HEWa0VtCxn7qCPhg8I4Q9o3GKNQovSptklXEaJ0RTR1n
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 19 Jan 2023 12:10 UTC

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.

So how would you do it with this style of programming (never
reading the whole line into memory)?

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!

an example of a traditional scanner:

It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.

PROCEDURE GetSym;

VAR i : CARDINAL;

BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...

Re: Scanning

<tqbdu1$1hm7a$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4006&group=comp.programming#4006

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: rjh@cpax.org.uk (Richard Heathfield)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Thu, 19 Jan 2023 12:43:45 +0000
Organization: Fix this later
Lines: 112
Message-ID: <tqbdu1$1hm7a$1@dont-email.me>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 19 Jan 2023 12:43:45 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="6fc9ceb16a8a5f34f36f3017f74a7b7e";
logging-data="1628394"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18incwfoObTbR3E2R61TbUEawl5TgrdytNjORijHMGkFQ=="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:GnYZH3F0mOccVp97m1hr1dZEdik=
In-Reply-To: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
Content-Language: en-GB

by: Richard Heathfield - Thu, 19 Jan 2023 12:43 UTC

On 19/01/2023 12:10 pm, Stefan Ram wrote:
> Some idle thoughts about scanning (lexical analysis, or
> rather what comes before it) ...
>
> Let's take a very simple task: This scanner for text files
> has nothing more to do than to return every character,
> except to strip the spaces at the end of a line.
>
> It is a function "get_next_token" that on each call will
> return the next character from a file to its client (caller),
> except that spaces at the end of a line will skipped.
>
> So we read the line and strip the spaces. (One line in
> Python.)
>
> But how do I know in advance if the line will fit into
> memory?
>
> Perhaps because of such fears, traditional scanners¹ do not
> read lines or, Heaven forbid, files, but only characters!
>
> They do not use random access with respect to the text to be
> scanned, but sequential access, although things would be
> easier with random access.
>
> So how would you do it with this style of programming (never
> reading the whole line into memory)?
>
> "I read a character. If it's a space, I peek at the next
> character, if that's a space, I start adding spaces to my
> look-ahead buffer. If an EOL is encountered, the look-ahead
> buffer is discarded. Otherwise, I have to start feeding my
> client from the lookahead buffer until the lookahead buffer
> is empty."
>
> If I am concerned that a line will not fit in memory, how do
> I know that the sequence of spaces at the end of a line will
> fit in memory (the look-ahead buffer)? The look-ahead buffer
> could be replaced by a counter. If you are paranoid, you
> would use a 64-bit counter and check it for overflow!
>
> Is it worth the effort with a look-ahead buffer and
> sequential access? Should you just read a line, assuming
> that a line will always fit into memory, and strip the
> blanks the easy way, i.e., using random access? TIA for any
> comments!
>
> 1
>
> an example of a traditional scanner:
>
> It only ever calls "GetCh", never "GetLine". The code could
> be easier to write by reading a whole line and then just
> using functions that can look at that line using random
> access to get the next symbol (maybe using regular
> expressions). But a traditional scanner carefully only ever
> reads a single character and manages a state.
>
> PROCEDURE GetSym;
>
> VAR i : CARDINAL;
>
> BEGIN
> WHILE ch <= ' ' DO GetCh END;
> IF ch = '/' THEN
> SkipLine;
> WHILE ch <= ' ' DO GetCh END
> END;
> IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
> i := 0;
> sym := literal;
> REPEAT
> IF i < IdLength THEN
> id [i] := ch;
> INC (i)
> END;
> IF ch > 'Z' THEN sym := ident END;
> GetCh
> ...

man 3 realloc

This was a perennial comp.lang.c topic back in the day.

My interface looked (and still looks) like this:

#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE 1

int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
*fp, unsigned int flags, size_t *plen);

It's easier to use than it might look:

char *data = NULL; /* where will the data go? NULL is fine */
size_t size = 0; /* how much space do we have right now? */
size_t len = 0; /* after call, holds line length */

while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
{
if(len > 0)

If you want fgetline.c and don't have 20 years of clc archives,
just yell.

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Re: Scanning

<e_ayL.85067$SdR7.59556@fx04.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4007&group=comp.programming#4007

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.6.1
Subject: Re: Scanning
Newsgroups: comp.programming
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
From: Richard@Damon-Family.org (Richard Damon)
Content-Language: en-US
In-Reply-To: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 95
Message-ID: <e_ayL.85067$SdR7.59556@fx04.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Thu, 19 Jan 2023 07:48:09 -0500
X-Received-Bytes: 4350

by: Richard Damon - Thu, 19 Jan 2023 12:48 UTC

On 1/19/23 7:10 AM, Stefan Ram wrote:
> Some idle thoughts about scanning (lexical analysis, or
> rather what comes before it) ...
>
> Let's take a very simple task: This scanner for text files
> has nothing more to do than to return every character,
> except to strip the spaces at the end of a line.
>
> It is a function "get_next_token" that on each call will
> return the next character from a file to its client (caller),
> except that spaces at the end of a line will skipped.
>
> So we read the line and strip the spaces. (One line in
> Python.)
>
> But how do I know in advance if the line will fit into
> memory?
>
> Perhaps because of such fears, traditional scanners¹ do not
> read lines or, Heaven forbid, files, but only characters!
>
> They do not use random access with respect to the text to be
> scanned, but sequential access, although things would be
> easier with random access.
>
> So how would you do it with this style of programming (never
> reading the whole line into memory)?
>
> "I read a character. If it's a space, I peek at the next
> character, if that's a space, I start adding spaces to my
> look-ahead buffer. If an EOL is encountered, the look-ahead
> buffer is discarded. Otherwise, I have to start feeding my
> client from the lookahead buffer until the lookahead buffer
> is empty."
>
> If I am concerned that a line will not fit in memory, how do
> I know that the sequence of spaces at the end of a line will
> fit in memory (the look-ahead buffer)? The look-ahead buffer
> could be replaced by a counter. If you are paranoid, you
> would use a 64-bit counter and check it for overflow!
>
> Is it worth the effort with a look-ahead buffer and
> sequential access? Should you just read a line, assuming
> that a line will always fit into memory, and strip the
> blanks the easy way, i.e., using random access? TIA for any
> comments!
>
> 1
>
> an example of a traditional scanner:
>
> It only ever calls "GetCh", never "GetLine". The code could
> be easier to write by reading a whole line and then just
> using functions that can look at that line using random
> access to get the next symbol (maybe using regular
> expressions). But a traditional scanner carefully only ever
> reads a single character and manages a state.
>
> PROCEDURE GetSym;
>
> VAR i : CARDINAL;
>
> BEGIN
> WHILE ch <= ' ' DO GetCh END;
> IF ch = '/' THEN
> SkipLine;
> WHILE ch <= ' ' DO GetCh END
> END;
> IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
> i := 0;
> sym := literal;
> REPEAT
> IF i < IdLength THEN
> id [i] := ch;
> INC (i)
> END;
> IF ch > 'Z' THEN sym := ident END;
> GetCh
> ...
>
>

Because of the particulars of this problem, you don't need a look-ahead
buffer, just a count of spaces you have seen and what character is after
the spaces.

If you had to handle multiple types of whitespace (arbitrary mix of tabs
and spaces for example) then you would need a buffer, and you need to
try and hand;e the case of that sequence being "too long".

In general, parsers need a way to report that the file it too
"complicated" for it.

Even the simple counter version has a limit, when ever the type of the
counter overflows

Re: Scanning

<scanners-20230119143810@ram.dialup.fu-berlin.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4008&group=comp.programming#4008

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: 19 Jan 2023 13:38:36 GMT
Organization: Stefan Ram
Lines: 13
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <scanners-20230119143810@ram.dialup.fu-berlin.de>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <tqbdu1$1hm7a$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 1rgFsXokIzzsHqLRni5M4AjKoRV/KNe/Oqzh3t6Ay3P3OW
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 19 Jan 2023 13:38 UTC

Richard Heathfield <rjh@cpax.org.uk> writes:
>This was a perennial comp.lang.c topic back in the day.

But what about writing a scanner in languages with automatic
memory management where reading a whole line is very simple
and assuming an input language that limits line length to
some reasonable value, say, 1,000,000 characters?

In such a language, would there still be reasons not to
read the whole line into memory, but to read it char-by-char
as traditional scanners do?

Re: Scanning

<tqbhs2$nlo$1@gioia.aioe.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4009&group=comp.programming#4009

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!aioe.org!2bOJBbN/dOuClNqbvu11SQ.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Thu, 19 Jan 2023 14:50:58 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tqbhs2$nlo$1@gioia.aioe.org>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="24248"; posting-host="2bOJBbN/dOuClNqbvu11SQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Dmitry A. Kazakov - Thu, 19 Jan 2023 13:50 UTC

On 2023-01-19 13:10, Stefan Ram wrote:

> But how do I know in advance if the line will fit into
> memory?

No idea, my parser reads whole source line into the buffer.

> Perhaps because of such fears, traditional scanners¹ do not
> read lines or, Heaven forbid, files, but only characters!

I think it is more C/UNIX tradition coming from having neither proper
strings in the language nor lines/records in the filesystem.

> So how would you do it with this style of programming (never
> reading the whole line into memory)?

By never following this style and never using scanners, lexers,
tokenizers and other primitive stuff. I do all that in a single pass
that produces either the code or else the AST.

> "I read a character. If it's a space, I peek at the next
> character, if that's a space, I start adding spaces to my
> look-ahead buffer. If an EOL is encountered, the look-ahead
> buffer is discarded. Otherwise, I have to start feeding my
> client from the lookahead buffer until the lookahead buffer
> is empty."

Reasonable languages deploy the rule that one blank character is
equivalent to any number of blank characters, so you could simply pass
one single space further. Note that you have to annotate tokens by
source location anyway (another reason for ditching the scanner
altogether). So you do not need to care about what this blank was built
of. And yet another reason not to use scanner is that the blank can be a
part of a, possibly malformed, comment or literal.

> Is it worth the effort with a look-ahead buffer and
> sequential access? Should you just read a line, assuming
> that a line will always fit into memory, and strip the
> blanks the easy way, i.e., using random access?

My parser works with an abstract source object. The implementation of
the source object maintains an internal line buffer, which size is a
parameter. Whether it is set to 1TB or 1024 bytes, the parser does not care.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Scanning

<tqbi6d$1hm7a$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4010&group=comp.programming#4010

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: rjh@cpax.org.uk (Richard Heathfield)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Thu, 19 Jan 2023 13:56:29 +0000
Organization: Fix this later
Lines: 35
Message-ID: <tqbi6d$1hm7a$2@dont-email.me>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
<tqbdu1$1hm7a$1@dont-email.me>
<scanners-20230119143810@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 19 Jan 2023 13:56:29 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="6fc9ceb16a8a5f34f36f3017f74a7b7e";
logging-data="1628394"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m/RoLSEb7dDK4T9zvnjjAiArLIsZWFWwRaIp10WsqrA=="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:Luxhp/9cw7b9GXc3AXLxG36ipKU=
In-Reply-To: <scanners-20230119143810@ram.dialup.fu-berlin.de>
Content-Language: en-GB

by: Richard Heathfield - Thu, 19 Jan 2023 13:56 UTC

On 19/01/2023 1:38 pm, Stefan Ram wrote:
> Richard Heathfield <rjh@cpax.org.uk> writes:
>> This was a perennial comp.lang.c topic back in the day.
>
> But what about writing a scanner in languages with automatic
> memory management where reading a whole line is very simple
> and assuming an input language that limits line length to
> some reasonable value, say, 1,000,000 characters?
>
> In such a language, would there still be reasons not to
> read the whole line into memory, but to read it char-by-char
> as traditional scanners do?

There are always reasons, and sometimes they conflict.

For example, memory management, which should be done by the
language because it's too important to be left to the programmer,
and which should be done by the programmer because it's too
important to be left to the language.

What are your priorities? Run speed? Speed of development? Code
re-use? Scalability? Programmer cost? Robustness? Security?

And what are your constraints?

I'm not asking for your answer to these questions. I'm just
pointing out that the answer to your question will depend at
least in part on the answers to mine.

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Re: Scanning

<scanner-20230119154238@ram.dialup.fu-berlin.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4011&group=comp.programming#4011

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: 19 Jan 2023 14:48:29 GMT
Organization: Stefan Ram
Lines: 124
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <scanner-20230119154238@ram.dialup.fu-berlin.de>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 7IK5nrQIBVVw0pqydBaiuwhjEw+mNcjXvI2u3n8sFtWCfA
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 19 Jan 2023 14:48 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>Let's take a very simple task: This scanner for text files
>has nothing more to do than to return every character,
>except to strip the spaces at the end of a line.

Richard said that it matters what I need this for.

I'd like to implement a tiny markup language similar
to languages like "Markdown" or "reStructuredText".
It should ignore spaces at the end of lines.
I'm going to implement it in Python.

Here is a first draft of a scanner that strips
spaces at the end of lines. It works by reading
single characters from the source.

For demonstration purposes, I have written spaces
as underlines "_".

The demo takes

Howdy___\nthere!

as input and outputs

Howdy\nthere!\n

. (It also tries to insert '\n' at the end of a
source when there is no '\n' at the end.)

The input text is given in the source code via

input_text = iter( 'Howdy___\nthere!' )

. What I now need to do next is to write more
tests in order to find errors. (I avoided using
classes to make the code a bit easier to read for
the newsgroup, but the code also will be changed
soon to use a class definition.)

Python 3.9

main.py

def catcode( ch ):
# 5 means: "this is a line terminator"
# 10 means: "this is a blank space"
# 11 means: "this is a plain character"
if ch == '\n': return 5
if ch == ' ': return 10
if ch == '_': return 10 # for debugging, make "_" a space
if ch == '\t': return 10
return 11

spaces_seen = [] # a buffer for spaces collected
char_read = '' # a buffer allowing one-character lookahead
previous = '' # the previous character read by "get_next_character"
terminated = False # set after the last character of the source was read

def get_next_character():
# insert EOL at the end of the last line if missing
global previous
global terminated
global char_read
if terminated: raise StopIteration
if char_read:
ch = char_read; char_read = ''
else:
try:
ch = next( input_text )
except StopIteration:
if previous != '' and catcode( previous )!= 5:
# if there is no EOL at EOF, insert one
ch = '\n'
terminated = True
else:
raise StopIteration
previous = ch
return ch

def get_next_token():
# skip blanks at the end of a line
global char_read
global spaces_seen
while True:
if not spaces_seen:
ch = get_next_character()
if catcode( ch )== 10:
spaces_seen =[ ch ]
while True:
ch = get_next_character()
if catcode( ch )== 10:
spaces_seen += ch
elif catcode( ch )== 5:
spaces_seen = []
return( 0, ch, 5, f'{spaces_seen=}' )
else:
char_read = ch
break
else:
return( 0, ch, catcode( ch ), f'{spaces_seen=}')
if spaces_seen:
ch = spaces_seen.pop( 0 )
return( 1, ch, catcode( ch ), f'{spaces_seen=}')

input_text = iter( 'Howdy___\nthere!' )

def main():
result = ''
while True:
try:
token = get_next_token()
result += token[ 1 ]
except StopIteration:
break
print( repr( result ))

main()

stdout

'Howdy\nthere!\n'

Re: Scanning

<tqbm8u$1hm7b$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4012&group=comp.programming#4012

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: rjh@cpax.org.uk (Richard Heathfield)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Thu, 19 Jan 2023 15:06:05 +0000
Organization: Fix this later
Lines: 29
Message-ID: <tqbm8u$1hm7b$1@dont-email.me>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
<scanner-20230119154238@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 19 Jan 2023 15:06:06 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="6fc9ceb16a8a5f34f36f3017f74a7b7e";
logging-data="1628395"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+VeRbosoKpGug7NhQuqaXSvBHk9TusLDzmG4lsb9F9mQ=="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:2nUiJgxBIUMETJd9Qs/AcGUqf+Y=
In-Reply-To: <scanner-20230119154238@ram.dialup.fu-berlin.de>
Content-Language: en-GB

by: Richard Heathfield - Thu, 19 Jan 2023 15:06 UTC

On 19/01/2023 2:48 pm, Stefan Ram wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>> Let's take a very simple task: This scanner for text files
>> has nothing more to do than to return every character,
>> except to strip the spaces at the end of a line.
>
> Richard said that it matters what I need this for.
>
> I'd like to implement a tiny markup language

Okay, BIG job with lots of complicated, so strive to keep each
part relatively simple if you ever hope to get it working. Do it
in whatever way comes most natural to your programming style,
because that's how /you/ can define 'simple'. You're using
Python, so I guess you're not overly concerned by performance, so
do it the way you personally find easiest. I'm guessing you'll go
for line by line and lean on Python's memory management.

But write this down somewhere: if, further down the line, your
parser turns out to be too slow and the profiler blames this bit,
rewriting it to go byte by byte might well be one of the ways you
could speed it up.

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Re: Scanning

<iterations-20230119174516@ram.dialup.fu-berlin.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4013&group=comp.programming#4013

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: 19 Jan 2023 16:46:08 GMT
Organization: Stefan Ram
Lines: 15
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <iterations-20230119174516@ram.dialup.fu-berlin.de>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <scanner-20230119154238@ram.dialup.fu-berlin.de> <tqbm8u$1hm7b$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 5qNBuHPNNbYV29aC4MW5qwXWTKgdq4D+9Oa7H8gO/jFIss
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 19 Jan 2023 16:46 UTC

Richard Heathfield <rjh@cpax.org.uk> writes:
>Okay, BIG job with lots of complicated, so strive to keep each
>part relatively simple if you ever hope to get it working.

I know I have to simplify things to make it work. To do this,
I have broken the development into several iterations.
During the first iteration, I just want to read a /single/
paragraph containing only simple words without any markup,
parse it into an internal format, and then generate two
output formats from it: HTML and plain text.

In the next iteration, I want to extend this to a sequence
of paragraphs. Still without any real markup.

Re: Scanning

<87v8l2z9bv.fsf@bsb.me.uk>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4014&group=comp.programming#4014

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Thu, 19 Jan 2023 18:08:04 +0000
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <87v8l2z9bv.fsf@bsb.me.uk>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader01.eternal-september.org; posting-host="74200513a5ebd5d644fe041ee1cd9c3d";
logging-data="1743436"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/WfDSEHXlmqOXR7sSiYlGbI83w1i/9siM="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:WQfW0hVdX2KQuNZLk2Phd6Ww82M=
sha1:8jyR4X/QSXKVpNmd6em99Zt3yms=
X-BSB-Auth: 1.4bc1f934776c72802ff4.20230119180804GMT.87v8l2z9bv.fsf@bsb.me.uk

by: Ben Bacarisse - Thu, 19 Jan 2023 18:08 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:

> Some idle thoughts about scanning (lexical analysis, or
> rather what comes before it) ...
>
> Let's take a very simple task: This scanner for text files
> has nothing more to do than to return every character,
> except to strip the spaces at the end of a line.
>
> It is a function "get_next_token" that on each call will
> return the next character from a file to its client (caller),
> except that spaces at the end of a line will skipped.
>
> So we read the line and strip the spaces. (One line in
> Python.)
>
> But how do I know in advance if the line will fit into
> memory?

That's a huge assumption! There's no need to read the line just to skip
spaces at the end. All you need to do is read and count them so you can
"hand back" the right number of spaces if you don't see a newline
character.

But then this is not the real problem, I suspect. You probably want to
skip spaces and tabs and probably other things at the end of a line.
Then again, maybe you really want to replace multiple spaces with just
on at this stage of the processing? That's is the trouble with cut down
problem statements -- they can have simple solutions that don't apply in
the real case.

Mind you, I would try hard to avoid reading a line unless a line is
really and important structure. You might only need to store the
largest token.

--
Ben.

Re: Scanning

<scanner-20230120125512@ram.dialup.fu-berlin.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4016&group=comp.programming#4016

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: 20 Jan 2023 12:16:57 GMT
Organization: Stefan Ram
Lines: 119
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <scanner-20230120125512@ram.dialup.fu-berlin.de>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <scanner-20230119154238@ram.dialup.fu-berlin.de> <tqbm8u$1hm7b$1@dont-email.me> <iterations-20230119174516@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de FMmducLABx6fFwRR9oOg3w5skfkb7049n3duxXLF4/KtdE
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Fri, 20 Jan 2023 12:16 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>In the next iteration, I want to extend this to a sequence
>of paragraphs. Still without any real markup.

(As before, I was not able to shorten all lines of this post
to the 72 characters which are recommended for Usenet posts,
so please bear with me while some lines below will exceed
the length of 72 characters. I do not ignore Usenet customs
lightly, but only after painstaking consideration.)

This post is just a report, but contains no questions to the
group, so please read on only if you are interested in the topic!

It was a bit difficult for me to figure out how to properly
do things, so I resorted to reading Chapter 8 of the TeXbook
where the scanning of TeX is explained. To verify my understanding,
I wrote small snippets of TeX. For example,

\tracingscantokens1
\tracingcommands3
\tracingonline1
H

\tracingscantokens0

(That is, one line containing only an "H" and then one empty line.)

gave this output on TeX:

{the letter H}
{horizontal mode: the letter H}
{blank space }
{\par}

. This is because TeX converts the first \n (directly after "H")
to a blank space and the next (directly below "H") to the control
sequence "\par".

I then tried to imitate this.

Here are the test cases I wrote for my code in Python:

catcode_dict[ '\t' ]= catcode_of_space # repeated here for clarification
process( 'Howdy___\nthere!' )
process( ' Howdy___\n there!_' )
process( 'H__\n\n' )
process( ' Howdy\n\n there!\n\n' )
process( ' Howdy\n \n there!\n \n' )
process( ' Howdy\n\n\n there!\n \n\n' )
process( 'Howdy\n\t\nthere!' )
catcode_dict[ '\t' ]= catcode_of_other
process( 'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' )
catcode_dict[ '\t' ]= catcode_of_space
process( '' )
process( ' ' )
process( ' ' )
process( r''' In a Galaxy, there lived a man.
He was happy when he was typing
paragraphs.''' )

One will see below, that, just like TeX, my scanner ignores
a tab at the end of a line, when the tab character has been
given then category of "space character" (as in plain TeX),
but not when it has been given the category of "other
character" (as in INITEX).

The output follows below. Most tests pass, but there is
still one error. (The error is: When the input is a sequence
of blanks, it produces [par], but should produce nothing.)
For demonstration purposes, the underscore "_" was made to
act like a blank space.

The actual output of the scanner is a sequence of tokens,
but it was assembled into a string for the demonstration
output below.

The output often ends with one space, because a '\n' is
added to the end of the input if it's missing, and this
then is being converted to a space. So, ironically, while
I set out to strip spaces at the end of lines, I now
sometimes add them to the end of lines!

'Howdy___\nthere!' (=input) ==>
'Howdy there! ' (=output)

' Howdy___\n there!_' (=input) ==>
'Howdy there! ' (=output)

'H__\n\n' (=input) ==>
'H [par]' (=output)

' Howdy\n\n there!\n\n' (=input) ==>
'Howdy [par]there! [par]' (=output)

' Howdy\n \n there!\n \n' (=input) ==>
'Howdy [par]there! [par]' (=output)

' Howdy\n\n\n there!\n \n\n' (=input) ==>
'Howdy [par][par]there! [par][par]' (=output)

'Howdy\n\t\nthere!' (=input) ==>
'Howdy [par]there! ' (=output)

'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' (=input) ==>
'Howdy \t there! (catcode of tab temporarily rededfined to "other") ' (=output)

'' (=input) ==>
'' (=output)

' ' (=input) ==>
'[par]' (=output)

' In a Galaxy, there lived a man.\nHe was happy when he was typing\nparagraphs.' (=input) ==>
'In a Galaxy, there lived a man. He was happy when he was typing paragraphs. ' (=output)

Re: Scanning

<tqf4jp$28cng$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4017&group=comp.programming#4017

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: uaigh@icloud.com (Noel Duffy)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Sat, 21 Jan 2023 11:29:13 +1300
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <tqf4jp$28cng$1@dont-email.me>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de>
<scanner-20230119154238@ram.dialup.fu-berlin.de>
<tqbm8u$1hm7b$1@dont-email.me>
<iterations-20230119174516@ram.dialup.fu-berlin.de>
<scanner-20230120125512@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 20 Jan 2023 22:29:14 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="fc5041f7bc005598161040701d38dfb0";
logging-data="2372336"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wpnE0K250nBF4E2FZkuCOrLMIXcRFI0A="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.6.1
Cancel-Lock: sha1:005b15nVHsRaJjux/7RWFjwH9Yk=
In-Reply-To: <scanner-20230120125512@ram.dialup.fu-berlin.de>
Content-Language: en-US

by: Noel Duffy - Fri, 20 Jan 2023 22:29 UTC

On 21/01/23 01:16, Stefan Ram wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
[..]
>
> The output often ends with one space, because a '\n' is
> added to the end of the input if it's missing, and this
> then is being converted to a space. So, ironically, while
> I set out to strip spaces at the end of lines, I now
> sometimes add them to the end of lines!

While I don't have any great insight to offer, I did write a small
markup engine a few years ago (in Object Pascal). What you say above
brought back memories of struggles I had with my code too. The
conclusion I came to at the time is that when it comes to things like
spacing, there are several equally valid ways to do it, and you'll
probably want different handling for different use-cases, so it's better
to parameterize it so that users of your code can set which handling
they prefer. I went with making it a parameter. It's a bit more work but
the flexibility is usually worth it in the long run.

Re: Scanning

<par-20230121155911@ram.dialup.fu-berlin.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4018&group=comp.programming#4018

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: 21 Jan 2023 15:00:35 GMT
Organization: Stefan Ram
Lines: 15
Expires: 1 Jan 2024 11:59:58 GMT
Message-ID: <par-20230121155911@ram.dialup.fu-berlin.de>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <scanner-20230119154238@ram.dialup.fu-berlin.de> <tqbm8u$1hm7b$1@dont-email.me> <iterations-20230119174516@ram.dialup.fu-berlin.de> <scanner-20230120125512@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de Hpb4PPugg9tUdvG+4J7Z/Abd2KsW46Qy2/JGPy9Nx4pwPq
X-Copyright: (C) Copyright 2023 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Sat, 21 Jan 2023 15:00 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>The output follows below. Most tests pass, but there is
>still one error. (The error is: When the input is a sequence
>of blanks, it produces [par], but should produce nothing.)

In this case, the error was not in my code but in my
assumptions. In fact, TeX behaves exactly this way too.
When the input is empty, the output is empty (no tokens),
but when the input is exactly one space, this yields the
one token "\par". This is because an "\n" is added to
the last line if it was missing. The space is ignored.
This gives an "\n" at the start of the line, and a "\n"
at the start of a line yields the "\par" token.

Re: Scanning

<2YwRboOtvOaNvrTA9@bongo-ra.co>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4019&group=comp.programming#4019

copy link Newsgroups: comp.programming

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: spibou@gmail.com (Spiros Bousbouras)
Newsgroups: comp.programming
Subject: Re: Scanning
Date: Sun, 22 Jan 2023 16:44:44 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <2YwRboOtvOaNvrTA9@bongo-ra.co>
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <scanner-20230119154238@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 22 Jan 2023 16:44:44 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="bd25c2bbe355dec6cc9c6b1dbfa1933f";
logging-data="3350359"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/VihzCUz/cam/3O1jg68VB"
Cancel-Lock: sha1:A93iONVmvNOWB4Si2PFpuTjIHzQ=
X-Server-Commands: nowebcancel
In-Reply-To: <scanner-20230119154238@ram.dialup.fu-berlin.de>
X-Organisation: Weyland-Yutani

by: Spiros Bousbouras - Sun, 22 Jan 2023 16:44 UTC

On 19 Jan 2023 14:48:29 GMT
ram@zedat.fu-berlin.de (Stefan Ram) wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
> >Let's take a very simple task: This scanner for text files
> >has nothing more to do than to return every character,
> >except to strip the spaces at the end of a line.
>
> Richard said that it matters what I need this for.
>
> I'd like to implement a tiny markup language similar
> to languages like "Markdown" or "reStructuredText".
> It should ignore spaces at the end of lines.
> I'm going to implement it in Python.

Does it need to have functionality where it produces output before it has
seen all the input ? If not , then I would not just read a whole line but a
whole file (or input) ! It seems extravagant but unless you have a realistic
scenario where you worry that the whole input won't fit into memory , it is
simplest to read the whole input into memory.

Re: Scanning

<5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=4029&group=comp.programming#4029

copy link Newsgroups: comp.programming

X-Received: by 2002:a05:620a:134e:b0:706:49fb:8049 with SMTP id c14-20020a05620a134e00b0070649fb8049mr933587qkl.36.1674812760909;
Fri, 27 Jan 2023 01:46:00 -0800 (PST)
X-Received: by 2002:a05:6808:7db:b0:367:163e:a5e with SMTP id
f27-20020a05680807db00b00367163e0a5emr1694860oij.162.1674812760648; Fri, 27
Jan 2023 01:46:00 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.programming
Date: Fri, 27 Jan 2023 01:46:00 -0800 (PST)
In-Reply-To: <tqbdu1$1hm7a$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=82.131.36.26; posting-account=ogslnwoAAACd9vU9PADzlWBA81fSuNpL
NNTP-Posting-Host: 82.131.36.26
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <tqbdu1$1hm7a$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com>
Subject: Re: Scanning
From: vvvvvvvvaaaaaaaaaaaaaaa@mail.ee (V V V V V V V V V V V V V V V V V V)
Injection-Date: Fri, 27 Jan 2023 09:46:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5464

by: V V V V V V V V V V - Fri, 27 Jan 2023 09:46 UTC

You are a devil !

On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote:
> On 19/01/2023 12:10 pm, Stefan Ram wrote:
> > Some idle thoughts about scanning (lexical analysis, or
> > rather what comes before it) ...
> >
> > Let's take a very simple task: This scanner for text files
> > has nothing more to do than to return every character,
> > except to strip the spaces at the end of a line.
> >
> > It is a function "get_next_token" that on each call will
> > return the next character from a file to its client (caller),
> > except that spaces at the end of a line will skipped.
> >
> > So we read the line and strip the spaces. (One line in
> > Python.)
> >
> > But how do I know in advance if the line will fit into
> > memory?
> >
> > Perhaps because of such fears, traditional scanners¹ do not
> > read lines or, Heaven forbid, files, but only characters!
> >
> > They do not use random access with respect to the text to be
> > scanned, but sequential access, although things would be
> > easier with random access.
> >
> > So how would you do it with this style of programming (never
> > reading the whole line into memory)?
> >
> > "I read a character. If it's a space, I peek at the next
> > character, if that's a space, I start adding spaces to my
> > look-ahead buffer. If an EOL is encountered, the look-ahead
> > buffer is discarded. Otherwise, I have to start feeding my
> > client from the lookahead buffer until the lookahead buffer
> > is empty."
> >
> > If I am concerned that a line will not fit in memory, how do
> > I know that the sequence of spaces at the end of a line will
> > fit in memory (the look-ahead buffer)? The look-ahead buffer
> > could be replaced by a counter. If you are paranoid, you
> > would use a 64-bit counter and check it for overflow!
> >
> > Is it worth the effort with a look-ahead buffer and
> > sequential access? Should you just read a line, assuming
> > that a line will always fit into memory, and strip the
> > blanks the easy way, i.e., using random access? TIA for any
> > comments!
> >
> > 1
> >
> > an example of a traditional scanner:
> >
> > It only ever calls "GetCh", never "GetLine". The code could
> > be easier to write by reading a whole line and then just
> > using functions that can look at that line using random
> > access to get the next symbol (maybe using regular
> > expressions). But a traditional scanner carefully only ever
> > reads a single character and manages a state.
> >
> > PROCEDURE GetSym;
> >
> > VAR i : CARDINAL;
> >
> > BEGIN
> > WHILE ch <= ' ' DO GetCh END;
> > IF ch = '/' THEN
> > SkipLine;
> > WHILE ch <= ' ' DO GetCh END
> > END;
> > IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
> > i := 0;
> > sym := literal;
> > REPEAT
> > IF i < IdLength THEN
> > id [i] := ch;
> > INC (i)
> > END;
> > IF ch > 'Z' THEN sym := ident END;
> > GetCh
> > ...
>
> man 3 realloc
>
> This was a perennial comp.lang.c topic back in the day.
>
> My interface looked (and still looks) like this:
>
> #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
> #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
> #define FGDATA_REDUCE 1
>
> int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
> *fp, unsigned int flags, size_t *plen);
>
> It's easier to use than it might look:
>
> char *data = NULL; /* where will the data go? NULL is fine */
> size_t size = 0; /* how much space do we have right now? */
> size_t len = 0; /* after call, holds line length */
>
> while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
> {
> if(len > 0)
>
> If you want fgetline.c and don't have 20 years of clc archives,
> just yell.
>
> --
> Richard Heathfield
> Email: rjh at cpax dot org dot uk
> "Usenet is a strange place" - dmr 29 July 1999
> Sig line 4 vacant - apply within

Subject	Author
Scanning	Stefan Ram
Scanning	Richard Heathfield
Scanning	Stefan Ram
Scanning	Richard Heathfield
Scanning	V V V V V V V V V V V V V V V V V V
Scanning	Richard Damon
Scanning	Dmitry A. Kazakov
Scanning	Stefan Ram
Scanning	Richard Heathfield
Scanning	Stefan Ram
Scanning	Stefan Ram
Scanning	Noel Duffy
Scanning	Stefan Ram
Scanning	Spiros Bousbouras
Scanning	Ben Bacarisse

I THINK MAN INVENTED THE CAR by instinct. -- Jack Handey, The New Mexican, 1988.

devel / comp.programming / Scanning