Message-ID:

Every living thing wants to survive. -- Spock, "The Ultimate Computer", stardate 4731.3

devel / comp.lang.tcl / Retrieving data from a thread

Hi.

My application used to take 7 seconds to load. A little long.

Now I am trying to implement some kind of spell checker so I just added
code to load and process a file.

set ::WordlistFile /home/tcl/longlist.txt
set ::Wordlist ""
set _fp [open $::WordlistFile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp

Now it takes 10 seconds to load. Grumble.

I thought about using threads to load the word list and make it all
load faster.

In fact, it already loads another list. Maybe I could "threadify" both?

Anyway,

I've been reading about threads and I must say the existing documentation
is not very easy to understand. Code examples on google are pretty scarce
too. I found an interesting one on StackOverflow where Donal (DKF)
suggests using pools, but I couldn't make that work. I mean, I have to
"get" data with tpool::get? But when? How do I know when the job is done?
He also suggests tsv, but I found the relevant documentation hard to read
and understand.

Yet I've made prototypes that almost worked. Almost.

The last mile I need, I think, is retrieving data from the thread. More
specifically, ::Wordlist after it is built.

I wish I didn't have to call it explicitly. Just let the thread build
and set ::Wordlist whenever it feels ready. My goal is to let the thread
load the list while the rest of the application loads other things.
Of course, it must not take longer than a few seconds. It has to be done
when I begin to type.

Among so many attempts, I only came up with one that worked. But
the application wouldn't load any faster. I must have done something
wrong. Can you please enlighten me?

package require Thread
set thr [thread::create]
thread::send $thr "set wordlistfile $::WordlistFile"
thread::send -async $thr {
set _fp [open $wordlistfile r]
while {![eof $_fp]} {
set _line [string trim [gets $_fp]]
if {$_line == ""} {continue}
lappend ::Wordlist $_line
}
close $_fp
} puts "Now what?"

--
Luc
>>

Re: Retrieving data from a thread

<ul8984$3c5v9$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12932&group=comp.lang.tcl#12932

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et99@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Mon, 11 Dec 2023 16:24:36 -0800
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <ul8984$3c5v9$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Dec 2023 00:24:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1a73218c396ea78edc24aa7b6f97fc90";
logging-data="3545065"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9WycyIf33s3sophG73+RT"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:Q6jOozYGkbdMpQ3mTkcgWUjcEcg=
Content-Language: en-US
In-Reply-To: <20231211183416.1833a560@lud1.home>

by: et99 - Tue, 12 Dec 2023 00:24 UTC

On 12/11/2023 1:34 PM, Luc wrote:
> Hi.
>
> My application used to take 7 seconds to load. A little long.
>
> Now I am trying to implement some kind of spell checker so I just added
> code to load and process a file.
>
> set ::WordlistFile /home/tcl/longlist.txt
> set ::Wordlist ""
> set _fp [open $::WordlistFile r]
> while {![eof $_fp]} {
> set _line [string trim [gets $_fp]]
> if {$_line == ""} {continue}
> lappend ::Wordlist $_line
> }
> close $_fp
>
> Now it takes 10 seconds to load. Grumble.
>
> I thought about using threads to load the word list and make it all
> load faster.
>
> In fact, it already loads another list. Maybe I could "threadify" both?
>
> Anyway,
>
> I've been reading about threads and I must say the existing documentation
> is not very easy to understand. Code examples on google are pretty scarce
> too. I found an interesting one on StackOverflow where Donal (DKF)
> suggests using pools, but I couldn't make that work. I mean, I have to
> "get" data with tpool::get? But when? How do I know when the job is done?
> He also suggests tsv, but I found the relevant documentation hard to read
> and understand.
>
> Yet I've made prototypes that almost worked. Almost.
>
> The last mile I need, I think, is retrieving data from the thread. More
> specifically, ::Wordlist after it is built.
>
> I wish I didn't have to call it explicitly. Just let the thread build
> and set ::Wordlist whenever it feels ready. My goal is to let the thread
> load the list while the rest of the application loads other things.
> Of course, it must not take longer than a few seconds. It has to be done
> when I begin to type.
>
> Among so many attempts, I only came up with one that worked. But
> the application wouldn't load any faster. I must have done something
> wrong. Can you please enlighten me?
>
> package require Thread
> set thr [thread::create]
> thread::send $thr "set wordlistfile $::WordlistFile"
> thread::send -async $thr {
> set _fp [open $wordlistfile r]
> while {![eof $_fp]} {
> set _line [string trim [gets $_fp]]
> if {$_line == ""} {continue}
> lappend ::Wordlist $_line
> }
> close $_fp
> }
> puts "Now what?"
>
>

What is /home/tcl/longlist.txt

Are these the words to lookup or your spelling dictionary. I hate to assume. Based on your variable names, I can't tell which it is.

But... I have a 400k word dictionary I got online somewhere. It's one word per line. No spaces, nothing to trim and no blank lines to remove. When I timed your code against a simpler,

set _fp [open $::WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist [split $data \n]
close $_fp

It went from 700ms to 50ms to create Wordlist. But again, not knowing for sure what that file really is....

For example, if it is your spelling dictionary, I would preprocess it so the trim and blank line tests wouldn't be needed, and then I would store it in a tcl array as a hash table where an [info exist dictionary($someword)] could be used to check spelling.

As to threads, I would *highly* recommend you get a copy of Ashok's book, The tcl programming language. It has a very good section on threads (and many other topics).

See here: https://www.magicsplat.com/

I purchased both the paper book and the pdf version (great for searching).

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

>What is /home/tcl/longlist.txt
>
>Are these the words to lookup or your spelling dictionary. I hate to
>assume. Based on your variable names, I can't tell which it is.
>
>But... I have a 400k word dictionary I got online somewhere. It's one word
>per line. No spaces, nothing to trim and no blank lines to remove. When I
>timed your code against a simpler,
>
> set _fp [open $::WordlistFile r]
> set data [read -nonewline $_fp]
> set Wordlist [split $data \n]
> close $_fp
>
>It went from 700ms to 50ms to create Wordlist. But again, not knowing for
>sure what that file really is....
>
>For example, if it is your spelling dictionary, I would preprocess it so
>the trim and blank line tests wouldn't be needed, and then I would store
>it in a tcl array as a hash table where an [info exist
>dictionary($someword)] could be used to check spelling.
>
>As to threads, I would *highly* recommend you get a copy of Ashok's book,
>The tcl programming language. It has a very good section on threads (and
>many other topics).
>
>See here: https://www.magicsplat.com/
>
>I purchased both the paper book and the pdf version (great for searching).
>
**************************

My wordlist file has 1,248,300 lines. Each line is a word, yes.

I am currently using lsearch -nocase for the lookups. Do you know for
a fact that searching an array is faster than searching a list?
I'm not confident enough in my own methods to measure these things.

You see, case will matter if I use [info exists dictionary($someword)].
Handling case in that scenario will also add overhead.

(I just realized I will have to split my word list in two, common words
and proper names, because proper names must not be all lowercase.)

But the lookup is no problem. That is fast enough. I spell check every
word in a sentence (maximum 80 characters) in one fell swoop and it
never feels slow at all.

No corrections or any kind of guessing though, just checking whether
the words exist or not. The correction suggestion part is currently
in the R&D stage. Well, just R, no D yet.

The bottleneck is definitely in loading the word list.

I did a test without trim and found that the words are not found and a
false misspell is flagged for everything. I soon realized that the
newlines become part of each word so the upshot is they invalidate the
entire dataset. I really have to axe them.

But I will try your file reading code and see if it's faster.

Either way, I would like the opportunity to learn about threads.

I really can't afford any book right now. I don't want to go into
details, suffice to say that I am in very very bad financial condition
right now. Like, really, no joke.

--
Luc
>>

On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:

>But... I have a 400k word dictionary I got online somewhere. It's one word
>per line. No spaces, nothing to trim and no blank lines to remove. When I
>timed your code against a simpler,
>
> set _fp [open $::WordlistFile r]
> set data [read -nonewline $_fp]
> set Wordlist [split $data \n]
> close $_fp
**************************

Well, I can say this, your code is many times as fast as mine.
Very noticeable difference.

I will be using that approach from now on.

Thank you.

--
Luc
>>

Re: Retrieving data from a thread

<ul8tfv$3ifi2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12935&group=comp.lang.tcl#12935

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et99@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Mon, 11 Dec 2023 22:10:08 -0800
Organization: A noiseless patient Spider
Lines: 150
Message-ID: <ul8tfv$3ifi2$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Dec 2023 06:10:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1a73218c396ea78edc24aa7b6f97fc90";
logging-data="3751490"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191hwcQt8waBp3zKSndemAJ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:11Ni5qHXR2TOZ6/K+ulm6eGlpNI=
In-Reply-To: <20231211225610.25a5d497@lud1.home>
Content-Language: en-US

by: et99 - Tue, 12 Dec 2023 06:10 UTC

On 12/11/2023 5:56 PM, Luc wrote:
> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
>
>> What is /home/tcl/longlist.txt
>>
>> Are these the words to lookup or your spelling dictionary. I hate to
>> assume. Based on your variable names, I can't tell which it is.
>>
>> But... I have a 400k word dictionary I got online somewhere. It's one word
>> per line. No spaces, nothing to trim and no blank lines to remove. When I
>> timed your code against a simpler,
>>
>> set _fp [open $::WordlistFile r]
>> set data [read -nonewline $_fp]
>> set Wordlist [split $data \n]
>> close $_fp
>>
>> It went from 700ms to 50ms to create Wordlist. But again, not knowing for
>> sure what that file really is....
>>
>> For example, if it is your spelling dictionary, I would preprocess it so
>> the trim and blank line tests wouldn't be needed, and then I would store
>> it in a tcl array as a hash table where an [info exist
>> dictionary($someword)] could be used to check spelling.
>>
>> As to threads, I would *highly* recommend you get a copy of Ashok's book,
>> The tcl programming language. It has a very good section on threads (and
>> many other topics).
>>
>> See here: https://www.magicsplat.com/
>>
>> I purchased both the paper book and the pdf version (great for searching).
>>
> **************************
>
>
> My wordlist file has 1,248,300 lines. Each line is a word, yes.
>
> I am currently using lsearch -nocase for the lookups. Do you know for
> a fact that searching an array is faster than searching a list?
> I'm not confident enough in my own methods to measure these things.
>
> You see, case will matter if I use [info exists dictionary($someword)].
> Handling case in that scenario will also add overhead.
>
> (I just realized I will have to split my word list in two, common words
> and proper names, because proper names must not be all lowercase.)
>
> But the lookup is no problem. That is fast enough. I spell check every
> word in a sentence (maximum 80 characters) in one fell swoop and it
> never feels slow at all.
>
> No corrections or any kind of guessing though, just checking whether
> the words exist or not. The correction suggestion part is currently
> in the R&D stage. Well, just R, no D yet.
>
> The bottleneck is definitely in loading the word list.
>
> I did a test without trim and found that the words are not found and a
> false misspell is flagged for everything. I soon realized that the
> newlines become part of each word so the upshot is they invalidate the
> entire dataset. I really have to axe them.
>
> But I will try your file reading code and see if it's faster.
>
> Either way, I would like the opportunity to learn about threads.
>
> I really can't afford any book right now. I don't want to go into
> details, suffice to say that I am in very very bad financial condition
> right now. Like, really, no joke.
>

Ok, I guessed right, the list of words are a dictionary of sorts.

You can always just use the [time] command to measure anything you're trying out. Tcl arrays are (afaik) implemented as hash tables so they should be faster for lookup than lsearch.

Also, you can create codes, in addition to inclusion in a list. For example:

set dictionary(Tcl) 1
set dictionary(tcl) 0

to indicate one has a capital letter at the beginning.

As to threads. Here's my 2 cents:

I would create the thread like so:

package require Thread

set tid [thread::create {
# init some variables here (they are global)
proc load {args} {
#...
return $args/load
}
proc lookup {args} {
#...
return $args/lookup
}
proc maintid {tid} { ;# save main's tid so can thread::send back to main later
set ::maintid $tid
return $tid/maintid
}
# more procs
thread::wait
}]

# and then use it like this:

set filename "listofwords.txt"
thread::send -async $tid [list load $filename] result1 ;# start the dictionary loading
# .... can do other inits here
vwait result1

set word "theword"
thread::send $tid [list lookup $word] result2

thread::send $tid [list maintid [thread::id]] result3 ;# send in main thread id

Here's the output from the above

This is my preference for structuring threads; it makes them look just like procs.

In the above, you can call the load one time, -async, and when you are done loading everything else, you would wait on that result variable.

But... There is a *gotcha*. If you vwait *AFTER* the value has already been set by the thread's return send, you will wait forever.

So.... I always do a 2-step:

unset -nocomplain result
thread::send -async .... result
#... can do other things here even update or vwait on other variables
if {![info exist result]} {vwait result} ;# when you are ready to wait for the result

With this code, you avoid that possible race condition where you allow the thread result to get set in between your thread::send and the vwait.

Have fun with threads. Keep in mind, errors can go silently. So, test your proc's separately before copying to the thread script. You can test for a dead thread with thread::exists

A word on global variables. I love them. Some hate them. But in threads, they are not really global, since each thread runs in it's own interpreter, so the globals in the thread are private to that thread. I like to call them - continental variables :)

On 12/11/23 19:56, Luc wrote:
> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
> ...
>
> My wordlist file has 1,248,300 lines. Each line is a word, yes.
>
> I am currently using lsearch -nocase for the lookups. Do you know for
> a fact that searching an array is faster than searching a list?

For a large list yes!

That all being said, you may want to step back and consider alternatives.

Suggestion: use SQLite...

1) Build "offline" (i.e. before you run your application) a SQLite DB
with a table that has one column each row one of your words.
2) Have your application open the SQLite Db and do searches on the table.

On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:

>On 12/11/23 19:56, Luc wrote:
>> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
>> ...
>>
>> My wordlist file has 1,248,300 lines. Each line is a word, yes.
>>
>> I am currently using lsearch -nocase for the lookups. Do you know for
>> a fact that searching an array is faster than searching a list?
>
>For a large list yes!
>
>That all being said, you may want to step back and consider alternatives.
>
>Suggestion: use SQLite...
>
>1) Build "offline" (i.e. before you run your application) a SQLite DB
>with a table that has one column each row one of your words.
>2) Have your application open the SQLite Db and do searches on the table.
>
**************************

I thought about that. I just decided that plain txt files are easier
(or should I say more convenient) to manage. I know I will be adding
items to them as time goes by.

And like I said, I have zero problem with the lookup time. It's working
very fast, no delay whatsoever and I'm even running two queries on most
of the words.

--
Luc
>>

Luc <luc@sep.invalid> wrote:
> I am currently using lsearch -nocase for the lookups. Do you know for
> a fact that searching an array is faster than searching a list?
> I'm not confident enough in my own methods to measure these things.

You don't need to be confident in your own methods. Tcl provides a
'time' command that does all the work for you, and reports the time
taken. So just mock up both versions, and use [time] to time
executiong of a bunch of lookups, and see which is faster.

Luc <luc@sep.invalid> wrote:
> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
>
>>But... I have a 400k word dictionary I got online somewhere. It's one word
>>per line. No spaces, nothing to trim and no blank lines to remove. When I
>>timed your code against a simpler,
>>
>> set _fp [open $::WordlistFile r]
>> set data [read -nonewline $_fp]
>> set Wordlist [split $data \n]
>> close $_fp
> **************************
>
>
> Well, I can say this, your code is many times as fast as mine.
> Very noticeable difference.
>
> I will be using that approach from now on.

Your code was iterating a loop, reading one line at a time, and doing
trims and checks for blank lines, for each line in your file (which you
said was over 1M lines long in another post.

So you were looping, in Tcl, over one million times, running the code
within your loop (a string trim and a check for blank lines).

et99's code is doing a single 'read' to bring the entire file in all at
once (much more IO efficient than reading one line at a time -- but
trades off against higher memory use).

Then he does a single 'split' on newliness to create the list.

So he is running two Tcl statements, with most of the 'work' being done
inside the C code which impliments those two commands, vs. running over
one million iterations of Tcl code to process one word at a time.

That is why et99's version is faster. But, et99's version presumes
that the words in the file are already "trimmed" and no blank lines
between any words. Both of which are reasonable to do as a "one-time"
preprocess of the words, so as to not have to repeat it each time the
file is loaded.

Luc <luc@sep.invalid> wrote:
> The last mile I need, I think, is retrieving data from the thread.
> More specifically, ::Wordlist after it is built.

There are several ways to retreive data from a thread.

1) use the 'tsv' module to setup a shared variable, have the sub-thread
store results in the shared variable, and have the main thread retreive
the results from the shared variable (but will sometimes need a way to
synchronize the two).

2) don't use the -async option to thread::send, and just have the
result be 'returned'. But this makes the main thread wait, which
removes most of the value of running the sub-thread to gain
parallelism.

3) use the optional ?varname? with thread::send and -async to have the
result returned in varname, and vwait later to pickup the result.

4) send the data to process, and the id of the thread doing the send to
the sub-thread, and write the sub-thread to perform a thread::send to
the supplied id to set a variable with the result. I.e. something
like:

sending thread:
thread::send -async tid2 [list do-process $data [thread::id]]

receiving thread:
proc do-process {data tid} {
set result [do-something-with $data]
thread::send -async $tid [list set ::thread_result $result]
}

And there are probably more ways to "get back" the result than what
I've detailed above.

Do note that using threads will also bring you into the world of
parallell processing with all the complexity that implies, including
the need for synchronization at times. The
thread::(mutex|rwmutex|cond) commands provide the sync. primitives, but
you do have to use them as appropriate.

Luc <luc@sep.invalid> wrote:
> On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:
>
>>On 12/11/23 19:56, Luc wrote:
>>> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
>>> ...
>>>
>>> My wordlist file has 1,248,300 lines. Each line is a word, yes.
>>>
>>> I am currently using lsearch -nocase for the lookups. Do you know for
>>> a fact that searching an array is faster than searching a list?
>>
>>For a large list yes!
>>
>>That all being said, you may want to step back and consider alternatives.
>>
>>Suggestion: use SQLite...
>>
>>1) Build "offline" (i.e. before you run your application) a SQLite DB
>>with a table that has one column each row one of your words.
>>2) Have your application open the SQLite Db and do searches on the table.
>>
> **************************
>
> I thought about that. I just decided that plain txt files are easier
> (or should I say more convenient) to manage. I know I will be adding
> items to them as time goes by.
>
> And like I said, I have zero problem with the lookup time. It's working
> very fast, no delay whatsoever and I'm even running two queries on most
> of the words.

You can keep your 'plain text' file, just setup a process to
'regenerate' the sqlite database whenever you update the plain text
file.

The advantage you get with sqlite is that all the preprocessing is done
ahead of time, and you only incur the "lookup time" when you do a
lookup.

A second advantage is your word list could be much larger than what you
can hold in memory when it is in a sqlite DB (although this advantage
has shrunk given the huge amount of RAM in modern systems).

Re: Retrieving data from a thread

<ulage2$3q1m9$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12944&group=comp.lang.tcl#12944

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et99@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 12:39:28 -0800
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <ulage2$3q1m9$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Dec 2023 20:39:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1a73218c396ea78edc24aa7b6f97fc90";
logging-data="3999433"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uCaT5ziM5DR8WWtuWmHZC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:taVu22i6URs7gzqzYW8xxexca08=
In-Reply-To: <20231212131917.0f86250b@lud1.home>
Content-Language: en-US

by: et99 - Tue, 12 Dec 2023 20:39 UTC

On 12/12/2023 8:19 AM, Luc wrote:
> On Tue, 12 Dec 2023 07:51:30 -0600, Gerald Lester wrote:
>
>> On 12/11/23 19:56, Luc wrote:
>>> On Mon, 11 Dec 2023 16:24:36 -0800, et99 wrote:
>>> ...
>>>
>>> My wordlist file has 1,248,300 lines. Each line is a word, yes.
>>>
>>> I am currently using lsearch -nocase for the lookups. Do you know for
>>> a fact that searching an array is faster than searching a list?
>>
>> For a large list yes!
>>
>> That all being said, you may want to step back and consider alternatives.
>>
>> Suggestion: use SQLite...
>>
>> 1) Build "offline" (i.e. before you run your application) a SQLite DB
>> with a table that has one column each row one of your words.
>> 2) Have your application open the SQLite Db and do searches on the table.
>>
> **************************
>
> I thought about that. I just decided that plain txt files are easier
> (or should I say more convenient) to manage. I know I will be adding
> items to them as time goes by.
>
> And like I said, I have zero problem with the lookup time. It's working
> very fast, no delay whatsoever and I'm even running two queries on most
> of the words.
>

One reason you may want a very fast lookup is you may want to eventually also make suggestions.

One way is to take a mis-spelled word and make changes, like reverse each pair of letters, add a letter and subtract a letter at all possible locations in the word, and then lookup the newly formed words. That's where having a very fast lookup can help.

Anyway, that's how I once did a spell checker. It worked decently well.

Incidentally, using a tsv shared variable array can be loaded like so:

proc load {WordlistFile} {
set _fp [open $WordlistFile r]
set data [read -nonewline $_fp]
set Wordlist2 [split $data \n]
foreach word $Wordlist2 {
tsv::set dictionary $word 1
}
close $_fp

}

This takes only about 300ms to load my 400k dictionary. (A regular array was 200ms).

Then lookups are on the order of a few microseconds:

proc lookup {word} {
return [tsv::exists dictionary $word]
}

% lookup hello
1 % lookup hellox
0 % time {lookup hello}
8 microseconds per iteration
% time {lookup hellox}
7 microseconds per iteration

I would think with lsearch, not finding a word might take the longest, if it's doing a sequential search.

Re: Retrieving data from a thread

<ulah15$3q460$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12945&group=comp.lang.tcl#12945

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 20:49:41 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <ulah15$3q460$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ulage2$3q1m9$1@dont-email.me>
Injection-Date: Tue, 12 Dec 2023 20:49:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ac120ce4ba72fa8594bcc8967ee38f61";
logging-data="4001984"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hHFFeQc3UkGFK4WS2iBCP"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:dJhN64mAjwLwsssUKe2AWlak8oM=

by: Rich - Tue, 12 Dec 2023 20:49 UTC

et99 <et99@rocketship1.me> wrote:
> I would think with lsearch, not finding a word might take the
> longest, if it's doing a sequential search.

For the default, yes, because the default is a sequential search.

But, if your list elements are sorted, you can use the "-sorted"
option, which speeds it up. The man page simply says "will use a more
efficient searching algorithm to search list", I suspect "-sorted"
turns on a binary search of the list elements.

Re: Retrieving data from a thread

<ulaifc$3qceq$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12946&group=comp.lang.tcl#12946

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et99@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 13:14:19 -0800
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <ulaifc$3qceq$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home> <ulage2$3q1m9$1@dont-email.me>
<ulah15$3q460$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Dec 2023 21:14:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1a73218c396ea78edc24aa7b6f97fc90";
logging-data="4010458"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19OdJy0haWEtcx+IrOr/pkN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:FWQWS+JI/F2aAxOBpuTIJTD70OU=
In-Reply-To: <ulah15$3q460$1@dont-email.me>
Content-Language: en-US

by: et99 - Tue, 12 Dec 2023 21:14 UTC

On 12/12/2023 12:49 PM, Rich wrote:
> et99 <et99@rocketship1.me> wrote:
>> I would think with lsearch, not finding a word might take the
>> longest, if it's doing a sequential search.
>
> For the default, yes, because the default is a sequential search.
>
> But, if your list elements are sorted, you can use the "-sorted"
> option, which speeds it up. The man page simply says "will use a more
> efficient searching algorithm to search list", I suspect "-sorted"
> turns on a binary search of the list elements.
>

Ahhh, good point. A binary search might only be doing around 20 compares on a 1mil word list.

% lsearch $Wordlist2 hello
154092
% lsearch -sorted $Wordlist2 hello
154092
% lsearch $Wordlist2 hellox
-1

% time {lsearch $Wordlist2 hello}
1127 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hello}
12 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hellox}
12 microseconds per iteration

and when run 50 times,

% time {lsearch -sorted $Wordlist2 hello} 50
0.78 microseconds per iteration
% time {lsearch -sorted $Wordlist2 hellox} 50
0.66 microseconds per iteration

Hah! so, no need for an array and a hash lookup.

On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:

>You can keep your 'plain text' file, just setup a process to
>'regenerate' the sqlite database whenever you update the plain text
>file.
>
>The advantage you get with sqlite is that all the preprocessing is done
>ahead of time, and you only incur the "lookup time" when you do a
>lookup.
>
>A second advantage is your word list could be much larger than what you
>can hold in memory when it is in a sqlite DB (although this advantage
>has shrunk given the huge amount of RAM in modern systems).
>
**************************

Great tips, I always learn a lot here. Thank you.

However, I don't understand what you mean by "all the preprocessing is
done ahead of time."

What preprocessing? The file is "slurped" once at launch then the word
list is permanently available in a list. Why would acess to that list
(in memory, I assume) be slower to access to a database (on disk, for
sure)?

--
Luc
>>

On Tue, 12 Dec 2023 12:39:28 -0800, et99 wrote:

>One reason you may want a very fast lookup is you may want to eventually
>also make suggestions.

I still have to investigate what algorithms there are out there. I tried
a couple of my own and they failed quite beautifully.

Anyway, I spent some time testing my newly acquired word list by looking
up all kinds of far fetched words to see how thorough it was. Out of
convenience, I did it like this:

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
} foreach w {word this that something some other whatever dunno} {
puts "[p.findword $w] $w"
}

I went as far as 10 words at a time, and the output is instantaneous
for all the 10 lookups. With my old algorithm to read the list file
(the one million iterations), there was a very noticeable pause before
the output, but now that is instantaneous too.

So no worries about speed yet.

--
Luc
>>

Re: Retrieving data from a thread

<ulasgl$3rq5s$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12949&group=comp.lang.tcl#12949

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 00:05:41 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <ulasgl$3rq5s$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me> <20231212185913.56b31be2@lud1.home>
Injection-Date: Wed, 13 Dec 2023 00:05:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8c13c7efcbba3215cafec4ee4078253";
logging-data="4057276"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193C1mrZrlMV9spo99g5ME2"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:T77PttS+CDjYvQVjM4uowBJtTYA=

by: Rich - Wed, 13 Dec 2023 00:05 UTC

Luc <luc@sep.invalid> wrote:
> On Tue, 12 Dec 2023 16:41:50 -0000 (UTC), Rich wrote:
>
>>You can keep your 'plain text' file, just setup a process to
>>'regenerate' the sqlite database whenever you update the plain text
>>file.
>>
>>The advantage you get with sqlite is that all the preprocessing is done
>>ahead of time, and you only incur the "lookup time" when you do a
>>lookup.
>>
>>A second advantage is your word list could be much larger than what you
>>can hold in memory when it is in a sqlite DB (although this advantage
>>has shrunk given the huge amount of RAM in modern systems).
>>
> **************************
>
> Great tips, I always learn a lot here. Thank you.
>
> However, I don't understand what you mean by "all the preprocessing is
> done ahead of time."
>
> What preprocessing?

Trimming spaces from the words only needs to be done once. Splitting
lines into words only needs to be done once.

> The file is "slurped" once at launch then the word list is
> permanently available in a list.

Part of your slurp involves a split on \n to convert the lines into
words. While that step, on modern CPU's, is fast enough you get the
luxury to ignore it, it is "preprocessing" that you are performing
every time you slurp the file in from disk.

> Why would acess to that list (in memory, I assume) be slower to
> access to a database (on disk, for sure)?

For things that fit in ram, and a list, and provided you have the list
sorted, and use the -sorted option to list, then lookups in the list
likely will beat sqlite. But, if the wordlist grows too large for
memory (this is unlikely for your specific use case, but for other
kinds of "data" is very common) or you don't keep it sorted so you have
to use lsearch's linear search then sqlite (provided you tell sqlite to
index the lookup column) will beat the list method in most cases.

Re: Retrieving data from a thread

<20231212213644.66ceb2c4@lud1.home>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12950&group=comp.lang.tcl#12950

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 21:36:44 -0300
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <20231212213644.66ceb2c4@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="694a5d60dea428a9ce9faecede733cfc";
logging-data="4064236"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BPYCD7UZMYQWLzkphumom59Zw1fOsJYo="
Cancel-Lock: sha1:2XreVj0Zs0uZNCMhPZt7TYPnxwc=

by: Luc - Wed, 13 Dec 2023 00:36 UTC

On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:

>For things that fit in ram, and a list, and provided you have the list
>sorted, and use the -sorted option to list, then lookups in the list
>likely will beat sqlite. But, if the wordlist grows too large for
>memory (this is unlikely for your specific use case, but for other
>kinds of "data" is very common) or you don't keep it sorted so you have
>to use lsearch's linear search then sqlite (provided you tell sqlite to
>index the lookup column) will beat the list method in most cases.
>
**************************

Do you by any chance know what happens if I use lsearch -sorted on a
list that

A. is not perfectly or completely sorted (new items have been added to
the end)

B. I run the garden variety GNU 'sort' command on the word list file
so it may not comply exactly with whatever lsearch thinks should be
considered "sorted" (ascii, alnum, etc.)?

--
Luc
>>

Re: Retrieving data from a thread

<20231212215114.4c6d08f3@lud1.home>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12951&group=comp.lang.tcl#12951

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 21:51:14 -0300
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <20231212215114.4c6d08f3@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="694a5d60dea428a9ce9faecede733cfc";
logging-data="4067797"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9ldEYvcAeciHm9x8AKuc5UuGazHcqEwc="
Cancel-Lock: sha1:18ZGFGQ6Zae7uPI3G881MOsJtls=

by: Luc - Wed, 13 Dec 2023 00:51 UTC

On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:

>Do you by any chance know what happens if I use lsearch -sorted on a
>list that
>
>A. is not perfectly or completely sorted (new items have been added to
>the end)
>
>B. I run the garden variety GNU 'sort' command on the word list file
>so it may not comply exactly with whatever lsearch thinks should be
>considered "sorted" (ascii, alnum, etc.)?
>
**************************

Whoa. I don't know what is going on, but something is going on and
it's bad.

proc p.findword {word} {
puts -nonewline [lsearch -nocase $::BIGLIST $word]
} foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

9 out of 10 words found.

Now, using -sorted:

proc p.findword {word} {
puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
} foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
puts "[p.findword $w] $w"
}

Visibly faster, but only 3 out of 10 words found.

Not good.

--
Luc
>>

Re: Retrieving data from a thread

<ulb4s7$3sqqt$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12952&group=comp.lang.tcl#12952

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et99@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Tue, 12 Dec 2023 18:28:23 -0800
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <ulb4s7$3sqqt$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home> <20231212215114.4c6d08f3@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 13 Dec 2023 02:28:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7756ec1fcc0f7fe281f4f4149240e3b4";
logging-data="4090717"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9spi/e4uFGimU+Sz37S4s"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:hKE/vMRjTf7GOhZbX7N12/LYHnY=
In-Reply-To: <20231212215114.4c6d08f3@lud1.home>
Content-Language: en-US

by: et99 - Wed, 13 Dec 2023 02:28 UTC

On 12/12/2023 4:51 PM, Luc wrote:
> On Tue, 12 Dec 2023 21:36:44 -0300, Luc wrote:
>
>> Do you by any chance know what happens if I use lsearch -sorted on a
>> list that
>>
>> A. is not perfectly or completely sorted (new items have been added to
>> the end)
>>
>> B. I run the garden variety GNU 'sort' command on the word list file
>> so it may not comply exactly with whatever lsearch thinks should be
>> considered "sorted" (ascii, alnum, etc.)?
>>
> **************************
>
> Whoa. I don't know what is going on, but something is going on and
> it's bad.
>
> proc p.findword {word} {
> puts -nonewline [lsearch -nocase $::BIGLIST $word]
> }
> foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
> puts "[p.findword $w] $w"
> }
>
>
> 9 out of 10 words found.
>
> Now, using -sorted:
>
> proc p.findword {word} {
> puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
> }
> foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
> puts "[p.findword $w] $w"
> }
>
>
> Visibly faster, but only 3 out of 10 words found.
>
> Not good.
>

What I meant by pre-processing was to take your list as cleaned up, sorted, etc. and write it out, once. Thereafter, you could use the read/split to restore it to memory quickly.

If, however, you are going to be adding words during a run, you could just keep 2 lists. The second list would likely be very short if added by the user during a session. Merging new words in might be a pain, and re-sorting the entire list likewise.

On the other hand, this is a plus for using the array, since order isn't important there, as it's just hashing them.

But are you also going to let the user do a "save dictionary" after adding in new words? Programs never do stay simple :)

Re: Retrieving data from a thread

<ulb6l3$3svjp$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12953&group=comp.lang.tcl#12953

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 02:58:43 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <ulb6l3$3svjp$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me> <20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me> <20231212213644.66ceb2c4@lud1.home>
Injection-Date: Wed, 13 Dec 2023 02:58:43 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8c13c7efcbba3215cafec4ee4078253";
logging-data="4095609"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oHHO6bD1DtwI4NLldiWq8"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:1Awj+PxfUpx1SQDQTIWscpN2BxY=

by: Rich - Wed, 13 Dec 2023 02:58 UTC

Luc <luc@sep.invalid> wrote:
> On Wed, 13 Dec 2023 00:05:41 -0000 (UTC), Rich wrote:
>
>>For things that fit in ram, and a list, and provided you have the list
>>sorted, and use the -sorted option to list, then lookups in the list
>>likely will beat sqlite. But, if the wordlist grows too large for
>>memory (this is unlikely for your specific use case, but for other
>>kinds of "data" is very common) or you don't keep it sorted so you have
>>to use lsearch's linear search then sqlite (provided you tell sqlite to
>>index the lookup column) will beat the list method in most cases.
>>
> **************************
>
> Do you by any chance know what happens if I use lsearch -sorted on a
> list that
>
> A. is not perfectly or completely sorted (new items have been added to
> the end)

Most likely the search will complete, but it may or may not find the
requested element.

Try creating a random ordered list from your long word list, then do a
lsearch -sorted on it and see what happens. ::struct::list shuffle
from Tcllib can randomly order your word list for you.

> B. I run the garden variety GNU 'sort' command on the word list file
> so it may not comply exactly with whatever lsearch thinks should be
> considered "sorted" (ascii, alnum, etc.)?

The manpage implies that -sorted expects the list to be sorted via the
-ascii comparison operator of lsort. Since this is unicode code point
order, it likely exactly matches that which gnu sort produces in its
default state.

But there are only two possibilities:

1) gnu sort sorts in the order that lsearch assumes for -sorted - in
which case you get back correct answers

2) gnu sort sorts slightly differently -- in which case you may get
back any of:

no answers at all (i.e. always a 'not found')

the wrong answers (although it is much more likely to simply return
"not found")

sometimes the right answer, sometimes a false "not found"

This third one would occur when the list is almost in the right order,
but only a few elements differ. If what you search for ends up being
in a segmennt that is ordered as expected, you get the right answer (or
a true "not found" answer). If what you search for ends up being in a
segment that is not ordered as expected, you most likely get back a
false not found.

Re: Retrieving data from a thread

<ulb6oc$3svjp$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12954&group=comp.lang.tcl#12954

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 03:00:28 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <ulb6oc$3svjp$2@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me> <20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me> <20231212213644.66ceb2c4@lud1.home> <20231212215114.4c6d08f3@lud1.home>
Injection-Date: Wed, 13 Dec 2023 03:00:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8c13c7efcbba3215cafec4ee4078253";
logging-data="4095609"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xd7ZvRbeePR0BoQqoBNCs"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:Id5I//qobCRAuFKTNYssNoKCZ7k=

by: Rich - Wed, 13 Dec 2023 03:00 UTC

Luc <luc@sep.invalid> wrote:
> Whoa. I don't know what is going on, but something is going on and
> it's bad.
>
> proc p.findword {word} {
> puts -nonewline [lsearch -nocase $::BIGLIST $word]
> }
> foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
> puts "[p.findword $w] $w"
> }
>
>
> 9 out of 10 words found.
>
> Now, using -sorted:
>
> proc p.findword {word} {
> puts -nonewline [lsearch -nocase -sorted $::BIGLIST $word]
> }
> foreach w {word1 word2 word3 word4 word5 word6 word7 word8 word9 word10} {
> puts "[p.findword $w] $w"
> }

When you use "-sorted", ::BIGLIST is, in fact, sorted, right?

> Visibly faster, but only 3 out of 10 words found.
>
> Not good.

Given the reduction in hits, this implies you do not have ::BIGLIST
sorted.

Re: Retrieving data from a thread

<20231213085435.67ed5c24@lud1.home>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12955&group=comp.lang.tcl#12955

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 08:54:35 -0300
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <20231213085435.67ed5c24@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home>
<20231212215114.4c6d08f3@lud1.home>
<ulb6oc$3svjp$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="a8eb6754170907c9fb3f4502291b215e";
logging-data="161272"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xmnnuhoOZzJFBzajbwDdNG6bCgfCsOP8="
Cancel-Lock: sha1:iyDHsvprMyu9NOqd8e0S+UzBpmw=

by: Luc - Wed, 13 Dec 2023 11:54 UTC

On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:

>When you use "-sorted", ::BIGLIST is, in fact, sorted, right?
>
>> Visibly faster, but only 3 out of 10 words found.
>>
>> Not good.
>
>Given the reduction in hits, this implies you do not have ::BIGLIST
>sorted.
**************************

::BIGLIST is slurped straight from the file which was a merge of multiple
word lists and dictionaries I found here and there, then sorted with
sort -u to remove the duplicates.

So it is sorted, but I guess it's not sorted in the way that lsearch
expects.

--
Luc
>>

Re: Retrieving data from a thread

<20231213091038.100989ea@lud1.home>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12956&group=comp.lang.tcl#12956

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 09:10:38 -0300
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <20231213091038.100989ea@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home>
<20231212215114.4c6d08f3@lud1.home>
<ulb4s7$3sqqt$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="a8eb6754170907c9fb3f4502291b215e";
logging-data="161272"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+fg0w3V26m9iOyzVm/ZUdL7QkfrVqdQNc="
Cancel-Lock: sha1:0oKs/tr6/2ff5e3lBG2sD/k036U=

by: Luc - Wed, 13 Dec 2023 12:10 UTC

On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:

>What I meant by pre-processing was to take your list as cleaned up,
>sorted, etc. and write it out, once. Thereafter, you could use the
>read/split to restore it to memory quickly.
>
>If, however, you are going to be adding words during a run, you could just
>keep 2 lists. The second list would likely be very short if added by the
>user during a session. Merging new words in might be a pain, and
>re-sorting the entire list likewise.
>
>On the other hand, this is a plus for using the array, since order isn't
>important there, as it's just hashing them.
>
>But are you also going to let the user do a "save dictionary" after adding
>in new words? Programs never do stay simple :)
>
**************************

Well, yes. It's done and in production already.

You see, names are simple. They have to begin with a capital letter.

But "begin" means it can be either Mary or MARY. For that I need some
kind of -nocase parameter or one normalization step plus a second lookup.
That may or may not defeat the superior speed of array lookups or more
likely just make the difference less meaningful.

Common words are less simple. In the beginning of a sentence, they must
begin with a capital letter. In the middle of a sentence, they must
begin with a small letter. But in either case it may be all upper case
too.

The shortest route I could think of was two lists: things and names.
1. Search in the first list with no case and that's it.
2. Not found? Search in the second list as is and that's it.
3. Still not found? Capitalize it and look for it again in the list
of names.

In case you're wondering, the problem of capitalizing words (or not)
according to punctuation is taken care of by a completely different proc
that does auto correct according to another list. I actually use the
concept of auto correct to auto expand abbreviations and type faster.
That proc takes care of capitalization according to punctuation.
In a public application that would not be good enough, but since this
is for private use and is working as intended, I won't bother fixing
what ain't broken.

But another problem comes up.

In my current design, boxes with any problem cannot be approved and I am
not allowed to jump to the next one until the problem is properly fixed.
A "problem" currently means too many characters or an empty box. Empty
boxes may be desirable in certain circumstances so there is a "force"
command (and key shortcut) in case I want to override it. Misspellings
will just be a third kind of problem.

Workflow speed is always a priority with this thing so I implemented the
possibility of a double override action. The first override key press
will add all unknown words to the word list and the second override will
"approve and move forward."

But then I can't distinguish things from names. I can, but I guess I
would have to introduce a pop-up to decide which one every time. That
would slow things down. I though that maybe it would be better to just
use one global word list and take care of casing with my own human
proofreading.

Then again, unknown words are highly likely to be proper names so I
decided to detect their case and send them straight to the names list
if they are written with a capital letter whether it's a name or not.
If they are not a name and happen to show up again in small letters,
then I will add them again, in which case they will go to the word list.

Now words or names are always added twice: to the list in memory and
appended to the file on disk.

--
Luc
>>

Re: Retrieving data from a thread

<20231213091543.291e75d8@lud1.home>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=12957&group=comp.lang.tcl#12957

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 09:15:43 -0300
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <20231213091543.291e75d8@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home>
<20231212215114.4c6d08f3@lud1.home>
<ulb6oc$3svjp$2@dont-email.me>
<20231213085435.67ed5c24@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="a8eb6754170907c9fb3f4502291b215e";
logging-data="212238"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18QaSn7OIFhJ/5sxiuGRw1vi/2wf25XSDM="
Cancel-Lock: sha1:QzdFeskhuAZjmWgZJqnLmy4raoo=

by: Luc - Wed, 13 Dec 2023 12:15 UTC

Subject	Author
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Gerald Lester
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich
Retrieving data from a thread	et99
Retrieving data from a thread	Rich
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich