Rocksolid Light - sci.electronics.design

Re: Predictive failures

<uvrkki$2c9fl$1@dont-email.me>

https://news.novabbs.org/tech/article-flat.php?id=136481&group=sci.electronics.design#136481

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: buzz_mccool@yahoo.com (Buzz McCool)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Thu, 18 Apr 2024 10:18:08 -0700
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <uvrkki$2c9fl$1@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 18 Apr 2024 19:18:10 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="21917b2bbfdf7e3c9a43c288bffe8669";
logging-data="2500085"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oxGLRx/1MJNDC/qkW15x3oiJHtsrdFzQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:BBS35qtI0r9zfkC9Ge+zeCdBoqU=
In-Reply-To: <uvjn74$d54b$1@dont-email.me>
Content-Language: en-US

by: Buzz McCool - Thu, 18 Apr 2024 17:18 UTC

On 4/15/2024 10:13 AM, Don Y wrote:
> Is there a general rule of thumb for signalling the likelihood of
> an "imminent" (for some value of "imminent") hardware failure?

This reminded me of some past efforts in this area. It was never
demonstrated to me (given ample opportunity) that this technology
actually worked on intermittently failing hardware I had, so be cautious
in applying it in any future endeavors.

https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

Re: Predictive failures

<uvs5eu$2g9e9$2@dont-email.me>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136484&group=sci.electronics.design#136484

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: blockedofcourse@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Thu, 18 Apr 2024 15:05:07 -0700
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <uvs5eu$2g9e9$2@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 00:05:20 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a10f2b25949bbce175e159a01160a168";
logging-data="2631113"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18K+g3OajlH+F+H37S7kJ9Q"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:DP/bQnSKDBMsUzA4cwIr/H3avnI=
In-Reply-To: <uvrkki$2c9fl$1@dont-email.me>
Content-Language: en-US

by: Don Y - Thu, 18 Apr 2024 22:05 UTC

On 4/18/2024 10:18 AM, Buzz McCool wrote:
> On 4/15/2024 10:13 AM, Don Y wrote:
>> Is there a general rule of thumb for signalling the likelihood of
>> an "imminent" (for some value of "imminent") hardware failure?
>
> This reminded me of some past efforts in this area. It was never demonstrated
> to me (given ample opportunity) that this technology actually worked on
> intermittently failing hardware I had, so be cautious in applying it in any
> future endeavors.

Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.

> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

Thanks for that. I didn't find it in my collection so it's addition will
be welcome.

Sun has historically been aggressive in trying to increase availability,
especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).

I am now seeing similar features in Dell servers. But, the *actual*
implementation details are always shrouded in mystery.

But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.

Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".

E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors! Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory
controller throws an error).

This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product. So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!

The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?

[One common flaw with RAID implementations and naive reliance on that
technology]

Re: Predictive failures

<PLjUN.6944$59Pb.4425@fx16.iad>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136487&group=sci.electronics.design#136487

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.netnews.com!netnews.com!s1-4.netnews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: nospam@null.void (Glen Walpert)
Subject: Re: Predictive failures
Newsgroups: sci.electronics.design
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
<uvs5eu$2g9e9$2@dont-email.me>
MIME-Version: 1.0
User-Agent: Pan/0.158 (Avdiivka; )
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Lines: 70
Message-ID: <PLjUN.6944$59Pb.4425@fx16.iad>
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Fri, 19 Apr 2024 01:27:11 UTC
Date: Fri, 19 Apr 2024 01:27:11 GMT
X-Received-Bytes: 3911
X-Original-Bytes: 3860

by: Glen Walpert - Fri, 19 Apr 2024 01:27 UTC

On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote:

> On 4/18/2024 10:18 AM, Buzz McCool wrote:
>> On 4/15/2024 10:13 AM, Don Y wrote:
>>> Is there a general rule of thumb for signalling the likelihood of an
>>> "imminent" (for some value of "imminent") hardware failure?
>>
>> This reminded me of some past efforts in this area. It was never
>> demonstrated to me (given ample opportunity) that this technology
>> actually worked on intermittently failing hardware I had, so be
>> cautious in applying it in any future endeavors.
>
> Intermittent failures are the bane of all designers. Until something is
> reliably observable, trying to address the problem is largely
> wack-a-mole.
>
>> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
>
> Thanks for that. I didn't find it in my collection so it's addition
> will be welcome.
>
> Sun has historically been aggressive in trying to increase availability,
> especially on big iron. In fact, such a "prediction" led me to discard
> a small server, yesterday (no time to dick with failing hardware!).
>
> I am now seeing similar features in Dell servers. But, the *actual*
> implementation details are always shrouded in mystery.
>
> But, it is obvious (for "always on" systems) that there are many things
> that can silently fail that will only manifest some time later -- if at
> all and possibly complicated by other failures that may have been
> precipitated by it.
>
> Sorting out WHAT to monitor is the tricky part. Then, having the
> ability to watch for trends can give you an inkling that something is
> headed in the wrong direction -- before it actually exceeds some baked
> in "hard limit".
>
> E.g., only the memory that you actively REFERENCE in a product is ever
> checked for errors! Bit rot may not be detected until some time after
> it has occurred -- when you eventually access that memory (and the
> memory controller throws an error).
>
> This is paradoxically amusing; code to HANDLE errors is likely the least
> accessed code in a product. So, bit rot IN that code is more likely to
> go unnoticed -- until it is referenced (by some error condition)
> and the error event complicated by the attendant error in the handler!
> The more reliable your code (fewer faults), the more uncertain you will
> be of the handlers' abilities to address faults that DO manifest!
>
> The same applies to secondary storage media. How will you know if
> some-rarely-accessed-file is intact and ready to be referenced WHEN
> NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
> is intact, NOW?
>
> [One common flaw with RAID implementations and naive reliance on that
> technology]

RAID, even with backups, is unsuited to high reliability storage of large
databases. Distributed storage can be of much higher reliability:

https://telnyx.com/resources/what-is-distributed-storage

<https://towardsdatascience.com/introduction-to-distributed-data-
storage-2ee03e02a11d>

This requires successful retrieval of any n of m data files, normally from
different locations, where n can be arbitrarily smaller than m depending
on your needs. Overkill for small databases but required for high
reliability storage of very large databases.

Re: Predictive failures

<uvsn7d$2n8f9$2@dont-email.me>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136488&group=sci.electronics.design#136488

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: blockedofcourse@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Thu, 18 Apr 2024 20:08:17 -0700
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <uvsn7d$2n8f9$2@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
<uvs5eu$2g9e9$2@dont-email.me> <PLjUN.6944$59Pb.4425@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 05:08:33 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a10f2b25949bbce175e159a01160a168";
logging-data="2859497"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187Fs3ssn2sSihx/oblDa30"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:6AjZUAM128ykpVd8V8wMdDzCgpk=
Content-Language: en-US
In-Reply-To: <PLjUN.6944$59Pb.4425@fx16.iad>

by: Don Y - Fri, 19 Apr 2024 03:08 UTC

On 4/18/2024 6:27 PM, Glen Walpert wrote:
> On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote:
>
>> The same applies to secondary storage media. How will you know if
>> some-rarely-accessed-file is intact and ready to be referenced WHEN
>> NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
>> is intact, NOW?
>>
>> [One common flaw with RAID implementations and naive reliance on that
>> technology]
>
> RAID, even with backups, is unsuited to high reliability storage of large
> databases. Distributed storage can be of much higher reliability:
>
> https://telnyx.com/resources/what-is-distributed-storage
>
> <https://towardsdatascience.com/introduction-to-distributed-data-
> storage-2ee03e02a11d>
>
> This requires successful retrieval of any n of m data files, normally from
> different locations, where n can be arbitrarily smaller than m depending
> on your needs. Overkill for small databases but required for high
> reliability storage of very large databases.

This is effectively how I maintain my archive. Except that the
media are all "offline", requiring a human operator (me) to
fetch the required volumes in order to locate the desired files.

Unlike mirroring (or other RAID technologies), my scheme places
no constraints as to the "containers" holding the data. E.g.,

DISK43 /somewhere/in/filesystem/ fileofinterest
DISK21 >some>other>place anothernameforfile
CDROM77 \yet\another\place archive.type /where/in/archive foo

Can all yield the same "content" (as verified by their prestored signatures).
Knowing the hash of each object means you can verify its contents from a
single instance instead of looking for confirmation via other instance(s)

[Hashes take up considerably less space than a duplicate copy would]

This makes it easy to create multiple instances of particular "content"
without imposing constraints on how it is named, stored, located, etc.

I.e., pull a disk out of a system, catalog its contents, slap an adhesive
label on it (to be human-readable) and add it to your store.

(If I could mount all of the volumes -- because I wouldn't know which volume
might be needed -- then access wouldn't require a human operator, regardless
of where the volumes were actually mounted or the peculiarities of the
systems on which they are mounted! But, you can have a daemon that watches to
see WHICH volumes are presently accessible and have it initiate a patrol
read of their contents while the media are being accessed "for whatever OTHER
reason" -- and track the time/date of last "verification" so you know which
volumes haven't been checked, recently)

The inconvenience of requiring human intervention is offset by the lack of
wear on the media (as well as BTUs to keep it accessible) and the ease of
creating NEW content/copies. NOT useful for data that needs to be accessed
frequently but excellent for "archives"/repositories -- that can be mounted,
accessed and DUPLICATED to online/nearline storage for normal use.

Re: Predictive failures

<v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136497&group=sci.electronics.design#136497

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-3.nntp.ord.giganews.com!border-4.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 19 Apr 2024 18:16:03 +0000
From: boB@K7IQ.com (boB)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Fri, 19 Apr 2024 11:16:02 -0700
Message-ID: <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me> <uvs5eu$2g9e9$2@dont-email.me>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 69
X-Trace: sv3-gvMXbOqsmvYlBzJLzqJP+OseJz4l84NudsH3WceaqG7Z4+szWSl1ybE9UlSBD62TzulNFugWmSBafYV!Xztg8+BiKAiRq+1tv/JEiKamZyfvv8kTc8KJUQdYK4rw4BzdEj2n7NI/xcequr4bgEuTj+Z/ZvVA!xybGXZk=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: boB - Fri, 19 Apr 2024 18:16 UTC

On Thu, 18 Apr 2024 15:05:07 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/18/2024 10:18 AM, Buzz McCool wrote:
>> On 4/15/2024 10:13 AM, Don Y wrote:
>>> Is there a general rule of thumb for signalling the likelihood of
>>> an "imminent" (for some value of "imminent") hardware failure?
>>
>> This reminded me of some past efforts in this area. It was never demonstrated
>> to me (given ample opportunity) that this technology actually worked on
>> intermittently failing hardware I had, so be cautious in applying it in any
>> future endeavors.
>
>Intermittent failures are the bane of all designers. Until something
>is reliably observable, trying to address the problem is largely
>wack-a-mole.
>

The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.

>> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
>
>Thanks for that. I didn't find it in my collection so it's addition will
>be welcome.

Yes, neat paper.

boB

>
>Sun has historically been aggressive in trying to increase availability,
>especially on big iron. In fact, such a "prediction" led me to discard
>a small server, yesterday (no time to dick with failing hardware!).
>
>I am now seeing similar features in Dell servers. But, the *actual*
>implementation details are always shrouded in mystery.
>
>But, it is obvious (for "always on" systems) that there are many things
>that can silently fail that will only manifest some time later -- if at
>all and possibly complicated by other failures that may have been
>precipitated by it.
>
>Sorting out WHAT to monitor is the tricky part. Then, having the
>ability to watch for trends can give you an inkling that something is
>headed in the wrong direction -- before it actually exceeds some
>baked in "hard limit".
>
>E.g., only the memory that you actively REFERENCE in a product is ever
>checked for errors! Bit rot may not be detected until some time after it
>has occurred -- when you eventually access that memory (and the memory
>controller throws an error).
>
>This is paradoxically amusing; code to HANDLE errors is likely the least
>accessed code in a product. So, bit rot IN that code is more likely
>to go unnoticed -- until it is referenced (by some error condition)
>and the error event complicated by the attendant error in the handler!
>The more reliable your code (fewer faults), the more uncertain you
>will be of the handlers' abilities to address faults that DO manifest!
>
>The same applies to secondary storage media. How will you know if
>some-rarely-accessed-file is intact and ready to be referenced
>WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
>verify that it is intact, NOW?
>
>[One common flaw with RAID implementations and naive reliance on that
>technology]

Re: Predictive failures

<uvufj0$35g6g$1@dont-email.me>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136498&group=sci.electronics.design#136498

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: blockedofcourse@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Fri, 19 Apr 2024 12:10:22 -0700
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <uvufj0$35g6g$1@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
<uvs5eu$2g9e9$2@dont-email.me> <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 21:10:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a10f2b25949bbce175e159a01160a168";
logging-data="3326160"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ZREZBzCiWd+rwV4Cyyaqk"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:L0tWcEj5qMx6VoL4BVAy1qvqcJo=
Content-Language: en-US
In-Reply-To: <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>

by: Don Y - Fri, 19 Apr 2024 19:10 UTC

On 4/19/2024 11:16 AM, boB wrote:
>> Intermittent failures are the bane of all designers. Until something
>> is reliably observable, trying to address the problem is largely
>> wack-a-mole.
>
> The problem I have with troubleshooting intermittent failures is that
> they are only intermittend sometimes.

My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing???

You're going to "bless" a product that you, personally, know has a fault...

Re: Predictive failures

<qkqa2j1kiat08unqmb0lf8lcqn0tttm0m4@4ax.com>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136525&group=sci.electronics.design#136525

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-3.nntp.ord.giganews.com!border-4.nntp.ord.giganews.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sun, 21 Apr 2024 19:37:59 +0000
From: boB@K7IQ.com (boB)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Sun, 21 Apr 2024 12:37:58 -0700
Message-ID: <qkqa2j1kiat08unqmb0lf8lcqn0tttm0m4@4ax.com>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me> <uvs5eu$2g9e9$2@dont-email.me> <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com> <uvufj0$35g6g$1@dont-email.me>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 26
X-Trace: sv3-gAYd/KZuKzcIY1Q8w1TUVceGrYVA20/PQ3HWgrOAY7OlCjmVHFf+a97HAyxtmanJlud/Hvw+Lj9qW5J!nxWEa85Qb+n3mbUjoXJhuYE+wZK13sijazfcxp6pOkIhU2c1GPHtdhtdZFt2kxw/0n6NYwbyZsBk!pfufxkg=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: boB - Sun, 21 Apr 2024 19:37 UTC

On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/19/2024 11:16 AM, boB wrote:
>>> Intermittent failures are the bane of all designers. Until something
>>> is reliably observable, trying to address the problem is largely
>>> wack-a-mole.
>>
>> The problem I have with troubleshooting intermittent failures is that
>> they are only intermittend sometimes.
>
>My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
>failure/fault but, because reproducing it is "hard", just pretend it
>never happened! Really? Do you think the circuit/code is self-healing???
>
>You're going to "bless" a product that you, personally, know has a fault...
>

Yes, it may be hard to replicate but you just have to try and try
again sometimes. Or create something that exercises the unit or
software to make it happen and automatically catch it in the act.

I don't care to have to do that very often. When I do, I just try to
make it a challenge.

boB

Re: Predictive failures

<v0404p$gii6$1@dont-email.me>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=136529&group=sci.electronics.design#136529

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!usenet.network!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: blockedofcourse@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Sun, 21 Apr 2024 14:23:32 -0700
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <v0404p$gii6$1@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
<uvs5eu$2g9e9$2@dont-email.me> <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>
<uvufj0$35g6g$1@dont-email.me> <qkqa2j1kiat08unqmb0lf8lcqn0tttm0m4@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Apr 2024 23:23:37 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="14d9072a1542c3d8f64aaec05fdddbf9";
logging-data="543302"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19GegrRDndLHU87fEjoov04"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:x8FWQYaJzYVqjMgFcfZbSf887kQ=
Content-Language: en-US
In-Reply-To: <qkqa2j1kiat08unqmb0lf8lcqn0tttm0m4@4ax.com>

by: Don Y - Sun, 21 Apr 2024 21:23 UTC

On 4/21/2024 12:37 PM, boB wrote:
> On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 4/19/2024 11:16 AM, boB wrote:
>>>> Intermittent failures are the bane of all designers. Until something
>>>> is reliably observable, trying to address the problem is largely
>>>> wack-a-mole.
>>>
>>> The problem I have with troubleshooting intermittent failures is that
>>> they are only intermittend sometimes.
>>
>> My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
>> failure/fault but, because reproducing it is "hard", just pretend it
>> never happened! Really? Do you think the circuit/code is self-healing???
>>
>> You're going to "bless" a product that you, personally, know has a fault...
>
> Yes, it may be hard to replicate but you just have to try and try
> again sometimes. Or create something that exercises the unit or
> software to make it happen and automatically catch it in the act.

I think this was the perfect application for Google Glass! It
seems a given that whenever you stumble on one of these "events",
you aren't concentrating on how you GOT there; you didn't expect
the failure to manifest so weren't keeping track of your actions.

If, instead, you could "rewind" a recording of everything that you
had done up to that point, it would likely go a long way towards
helping you recreate the problem!

When you get a "report" of someone encountering some anomalous
behavior, its easy to shrug it off because they are often very
imprecise in describing their actions; details (crucial) are
often missing or a subject of "fantasy". Is the person sure
that the machine wasn't doing exactly what it SHOULD in that
SPECIFIC situation??

OTOH, when it happens to YOU, you know that the report isn't
a fluke. But, are just as weak on the details as those third-party
reporters!

> I don't care to have to do that very often. When I do, I just try to
> make it a challenge.

Being able to break a design into small pieces goes a long way to
improving its quality. Taking "contractual design" to its extreme
lets you build small, validatable modules that stand a greater
chance of working in concert.

Unfortunately, few have the discipline for such detail, hoping,
instead, to test bigger units (if they do ANY formal testing at all!)

Think of how little formal testing goes into a hardware design.
Aside from imposing inputs and outputs at their extremes, what
*really* happens before a design is released to manufacturing?
(I haven't seen a firm that does a rigorous shake-n-bake in
more than 40 years!)

And, how much less goes into software -- where it is relatively easy to
build test scaffolding and implement regression tests to ensure new
releases don't reintroduce old bugs...

When the emphasis (Management) is getting product out the door,
it's easy to see engineering (and manufacturing) disciplines suffer.

Hackers of the world, unite!

tech / sci.electronics.design / Re: Predictive failures

Subject	Author
Predictive failures	Don Y
Re:Predictive failures	Martin Rid
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	Edward Rawde
Re: Predictive failures	legg
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Re:Predictive failures	Don Y
Re: Re:Predictive failures	Edward Rawde
Re: Predictive failures	Jasen Betts
Re: Predictive failures	Don Y
Re: Predictive failures	Liz Tuddenham
Re: Predictive failures	Don Y
Re: Predictive failures	john larkin
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	john larkin
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	john larkin
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	John Larkin
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	John Larkin
Re: Predictive failures	Edward Rawde
Re: Predictive failures	John Larkin
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	Glen Walpert
Re: Predictive failures	Phil Hobbs
Re: Predictive failures	John Larkin
Re: Predictive failures	Joe Gwinn
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Don Y
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Don Y
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Martin Brown
Re: Predictive failures	Chris Jones
Re: Predictive failures	Don Y
Re: Predictive failures	Don Y
Re: Predictive failures	Martin Brown
Re: Predictive failures	Don Y
Re: Predictive failures	John Larkin
Re: Predictive failures	Bill Sloman
Re: Predictive failures	Edward Rawde
Re: Predictive failures	John Larkin
Re: Predictive failures	Edward Rawde
Re: Predictive failures	John Larkin
Re: Predictive failures	John Larkin
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Don
Re: Predictive failures	Edward Rawde
Re: Predictive failures	Don
Re: Predictive failures	Don Y
Re: Predictive failures	john larkin
Re: Predictive failures	Don
Re: Predictive failures	John Larkin
Re: Predictive failures	Don
Re: Predictive failures	Don Y
Re: Predictive failures	Buzz McCool
Re: Predictive failures	Don Y
Re: Predictive failures	Glen Walpert
Re: Predictive failures	Don Y
Re: Predictive failures	boB
Re: Predictive failures	Don Y
Re: Predictive failures	boB
Re: Predictive failures	Don Y