Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Walk softly and carry a megawatt laser.


aus+uk / uk.comp.sys.mac / Unusual OCR problem

SubjectAuthor
* Unusual OCR problemRichard Tobin
+* Unusual OCR problemJaimie Vandenbergh
|`* Unusual OCR problemRichard Tobin
| `- Unusual OCR problemJaimie Vandenbergh
+* Unusual OCR problemBruce Horrocks
|`- Unusual OCR problemLiz Tuddenham
`* Unusual OCR problemTheo
 `- Unusual OCR problemChris Ridd

1
Unusual OCR problem

<u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16662&group=uk.comp.sys.mac#16662

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!news.nntp4.net!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!usenet.inf.ed.ac.uk!.POSTED!not-for-mail
From: richard@cogsci.ed.ac.uk (Richard Tobin)
Newsgroups: uk.comp.sys.mac
Subject: Unusual OCR problem
Date: Fri, 16 Jun 2023 14:36:44 +0000 (UTC)
Organization: Language Technology Group, University of Edinburgh
Lines: 9
Message-ID: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
NNTP-Posting-Host: macaroni.inf.ed.ac.uk
X-Trace: macpro.inf.ed.ac.uk 1686926204 55346 129.215.197.42 (16 Jun 2023 14:36:44 GMT)
X-Complaints-To: usenet@macpro.inf.ed.ac.uk
NNTP-Posting-Date: Fri, 16 Jun 2023 14:36:44 +0000 (UTC)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: richard@cogsci.ed.ac.uk (Richard Tobin)
 by: Richard Tobin - Fri, 16 Jun 2023 14:36 UTC

I have a 100+ page assembler listing of a 6502 program, printed on
a dot-matrix printer in 1983. Does anyone know of an OCR program
that might handle this?

Ones trained on English are likely to hopeless at everything except
the comments.

-- Richard

Re: Unusual OCR problem

<kf3b6gF93ssU1@mid.individual.net>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16663&group=uk.comp.sys.mac#16663

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!lilly.ping.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: jaimie@usually.sessile.org (Jaimie Vandenbergh)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: 16 Jun 2023 14:50:57 GMT
Lines: 21
Message-ID: <kf3b6gF93ssU1@mid.individual.net>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=fixed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net PolkgIa10YGb+Sm6CfCELQpKqqULjnJNo/V/QRzrFJ51Mw85sO
Cancel-Lock: sha1:ank2OUBpIefGmcEHHEn2BoxefQU=
User-Agent: Usenapp for MacOS
X-Usenapp: v1.27.1/l - Full License
 by: Jaimie Vandenbergh - Fri, 16 Jun 2023 14:50 UTC

On 16 Jun 2023 at 15:36:44 BST, "Richard Tobin" <Richard Tobin> wrote:

> I have a 100+ page assembler listing of a 6502 program, printed on
> a dot-matrix printer in 1983. Does anyone know of an OCR program
> that might handle this?
>
> Ones trained on English are likely to hopeless at everything except
> the comments.
>
> -- Richard

Have you tried taking photos and using the iPhone/Mac text recognition?
I've had success with dot-matrix printouts before, but at the 99% kinda
level so you do need to check it all.

Cheers - Jaimie
--
Tomorrow (noun) - A mystical land where 99 per cent
of all human productivity, motivation and achievement
is stored.
-- http://thedoghousediaries.com/3474

Re: Unusual OCR problem

<u6hu9h$1mlf$1@macpro.inf.ed.ac.uk>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16664&group=uk.comp.sys.mac#16664

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!news.nntp4.net!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!usenet.inf.ed.ac.uk!.POSTED!not-for-mail
From: richard@cogsci.ed.ac.uk (Richard Tobin)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: Fri, 16 Jun 2023 15:14:57 +0000 (UTC)
Organization: Language Technology Group, University of Edinburgh
Lines: 20
Message-ID: <u6hu9h$1mlf$1@macpro.inf.ed.ac.uk>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk> <kf3b6gF93ssU1@mid.individual.net>
NNTP-Posting-Host: macaroni.inf.ed.ac.uk
X-Trace: macpro.inf.ed.ac.uk 1686928497 55983 129.215.197.42 (16 Jun 2023 15:14:57 GMT)
X-Complaints-To: usenet@macpro.inf.ed.ac.uk
NNTP-Posting-Date: Fri, 16 Jun 2023 15:14:57 +0000 (UTC)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: richard@cogsci.ed.ac.uk (Richard Tobin)
 by: Richard Tobin - Fri, 16 Jun 2023 15:14 UTC

In article <kf3b6gF93ssU1@mid.individual.net>,
Jaimie Vandenbergh <jaimie@usually.sessile.org> wrote:

>Have you tried taking photos and using the iPhone/Mac text recognition?
>I've had success with dot-matrix printouts before, but at the 99% kinda
>level so you do need to check it all.

I tried copying and pasting from Preview but it was terrible. for example

LDXIM (LISPEN-LISVAL-1)/256+1

was read as

LOXIM CLISPENLISVAL-1) /25641

Surprisingly, pointing my Android phone at the screen worked much
better. It got that example right, but there are still lots of errors
and it is misinterpreting it as columns of text.

-- Richard

Re: Unusual OCR problem

<kf3k7hFaferU1@mid.individual.net>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16667&group=uk.comp.sys.mac#16667

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!lilly.ping.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: jaimie@usually.sessile.org (Jaimie Vandenbergh)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: 16 Jun 2023 17:25:05 GMT
Lines: 31
Message-ID: <kf3k7hFaferU1@mid.individual.net>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk> <kf3b6gF93ssU1@mid.individual.net> <u6hu9h$1mlf$1@macpro.inf.ed.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=fixed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net 8oxfWjbkGpfZZqkUIcJWewYU2qONj1SmSMVpk9hkk/8K6FmPlj
Cancel-Lock: sha1:+/+uUkAJrEQpwvdAUnXCITiOMXA=
User-Agent: Usenapp for MacOS
X-Usenapp: v1.27.1/l - Full License
 by: Jaimie Vandenbergh - Fri, 16 Jun 2023 17:25 UTC

On 16 Jun 2023 at 16:14:57 BST, "Richard Tobin" <Richard Tobin> wrote:

> In article <kf3b6gF93ssU1@mid.individual.net>,
> Jaimie Vandenbergh <jaimie@usually.sessile.org> wrote:
>
>> Have you tried taking photos and using the iPhone/Mac text recognition?
>> I've had success with dot-matrix printouts before, but at the 99% kinda
>> level so you do need to check it all.
>
> I tried copying and pasting from Preview but it was terrible. for example
>
> LDXIM (LISPEN-LISVAL-1)/256+1
>
> was read as
>
> LOXIM CLISPENLISVAL-1) /25641

Oof! You can see where it was going with that, but...

> Surprisingly, pointing my Android phone at the screen worked much
> better. It got that example right, but there are still lots of errors
> and it is misinterpreting it as columns of text.

Aw. Android does this stuff in datacentre rather than on-device, so has
a lot more server to throw at it, but...

Cheers - Jaimie (but...)

--
Reality is what doesn't go away when you stop believing in it
-- Philip K Dick

Re: Unusual OCR problem

<6bb79be7-638b-7d37-921b-b301698fb73e@scorecrow.com>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16671&group=uk.comp.sys.mac#16671

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!news.furie.org.uk!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: 07.013@scorecrow.com (Bruce Horrocks)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: Fri, 16 Jun 2023 22:43:10 +0100
Lines: 26
Message-ID: <6bb79be7-638b-7d37-921b-b301698fb73e@scorecrow.com>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net Fldg0ylSVBhjncvSoblWlAHjg5Ira0WA1lPPn8A7uYI9YyOjdo
Cancel-Lock: sha1:ha7et2Sv14/jw86S33aJl9O5n24=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.11.2
Content-Language: en-GB
In-Reply-To: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
 by: Bruce Horrocks - Fri, 16 Jun 2023 21:43 UTC

On 16/06/2023 15:36, Richard Tobin wrote:
> I have a 100+ page assembler listing of a 6502 program, printed on
> a dot-matrix printer in 1983. Does anyone know of an OCR program
> that might handle this?
>
> Ones trained on English are likely to hopeless at everything except
> the comments.

Not aware of anything off the shelf, but not had to do it myself.

This project used a custom dot-matrix training set for Google's
Tesseract which hugely improved their recognition rate.
<https://github.com/ameera3/OCR_Expiration_Date>

An alternative is to pre-process your scanned images by duplicating and
then merging a very slightly vertically or diagonally shifted copy of
each page on top of itself. This has the effect of filling in the dots
on the dot matrix font so the normal OCR software can work a lot better.

ImageMagick will be able to do the duplicate and merge bit.

Regards,
--
Bruce Horrocks
Surrey, England

Re: Unusual OCR problem

<CkC*GxZiz@news.chiark.greenend.org.uk>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16672&group=uk.comp.sys.mac#16672

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!news.nntp4.net!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!.POSTED.chiark.greenend.org.uk!not-for-mail
From: theom+news@chiark.greenend.org.uk (Theo)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: 16 Jun 2023 22:43:28 +0100 (BST)
Organization: University of Cambridge, England
Message-ID: <CkC*GxZiz@news.chiark.greenend.org.uk>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
Injection-Info: chiark.greenend.org.uk; posting-host="chiark.greenend.org.uk:212.13.197.229";
logging-data="10861"; mail-complaints-to="abuse@chiark.greenend.org.uk"
User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (Linux/5.10.0-22-amd64 (x86_64))
Originator: theom@chiark.greenend.org.uk ([212.13.197.229])
 by: Theo - Fri, 16 Jun 2023 21:43 UTC

Richard Tobin <richard@cogsci.ed.ac.uk> wrote:
> I have a 100+ page assembler listing of a 6502 program, printed on
> a dot-matrix printer in 1983. Does anyone know of an OCR program
> that might handle this?

I wonder if there's a way to train the OCR on the specific dot matrix font?
There were only a limited number of dot positions and so it may have to
infer a collection of dots represent a particular character, being the best
guess from the limited character set.

One that's expecting whole printed characters may get very confused by
discontiguous matrixes of dots.

Theo

Re: Unusual OCR problem

<u6jr22$167g9$1@dont-email.me>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16676&group=uk.comp.sys.mac#16676

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chrisridd@mac.com (Chris Ridd)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: Sat, 17 Jun 2023 09:32:01 +0100
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <u6jr22$167g9$1@dont-email.me>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk>
<CkC*GxZiz@news.chiark.greenend.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Jun 2023 08:32:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="01f26d8ac6e0640558a8a0c0ba936167";
logging-data="1252873"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18/XNe17A/G42j/kKn8wo4SsNm1kOgc35Y="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.12.0
Cancel-Lock: sha1:b4KXqaKF6j5iWW07WMl8xtB6758=
In-Reply-To: <CkC*GxZiz@news.chiark.greenend.org.uk>
 by: Chris Ridd - Sat, 17 Jun 2023 08:32 UTC

On 16/06/2023 22:43, Theo wrote:
> Richard Tobin <richard@cogsci.ed.ac.uk> wrote:
>> I have a 100+ page assembler listing of a 6502 program, printed on
>> a dot-matrix printer in 1983. Does anyone know of an OCR program
>> that might handle this?
>
> I wonder if there's a way to train the OCR on the specific dot matrix font?
> There were only a limited number of dot positions and so it may have to
> infer a collection of dots represent a particular character, being the best
> guess from the limited character set.
>
> One that's expecting whole printed characters may get very confused by
> discontiguous matrixes of dots.

If the OCR knew the material was monospaced, that would probably improve
the accuracy too.

--
Chris

Re: Unusual OCR problem

<1qchm5o.hzpahq1sbcu48N%liz@poppyrecords.invalid.invalid>

  copy mid

https://news.novabbs.org/aus+uk/article-flat.php?id=16678&group=uk.comp.sys.mac#16678

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.imp.ch!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: liz@poppyrecords.invalid.invalid (Liz Tuddenham)
Newsgroups: uk.comp.sys.mac
Subject: Re: Unusual OCR problem
Date: Sat, 17 Jun 2023 20:59:38 +0100
Organization: Poppy Records
Lines: 33
Message-ID: <1qchm5o.hzpahq1sbcu48N%liz@poppyrecords.invalid.invalid>
References: <u6hs1s$1m1i$1@macpro.inf.ed.ac.uk> <6bb79be7-638b-7d37-921b-b301698fb73e@scorecrow.com>
X-Trace: individual.net D8qHwIJN74iktEDB+1+cNwlRRb0EcAWP4ooEETLvW9b84nUQ+l
X-Orig-Path: liz
Cancel-Lock: sha1:KEMkl563SwSa7wmJpFeXeNGaY6Y=
User-Agent: MacSOUP/2.4.6
 by: Liz Tuddenham - Sat, 17 Jun 2023 19:59 UTC

Bruce Horrocks <07.013@scorecrow.com> wrote:

> On 16/06/2023 15:36, Richard Tobin wrote:
> > I have a 100+ page assembler listing of a 6502 program, printed on
> > a dot-matrix printer in 1983. Does anyone know of an OCR program
> > that might handle this?
> >
> > Ones trained on English are likely to hopeless at everything except
> > the comments.
>
> Not aware of anything off the shelf, but not had to do it myself.
>
> This project used a custom dot-matrix training set for Google's
> Tesseract which hugely improved their recognition rate.
> <https://github.com/ameera3/OCR_Expiration_Date>
>
> An alternative is to pre-process your scanned images by duplicating and
> then merging a very slightly vertically or diagonally shifted copy of
> each page on top of itself. This has the effect of filling in the dots
> on the dot matrix font so the normal OCR software can work a lot better.
>
> ImageMagick will be able to do the duplicate and merge bit.

In photoshopLE, selecting 'Blur more' followed by 'Sharpen more' gets
rid of the worst of the screening dots on newspaper photographs.
Perhaps somethig like that would work on a dot matrix print?

--
~ Liz Tuddenham ~
(Remove the ".invalid"s and add ".co.uk" to reply)
www.poppyrecords.co.uk


aus+uk / uk.comp.sys.mac / Unusual OCR problem

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor