Rocksolid Light - rec.photo.digital - Re: Convert book pdf to an audio podcast to send as an email attachment

Re: Convert book pdf to an audio podcast to send as an email attachment

<u6gir2$ngdo$1@dont-email.me>

https://news.novabbs.org/tech/article-flat.php?id=14233&group=rec.photo.digital#14233

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf alt.comp.software.firefox rec.photo.digital
Followup: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: occassionally-confused@nospam.co.uk (Peter)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,alt.comp.software.firefox,rec.photo.digital
Subject: Re: Convert book pdf to an audio podcast to send as an email attachment
Followup-To: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Date: Fri, 16 Jun 2023 03:53:54 +0100
Organization: -
Lines: 66
Message-ID: <u6gir2$ngdo$1@dont-email.me>
References: <u675kg$34cnq$1@dont-email.me> <u67kqs$365rm$1@dont-email.me> <u68ng6$3fr13$1@dont-email.me> <u6fdam$f678$1@dont-email.me> <u6fdmc$f835$1@dont-email.me> <u6fei4$fab4$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Jun 2023 02:53:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="138b9b70fefa1097b4278351920597b6";
logging-data="770488"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19m61OqP2DuR5u/NHsTxfLr"
Cancel-Lock: sha1:GFpFnMjyyPixGf8A67ZFQYrGREc=
X-No-Archive: yes
X-Newsreader: Forte Agent 3.3/32.846

by: Peter - Fri, 16 Jun 2023 02:53 UTC

knuttle <keith_nuttle@yahoo.com> wrote
> A quick Google, seems to indicate that OCR is part of the professional
> reader, ot a paid subscription

I have Adobe Acrobat 6, which is a decade or more old but I tried that
after seeing your suggestion about the "professional reader" (which doesn't
really make sense to me as the professional is a writer and not a reader).

Adobe Acrobat 6 Standard has a line item menu for
"Document > Paper Capture > Start Capture" and then a choice of
All pages, Current page, and From Page x to y

The default "Paper Capture Settings" language is
"Primary OCR Language = English"
(although you can set any language in their long list).

The default "Paper Capture Settings" "PDF Output Style" is set to
"Searchable Image (Exact)"
But there were options of "Searchable Image (Compact)" & an option for
"Formatted Text and Graphics".

The default "Paper Capture Settings" "Downsample Images" dpi is set to
"None" but there were options for "Low (300 dpi)", "Medium (150 dpi)",
and "High (72 dpi)".
I thought I was heading toward the finish line when I tried it on a sample
full text page of a 200 page document whose text I couldn't select - but
the professional Adobe Acrobat 6 Paper Capture feature erred out saying
"Acrobat could not run Paper Capture on the page because
of the following error: This page contains renderable text."

I thought I'd be sneaky and run the Adobe Acrobat Professional feature of
"File > Reduce File Size > Compatible with Acrobat 4.0 and later"
But it erred out in trying to convert the PDF to Acrobat 4 saying
"This PDF cannot be made compatible with Acrobat 4.0 because it uses
transparency.", however it did convert to Acrobat 5.0 and 6.0 versions.

Unfortunately I wasn't sneaky enough as it still complained that
"This page contains renderable text."

Only did it slowly dawn on me that if the file contains "renderable text",
then maybe I could just save the entire file to plain text, which worked.

Which means there must be at least 3 kinds of PDF text documents at least
Real PDF (with selectable text which we can presume can be OCR'd)
Renderable text (without selectable text which can't be OCR'd apparently)
Bitmap text (without selectable text which we can presume can be OCR'd)

Being confused how to tell which is which, I dug up a PDF where I could
select the text, and THAT was able to be OCR'd by the professional Acrobat.
Paper Capture
"Rasterizing page and sending to Paper Capture..."
Performing page recognition
Converting to indexed color
Thresholding image
Deskewing image
Finding rules and frames
Reading characters
Forming words
Grouping characters and words
Writing ACP file
Do you want to save the changes to "file.pdf" before closing?

But where's the OCR results?
--
fup set to replace firefox with r.p.d instead.

Re: Convert book pdf to an audio podcast to send as an email attachment

<u6hf4o$qc6g$1@dont-email.me>

copy mid

https://news.novabbs.org/tech/article-flat.php?id=14236&group=rec.photo.digital#14236

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@needed.invalid (Paul)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Re: Convert book pdf to an audio podcast to send as an email
attachment
Date: Fri, 16 Jun 2023 06:56:23 -0400
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <u6hf4o$qc6g$1@dont-email.me>
References: <u675kg$34cnq$1@dont-email.me> <u67kqs$365rm$1@dont-email.me>
<u68ng6$3fr13$1@dont-email.me> <u6fdam$f678$1@dont-email.me>
<u6fdmc$f835$1@dont-email.me> <u6fei4$fab4$1@dont-email.me>
<u6gir2$ngdo$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Jun 2023 10:56:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e9e635255937d9c5b290d92db16de619";
logging-data="864464"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19aV0i8RHaSlulMHXis0Ok1vgOF/7QJVeY="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:GlOkAfSx85w0Z4IbL6TZokn553E=
Content-Language: en-US
In-Reply-To: <u6gir2$ngdo$1@dont-email.me>

by: Paul - Fri, 16 Jun 2023 10:56 UTC

On 6/15/2023 10:53 PM, Peter wrote:

> Adobe Acrobat Professional

> Paper Capture
> "Rasterizing page and sending to Paper Capture..."
> Performing page recognition
> Converting to indexed color
> Thresholding image
> Deskewing image
> Finding rules and frames
> Reading characters
> Forming words
> Grouping characters and words
> Writing ACP file
> Do you want to save the changes to "file.pdf" before closing?
>
> But where's the OCR results?
>
The text precisely overlays the scanned characters underneath.

If any of the bitmap below the letter "sticks out", that means
an OCR error has been made.

In Acrobat Free Reader, the File : Properties : Fonts,
is your friend. The richness and type of fonts listed
in there, tells you the kind of document you're dealing with.

The Sony service manual, which is just a scan of some paper and
stored in a PDF, the fonts list for that is: blank.

Finding no fonts, means you need to run OCR.

When I run the tool that uses Tesseract, well, Tesseract does
not add the PDF text, the thing running Tesseract on your
behalf, adds overlay text on the bitmap images. The Font
properties after it is finished ? Just one font is listed,
and it is a magical font which does not render in the reader.
You can only detect it is present, by doing a "select all",
and then a copy/paste.

When a document has a dozen fonts listed, then that has "rendered text"
in it, and you're not allowed by the tools, to be adding OCR overlay text
to a document that already has "rendered text".

Paul

Thufir's a Harkonnen now.

tech / rec.photo.digital / Re: Convert book pdf to an audio podcast to send as an email attachment

Subject	Author
Convert book pdf to an audio podcast to send as an email attachment	Peter
Convert book pdf to an audio podcast to send as an email	Paul