'ocr' Service

enaio® 12.0

The 'ocr' service converts image documents into text documents that are used for full-text indexing, together with the OCR component. PDF files with hidden text, various PDF formats, and highly compressed PDFs can also be created.

ABBYY FineReader or Tesseract can be integrated as an OCR component.

Tesseract is enabled by default for new installations of enaio®.

Tesseract is part of the 'ocr' service and does not need to be installed separately. ABBYY FineReader needs to be installed separately.

Enabling an OCR Component

Tesseract is enabled by default for new installations of enaio® by assigning the tesseract profile in the servicewatcher-sw.yml file located in the \service-manager\config\ directory.

The existing file servicewatcher-sw.yml is not changed during updates and upgrades.

Example:

- name: ocrservice
  type: microservice
  profiles: prod,cloud,blue,tesseract
  instances: 1
  memory: 512M
  port: 7241-7250
  path: ${appBase}/ocrservice/ocrservice-app.jar
                                                            

You need to first delete the tesseract profile in order to enable ABBYY FineReader.

Configuring Tesseract

Tesseract is configured via the ocr-prod.yml configuration file, which is created in the \service-manager\config\ directory.

Example:

engine:
  parallelJobs: 2
tesseract:
  pdfFormat: "PDF_A_1B"
                                                            

The values in the example correspond to the default values.

Parameters:

Parameter	Value
engine.parallelJobs	The maximum number of parallel jobs is specified by the 'TES' license. Default: 2 If there are several installations of enaio® service-manager with the 'ocr' service, the number is distributed across the installations.
tesseract.pdfFormat	PDF format PDF PDF_A_1B Default PDF_A_1A PDF_A_2A PDF_A_2U PDF_A_3A PDF_A_3U

Parameter

Value

engine.parallelJobs

The maximum number of parallel jobs is specified by the 'TES' license.

Default: 2

If there are several installations of enaio® service-manager with the 'ocr' service, the number is distributed across the installations.

tesseract.pdfFormat

PDF format

PDF
PDF_A_1B

Default
PDF_A_1A
PDF_A_2A
PDF_A_2U
PDF_A_3A
PDF_A_3U

The following languages are supported: English, French, German, Italian, and Spanish.

Configuring the File Size

The following parameters in the ocr-prod.yml configuration file can be used to configure the file size of the PDF files that Tesseract creates. Size reduction is disabled by default.

Parameter	Value
tesseract.pdf.image.resize.enabled	Size reduction enabled false/true Default: false
tesseract.pdf.image.resize.max-resolution	Resolution (in DPI) 300 200 150 Default: 200
tesseract.pdf.image.resize.quality	Quality Values from 100 (maximum quality) to 1 (minimum quality). Default: 75

Parameter

Value

tesseract.pdf.image.resize.enabled

Size reduction enabled

false/true

Default: false

tesseract.pdf.image.resize.max-resolution

Resolution (in DPI)

Default: 200

tesseract.pdf.image.resize.quality

Quality

Values from 100 (maximum quality) to 1 (minimum quality).

Default: 75

Example with default values:

tesseract:
  pdf:
    image:
      resize:
        enabled: false
        max-resolution: 200
        quality: 75
                                                            

Configuring ABBYY FineReader

The 'ocr' service works with the following default settings:

PDF profile: format	PDF/A1b
PDF profile: method	Balanced
Text profile	Predefined: TextExport.ini
File transfer to enaio® rendition-plus	Stream
Number of cores for ABBYY FineReader	1

These settings can be modified via a ocr-prod.yml configuration file, which is created in the \service-manager\config\ directory.

Example of a configuration in the ocr-prod.yml file:

finereader:
  profile:
    pdfa: PDFA1bBalanced.ini
    text: TextExport.ini
  engine:
    parallelJobs: 1
rest:
  transferPolicy: stream
                                                            

The example corresponds to the default settings.

Only the settings that differ from the default settings need to be specified.

Customizing the profile

Customizing the PDF profile

To customize the PDF format and the method, assign a value to the finereader:profile:pdfa property, whereby the value is composed of the format and method: <format><method>.ini

The following formats can be created:

Format	Notation
PDF	PDF
PDF/A1a	PDFA1a
PDF/A1b	PDFA1b
PDF/A2a	PDFA2a
PDF/A2u	PDFA2u
PDF/A3a	PDFA3a
PDF/A3u	PDFA3u

The following methods are available:

Method	Description
MaxQuality	Produces results with the best resolution. Speed and degree of compression are secondary.
MaxSpeed	Produces results according to the fastest method. Quality and degree of compression are secondary.
MinSize	Produces results with the smallest file size. Speed and quality are secondary.
Balanced	Produces results with a balanced ratio of quality, speed, and degree of compression.

Customizing the text profile

Currently, only the text profile TextExport.ini is available for creating texts.

Specifying the File Transfer

Assign a value to the rest:transferPolicy property to specify how the files are transferred.

Transfer Types:
File transfer	Description
stream	Transfer via an HTTP stream
fileref	Transfer via file system reference
auto	The transfer type is selected automatically. The IP address of the endpoint of enaio® rendition-plus is used to determine whether enaio® rendition-plus and the 'ocr' service are running on the same computer. If they are, the transfer is done via file system references; otherwise, it is done via an HTTP stream.

Integrating a Profile File

You can customize the profile or create your own profile file with additional settings and integrate it via the ocr-prod.yml configuration file.

Integration example:

finereader: 
  profile:   
    pdfa: 'file://d:/enaio/OCRconfig/custom_ocr.ini'
engine:   
  parallelJobs: 4
rest:   
  transferPolicy: 'auto'  
                                                            

Example of a profile file:

[PDFExportParams]
Scenario = PES_Balanced
PDFAComplianceMode = PCM_Pdfa_1b

[PrepareImageMode]
CorrectSkew = false

[PagePreprocessingParams]
CorrectOrientation=true
CorrectSkew=TSPV_No
CorrectGeometry=TSPV_No

[RecognizerParams]
TextLanguage = German,French,English
DetectLanguage = true
BalancedMode=true

[PageAnalysisParams]
DetectVerticalEuropeanText=true

[ObjectsExtractionParams]
DetectTextOnPictures=true
                                                            

You can find information on the settings in the ABBYY FineReader documentation.

Examples of settings areas:

[PDFExportParams]	Setting the parameters for exporting recognized text into PDF format.
[PagePreprocessingParams]	Setting parameters for preprocessing pages.
[PrepareImageMode]	Setting parameters for optimizing images prior to processing.
[RecognizerParams]	Setting recognizer parameters such as language settings.
[PageAnalysisParams]	Setting parameters for layout analyses
[ObjectsExtractionParams]	Setting parameters for extracting objects