'ocr' Service

enaio® 12.0 »

The 'ocr' service converts image documents into text documents that are used for full-text indexing, together with the OCR component. PDF files with hidden text, various PDF formats, and highly compressed PDFs can also be created.

ABBYY FineReader or Tesseract can be integrated as an OCR component.

Tesseract is enabled by default for new installations of enaio®.

Tesseract is part of the 'ocr' service and does not need to be installed separately. ABBYY FineReader needs to be installed separately.

Enabling an OCR Component

Tesseract is enabled by default for new installations of enaio® by assigning the tesseract profile in the servicewatcher-sw.yml file located in the \service-manager\config\ directory.

The existing file servicewatcher-sw.yml is not changed during updates and upgrades.

Example:

- name: ocrservice
  type: microservice
  profiles: prod,cloud,blue,tesseract
  instances: 1
  memory: 512M
  port: 7241-7250
  path: ${appBase}/ocrservice/ocrservice-app.jar

You need to first delete the tesseract profile in order to enable ABBYY FineReader.

Configuring Tesseract

Tesseract is configured via the ocr-prod.yml configuration file, which is created in the \service-manager\config\ directory.

Example:

engine:
  parallelJobs: 2
tesseract:
  pdfFormat: "PDF_A_1B"

The values in the example correspond to the default values.

Parameter:

Parameter Value
engine.parallelJobs

The maximum number of parallel jobs is specified by the 'TES' license.

Default: 2

If there are several installations of enaio® service-manager with the 'ocr' service, the number is distributed across the installations.

tesseract.pdfFormat

PDF format

  • PDF

  • PDF_A_1B

    Default

  • PDF_A_1A

  • PDF_A_2A

  • PDF_A_2U

  • PDF_A_3A

  • PDF_A_3U

The following languages are supported: English, French, German, Italian, and Spanish.

Configuring ABBYY FineReader

The 'ocr' service works with the following default settings:

PDF profile: format PDF/A1b
PDF profile: method Balanced
Text profile Predefined: TextExport.ini
File transfer to enaio® rendition-plus Stream
Number of cores for ABBYY FineReader 1

These settings can be changed via the ocr-prod.yml configuration file located in the \servicemanager\config\ directory.

Example of a configuration in the ocr-prod.yml file:

finereader:
  profile:
    pdfa: PDFA1bBalanced.ini
    text: TextExport.ini
  engine:
    parallelJobs: 1
rest:
  transferPolicy: stream

The example corresponds to the default settings.

Only the settings that differ from the default settings need to be specified.

Integrating a profile file

You can customize the profile or create your own profile file with additional settings and integrate it via the ocr-prod.yml configuration file.

Integration example:

finereader: 
  profile:   
    pdfa: 'file://d:/enaio/OCRconfig/custom_ocr.ini'
engine:   
  parallelJobs: 4
rest:   
  transferPolicy: 'auto'  

Example of a profile file:

[PDFExportParams]
Scenario = PES_Balanced
PDFAComplianceMode = PCM_Pdfa_1b

[PrepareImageMode]
CorrectSkew = false

[PagePreprocessingParams]
CorrectOrientation=true
CorrectSkew=TSPV_No
CorrectGeometry=TSPV_No

[RecognizerParams]
TextLanguage = German,French,English
DetectLanguage = true
BalancedMode=true

[PageAnalysisParams]
DetectVerticalEuropeanText=true

[ObjectsExtractionParams]
DetectTextOnPictures=true

You can find information on the settings in the ABBYY FineReader documentation.

Examples of settings areas:

[PDFExportParams] Setting the parameters for exporting recognized text into PDF format.
[PagePreprocessingParams]   Setting parameters for preprocessing pages.
[PrepareImageMode] Setting parameters for optimizing images prior to processing.
[RecognizerParams] Setting recognizer parameters such as language settings.
[PageAnalysisParams] Setting parameters for layout analyses
[ObjectsExtractionParams] Setting parameters for extracting objects