'ocr' Service
The 'ocr' service converts image documents into text documents that are used for full-text indexing, together with the text recognition software ABBYY FineReader. It can also be used to create PDF files with hidden text, various PDF/A formats, and highly compressed PDFs.
Configuration
The 'ocr' service works with the following default settings:
PDF profile: format | PDF/A1b |
PDF profile: method | Balanced |
Text profile | Predefined: TextExport.ini |
File transfer to enaio® rendition-plus | Stream |
Number of cores for ABBYY FineReader | 1 |
These settings can be changed via the ocr-prod.yml configuration file located in the\servicemanager\config\ directory.
Example of a configuration in the ocr-prod.yml file:
finereader:
profile:
pdfa: PDFA1bBalanced.ini
text: TextExport.ini
engine:
numberOfCores: 1
rest:
transferPolicy: stream
The example corresponds to the default settings.
Only the settings that differ from the default settings need to be specified.
-
Customizing the PDF profile
To customize the PDF format and the method, assign a value to the finereader:profile:pdfa property, whereby the value is composed of the format and method: <format><method>.ini
The following formats can be created:
Format Notation PDF PDF PDF/A1a PDFA1a PDF/A1b PDFA1b PDF/A2a PDFA2a PDF/A2u PDFA2u PDF/A3a PDFA3a PDF/A3u PDFA3u The following methods are available:
Method Description MaxQuality Produces results with the best resolution. Speed and degree of compression are secondary. MaxSpeed Produces results according to the fastest method. Quality and degree of compression are secondary. MinSize Produces results with the smallest file size. Speed and quality are secondary. Balanced Produces results with a balanced ratio of quality, speed, and degree of compression. -
Customizing the text profile
Currently, only the text profile TextExport.ini is available for creating texts.
Assign a value to the rest:transferPolicy property to specify how the files are transferred.
File transfer | Description |
---|---|
stream | Transfer via an HTTP stream |
fileref | Transfer via file system reference |
auto |
The transfer type is selected automatically. The IP address of the endpoint of enaio® rendition-plus is used to determine whether enaio® rendition-plus and the 'ocr' service are running on the same computer. If they are, the transfer is done via file system references; otherwise, it is done via an HTTP stream. |
The maximum number of cores ABBYY FineReader can work with depends on the license that was purchased.
Entry: finereader:engine:numberOfCores: <number>
Integrating a Profile File
You can customize the profile or create your own profile file with additional settings and integrate it via the ocr-prod.yml configuration file.
Integration example:
finereader:
profile:
pdfa: 'file://d:/enaio/OCRconfig/custom_ocr.ini'
engine:
numberOfCores: 4
rest:
transferPolicy: 'auto'
Example of a profile file:
[PDFExportParams]
Scenario = PES_Balanced
PDFAComplianceMode = PCM_Pdfa_1b
[PrepareImageMode]
CorrectSkew = false
[PagePreprocessingParams]
CorrectOrientation=true
CorrectSkew=TSPV_No
CorrectGeometry=TSPV_No
[RecognizerParams]
TextLanguage = German,French,English
DetectLanguage = true
BalancedMode=true
[PageAnalysisParams]
DetectVerticalEuropeanText=true
[ObjectsExtractionParams]
DetectTextOnPictures=true
You can find information on the settings in the ABBYY FineReader documentation.
Examples of settings areas:
[PDFExportParams] | Setting the parameters for exporting recognized text into PDF format. |
[PagePreprocessingParams] | Setting parameters for preprocessing pages. |
[PrepareImageMode] | Setting parameters for optimizing images prior to processing. |
[RecognizerParams] | Setting recognizer parameters such as language settings. |
[PageAnalysisParams] | Setting parameters for layout analyses |
[ObjectsExtractionParams] | Setting parameters for extracting objects |