Installing an OCR Component
The 'ocrservice' microservice includes the text recognition software ABBYY FineReader, which generates text documents for full-text indexing and PDF files with hidden text from image documents. The 'Index file for full-text search' property is required for the object types.
Tesseract and ABBYY FineReader are available as OCR components.
FineReader | Tesseract | |
---|---|---|
License | SMUA license provided by OPTIMAL SYSTEMS |
License-free Apache license version 2.0 |
Installation | Installation via a setup file included as part of the installation data | Part of the yuuvis® RAD service-manager installation |
Languages | Additional fees may be apply for further languages | Additional languages available free of charge |
Supported image formats | FineReader documentation | Tesseract documentation |
PDF rendition with hidden text | Yes | Yes |
PDF/A rendition | Yes | No |
Barcode recognition | Yes | No |
Number of cores | May be subject to additional fees depending on the license | Not licensed-based/no additional costs, default: 4 |
Tesseract
Tesseract is installed as part of yuuvis® RAD service-manager. Tesseract is preconfigured if the corresponding option is activated.
The <service-manager>\config\ocr–prod.yml configuration file is created during installation. The file contains the languages that are specified during installation. The file can be edited to include additional or other languages.
Example:
tesseract:
languages: deu,eng
The 'ocrservice' service for Tesseract is included in the <service-manager>\config\servicewatcher-sw.yml configuration file:
- name: ocrservice
type: microservice
profiles: prod,cloud,red,tesseract
instances: 1
memory: 512M
port: 7241-7250
path: ${appBase}/ocrservice/ocrservice-app.jar
env:
ProgramData: null
ALLUSERSPROFILE: null
#OMP_THREAD_LIMIT: 4
The OCR engine must be configured in the route.properties configuration file located in the \rendition-plus\webapps\osrenditioncache\WEB-INF\classes\config\ directory:
ocr-engine=finereader
In general, the finereader parameter activates an OCR component (ABBYY FineReader or Tesseract).
Language for Tesseract
The following languages are available for Tesseract:
Abbreviation | Language |
---|---|
chi_sim | Chinese (simplified) |
chi_sim_vert | Chinese vertical (simplified) |
eng | German |
eng | English |
fra | French |
ind | Indonesian |
ita | Italian |
jpn | Japanese |
jpn_vert | Japanese vertical |
kor | Korean |
kor | Korean vertical |
msa | Malay |
spa | Spanish |
tha | Thai |
The language files for these languages are installed in the \<service-manager>\data\tesseract_data directory. Other languages are available for download and must be copied to this directory.
ABBYY FineReader
To install ABBYY FineReader, you need a license file, which you can purchase via OPTIMAL SYSTEMS and import into the directories \bin and \bin64 of the ABBYY FineReader installation.
ABBYY FineReader must be installed on a computer on which yuuvis® RAD service-manager is installed with the 'ocr', 'adminservice', 'discoveryservice', and 'renditionsidecar' services.
ABBYY FineReader is installed via the setup.exe application from the \finereader installation directory. Follow the installation dialogs.
After installation, settings for PDF creation in particular can be adapted via the ocr-prod.yml configuration file located in the \<service-manager>\config\ directory.
Integration into the Microservice Infrastructure
To integrate ABBYY FineReader, follow these steps
- Enter the number of instances in the servicewatcher-sw.yml configuration file located in the \<service-manager>\config\ directory:
- If yuuvis® RAD rendition-plus is also not installed at the workstation, then the IP of yuuvis® RAD rendition-plus must be entered in the application-red.yml configuration file located in the \<service-manager>\config\ directory:
yuuvis.rendition.server: <host>:8090 - ABBYY FineReader must be configured as an OCR component in the route.properties configuration file located in the \rendition-plus\webapps\osrenditioncache\WEB-INF\classes\config\ directory:
ocr-engine=finereader
- name: ocrservice
type: microservice
profiles: prod,cloud,red
instances: 0
memory: 128M
port: 7241-7250
path: ${appBase}/ocrservice/ocrservice-app.jar
env:
ProgramData: null
ALLUSERSPROFILE: null
In general, the finereader parameter activates an OCR component (ABBYY FineReader or Tesseract).
Uninstalling
You can uninstall ABBYY FineReader via the Windows Control Panel.
Updates
For information on updating components, see Release Information.