How to run OCR on primary documents

Prerequisites

Before you start, make sure you have:

Admin access to your Uwazi instance
A primary document saved as a PDF, with a supported language set

OCR has two parts. First, an admin turns on the OCR trigger in settings. Then an admin or editor runs OCR on a PDF from the document viewer. The sections below cover each part.

Steps

Turn on the OCR trigger

Go to Settings > Collection.
Find the Services card.

If you don't see this card, your instance isn't connected to an OCR service.
Turn on the Document OCR trigger toggle.
Select Save.

Run OCR on a document

Open a PDF primary document in the document viewer.
Find the OCR control in the toolbar at the top of the document. It sits next to the page navigator and the Plain text link.

The control appears only for admin and editor roles, and only when the trigger is on.
Check the control's label. If it reads Unsupported OCR language, set a supported language on the document first, then reopen it.
Select OCR PDF.
The control changes to In OCR queue. The OCR service now works in the background, so you can leave the page.
Wait for the control to show OCR completed with a check mark. The document reloads on its own.

note

OCR can't be undone from the interface. Uwazi keeps your original PDF as an attachment on the same entity, and moves any linked references to the new file.

Reading the OCR control

The control shows the document's current state. Use this table to read it:

What you see	What it means
OCR PDF	Ready to start. Select it to begin.
In OCR queue	Submitted. The service is processing the document.
OCR completed	Done. The document is now text-searchable.
Unsupported OCR language	The service can't read the document's language.
OCR error	The service couldn't process the document. Try again.

If the control stays on In OCR queue for a long time, the OCR service or its background worker may be down.

Result

Your primary document is now a text-searchable PDF. Readers can search and select its text, and the original scan stays on the entity as an attachment.

tip

Set the right language on each document before you run OCR. The service uses that language to recognize the text, and an unsupported language blocks the run.

Prerequisites​

Steps​

Turn on the OCR trigger​

Run OCR on a document​

Reading the OCR control​

Result​

See also​