How to run OCR on primary documents
Prerequisites
Before you start, make sure you have:
- Admin access to your Uwazi instance
- A primary document saved as a PDF, with a supported language set
OCR has two parts. First, an admin turns on the OCR trigger in settings. Then an admin or editor runs OCR on a PDF from the document viewer. The sections below cover each part.
Steps
Turn on the OCR trigger
-
Go to Settings > Collection.
-
Find the Services card.
If you don't see this card, your instance isn't connected to an OCR service.
-
Turn on the Document OCR trigger toggle.
-
Select Save.
Run OCR on a document
-
Open a PDF primary document in the document viewer.
-
Find the OCR control in the toolbar at the top of the document. It sits next to the page navigator and the Plain text link.
The control appears only for admin and editor roles, and only when the trigger is on.
-
Check the control's label. If it reads Unsupported OCR language, set a supported language on the document first, then reopen it.
-
Select OCR PDF.
-
The control changes to In OCR queue. The OCR service now works in the background, so you can leave the page.
-
Wait for the control to show OCR completed with a check mark. The document reloads on its own.
OCR can't be undone from the interface. Uwazi keeps your original PDF as an attachment on the same entity, and moves any linked references to the new file.
Reading the OCR control
The control shows the document's current state. Use this table to read it:
| What you see | What it means |
|---|---|
| OCR PDF | Ready to start. Select it to begin. |
| In OCR queue | Submitted. The service is processing the document. |
| OCR completed | Done. The document is now text-searchable. |
| Unsupported OCR language | The service can't read the document's language. |
| OCR error | The service couldn't process the document. Try again. |
If the control stays on In OCR queue for a long time, the OCR service or its background worker may be down.
Result
Your primary document is now a text-searchable PDF. Readers can search and select its text, and the original scan stays on the entity as an attachment.
Set the right language on each document before you run OCR. The service uses that language to recognize the text, and an unsupported language blocks the run.
See also
- How to extract metadata from your documents — pull field values from documents once they hold text
- How to extract paragraphs from documents — split a text-searchable document into paragraph entities
- How machine-learning (ML) extraction works — why OCR comes before extraction in the pipeline