Workflow Overview

The overview below illustrates the manner in which the OCR Library Types and Functions operate in relation to each other:

OCR Preparation

•PXODocument is the main document structure that the PDF_XChange PRO SDK utilizes.

•OCR_Init sets up a new PXODocument in order to load input files and perform OCR.

•OCR_LoadA and OCR_LoadW load input files into the PXODocument object’s input layer.

•OCR_GetText processes a PXODocument and then formats and returns the plain text.

•OCR_MakeSearchable processes a PXODocument and then generates a new output layer that contains searchable PDF results.

•OCR_SaveW and OCR SaveA can be used to save these results.

•OCR_GetField performs OCR on a PXODocument and then formats and returns the plain text.

•OCR_GetFields performs multiple OCRs on a PXODocument and then formats and returns the plain text.

•PXO_FieldInputFlags is an enumerated type that determines the style of input coordinates that OCR_GetField and OCR_Getfields use.

•OCR_SetCallBack sets the callback function for the PDF rasterization process of PXODocument structures.

•PXO_CallbackStage is an enumerated type that is passed to the user-defined callback function that OCR_SetCallback determines.

•OCRp_Page performs OCR on a specified page of a PXODocument, then returns the results in a structure that can be queried for text layout details.

•OCRp_Field performs OCR on a specified area of a PXODocument, then returns the results in a structure that can be queried for text layout details.

•OCR_GetNumInputPages returns the number of pages in the input layer of the PXODocument.

•OCR_Delete deletes the PXODocument, which is a necessary step once all functions are complete.

•PXO_Options is the OCR options input structure that determines variables for the OCR process. It utilizes:

•PXO_Language to determine the language used for OCR.

•OCR_RegionMode to improve the accuracy of page segmentation for specific text formats.

•OCR_ImageProcessingFlags to enable additional operations when images are processed.

•PXO_Pagelist is an input type used to store PDF page numbers for OCR operations.

•OCR_NewPagelist initializes a new PXO_Pagelist structure.

•OCR_AddPage adds a new input document page number to the PXO_Pagelist structure.

•OCR_NumPages returns the number of input page numbers stored in the PXO_Pagelist structure.

•OCR_GetPageByIndex returns a specified input document page number from the PXO_Pagelist structure.

•OCR_PagesToInputFields duplicates an input field for the pages that the input PXO_Pagelist structure specifies.

•OCR_ReleasePagelist releases the memory that PXO_Pagelist structures use, which is a necessary step once all functions are complete.

•PXO_InputField is an input structure for OCR.

•OCR_NewInputFields initializes a new PXO_InputField structure.

•OCR_GetInputFieldByIndex returns the zero-index PXO_InputField currently stored in its InFields structure.

•PXO_FieldInputFlags is an enumerated type that specifies the flags used to determine the style of input coordinates that PXO_Inputfield, OCR_Get Field and OCR_GetFields use.

•OCR_ReleaseInputfields deletes PXO_InputField structures, which is a necessary step once all functions are complete.

•PXO_Inputfields contains a list of PXO_InputField structures for zonal/regional OCR.

•OCR_AddInputField adds a new PXO_InputField to a PXO_InputFields structure.

•OCR_LoadTemplateW loads a list of input fields into a PXO_InputFields structure from an ASCII text input file.

•OCR_SaveTemplateW saves a list of input fields from a PXO_InputFields structure to a template file.

•OCR_NumInputFields returns the number of input fields stored in PXO_InputFields.

•OCR_ReleaseInputFields frees the memory that PXO_InputFields structures use, which is a necessary step once all functions are complete.

•OCR_RasterPageSettings converts PDF coordinates to/from rasterized page image coordinates, which is a necessary step for some of the Low-Level Functions.

•OCRp_RasterRectToPDF utilizes this structure.

•OCRp_Page and OCRp_Field return OCR_RasterPageSettings.

•OCRp_RasterRectToPDF converts results from OCRp_Page or OCRp_Field into PDF coordinates to make it possible to query them for text layout details.

•OCR_SymbolBox is a structure that contains a single character and, when available, descriptive information from the OCR process. It uses OCR_Baseline to store the baseline for text elements.

OCR Results Hierarchy

Results are returned in a hierarchy after the OCR process is performed:

•PXO_Page is the top level of the OCR results hierarchy.

•PXO_Page may contain PXO_Region members (see below).

•OCRp_PageText returns plain text from specified PXO_Page structures.

•OCRp_RegionCountFromPage returns the number of regions in the specified PXO_Page structure.

•OCRp_GetRegionFromPage returns the requested output region from the specified PXO_Page structure.

•OCRp_FreePage is used to delete PXO_Page structures and free the memory that they use.

•PXO_Region is the second level of the OCR results hierarchy.

•OCRp_GetRegionFromPage can be used to return PXO_Region members. (N.b. OCRp_FreePage must be used to free memory when this process is complete, which will also delete associated PXO_Region members).

•PXO_Region structures may contain OCR_SymbolBox members, which OCRp_GetSymbolFromRegion can be used to access.

•OCRp_SymbolCountFromRegion returns the number of symbols in the specified PXO_Region structure.

•OCRp_GetSymbolFromRegion returns a requested symbol from the specified PXO_Region structure.