When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software. Below is a laundry list:
- PDF + Hidden Text is the preferred format. Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable. The de facto standard right now in the imaging industry is the PDF image + Hidden Text format. This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.
- Ensure your document capture software can import PDF files. Just about every organization has pre-existing scanned PDF files. In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint. If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
- Require the ability to create and populate custom PDF headers. PDF headers allow custom metadata to be built into the core PDF file. Why is this necessary? Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need. If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package. Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
- Require support for the latest standard. PDF – A is the latest and greatest standard, and the goal of this ISO standard was to build a file format suitable for long term archiving. Ensure you can support this option.