PDFs and SharePoint: What is recommended??

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.

Enabling the PDF iFilter in SharePoint to Crawl Searchable PDFs

  • Out of the box, Microsoft SharePoint will not index full text PDFs.  There are several steps to enable PDF indexing, and also make sure you see Adobe icons within the SharePoint viewer.
  • You will first need Adobe Reader, as it includes Adobe  IFilter from http://get.adobe.com/reader/
  • You will need to grab the Acrobat PDF Picture.  This will display the PDF icon next to PDF Documents in Microsoft SharePoint.  You can download it from http://www.adobe.com/images/pdficon_small.gif
  • You will now need to add the PDF file type to the Extensions List for SharePoint  search by editing the registry
    • Start the registry editor, by going to Run, and typing regedit
    • Open up HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\{Random GUID}\Gather\Search\Extensions\ExtensionList
    • You will need to Add “pdf” to the list as a new String Value. Find the highest in the list, typically 37, and create a new key with the next number (38) as the key with the value “pdf”
  • Add the Acrobat PDF icon you downloaded above to the Microsoft SharePoint templates directory. Copy the icon called pdficon_small.gif into the folder “%programfiles%\Common Files\Microsoft Shared\Web Server Extensions\12\TEMPLATE\IMAGES”.

  • Now you will have to bind the Acrobat PDF picture to the PDF file type
    • Open the “%programfiles%\Common Files\Microsoft Shared\Web Server Extensions \12 \TEMPLATE\XML\DOCICON.XML file
      • Locate  the <DocIcons.ByExtension> section of the file.
      • Add the mapping below:
        <mapping Key=”pdf” Value=”pdficon_small.gif” OpenControl=”" />
      • Change the iFilter mapping in registry
        • Go to start, and run regedit
        • Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\
        • Add (or modify) the .pdf key
        • Add a Multi-String value with value {E8978DA6-047F-4E3D-9C78-CDBE46041603} or modify if another GUID value already exists.
        • Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\
        • Add (or modify) the .pdf key
        • Add a Multi-String value with value {4C904448-74A9-11D0-AF6E-00C04FD8DC02}
          • or modify if another GUID value already exists.
          • You will need to add the Adobe Reader folder to the environment path variable
            • Open the System Icon in the Control Panel
            • Open the Advanced tab
            • Go to the Environment variables
            • Edit the Path variable
            • Add your Reader folder to the Path list, e.g. C:\Program Files\Adobe\Reader 9.0\Reader
            • Restart the Search service by restarting your server or executing the following commands:
              • Run: net stop osearch
              • Run: net start osearch
              • Open a command prompt and do a iis_reset