PDF iFilter and 64 bit platforms: Crawling PDFs in SharePoint 2010

Steps to Configure Adobe iFilter based on steps mentioned below (From Technet):

 

  1. Install PDF iFilter 9.0 (64 bit) from http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025
  2. Download PDF icon picture from Adobe web site http://www.adobe.com/misc/linking.html and copy to C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\TEMPLATE\IMAGES\
  3. Add the following entry in docIcon.xml file, which can be found at: C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\TEMPLATE\XML
    <Mapping Key=”pdf” Value=”pdf16.gif” />
  4. Add pdf file type on the File Type page under Search Service Application
  5. Open regedit
  6. Navigate to the following location:
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension
  7. Right-click > Click New > Key to create a new key for .pdf
  8. Add the following GUID in the default value
    {E8978DA6-047F-4E3D-9C78-CDBE46041603}
  • Restart the SharePoint Server Search 14
  • Reboot the SharePoint servers in Farm
  • Create a Test site (with any out-of-box site template) and create a document library upload any sample PDF document(s).
  • Perform FULL Crawl to get search result.

Once the crawl is completed we will get search results.

NOTE: If this is a SharePoint Foundation 2010 environment additional steps will be required instead of step 4 above

Adding Searchable File Types to SharePoint Foundation 2010
http://support.microsoft.com/kb/2518465

PDFs and SharePoint: What is recommended??

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.