PDF iFilter and 64 bit platforms: Crawling PDFs in SharePoint 2010

Steps to Configure Adobe iFilter based on steps mentioned below (From Technet):

 

  1. Install PDF iFilter 9.0 (64 bit) from http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025
  2. Download PDF icon picture from Adobe web site http://www.adobe.com/misc/linking.html and copy to C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\TEMPLATE\IMAGES\
  3. Add the following entry in docIcon.xml file, which can be found at: C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\TEMPLATE\XML
    <Mapping Key=”pdf” Value=”pdf16.gif” />
  4. Add pdf file type on the File Type page under Search Service Application
  5. Open regedit
  6. Navigate to the following location:
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension
  7. Right-click > Click New > Key to create a new key for .pdf
  8. Add the following GUID in the default value
    {E8978DA6-047F-4E3D-9C78-CDBE46041603}
  • Restart the SharePoint Server Search 14
  • Reboot the SharePoint servers in Farm
  • Create a Test site (with any out-of-box site template) and create a document library upload any sample PDF document(s).
  • Perform FULL Crawl to get search result.

Once the crawl is completed we will get search results.

NOTE: If this is a SharePoint Foundation 2010 environment additional steps will be required instead of step 4 above

Adding Searchable File Types to SharePoint Foundation 2010
http://support.microsoft.com/kb/2518465

New SharePoint 2010 Limitations on Storage

Holy cow…4TB on content DBs.  Below is quoted from TechNet:

Content databases of up to 4 TB are supported when the following requirements are met:

  • Disk sub-system performance of 0.25 IOPs per GB. 2 IIOPs per GB is recommended for optimal performance.
  • You must have developed plans for high availability, disaster recovery, future capacity, and performance testing.

You should also carefully consider the following factors:

  • Requirements for backup and restore may not be met by the native SharePoint Server 2010 backup for content databases larger than 200 GB. It is recommended to evaluate and test SharePoint Server 2010 backup and alternative backup solutions to determine the best solution for your specific environment.
  • It is strongly recommended to have proactive skilled administrator management of the SharePoint Server 2010 and SQL Server installations.
  • The complexity of customizations and configurations on SharePoint Server 2010 may necessitate refactoring (or splitting) of data into multiple content databases. Seek advice from a skilled professional architect and perform testing to determine the optimum content database size for your implementation. Examples of complexity may include custom code deployments, use of more than 20 columns in property promotion, or features listed as not to be used in the over 4 TB section below.
  • Refactoring of site collections allows for scale out of a SharePoint Server 2010 implementation across multiple content databases. This permits SharePoint Server 2010 implementations to scale indefinitely. This refactoring will be easier and faster when content databases are less than 200 GB.
  • It is suggested that for ease of backup and restore that individual site collections within a content database be limited to 100 GB. For more information, see Site collection limits.

For more information on SharePoint Server 2010 data size planning, see Storage and SQL Server capacity planning and configuration (SharePoint Server 2010).

Document Routing and Microsoft SharePoint

See a ton of companies struggling with the question:  How do i get my copiers to scan to SharePoint?

I go back and forth on the idea of panel applications that enable intelligent routing at the copier.  It always comes back to contention at the device.  I recall one instance where an admin had all her documents piled on the copier, they were using eCopy, and she was scanning one document at a time, and sending them to SharePoint.  During her 20 minutes of copier hoarding, at least 10 people walked up, and walked away.

There are several things that i believe are absolutely critical to enabling copiers as scanning and capture onramps to SharePoint:

  1. Document Separators are an absolute requirement!!!  You have to be able to take a whole stack of documents, place barcode/routing separators between them, throw them all in the hopper and hit the green button.
  2. Intelligent Routing is required.  Separators need to provide document intelligence, and give the user the ability to pre-index the document through the use of a barcode creation utility, or an Optical Mark Recognition (OMR) routing sheet with check boxes.
  3. Flexibility in routing is required.  An application that can provide automatic routing to SharePoint based on barcodes or checkboxes can provide ultimate flexibility for the users.  The ability to route to site, library and folder is necessary, and the need to set content type and file naming is also a key.

Here is a sample of a routing sheet:   Scanning Route-SP-Dynamic-Template

To Folder or Not to Folder(In SharePoint). That is the question.

Should I use folders in SharePoint?

I am always in search of opinions on the use of folders within SharePoint.

Arguments For Folders in SharePoint:

  • End users are comfortable with them.  The transition to any new technology is always easier, and adoption rates higher the more end users can apply “old school” ways to any new interface.
  • Folders, although merely logical in SharePoint, provide a hierarchical structure, and some standardization.
  • For the power user, you can get rid of the infantile folders, and create a custom view that eliminates them.
  • There is always the 2,000 (or is it 3,000? or maybe 4,000?) object limit within any view.  My understanding is that folders in SharePoint can break up you library into segments so you dont need to worry about these limits in rendering a list.
  • Logical structure can help down the line for any reorganization, export or migration of data and files.
  • For scanning to SharePoint, most advanced capture technologies provide custom foldering as a migration method to SharePoint.  Why not use it if it is there?
Arguments Against Folders in SharePoint Libraries:
  • Folders are “old school”, and have no place within SharePoint libraries, especially in SharePoint 2010.  Customized views, content types and document sets should be utilized for organization and viewing.
  • SharePoint should not be used like a file system, it is a database, and the search interface should be used to find what you are looking for in the content databases versus the folder “Hunt and Peck” method.
  • Encouraging end users to create folders within a SharePoint Library will only lead to the end users “gone wild” scenario that happened to our file share system.
Did I miss anything?  Any enhancement of this post would be greatly appreciated.

Questions to ask before you start your SharePoint scanning, imaging or capture project

So you want to use Microsoft SharePoint as storage for scanned images? Take a quick breath and don’t charge in too fast, as there are many facets of this type of project that need to be considered.

What type of volume are you scanning on a daily basis?

  
You need to take a deep dive into departmental and end user needs, and really look at the volume of pages they need to image and capture. This brings up a point I discus on a daily basis: Do you want to scan or capture? You may read this and say, what in the world are you talking about, but here is an explanation below:
Let’s create a definition and define a feature set for scanning applications. A scanning application is just a means to take paper, and quickly and easily convert it from paper to digital form. They are well suited to environments with very basic needs, and what I call “onsie-twosie” scanning, or low volume environments. Their feature sets provide very basic functionality, and may allow the use of basic separation, and very basic integrations with SharePoint. The majority of scanning hardware vendors bundle these applications with their hardware, although there are vendors that have taken it to the next level, and provide enhanced scanning capabilities beyond the typical bundled software.
Document Capture software can be utilized for basic scanning needs, but takes you to a whole new level from a “capture” perspective. These applications typically have a number of ways to “slice and dice” documents, and really focus on efficiency, and minimizing the time required to scan, index and capture data. Capture software provides numerous ways to automatically populate columns, including barcode reading, database lookups, OCR, and data extraction. True capture applications provide integration with scanners, folders with images, SharePoint Web Dav folders, etc. Any organization that is serious about processing paper documents, and want to do it in the most efficient, standardized manner, should look seriously at advanced capture applications.
Capture applications are typically well suited to high volume situations or in situations where data can be extracted automatically. Scanning applications are suited for very simple operations, and usually suited to low volume.

What type of scanning device(s) are you going to utilize?

 
There are only a few applications out there that will provide you with the ability to scan from any type of device. Are you going to use network based scanning devices or direct connect scanners? Look into support in these specific areas:
• What type of drivers are supported? ISIS, TWAIN, and VRS should all be allowed.
• Can hot folder functionality provide the auto-import and processing of all different image types, PDF included? Hot folder functionality should span local, network and WebDav folders.
Beware of “panel” based applications. They are typically very static, and can provide a line at the MFP/Copier as people are entering information about their documents at the actual device.


What output format do you want in the SharePoint libraries?

 
Scanning and capture applications today provide a broad array of image output formats, but the standard seems to be PDF Image with Hidden Text. This provides an all in one container for the original image and the searchable text. Install the PDF iFilter, and you have a searchable content store. There are some specialized usages that may require other formats. For instance, if you are importing JPEGs with EXIF tags with your advanced capture application, you will want to keep the original JPEG file with tags intact rather than performing a conversion.


What Scanning and Capture features will be necessary in your environment?


What features should you look for? This is the most difficult question of them all, and you really need to find an application that has a broad and expansive feature set to make sure you can cover today’s needs, and the needs of your organization in the future. This BLOG post is a great place to start:
Trends in Scanning and Capture




How much storage space will I require? Where are you going to store your images?


Just a few stats here to get you on your way:
• The standard scanned page can be estimated at 50K in size (at 300DPI)
• A file cabinet contains between 10,000 and 12,000 pages
This can give you a quick idea of how much storage will be required, and let you do some growth estimation over time.
You should also use these numbers to see if you should use the SharePoint DB for content storage, or utilize Remote BLOB Storage (RBS). SharePoint 2010 with SQL 2008 R2 allows this without the need for additional software through the FILESTREAM provider.


How will I view images once they are in SharePoint?


Without a viewer add-on, SharePoint will require you to open an image to view pages. This can be problematic if you are serving up large image files. Definitely take a look at some of the image viewer add ons to SharePoint. My favorite, VizitSP SharePoint Viewer, provides the ability to view/preview, annotate, image process, search (column based and full text) and have multiple images open in a tabbed view. This is an absolute necessity if you are going to give end users the best experience possible.

Just some questions to get the gears turning and make sure you get all the pieces to the puzzle.

SharePoint 2010 and Document Sets

So many good posts coming out on the web for 2010. Working to figure out all the angles on how to improve SharePoint as an imaging, scanning and capture platform. Document sets seem to be a great focal point. Great article outlining them and how to use:

Document Sets and SharePoint 2010