Poll Results: What type of scanners are companies using to scan to SharePoint?

Conducted a poll over about 3 months, and wanted to see what type of devices people were using to scan to Microsoft SharePoint.  The results are in tune with what I see in the field, as folks are using a distributed scanning model with SP to put scanning into the hands of the knowledge workers.  Below are the results:

The Two Most Popular Features for Scanning to SharePoint

So, when I look at all the customers I have worked with, and examine the feature sets that are most applied in a SharePoint environment, there are two that stand out: Routing Sheets and Advanced Data Extraction.  I would say 90% of all my customers use these features in some way to make the process automated and efficient.  So, what are they?  How do they work?  Outlines below:

Routing Sheets

I have mentioned these quite a bit on the BLOG, and they lend themselves nicely to distributed scanning from MFPs/Copiers, Faxs, network scanners, etc.  A routing sheet is a combo of barcodes and/or checkboxes that can allow the end users to index prior to the scan.  The information can then be translated into metadata.  This feature requires Optical Mark Recognition, or OMR.  So make sure your scanning product supports OMR.  Below are some samples:

Legal Routing Sheet

HR Routing Sheet

Advanced Data Extraction (ADE)

Many of the solutions out there today support what is called Zoning, or the ability to pick information from a specific area on a page and enter it as metadata.  ADE takes that to a whole new level and provides the ability to match patterns and extract information.  So if a customer needs an order number that is 6 digits, and always starts with a 7, the extraction engine can search the whole page and extract.  This is a huge time saver, and allows the utmost in automation and verification of data.

 

Imaging File Size Comparison for Planning – Color and DPI

When planning for scanning to SharePoint, here is a quick matrix for the impact DPI and color can have on file size, and the size of your content DBs.

Scanning Mode/DPI File Size
Black and White – 200 DPI 26K
Black and White – 300 DPI 38K
Black and White – 400 DPI 51K
Black and White – 600 DPI 80K
Greyscale – 300 DPI 301K
Color- 300 DPI 577K

5 Tips for Optimizing Image Size When Scanning to SharePoint

I find quite a few customers are not optimizing their scanning process, and creating very large image files, slamming their network and bloating their content databases.  Below are 5 tips to live by when Scanning to SharePoint:

  1. Scanning at anything greater than 300 DPI is unecessary.   DPI can be a huge killer, and really bloat your file size.  For most instances, 200 dpi is perfectly fine for archive purposes.  If you are using OCR or performing data extraction, 300 dpi will give you a great quality image.  Anything beyond that will give you no better quality, but increases the file size exponentially.
  2. Use color and grayscale sparingly.  Color and grayscale files can be massive, and can be a huge burden on many different aspects of any SharePoint system.  Use them only when absolutely necessary, as black and white images are perfectly acceptable in almost every instance.
  3. Image processing is key.  Having an image processing engine that can despeckle, deshade and remove black borders will reduce file size and conserve storage.
  4. Check you copiers.  Most copiers today like to show off their fancy color capabilities and typically come with default settings to create color scans.  Check DPI and color settings to make sure your users unknowingly are creating massive files.
  5. TIFF or PDF?  This can be a whole additional conversation, and possible next post.  There really is no difference in file size for the same scanned image, and I find PDF is becoming the de facto standard in imaging.

 

 

5 Keys to a Successful SharePoint Scanning Project

Below are 5 primary keys to implementing a successful SharePoint scanning / imaging project:

1.  Make sure you do some in depth storage planning.  

When imaging to SharePoint or Office 365, you need to make sure you plan for not only storage requirements, but also figure out the loading on your network.  Scanning, if done incorrectly, can great a huge burden on your network and bloat your content databases.  More info here: SharePoint Scanning Storage Planning

2.  Leverage existing scanning devices for the pilot project.

Giving users a familiar  interface will go miles towards acceptance.  Make it easy, and leverage copiers or other scanners within the organization to make the transition to paperless workflows familiar.  More on scanning hardware here: SharePoint Scanning Hardware

3.  Involve end users in SharePoint design.

I have seen so many projects where IT just builds what they think users want.  Make the layout of the site a collaborative effort, and build your site and library structures accordingly.  Map paper documents to digital, and leverage content types and managed metadata .  Finally, capture drives search, and make sure appropriate columns are put in place so users can find, sort and create views simply and easily.

4.  Leverage folders for quick adoption.

Here we go, the old folder argument.  Along with creating a familiar environment, users love folders, and they give quite a bit of power in the SharePoint world.    Adding them costs nothing, and they can be turned off for users who don’t want them.  Use folders.

5.  Automation is key, and necessary for standardization.

Make sure you utilize a scanning application that allows for standardization rule set.  Site, library, content type, folder, file naming and terms should all have the ability to be controlled and automatically set.  Automation makes standardization easy, and totally transparent giving you a repeatable, consistent scanning and capture process.

PDFs and SharePoint: What is recommended??

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.

SharePoint Scanning Planning – Part 4 – Document Scanning Models

Document Scanning Models

After doing some planning on the hardware types and document scanning volumes, the next step would be to examine what type of model you need to deploy.  There are typically 3 standard  models for document scanning and capture: Centralized, De-centralized and Distributed.

Each model has its own pros/cons, and below I will examine each, and dive into some detail.

Centralized

Ah, the centralized model.  Some call this old school scanning and capture, as for many years, this was the only way to get the job done, and convert your paper to digital form.  This model provides a centralized scanning center to provide mass conversion for the organization.  The operation can be run by in house personnel, be managed by a services provider in house, or be outsourced to a scanning service bureau.  It requires high volume/high speed hardware, and typically utilizes advanced capture software to allow for the utmost in automation and efficiency.  The software and hardware operators are typically highly trained, and there are usually only a few of them.  Paper and/or digital media is shipped to the centralized location and processed through a set, standardized capture workflow.

Centralized Pros

  • Easily standardized process due to a limited number of skilled/trained scan operators
  • High speed hardware/software results in minimal processing time once paper is received
  • Centralized reporting and control of overall process
  • No loading on WAN infrastructure
  • Centralized backup and restore

Centralized Cons

  • Usually a high time delay for availability of documents
  • High cost due to shipping of documents
  • High maintenance costs
  • High training costs to bring on new operators
  • Disaster recovery planning issues if centralized site is down
  • Operators are typically not knowledgeable in the documents they are indexing

Decentralized

Over time, as bandwidth and scanning hardware/software prices went down, the obvious move was to decentralize the whole scanning and capture process.  This move placed scanning in the branches, and allowed the whole document capture process to be performed by those who had working knowledge of the documents.  Smaller, desktop class hardware could be used, and most capture companies made batch scanning and upload to the centralized repository simple to accomplish.

Decentralized Pros

  • Scan operators are well versed in the documents they scan
  • Documents are available almost immediately
  • No shipping or transfer costs for documents
  • Branch control of the whole scanning process

Decentralized Cons

  • Standardization can be an issue
  • No centralized control or reporting
  • WAN Bandwidth consumption can be high
  • Licensing costs can be high depending on software utilized

Distributed

The advance of network-based scanning devices and the lowering of bandwidth pricing led to the newest model, the Distributed Model.  Distributed Scanning allows for just about anyone in the organization to walk up to a network scanning device/scanning copier/fax machine and send documents to a repository.  The devices are typically multi-faceted, and along with repository integration, can provide scan to network folder, FTP and email.  Collaborative back-end systems, like Microsoft SharePoint, lend themselves nicely to this model, as they allow anyone to participate in a Document Workspace.

Distributed Pros

  • Put scanning in the hands of everyone in the organization
  • Provides a great launching pad for collaborative solutions
  • Simple, easy to use interfaces allow for minimal training and quick adoption
  • Capture and indexing is now in the hands of the true document owner
  • One-to-many solution provides a single device to service many users

Distributed Cons

  • Lack of standardization without software addition
  • Security and document control can be major issues
  • Bandwidth from smaller branches can be a problem with larger scans
  • Lack of hardware integrations with back-end systems

So, most organizations today are combining the above models to create a Hybrid Scanning and Capture solution, and leveraging all the strengths together to minimize the weaknesses of any one model.   Another strategy is to tie scanning models to specific business processes, as most lend themselves nicely to specific scanning and capture workflows.

Hardware and Choosing Your Scanning Model

 

Most organizations will choose their model to leverage their existing hardware investment, but this can be lead to decisions that seem good at the time, but if deeper examination occurs, it can make sense to realign hardware with the best model.  Take for example, a company that instantly leans toward a distributed model, and attempts to leverage their copier fleet that is currently under lease.  If you examine the part of this guide that covers scanning hardware, copiers will not always fit for the type of scanning you need to perform.  Take for example a branch accounting department that is looking to scan receipts or check stubs.  Will the copier perform well with mixed original sizes?  Just a word of caution to examine the paper, workflow, and document types to get the best feel and adapt the best model.