Poll Results: What type of scanners are companies using to scan to SharePoint?

Conducted a poll over about 3 months, and wanted to see what type of devices people were using to scan to Microsoft SharePoint.  The results are in tune with what I see in the field, as folks are using a distributed scanning model with SP to put scanning into the hands of the knowledge workers.  Below are the results:

Ooops. Did we backup the file cabinets?

If you look at the headlines over the past few years, you cannot help but notice the number of natural disasters that have occurred. In my conferences with IT and Departmental Management, I always pose the question when discussing business continuity or disaster planning: Do you have a plan for your paper?

Just about every company has implemented some type of plan for backing up their important digital files. Some go to the extreme with data snapshots that can be recovered from multiple locations. But companies typically don’t take the same strategy with their paper assets. The good ole file cabinet, the protector of all things paper will provide protection, right?

Companies need to take a good hard look at their paper, and assess the business impact should disaster destroy their file room. Backing up your paper nowadays is not hard, nor expensive when compared to the legal implications and time it would take to reproduce (if possible) contracts, customer files, sales records and the like.

Any paper backup plan involves a concept i call Bridging the Gap (BTG). BTG involve hardware and capture software to digitize and build the bridge to the digital world, and then a repository on the “other side” to house the records and make search and retrieval simple. The repository can be as simple as a set of named network folders, or as complex as a true ECM system like MS SharePoint. Take the initiative and backup your paper today.

The Two Most Popular Features for Scanning to SharePoint

So, when I look at all the customers I have worked with, and examine the feature sets that are most applied in a SharePoint environment, there are two that stand out: Routing Sheets and Advanced Data Extraction.  I would say 90% of all my customers use these features in some way to make the process automated and efficient.  So, what are they?  How do they work?  Outlines below:

Routing Sheets

I have mentioned these quite a bit on the BLOG, and they lend themselves nicely to distributed scanning from MFPs/Copiers, Faxs, network scanners, etc.  A routing sheet is a combo of barcodes and/or checkboxes that can allow the end users to index prior to the scan.  The information can then be translated into metadata.  This feature requires Optical Mark Recognition, or OMR.  So make sure your scanning product supports OMR.  Below are some samples:

Legal Routing Sheet

HR Routing Sheet

Advanced Data Extraction (ADE)

Many of the solutions out there today support what is called Zoning, or the ability to pick information from a specific area on a page and enter it as metadata.  ADE takes that to a whole new level and provides the ability to match patterns and extract information.  So if a customer needs an order number that is 6 digits, and always starts with a 7, the extraction engine can search the whole page and extract.  This is a huge time saver, and allows the utmost in automation and verification of data.

 

What is “Bridging the Gap”?

 The movement towards an office with less paper and more efficiency can be quite difficult, and with the wrong tools can end in failure.  The key challenge is a process I call “Bridging the Gap”, which uses several applications to create a bridge between the physical and digital world, and helps create a seamless process.  So what is required?  How do you create the bridge?

 

On one side of the gap, you have your physical environment: file cabinets, inboxes, stacks of folders on desks, etc.  There are two components that facilitate the crossing:

  • Scanning Hardware – scanners allow the conversion of paper documents into digital documents or images.  Organizations can use scanning copiers, fax machines or dedicated scanners to digitize.
  • Capture Software – capture software works with the scanning hardware to create an efficient and automated bridging process.  It controls the flow of digitized documents, standardizing how they are routed, and using OCR, Barcodes, Advanced Data Extraction (ADE) and other features to automate the collection of information.  It spans the gap and creates a connection to the other side or the repository.

Once the gap has been spanned, the documents need to land somewhere, just as physical documents land in a file cabinet, inbox on someones desk or another location in the organization.  Below are the two components that exist on the far side of the gap:


  • Workflow Software – think of this as the digital inbox and outbox…on steroids.  Workflow Software is utilized to create a digital mirror of your physical processes.  It can move around files, create approval steps, automatically email and perform logic that usually requires intervention by a human.  Some oraganizations dont have this entity on the other side of the gap.
  • Repository –  Think of the repository as a temporary and permanent file cabinet that can hold files during a workflow process, or as an archive copy once the whole process is complete.  You can search, sort and organize, print, distribute and copy.  Most repositories can allow full text search, if the capture software has created a searchable file format, and also allow column based searching for specific criteria.

I have seen many organizations try and bridge the gap, and not have one of the pieces above, or a piece that cannot suit all their needs.  A missing component can impact the overall value of the system.  For example, take a scanning copier that an AP department uses to scan invoices.  They email themselves the scans, open them, rename them and then save them into their repository.  Without capture software to automate the naming and routing, this is a highly inefficient process.  Without capture, files are not made searchable through OCR, and this can also reduce effiency during search.  Another example might be the lack of a repository that can provide all the bits and pieces an organization may require.  Take the organization that just saves PDFs to a network directory.  This may be fine for many organizations that merely need a simple archive to house their files.  But what about an audit event, or legal issue that may require extensive searching and sorting?


“Briding the Gap” and creating an office with less paper can provide an organization countless benefits with proper planning and design, and the inclusion of all the above components.

PDFs and SharePoint: What is recommended??

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.

SharePoint Scanning Planning – Part 2 – Separation

Document Examination and Separation

One of the key steps in preparing for document scanning and capture is to identify how you will separate or split documents.  What is separation and how does it work?  Details below:

For those of you that are new to document management and capture, document separation is the notion of how we can determine when a document begins and ends.  With most simple scanning software, this process is easy.  You load a single document in the feeder, click scan, and when it is done, you name it and save it.  With advanced capture, you can load multiple documents into the feeder, scan them all at once, and use a separation method to split them into individual digital documents.    This is a massive time saver.  Imagine loading 20 individual documents into a scanner one at a time, scanning each individually, and then entering information about each.   Below are some key separation methods any advanced capture suite should have:

Fixed Page Count Separation – This allows you to split based on a certain page count.  So if you scan a stack of 100 two page forms, you will have 50 separate documents in your capture interface.

Barcode Separation – probably the most pervasive separation method is a barcode separator.  Place a sheet with a specific barcode pattern between each document, and you are off to the races.  To give you the most flexibility, applications should support the following enhanced barcode separation methods:

  • Separate on any barcode
  • Separate on specific barcode terms and patterns
  • Separate on barcode type
  • Separate on barcode count
  • Separate on a certain number of barcodes on a page
  • Separate when a barcode changes

You want to make sure your barcode engine supports 1D and 2D barcodes without the purchase of any expensive modules or add-ons, and it should also have a simple feature that lets you split 2D barcodes and identify separation terms.

Patch Code Separation – So what the heck is a patch code?  Just an old school horizontal barcode.  Below is an example.  If you work in the medical field, most medical billing forms will have these on them, and some scanners actually support using patch codes to shift scanner settings during the scanning process.  For flexibility, choose an application that supports patch code separation.

Scanning Patch Code

Patch Code Example

Optical Character Recognition (OCR) Separation – OCR is the process of converting a scanned or imported image into searchable text.  OCR separation searches for a key word, term or phrase on the document, and will recognize that page as the first page in a new document.  This is a preferred method, as you don’t have to kill trees to print cover sheets, and it makes document preparation simple (no inserting separator sheets).  For example, if you are scanning contracts, and you want to split when you find an 8 digit contract number in the right hand corner, this comes in very handy.  There are several key requirements in this feature that are absolutely required in your application to make sure you get high separation accuracy:

  • Scan at 300DPI and use an app that has image processing software to clean up the page.  Also, your image processing engine must allow processing of imported PDFs and TIFFs if you plan to harvest documents.  Some image correction/processing engines only work with scanners.
  • Insure you capture application allows you to use expression matching (Regular expressions) so you have the utmost flexibility in finding separation patterns.
  • Character sets are key.  These provide the ability to tell the OCR engine the type of characters you are looking for (A-Z, 0-9, etc), so if it misidentifies a character, it auto-corrects the information.
  • Finally, top line applications also allow you to separate when OCR terms change.  So you can look for that contract number, and only split when you find a new one.

Intelligent Character Recognition (ICR) Separation- ICR is the process of converting scanned images of hand printing to text.  This method can be utilized to split pages when certain patterns in hand printing are detected.  Note:  all of the features required to insure accuracy for OCR separation should also be considered if you utilize this method as well.

Document Import and Separation – There are several separation methods that can be key to success if you need to import large volumes of documents, or you want to process documents scanned from copiers, network scanners, or fax machines.  Below is several separation methods required for any document capture from imported files:

  • New File Separation – This method of separation will look at a directory, pick up files, and maintain each new file as its own digital document.
  • Folder-based separation – This is a key method if you are importing documents and want to combine them based on the folder.  One example might be a law firm that has a folder structure of case documents on different subjects for the case and wants to combine each folder into a single PDF file.

Blank Page Separation – I only mention this as I would always, always avoid it unless absolutely necessary, especially if you are scanning in duplex.  Most implementations of this method, unless operated under strict preparation by knowledgeable operators becomes an absolute mess. (Just my humble opinion ;)   )

Separation Scripting – Finally, for those rare and special occasions, you always want a product that has a pre-built scripting interface for customizing the whole process if necessary.  Now let me be clear, not a sales rep “Yeah we can do that” (Which usually means $20,000 in professional services), but a product that has simple hooks into the separation function, that allows you a simple “yes or No” based on some parameter or criteria that anyone with basic scripting skills can write.  When would you use something like this?  Usually for very complex jobs where the original documents cannot be modified, but you need to put some logic in place to spit documents.

The last separation topic I want to cover is something called triggered separation.  Let me set the stage on this one, and describe a process which is near and dear to every accounting manager’s heart, invoices.  So you have a stack of invoices, some single page, some multi-page and you are struck with a dilemma.  If I use barcode separators, and I have 100 single page invoices, do I really have to put 100 barcode separators between them all?  Separation triggers allow you to scan single page and multi-page documents all together.  So in this example, you can stack your singles, and then put separators between your stack of variable length separators.  Put a trigger sheet between the two stacks (this tells the capture software to switch from single page separation to barcode-based separation), and scan the whole stack in one fell swoop.  This is a huge time saver in high volume environments, and can allow you to also build redundant separation logic, so you get the highest accuracy in separation with the least amount of document preparation.  Phewwww.  That was geeky.

Do you really need all of this?  Does separation have to be that complex?  The whole goal here is to have as much as you possibly can in the tool kit to insure you can meet all the capture needs within your organization.  I liken it to buying the base model of a car with no accessories, and then wishing every day you had power windows, the iPod Kit, cruise control, 4WD, etc.

So now you have examined your documents, and figured out how to efficiently scan and split.

SharePoint Scanning Planning – Part 1 – Storage and Sizing

With SharePoint Scanning and Capture, as with any project, planning is essential to success.  If you are going to use scanning software to send scanned images to a SharePoint Content Database, you need to lay some ground work.  This is the first in a series of planning articles.

One of the key areas of planning for any scanning/capture implementation is sizing and storage.   Many of the customers we work with have no real grasp on the volume of paper they deal with on a day to day basis, and when they make the migration to digitizing their paper, they are often quite surprised at the amount of paper they push through the system.  Obviously, this can cause some serious issues on many different fronts.   So how do you estimate the amount of paper?  There are several key conversion factors used by the document management industry, as outlined below:

 

Description Number of Pages Storage
1 Scanned Page – 8.5 x 11 1 50KB
1 Scanned Page – 11×17 1 100KB
1 File Cabinet – 4 drawers 10,0000 500MB
1 Box 2500 125MB
1 Linear Inch 100 5MB
1 E Size Engineering Drawing (48×36) 16 – 8.5×11 800KB

This table is a basic planning tool, and can be used as a starting point.  One thing to remember is that these are all standard pages.  Not full image magazine pages, but full text pages.  The other thing to keep in mind is that we have listed for boxes and file cabinets, the average number of pages contained within.  In the imaging world, we deal with images, not pages.  What is the difference?  A page may have 2 sides, which are converted digitally into 2 images.  So effectively, if you have a box with double sided pages you are scanning, you will have to double the storage required.

Some other key factors that can contribute to storage and sizing:

DPI Setting – one of the key questions we always receive is What DPI should I set on my scanner?  For most basic scanning and archive applications, you can set your scanner to 200 DPI.  If you are doing OCR or any type of advanced data extraction, you always want a 300 DPI image for maximum accuracy.  Anything beyond that is just a space killer, will slow down your process and really bloat your files.

Black and White, Greyscale and Color – always use black and white scanning to keep file sizes at an absolute minimum.  Greyscale and color scanning should only be used when absolutely necessary, as file sizes are just crazy.  Below is a table of file sizes for the same letter.  The letter was about 50% page coverage.

 

Scanning Mode/DPI File Size
Black and White – 200 DPI 26K
Black and White – 300 DPI 38K
Black and White – 400 DPI 51K
Black and White – 600 DPI 80K
Greyscale – 300 DPI 301K
Color- 300 DPI 577K

Image Processing – image cleanup can significantly reduce file sizes, and it is very important to use this feature whenever you can.  Despeckle, deshade, border removal, etc. will eliminate unnecessary noise in scanned images, and reduce your storage requirement by 10-30% depending on the quality of your documents.

Image Format – There is a lot of misinformation on the market about TIFF versus PDF.  I always hear “We want to store as TIFF because PDFs are just too big.”  Just not the case.  An image scanned to PDF is just a TIFF in PDF clothing (Or a PDF wrapper to be more exact).  The PDF overhead is almost negligible.  The de facto standard in imaging today is rapidly becoming the PDF image with hidden text.  This gives you a nice little file with the pristine image, and converted OCR text in the background.  The text layer adds negligible size to the file.

So now, with all this info, you can estimate volume in images, and then come up with required storage on a monthly, yearly or project basis.

SharePoint Governance and Document Capture

So, SharePoint Governance is the big topic of late, and the ability to control and manage your SharePoint infrastructure is extremely important.  The whole concept applies even more so to document scanning and capture, and there are some key requirements you need to look for in any application that will be your onramp to SharePoint 2010:

  • Make sure that the application has the ability to control the whole scanning process from start to finish
  • A Quality Assurance module is absolutely required to allow for validation and checking of not only image quality, but also the quality of your data
  • Fields within the scanning application should not only map to columns in SharePoint, but also it should be absolutely required that fields can map to file naming standards and folder naming.  This will allow for standardization of the overall process.
  • Fields should also be allowed to map to libraries for dynamic, and controlled placement of documents into specific repositories

The overall control of this process will ensure organization, and allow governance of end user operations when capturing documents to the SharePoint Platform.

A Little more on Scanning and Capture Models

So, in examining a corporate strategy on how best to deploy a scanning and capture solution for SharePoint, there are typically 3 models:

  • Centralized
  • De-centralized
  • Distributed

Each model has its own pros/cons, and below I will examine each, and dive into some detail.

 

Centralized

Ah, the centralized model.  Some call this old school scanning and capture, as for many years, this was the only way to get the job done, and convert your paper to digital form.  This model provides a centralized scanning center to provide mass conversion for the organization.  The operation can be run by in house personnel, be managed by a services provider in house, or be outsourced to a scanning service bureau.  It requires high volume/high speed hardware, and typically utilizes advanced capture software to allow for the utmost in automation and efficiency.  The software and hardware operators are typically highly trained, and there are usually only a few of them.  Paper and/or digital media is shipped to the centralized location and processed through a set, standardized capture workflow.

Centralized Pros

  • Easily standardized process due to a limited number of skilled/trained scan operators
  • High speed hardware/software results in minimal processing time once paper is received
  • Centralized reporting and control of overall process
  • No loading on WAN infrastructure
  • Centralized backup and restore

Centralized Cons

  • Usually a high time delay for availability of documents
  • High cost due to shipping of documents
  • High maintenance costs
  • High training costs to bring on new operators
  • Disaster recovery planning issues if centralized site is down
  • Operators are typically not knowledgeable in the documents they are indexing

 

Decentralized

Over time, as bandwidth and scanning hardware/software prices went down, the obvious move was to decentralize the whole scanning and capture process.  This move placed scanning in the branches, and allowed the whole document capture process to be performed by those who had working knowledge of the documents.  Smaller, desktop class hardware could be used, and most capture companies made batch scanning and upload to the centralized repository simple to accomplish.

Decentralized Pros

  • Scan operators are well versed in the documents they scan
  • Documents are available almost immediately
  • No shipping or transfer costs for documents
  • Branch control of the whole scanning process

 

Decentralized Cons

  • Standardization can be an issue
  • No centralized control or reporting
  • WAN Bandwidth consumption can be high
  • Licensing costs can be high depending on software utilized

 

Distributed

The advance of network-based scanning devices and the lowering of bandwidth pricing led to the newest model, the Distributed Model.  Distributed Scanning allows for just about anyone in the organization to walk up to a network scanning device/scanning copier/fax machine and send documents to a repository.  The devices are typically multi-faceted, and along with repository integration, can provide scan to network folder, FTP and email.  Collaborative back-end systems, like Microsoft SharePoint, lend themselves nicely to this model, as they allow anyone to participate in a Document Workspace.

Distributed Pros

  • Put scanning in the hands of everyone in the organization
  • Provides a great launching pad for collaborative solutions
  • Simple, easy to use interfaces allow for minimal training and quick adoption
  • Capture and indexing is now in the hands of the true document owner
  • One-to-many solution provides a single device to service many users

Distributed Cons

  • Lack of standardization without software addition
  • Security and document control can be major issues
  • Bandwidth from smaller branches can be a problem with larger scans
  • Lack of hardware integrations with back-end systems

So, most organizations today are combining the above models to create a Hybrid Scanning and Capture solution, and leveraging all the strengths together to minimize the weaknesses of any one model.   Another strategy is to tie scanning models to specific business processes, as most lend themselves nicely to specific scanning and capture workflows.

For more information, view a webinar on Distributed Scanning and Capture at the link below:

Distributed Scanning and Capture Webinar