What is “Bridging the Gap”?

 The movement towards an office with less paper and more efficiency can be quite difficult, and with the wrong tools can end in failure.  The key challenge is a process I call “Bridging the Gap”, which uses several applications to create a bridge between the physical and digital world, and helps create a seamless process.  So what is required?  How do you create the bridge?

 

On one side of the gap, you have your physical environment: file cabinets, inboxes, stacks of folders on desks, etc.  There are two components that facilitate the crossing:

  • Scanning Hardware – scanners allow the conversion of paper documents into digital documents or images.  Organizations can use scanning copiers, fax machines or dedicated scanners to digitize.
  • Capture Software – capture software works with the scanning hardware to create an efficient and automated bridging process.  It controls the flow of digitized documents, standardizing how they are routed, and using OCR, Barcodes, Advanced Data Extraction (ADE) and other features to automate the collection of information.  It spans the gap and creates a connection to the other side or the repository.

Once the gap has been spanned, the documents need to land somewhere, just as physical documents land in a file cabinet, inbox on someones desk or another location in the organization.  Below are the two components that exist on the far side of the gap:


  • Workflow Software – think of this as the digital inbox and outbox…on steroids.  Workflow Software is utilized to create a digital mirror of your physical processes.  It can move around files, create approval steps, automatically email and perform logic that usually requires intervention by a human.  Some oraganizations dont have this entity on the other side of the gap.
  • Repository –  Think of the repository as a temporary and permanent file cabinet that can hold files during a workflow process, or as an archive copy once the whole process is complete.  You can search, sort and organize, print, distribute and copy.  Most repositories can allow full text search, if the capture software has created a searchable file format, and also allow column based searching for specific criteria.

I have seen many organizations try and bridge the gap, and not have one of the pieces above, or a piece that cannot suit all their needs.  A missing component can impact the overall value of the system.  For example, take a scanning copier that an AP department uses to scan invoices.  They email themselves the scans, open them, rename them and then save them into their repository.  Without capture software to automate the naming and routing, this is a highly inefficient process.  Without capture, files are not made searchable through OCR, and this can also reduce effiency during search.  Another example might be the lack of a repository that can provide all the bits and pieces an organization may require.  Take the organization that just saves PDFs to a network directory.  This may be fine for many organizations that merely need a simple archive to house their files.  But what about an audit event, or legal issue that may require extensive searching and sorting?


“Briding the Gap” and creating an office with less paper can provide an organization countless benefits with proper planning and design, and the inclusion of all the above components.

Imaging File Size Comparison for Planning – Color and DPI

When planning for scanning to SharePoint, here is a quick matrix for the impact DPI and color can have on file size, and the size of your content DBs.

Scanning Mode/DPI File Size
Black and White – 200 DPI 26K
Black and White – 300 DPI 38K
Black and White – 400 DPI 51K
Black and White – 600 DPI 80K
Greyscale – 300 DPI 301K
Color- 300 DPI 577K

5 Tips for Optimizing Image Size When Scanning to SharePoint

I find quite a few customers are not optimizing their scanning process, and creating very large image files, slamming their network and bloating their content databases.  Below are 5 tips to live by when Scanning to SharePoint:

  1. Scanning at anything greater than 300 DPI is unecessary.   DPI can be a huge killer, and really bloat your file size.  For most instances, 200 dpi is perfectly fine for archive purposes.  If you are using OCR or performing data extraction, 300 dpi will give you a great quality image.  Anything beyond that will give you no better quality, but increases the file size exponentially.
  2. Use color and grayscale sparingly.  Color and grayscale files can be massive, and can be a huge burden on many different aspects of any SharePoint system.  Use them only when absolutely necessary, as black and white images are perfectly acceptable in almost every instance.
  3. Image processing is key.  Having an image processing engine that can despeckle, deshade and remove black borders will reduce file size and conserve storage.
  4. Check you copiers.  Most copiers today like to show off their fancy color capabilities and typically come with default settings to create color scans.  Check DPI and color settings to make sure your users unknowingly are creating massive files.
  5. TIFF or PDF?  This can be a whole additional conversation, and possible next post.  There really is no difference in file size for the same scanned image, and I find PDF is becoming the de facto standard in imaging.

 

 

PDFs and SharePoint: What is recommended??

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.

PSIGEN Releases PSI:Capture 4.0

Ok, talk about a game changer.  Take a look at version 4.0 of PSI:Capture, the new release from the mature document capture company has over 100 new features.  It provides the ability to perform Intelligent Character Recognition (ICR) to read hand printing, a whole set of new forms processing technology, enhanced Optical Character Recognition – OCR for SharePoint, and Dynamic Routing for SharePoint.  For a list of features and functions, go to Document Capture 4.0-PSI:Capture.

 

What scanning and capture model should you choose?

Model, what the heck does that mean?

In traditional scanning and capture, there are 3 well recognized scanning models: centralized, decentralized and distributed.  Below I will cover each in detail:

  • Centralized – Ah, centralized…the old school method.  Imagine a room with ten blue hairs, feeding big iron scanners, and the hum of paper over rollers filling the air.  This is the traditional scanning model, where paper is shipped to a centralized location, and a few highly trained operators with high speed scanners capture and process paper.  This process is easily standardized, but usually the operators are not the knowledge workers that know most about the documents.
  • De-centralized – As bandwidth got cheaper, companies began to look for ways to put the scanning task into the hands of the end users.  The decentralized model provides branch level scanning, usually with smaller desktop hardware, and gives more control to the knowledge workers.  Things get scanned more quickly, and the indexing process is less error prone.
  • Distributed – with the advent of network connected scanners, copiers and fax machines, distributed scanning has evolved to be the model of choice for SharePoint.  It puts the scanning and capture task in the hands of everyone in the organization.  It does have some drawbacks though:  usually you need some software to standardize and govern the whole process, security becomes an issue with scanner availability, and most manufacturers have limited integration options for ECM.

Typically, a SharePoint Scanning and Capture environment requires some type of Hybrid Solution that can be a mesh of all models.  Beware, you will need a capture application that can prosper in all different types of environments.

 

Scanning to SharePoint: Capture Drives Search

So what is the most important part of any SharePoint scanning and capture implementation?

In any ECM system, Capture drives search.

I cannot emphasize this point enough, and time spent in assessing the front end process of any organization will pay huge dividends over time.   Below are 3 key focus areas to examine during the planning phase of any SharePoint ECM / Scanning and Capture implementation:

  1. Standardization is key. Define your site, library, folder and column structure on day one.  Use a capture technology that can create a repeatable, standardized capture process regardless of the user or device.  Ensure that the technology you utilize has the ability to create dynamic library, folder name and file name structures.  Don’t settle for hard-coded folder and file naming structures where you have no control.  This is absolutely critical to “findability” for all users.
  2. Automation creates repeatable processes. The less human intervention, the better.  Technologies like Advanced Data Extraction (ADE) provide automated column population, field validation and exception processing.  Automating the whole collection process drives correct search criteria, and allows for the utmost in efficiency.
  3. Create PDF Image with Hidden text through OCR.  Column based search is great for most of our needs, but to provide the most powerful search repository, adding Optical Character Recognition (OCR) to any process is critical.  Choose a technology that provides multiple OCR engine options.  Why?  Not all OCR engines are optimized for all operations.  Some are built for speed, some built for accuracy, others are just built.

Questions to ask before you start your SharePoint scanning, imaging or capture project

So you want to use Microsoft SharePoint as storage for scanned images? Take a quick breath and don’t charge in too fast, as there are many facets of this type of project that need to be considered.

What type of volume are you scanning on a daily basis?

  
You need to take a deep dive into departmental and end user needs, and really look at the volume of pages they need to image and capture. This brings up a point I discus on a daily basis: Do you want to scan or capture? You may read this and say, what in the world are you talking about, but here is an explanation below:
Let’s create a definition and define a feature set for scanning applications. A scanning application is just a means to take paper, and quickly and easily convert it from paper to digital form. They are well suited to environments with very basic needs, and what I call “onsie-twosie” scanning, or low volume environments. Their feature sets provide very basic functionality, and may allow the use of basic separation, and very basic integrations with SharePoint. The majority of scanning hardware vendors bundle these applications with their hardware, although there are vendors that have taken it to the next level, and provide enhanced scanning capabilities beyond the typical bundled software.
Document Capture software can be utilized for basic scanning needs, but takes you to a whole new level from a “capture” perspective. These applications typically have a number of ways to “slice and dice” documents, and really focus on efficiency, and minimizing the time required to scan, index and capture data. Capture software provides numerous ways to automatically populate columns, including barcode reading, database lookups, OCR, and data extraction. True capture applications provide integration with scanners, folders with images, SharePoint Web Dav folders, etc. Any organization that is serious about processing paper documents, and want to do it in the most efficient, standardized manner, should look seriously at advanced capture applications.
Capture applications are typically well suited to high volume situations or in situations where data can be extracted automatically. Scanning applications are suited for very simple operations, and usually suited to low volume.

What type of scanning device(s) are you going to utilize?

 
There are only a few applications out there that will provide you with the ability to scan from any type of device. Are you going to use network based scanning devices or direct connect scanners? Look into support in these specific areas:
• What type of drivers are supported? ISIS, TWAIN, and VRS should all be allowed.
• Can hot folder functionality provide the auto-import and processing of all different image types, PDF included? Hot folder functionality should span local, network and WebDav folders.
Beware of “panel” based applications. They are typically very static, and can provide a line at the MFP/Copier as people are entering information about their documents at the actual device.


What output format do you want in the SharePoint libraries?

 
Scanning and capture applications today provide a broad array of image output formats, but the standard seems to be PDF Image with Hidden Text. This provides an all in one container for the original image and the searchable text. Install the PDF iFilter, and you have a searchable content store. There are some specialized usages that may require other formats. For instance, if you are importing JPEGs with EXIF tags with your advanced capture application, you will want to keep the original JPEG file with tags intact rather than performing a conversion.


What Scanning and Capture features will be necessary in your environment?


What features should you look for? This is the most difficult question of them all, and you really need to find an application that has a broad and expansive feature set to make sure you can cover today’s needs, and the needs of your organization in the future. This BLOG post is a great place to start:
Trends in Scanning and Capture




How much storage space will I require? Where are you going to store your images?


Just a few stats here to get you on your way:
• The standard scanned page can be estimated at 50K in size (at 300DPI)
• A file cabinet contains between 10,000 and 12,000 pages
This can give you a quick idea of how much storage will be required, and let you do some growth estimation over time.
You should also use these numbers to see if you should use the SharePoint DB for content storage, or utilize Remote BLOB Storage (RBS). SharePoint 2010 with SQL 2008 R2 allows this without the need for additional software through the FILESTREAM provider.


How will I view images once they are in SharePoint?


Without a viewer add-on, SharePoint will require you to open an image to view pages. This can be problematic if you are serving up large image files. Definitely take a look at some of the image viewer add ons to SharePoint. My favorite, VizitSP SharePoint Viewer, provides the ability to view/preview, annotate, image process, search (column based and full text) and have multiple images open in a tabbed view. This is an absolute necessity if you are going to give end users the best experience possible.

Just some questions to get the gears turning and make sure you get all the pieces to the puzzle.

SharePoint 2010 and Document Sets

So many good posts coming out on the web for 2010. Working to figure out all the angles on how to improve SharePoint as an imaging, scanning and capture platform. Document sets seem to be a great focal point. Great article outlining them and how to use:

Document Sets and SharePoint 2010