Imaging File Size Comparison for Planning – Color and DPI

When planning for scanning to SharePoint, here is a quick matrix for the impact DPI and color can have on file size, and the size of your content DBs.

Scanning Mode/DPI File Size
Black and White – 200 DPI 26K
Black and White – 300 DPI 38K
Black and White – 400 DPI 51K
Black and White – 600 DPI 80K
Greyscale – 300 DPI 301K
Color- 300 DPI 577K

5 Tips for Optimizing Image Size When Scanning to SharePoint

I find quite a few customers are not optimizing their scanning process, and creating very large image files, slamming their network and bloating their content databases.  Below are 5 tips to live by when Scanning to SharePoint:

  1. Scanning at anything greater than 300 DPI is unecessary.   DPI can be a huge killer, and really bloat your file size.  For most instances, 200 dpi is perfectly fine for archive purposes.  If you are using OCR or performing data extraction, 300 dpi will give you a great quality image.  Anything beyond that will give you no better quality, but increases the file size exponentially.
  2. Use color and grayscale sparingly.  Color and grayscale files can be massive, and can be a huge burden on many different aspects of any SharePoint system.  Use them only when absolutely necessary, as black and white images are perfectly acceptable in almost every instance.
  3. Image processing is key.  Having an image processing engine that can despeckle, deshade and remove black borders will reduce file size and conserve storage.
  4. Check you copiers.  Most copiers today like to show off their fancy color capabilities and typically come with default settings to create color scans.  Check DPI and color settings to make sure your users unknowingly are creating massive files.
  5. TIFF or PDF?  This can be a whole additional conversation, and possible next post.  There really is no difference in file size for the same scanned image, and I find PDF is becoming the de facto standard in imaging.

 

 

New SharePoint 2010 Limitations on Storage

Holy cow…4TB on content DBs.  Below is quoted from TechNet:

Content databases of up to 4 TB are supported when the following requirements are met:

  • Disk sub-system performance of 0.25 IOPs per GB. 2 IIOPs per GB is recommended for optimal performance.
  • You must have developed plans for high availability, disaster recovery, future capacity, and performance testing.

You should also carefully consider the following factors:

  • Requirements for backup and restore may not be met by the native SharePoint Server 2010 backup for content databases larger than 200 GB. It is recommended to evaluate and test SharePoint Server 2010 backup and alternative backup solutions to determine the best solution for your specific environment.
  • It is strongly recommended to have proactive skilled administrator management of the SharePoint Server 2010 and SQL Server installations.
  • The complexity of customizations and configurations on SharePoint Server 2010 may necessitate refactoring (or splitting) of data into multiple content databases. Seek advice from a skilled professional architect and perform testing to determine the optimum content database size for your implementation. Examples of complexity may include custom code deployments, use of more than 20 columns in property promotion, or features listed as not to be used in the over 4 TB section below.
  • Refactoring of site collections allows for scale out of a SharePoint Server 2010 implementation across multiple content databases. This permits SharePoint Server 2010 implementations to scale indefinitely. This refactoring will be easier and faster when content databases are less than 200 GB.
  • It is suggested that for ease of backup and restore that individual site collections within a content database be limited to 100 GB. For more information, see Site collection limits.

For more information on SharePoint Server 2010 data size planning, see Storage and SQL Server capacity planning and configuration (SharePoint Server 2010).

SharePoint Scanning Planning – Part 1 – Storage and Sizing

With SharePoint Scanning and Capture, as with any project, planning is essential to success.  If you are going to use scanning software to send scanned images to a SharePoint Content Database, you need to lay some ground work.  This is the first in a series of planning articles.

One of the key areas of planning for any scanning/capture implementation is sizing and storage.   Many of the customers we work with have no real grasp on the volume of paper they deal with on a day to day basis, and when they make the migration to digitizing their paper, they are often quite surprised at the amount of paper they push through the system.  Obviously, this can cause some serious issues on many different fronts.   So how do you estimate the amount of paper?  There are several key conversion factors used by the document management industry, as outlined below:

 

Description Number of Pages Storage
1 Scanned Page – 8.5 x 11 1 50KB
1 Scanned Page – 11×17 1 100KB
1 File Cabinet – 4 drawers 10,0000 500MB
1 Box 2500 125MB
1 Linear Inch 100 5MB
1 E Size Engineering Drawing (48×36) 16 – 8.5×11 800KB

This table is a basic planning tool, and can be used as a starting point.  One thing to remember is that these are all standard pages.  Not full image magazine pages, but full text pages.  The other thing to keep in mind is that we have listed for boxes and file cabinets, the average number of pages contained within.  In the imaging world, we deal with images, not pages.  What is the difference?  A page may have 2 sides, which are converted digitally into 2 images.  So effectively, if you have a box with double sided pages you are scanning, you will have to double the storage required.

Some other key factors that can contribute to storage and sizing:

DPI Setting – one of the key questions we always receive is What DPI should I set on my scanner?  For most basic scanning and archive applications, you can set your scanner to 200 DPI.  If you are doing OCR or any type of advanced data extraction, you always want a 300 DPI image for maximum accuracy.  Anything beyond that is just a space killer, will slow down your process and really bloat your files.

Black and White, Greyscale and Color – always use black and white scanning to keep file sizes at an absolute minimum.  Greyscale and color scanning should only be used when absolutely necessary, as file sizes are just crazy.  Below is a table of file sizes for the same letter.  The letter was about 50% page coverage.

 

Scanning Mode/DPI File Size
Black and White – 200 DPI 26K
Black and White – 300 DPI 38K
Black and White – 400 DPI 51K
Black and White – 600 DPI 80K
Greyscale – 300 DPI 301K
Color- 300 DPI 577K

Image Processing – image cleanup can significantly reduce file sizes, and it is very important to use this feature whenever you can.  Despeckle, deshade, border removal, etc. will eliminate unnecessary noise in scanned images, and reduce your storage requirement by 10-30% depending on the quality of your documents.

Image Format – There is a lot of misinformation on the market about TIFF versus PDF.  I always hear “We want to store as TIFF because PDFs are just too big.”  Just not the case.  An image scanned to PDF is just a TIFF in PDF clothing (Or a PDF wrapper to be more exact).  The PDF overhead is almost negligible.  The de facto standard in imaging today is rapidly becoming the PDF image with hidden text.  This gives you a nice little file with the pristine image, and converted OCR text in the background.  The text layer adds negligible size to the file.

So now, with all this info, you can estimate volume in images, and then come up with required storage on a monthly, yearly or project basis.