Archive for the ‘CodeProject’ Category

GZipStream is helpful, but has some missing features

Monday, July 27th, 2009

I recently had to work around a problem in a particularly ugly way (which I wont detail :-) ), so after that painful experience I opted to create a class to solve my specific issue in a sane and reusable manner! Out of this unexpected need the class “GZipHelper” was born. This is really just a wrapper around the  base .Net System.IO.Compression.GZipStream . Its was kind of a sad day as I really didn’t want to be doing this type of wrapper code, I was hoping it would have just been nativity available in the existing GZipStream class and I could have got on with solving my real business problem at hand.

Firstly it should be said that the standard GZipStream stream provides the functionality I’m sure the MS engineers expected it to do, which was for HTTP based compression (at least I think that was its expected purpose). However it is certainly not a fully featured class that is really easy to use for the programmers looking to get quick & helpful access to the GZip compression.

Specifically the problem I needed to solved was I needed to know how big any given “.GZ” decompressed file was without fully reading and decompressing the file. It seemed trivial enough – “gzip.exe -l” does what I needed, but no amount of hunting within MSDN helped. So on to the ever handy GZip wikipedia entry that detailed enough of the file format and provided the reference to the “GZIP file format specification version 4.3“.

So armed this this information we can start to decode the GZip file format to extract the length. Infact this class will check the file to see if it is GZip compressed and returns the decompressed length for that or the regular file length if it is not compressed.

The following class functions have been implemented (see the bottom of the article for the link to the full project):

   /// <summary>
   /// Utility class to help with managing GZip (.gz) files in .Net
   /// </summary>
   /// <remarks>
   /// This is a trivial wrapper class on top of <see cref="GZipStream"/> that does a little magic
   /// under the covers by looking at the underlying data format and retrieves the
   /// stored data information within the GZip compressed file.
   /// </remarks>
   public class GZipHelper
   {
      /// <summary>
      /// Gets the compressed file details
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <returns>True if file exists, else false</returns>
      public bool GetFileDetails(string filename);

      /// <summary>
      /// Gets the compressed file information from a file stream
      /// </summary>
      /// <param name="fileStream">The file stream.</param>
      /// <remarks>
      /// Definitions provided by RFC 1952 -GZIP File Format Specification (May 1996).
      /// Coding was performed against ftp://ftp.isi.edu/in-notes/rfc1952.txt
      /// </remarks>
      public void GetFileInformation(FileStream fileStream);

      /// <summary>
      /// Compresses the file file
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
      /// <returns></returns>
      public void CompressFile(string filename, bool overWriteExisting);

      /// <summary>
      /// Decompresses the file.
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
      /// <returns></returns>
      public bool DecompressFile(string filename, bool overWriteExisting);

      /// <summary>
      /// Returns a seekable stream into either a file or compressed file (defaults read-only)
      /// </summary>
      /// <remarks>
      /// Decompresses the stream into a <see cref="MemoryStream"/> if the file is compressed
      /// otherwise just returns back a regular <see cref="FileStream"/> as a <see cref="Stream"/>
      /// </remarks>
      /// <param name="filename">The filename to open.</param>
      /// <returns>Reference to opened stream</returns>
      public Stream GetSeekableStream(string filename);
   }

In combination to this the following properties are available:

  • CompressedLength – Size of the compressed file (or regular file size if not compressed)
  • DecompressedLength – Size of the file if it were uncompressed (or regular file size if not compressed)
  • IsTextFile – Indicates if GZip thought the file was text based, potentially leading to better compression
  • CompressionModeValue – Numeric indication of the compression mode used
  • CRC16Present – Indicates a CRC16 is available for the file
  • ExtraFieldsPresent – Additional meta fields are available in the file
  • FileNamePresent – GZip contains the original file name
  • FileCommentPresent – Compressed file has a comment associated with it
  • IsCompressed – Indicates if the file is GZip compressed or not
  • CompressedDate – If stored this is the date the file was compressed.
  • CRC32 – CRC32 value associated with the file

Along with the project there are MSTest harnesses to test the class (trivial implementations). So the features of the class are:

  • Can trivially determine a true file size (regardless if it was compressed via GZip or is uncompressed). This makes your code path much more readable if you are dealing with mixed file types.
  • Provides a Seekable stream into the compressed file via via a MemoryStream. The key is that you dont need to worry about the compression (unless you are reading in BIG files) as you will get back a Stream for either a File or a Compressed file – both support seeking. This can be handy if you problem assumes it can Seek in the stream and you need to access GZip files!
  • Trivial Decompress file, this also honors the CompressedDate. If that date is set then the decompressed file has that creation date.
  • Trivial Compress file. Unfortunately at the time of writing I’ve not updated the header to include the date of the compressed file. This may come in a later version (and if so I’ll update the blog :-) – but definitely no promises!).

Simple example usages are (taken straight from the unit tests!):


// Perform a file compression
GZipHelper actual = new GZipHelper();
actual.CompressFile(_fileName, true);

// Perform a file decompression
GZipHelper actual = new GZipHelper();
string fileName = "CSharpHackerSmallTest.txt.gz";
actual.DecompressFile(fileName, true);

// Get a seekable stream
GZipHelper actual = new GZipHelper();
using (Stream dataStream = actual.GetSeekableStream("CSharpHackerSmallTest.txt.gz"))
{
    // Silly seek - but it just shows it can be done
    dataStream.Seek(0, SeekOrigin.Begin);
    StreamReader sr = new StreamReader(dataStream);
    string contents = sr.ReadToEnd();

    Assert.AreEqual(119, contents.Length);
}

// Gets natural decompressed file length from a compressed file.
GZipHelper actual = new GZipHelper();
actual.GetFileInformation("CSharpHackerSmallTest.txt.gz");
Assert.AreEqual(119, actual.DecompressedLength);

Finally it should be noted that by all accounts the standard implementation of GZipStream in the base .Net libraries (actually the DeflateStream) has a problem when attempting to compress random or already compressed data. There is a Microsoft Connect article [http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93930] that details the issue.

The GZipStream and DeflateStream classes can _significantly_ increase the size of “compressed” data. That means, they don’t just add a few header bytes as stand-alone compressors do, but they _inflate_ the data by as much as 50%. This is apparently because these classes do not check for incompressible data which is a standard feature of all stand-alone compressors. Both classes work fine when the data actually can be compressed.

Please refer to this thread for more details:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=179704&SiteID=1

The base implementation worked for me and met my specific needs without the need of bringing in any third party DLLs. Which incidentally also has a nice benefit for those looking to bring this into proprietary software of avoiding any licensing discussions with supervisors! If you want a more robust GZipStream implementation you can check out http://dotnetzip.codeplex.com/. This apparently has a drop in replacement, but this class could still be useful even if use this drop in replacement as well.

I hope this helps some one out there :-)

[Download GZipHelper (Source + Project) Here]

This download link will always have the latest and greatest version.

Gareth

How to write SHA256Sum in C# (or MD5Sum, SHA1Sum)

Sunday, July 12th, 2009

Occasionally you may have the need to create a file ‘fingerprint’ using one of the well known and supported hash programs. The common hash algorithms are:

  • MD5 - Dont use if you can avoid it as this is known to have vulnerabilities and should never be used!
  • RIPEMD160 – This is supported by .Net, but isnt really heavily used. Recommend using SHA256 or SHA512
  • SHA1- If you can avoid it use SHA256 or SHA512
  • SHA2 Family
    • SHA256
    • SHA384
    • SHA512

If you haven’t come across MD5Sum.exe, SHA1Sum.exe, SHA256Sum.exe you can find native Windows ports here (or if you are looking for the more official GNU versions they can be found here) . Which if you are just looking for the command line tools that is probably enough. However sometimes you may have the need to do all this work yourselves in C#, if so this is the article that should help guide you!

First of all this is going to be fairly simple as the .Net library supports all of the above hash formats, so all we are really talking about doing is showing you the best way to use the supplied .Net runtimes to perform your hashing. So on to the magic (note to avoid width formatting issues this isnt exactly how I normally format the code!):

/// <summary>
/// Performs the SHA1 Hash function on file
/// </summary>
/// <param name="filename">
/// The filename to be hashed.
/// </param>
/// <returns>
/// SHA1 Hash value associated with the file
/// </returns>
public static string SHA1HashFile(string filename)
{
   string hashedValue = string.Empty;

   //create our SHA1 provider
   SHA1CryptoServiceProvider hashAlgorithm = new SHA1CryptoServiceProvider();

   //hash the data from the file
   byte[] hashedData = hashAlgorithm.ComputeHash(File.ReadAllBytes(filename));

   //loop through each byte in the returned byte array to convert into printed ASCII
   foreach (byte b in hashedData)
   {
      hashedValue += String.Format("{0,2:x2}", b);
   }

   //return the hashed value to the caller
   return hashedValue;
}

This does the SHA1 hashing of the supplied file – and matches the output of GNU version of SHA1Sum.exe. I told you it was simple :-) . Dissecting this code should be pretty trivial:

  1. Create the SHA1 Service provider (System.Security.Cryptography.SHA1CryptoServiceProvider)
  2. Call ComputeHash passing in a byte array.
  3. Take the results and output it as text. In case it wasnt obvious the results of the hash is a binary blob, hence the need to format it into a string friendly representation.

However there are really 2 problems with this code:

  1. File.ReadAllBytes – This returns a byte array of the file – pretty much as you would expect! The problem is that if this is a very big file, for example a hash for a DVD, the entire file needs to be loaded into memory before it gets hashed. Obviously not the most optimal approach!
  2. This is completely locked into SHA1, you need a new function for any other different hashing function. Not a biggie, but it would be nice to  get some reuse in now and then. Definitely useful if you see the full example where we have to use some fall back processing if an algorithm is not available.

Thankfully fixing this is still pretty trivial. To fix issue 1 rather than using the ComputeHash that takes in the byte array, use the one that takes the stream. This avoids the need for having the entire file in memory before the hash process can start. Out of curiosity I looked up the publicly available source code for the function to check it was in fact doing what I thought. Thankfully it is simple and obvious:

...
// Default the buffer size to 4K.
byte[] buffer = new byte[4096];
int bytesRead;
do {
   bytesRead = inputStream.Read(buffer, 0, 4096);
   if (bytesRead > 0) {
     HashCore(buffer, 0, bytesRead);
  }
} while (bytesRead > 0);
...

So we can see when we use the stream version of HashAlgorithm.ComputeHash Method (Stream) it only will use up a small memory chunk for calculating the hash values. So we are safe from big files from potentially killing the application.

Issue 2 – The .Net team did a nice job of creating base classes, one of which is HashAlgorithm. This is actually the class that implements the hashing ‘interface’. All hash algorithms must derive from this class. So we can use this to our advantage:

/// <summary>
/// Performs a Hash operation on the supplied file.
/// </summary>
/// <param name="filename">
/// The filename to be hashed.
/// </param>
/// <returns>
/// Selected Hash value associated with the file
/// </returns>
public static string HashFile(
          string filename
          , HashAlgorithm hashAlgorithm)
{
   if (!File.Exists(filename))
   {
      throw new ArgumentException(filename + " must exist", "filename");
   }

   string hashedValue = string.Empty;
   byte[] hashedData = null;

   // Create the stream
   using (FileStream fs = File.Open(filename, FileMode.Open, FileAccess.Read))
   {
      hashedData = hashAlgorithm.ComputeHash(fs);
   }

   //loop through each byte in the returned byte array
   foreach (byte b in hashedData)
   {
      //convert each byte and append
      hashedValue += String.Format("{0,2:x2}", b);
   }

   //return the hashed value
   return hashedValue;
}

Now we have a generic function to return back a hash value from any supported .Net hash algorithm. To call it you could just do “HashFile(filename,new SHA1CryptoServiceProvider())”. Voila! Performance and can trivially support any hashing class the .Net framework implements.

Ok so lets get a little more adventurous now. Attached to this entry is a very simple (aka not fully featured) HashSum source code that allows the same executable to be used to provide all the above hashing! However as a word of caution you need to be a little careful with the more advanced hashing. For example Microsoft supply both “SHA256CryptoServiceProvider” and “SHA256Managed”. On the surface they look pretty much the same, apart from SHA256CryptoServiceProvider is only available in .Net 3.5 (or higher). However since this uses Operating system cryptographic service providers they may not be available on the platform your program is running on. If that is the case then you will get the PlatformNotSupportedException exception thrown:

System.PlatformNotSupportedException was unhandled
  Message="The specified cryptographic algorithm is not supported on this platform."
  Source="System.Core"
  StackTrace:
       at System.Security.Cryptography.CapiNative.AcquireCsp(String keyContainer, String providerName, ProviderType providerType, CryptAcquireContextFlags flags, Boolean throwPlatformException)
       at System.Security.Cryptography.CapiHashAlgorithm..ctor(String provider, ProviderType providerType, AlgorithmId algorithm)
       at System.Security.Cryptography.SHA256CryptoServiceProvider..ctor()
       at CSharpHacker.Hash.HashSum.Main(String[] args) in X:\GIT\FileHash\FileSum.cs:line 76
       at System.AppDomain._nExecuteAssembly(Assembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException:

Hmmm – you don’t read about that little chestnut in the MSDN help! So if you know you are going to only be running this on a Windows 2008, Vista or Windows 7 or later you can just use the “SHA256CryptoServiceProvider” version. However if you may have to support systems such as XP (potentially Windows 2003 as well) you will have to use the Managed versions. The safest route would be to provide a graceful fall back mechanism (possibly with a warning) that the CSP version could not be used and using the managed code version instead. This provides the best of both worlds, if the platform supports the CSP version you can use that (which should give you a speed increase) or you use the managed solution.

case "SHA256SUM":
default:
   try
   {
      hashAlgorithm = new SHA256CryptoServiceProvider();
   }
   catch (PlatformNotSupportedException)
   {
      // Fall back to the managed version if the CSP
      // is not supported on this platform.
      hashAlgorithm = new SHA256Managed();
   }
   break;

This is a second benefit of using the “HashAlgorithm” approach, the underlying code responsible for the generic hashing function is that it doesnt need to know what version (or even algorithm) it is using.

You also have to bear in mind that if you are going to use “SHA256CryptoServiceProvider” (or equivalent) you have to be using .Net 3.5 or greater.

Click to download Sample CSharp FileSum project

This is a .Net 3.5 project that based off the executable file name it uses that algorithm. So if you rename FileSum.exe to “SHA512Sum.exe” it will perform the SHA512 hash on the input file, MD5Sum.exe MD5 hash, etc. If the name is not a recognized name it defaults to using the SHA256 algorithm. This is not designed to be a wholesale replacement for SHA256Sum etc, but more of a guide how to write a fully featured version. So things missing from this include (and are not limited to :-) ):

  • No support for wildcards (simple enough to add but not there)
  • No support for checking files match the input file (‘-c or –check’)
  • Only binary mode is supported, no support for ASCII/Text mode. No support for ‘-t’ or ‘–text’
  • No support for standard input processing

Hope you found this useful,

Gareth

is not supported