Archive for July, 2009

GZipStream is helpful, but has some missing features

Monday, July 27th, 2009

I recently had to work around a problem in a particularly ugly way (which I wont detail :-) ), so after that painful experience I opted to create a class to solve my specific issue in a sane and reusable manner! Out of this unexpected need the class “GZipHelper” was born. This is really just a wrapper around the  base .Net System.IO.Compression.GZipStream . Its was kind of a sad day as I really didn’t want to be doing this type of wrapper code, I was hoping it would have just been nativity available in the existing GZipStream class and I could have got on with solving my real business problem at hand.

Firstly it should be said that the standard GZipStream stream provides the functionality I’m sure the MS engineers expected it to do, which was for HTTP based compression (at least I think that was its expected purpose). However it is certainly not a fully featured class that is really easy to use for the programmers looking to get quick & helpful access to the GZip compression.

Specifically the problem I needed to solved was I needed to know how big any given “.GZ” decompressed file was without fully reading and decompressing the file. It seemed trivial enough – “gzip.exe -l” does what I needed, but no amount of hunting within MSDN helped. So on to the ever handy GZip wikipedia entry that detailed enough of the file format and provided the reference to the “GZIP file format specification version 4.3“.

So armed this this information we can start to decode the GZip file format to extract the length. Infact this class will check the file to see if it is GZip compressed and returns the decompressed length for that or the regular file length if it is not compressed.

The following class functions have been implemented (see the bottom of the article for the link to the full project):

   /// <summary>
   /// Utility class to help with managing GZip (.gz) files in .Net
   /// </summary>
   /// <remarks>
   /// This is a trivial wrapper class on top of <see cref="GZipStream"/> that does a little magic
   /// under the covers by looking at the underlying data format and retrieves the
   /// stored data information within the GZip compressed file.
   /// </remarks>
   public class GZipHelper
   {
      /// <summary>
      /// Gets the compressed file details
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <returns>True if file exists, else false</returns>
      public bool GetFileDetails(string filename);

      /// <summary>
      /// Gets the compressed file information from a file stream
      /// </summary>
      /// <param name="fileStream">The file stream.</param>
      /// <remarks>
      /// Definitions provided by RFC 1952 -GZIP File Format Specification (May 1996).
      /// Coding was performed against ftp://ftp.isi.edu/in-notes/rfc1952.txt
      /// </remarks>
      public void GetFileInformation(FileStream fileStream);

      /// <summary>
      /// Compresses the file file
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
      /// <returns></returns>
      public void CompressFile(string filename, bool overWriteExisting);

      /// <summary>
      /// Decompresses the file.
      /// </summary>
      /// <param name="filename">The filename.</param>
      /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
      /// <returns></returns>
      public bool DecompressFile(string filename, bool overWriteExisting);

      /// <summary>
      /// Returns a seekable stream into either a file or compressed file (defaults read-only)
      /// </summary>
      /// <remarks>
      /// Decompresses the stream into a <see cref="MemoryStream"/> if the file is compressed
      /// otherwise just returns back a regular <see cref="FileStream"/> as a <see cref="Stream"/>
      /// </remarks>
      /// <param name="filename">The filename to open.</param>
      /// <returns>Reference to opened stream</returns>
      public Stream GetSeekableStream(string filename);
   }

In combination to this the following properties are available:

  • CompressedLength – Size of the compressed file (or regular file size if not compressed)
  • DecompressedLength – Size of the file if it were uncompressed (or regular file size if not compressed)
  • IsTextFile – Indicates if GZip thought the file was text based, potentially leading to better compression
  • CompressionModeValue – Numeric indication of the compression mode used
  • CRC16Present – Indicates a CRC16 is available for the file
  • ExtraFieldsPresent – Additional meta fields are available in the file
  • FileNamePresent – GZip contains the original file name
  • FileCommentPresent – Compressed file has a comment associated with it
  • IsCompressed – Indicates if the file is GZip compressed or not
  • CompressedDate – If stored this is the date the file was compressed.
  • CRC32 – CRC32 value associated with the file

Along with the project there are MSTest harnesses to test the class (trivial implementations). So the features of the class are:

  • Can trivially determine a true file size (regardless if it was compressed via GZip or is uncompressed). This makes your code path much more readable if you are dealing with mixed file types.
  • Provides a Seekable stream into the compressed file via via a MemoryStream. The key is that you dont need to worry about the compression (unless you are reading in BIG files) as you will get back a Stream for either a File or a Compressed file – both support seeking. This can be handy if you problem assumes it can Seek in the stream and you need to access GZip files!
  • Trivial Decompress file, this also honors the CompressedDate. If that date is set then the decompressed file has that creation date.
  • Trivial Compress file. Unfortunately at the time of writing I’ve not updated the header to include the date of the compressed file. This may come in a later version (and if so I’ll update the blog :-) – but definitely no promises!).

Simple example usages are (taken straight from the unit tests!):


// Perform a file compression
GZipHelper actual = new GZipHelper();
actual.CompressFile(_fileName, true);

// Perform a file decompression
GZipHelper actual = new GZipHelper();
string fileName = "CSharpHackerSmallTest.txt.gz";
actual.DecompressFile(fileName, true);

// Get a seekable stream
GZipHelper actual = new GZipHelper();
using (Stream dataStream = actual.GetSeekableStream("CSharpHackerSmallTest.txt.gz"))
{
    // Silly seek - but it just shows it can be done
    dataStream.Seek(0, SeekOrigin.Begin);
    StreamReader sr = new StreamReader(dataStream);
    string contents = sr.ReadToEnd();

    Assert.AreEqual(119, contents.Length);
}

// Gets natural decompressed file length from a compressed file.
GZipHelper actual = new GZipHelper();
actual.GetFileInformation("CSharpHackerSmallTest.txt.gz");
Assert.AreEqual(119, actual.DecompressedLength);

Finally it should be noted that by all accounts the standard implementation of GZipStream in the base .Net libraries (actually the DeflateStream) has a problem when attempting to compress random or already compressed data. There is a Microsoft Connect article [http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93930] that details the issue.

The GZipStream and DeflateStream classes can _significantly_ increase the size of “compressed” data. That means, they don’t just add a few header bytes as stand-alone compressors do, but they _inflate_ the data by as much as 50%. This is apparently because these classes do not check for incompressible data which is a standard feature of all stand-alone compressors. Both classes work fine when the data actually can be compressed.

Please refer to this thread for more details:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=179704&SiteID=1

The base implementation worked for me and met my specific needs without the need of bringing in any third party DLLs. Which incidentally also has a nice benefit for those looking to bring this into proprietary software of avoiding any licensing discussions with supervisors! If you want a more robust GZipStream implementation you can check out http://dotnetzip.codeplex.com/. This apparently has a drop in replacement, but this class could still be useful even if use this drop in replacement as well.

I hope this helps some one out there :-)

[Download GZipHelper (Source + Project) Here]

This download link will always have the latest and greatest version.

Gareth

Good Tech news of the day

Thursday, July 23rd, 2009

Wow this has been a crazy week – much stuff to blog about, but need to unjam the funnel!

SQL 2008 Cumulative Updates Released

Tuesday, July 21st, 2009

Two new CU updates have been released for SQL 2008

Things that stand out are:

  • [970399]  FIX: The MAXDOP option for a running query or the max degree of parallelism option for the sp_configure stored procedure does not work in SQL Server 2008
  • [969844] FIX: You receive inconsistent results when you run index-related DMVs to return statistical information about missing indexes in SQL Server 2005 or in SQL Server 2008
  • [969997] FIX: You receive an incorrect result when you query data from a linked server that is created by using an index OLE DB provider in SQL Server 2005 or in SQL Server 2008
  • [970507] FIX: Error message in SQL Server 2008 when you run an INSERT SELECT statement on a table: “Violation of PRIMARY KEY constraint ‘<PrimaryKey>’. Cannot insert duplicate key in object ‘<TableName>’”
  • [971064] FIX: Quotation marks are rendered incorrectly when you export a SQL Server 2008 Reporting Services report to a .csv file

Obviously you need to really read to understand if you have seen any of the problems the fixes address to be fully aware – and definitely don’t apply unless you need a fix!

Gareth

CLR, DLLs and a blast from the past

Tuesday, July 21st, 2009

The CLR and performance folks have told us old timers that some things never change :-) , although if you read closely it does!

[A primer on setting base-addresses for managed DLLs] covers how for performance reasons you probably want to [NGen] your code (on the remote computer) and ensure you have your base-addresses setup to avoid clashes. The interesting thing is that with [ASLR] enabled on VISTA, Windows 7, Windows Server 2008 and higher it doesnt help :-) .

So for the performance conscious .net coders out there targetting fast startups definitely have a read.

Gareth