A Field Guide to Project 2

by Seth Nielson, updated by Emily Fortuna

Table of Contents

Overview

This guide highlights a few points in the RFCs and provides some exercises, examples, and tools to help give you some direction for prototyping and reverse-engineering the gzip project. The exercises are completely optional, but may help you understand the gzip project.

Background

This guide is designed to be used after, repeat AFTER several reads through RFC 1951 and 1952.

Have you read them yet? No? Ok, come back when you have read them.

Ok. You've read them now? We can continue.

A quick skip through the RFCs

Experiments

Below are some optional experiments to help you understand gzip compresses files. Don't worry about spending a lot of time designing or "cleaning up" code. This is the sand box. Hey, it might even be the litter box. Behave appropriately. You might, for instance create an "Experiments" class with a main method and a bunch of static methods that test various things.

Bit Manipulation Experiments

As bit manipulations take up so much of this assignment, this is one of the most important parts of the code. Even though you may not fully understand the RFC's at this point, it should be clear after some careful reading that you will have need of various bit controls.

The very first thing you should be able to do is print a byte. Again, you may find Java Bits Manipulation Guide to be helpful. In other words, if you have a byteToString() method, then System.out.println(byteToString((byte)255); should produce:

11111111
Given the various changes in bit order described in the RFC, try giving it a parameter to print out LSB or MSB first (i.e., "1" would become 00000001 in MSB-first order and 10000000 in LSB-first order).

Now that you have print-byte working, write a setBit() function for setting an individual bit. Write the corresponding getBit(). You will use a bit index, of course. Make it clear what this means.

A tougher experiment is to be able to convert several bytes to a number as illustrated in both RFC's. You could probably set a limit on 4 bytes, so try that. Make two functions. The first will take a long and produce four bytes in the ordering described in section 3.2 of RFC 1951. The second will take one to four bytes in the RFC's described format and produce a long.

This final bit experiment is somewhat challenging, but if you succeed, it will make much of the assignment easier. Write a class that will allow you to stream an arbitrary number of bytes as a stream of bits. Be able to stream the bits in either LSB (bit) first or MSB (bit) first.

Gzip Meta Experiments

Now we come to processing the GZip file itself. If you aren't familiar with gzip, it is pretty simple. Start with a file with just a few characters in it, perhaps named gzip-experiment1. From the command line, in the directory with that file, run gzip gzip-experiment1 and it should destroy the file and replace it with gzip-experiment1.gz.

For printing the bytes to the screen, use your byteToString() method and put spaces between each byte.

Your next experiment is to process the header. This really shouldn't be too challenging. Remember this is a prototype. A big huge series of if statements inside a while loop will do just fine. You do not need to store anything at all. Your goal here is to be able to find the start of the first compressed block. It will also help you learn how to write your own simple headers.

Now you can implement the pseudo-code described in section 8 (appendix) of RFC 1952. Be careful, though, Java has no "unsigned" modifier. you should be able to pull the CRC from the last 8 bytes of your gzip-experiment1.gz binary (or hexadecimal) dump. This wouldn't work in a multi-member gzip file, of course, but you don't have to worry about that here. To compute the CRC on the uncompressed text, you'll need to have access to the raw text. Make sure the text compressed in your test file is something simple, like "abc". Then, you can extract the crc and verify it without having to handle the whole file.

Make sure you can also process the size. This should be nothing more than using your multi-byte to long converter you constructed previously.

BTYPE 00 Experiments

It is difficult to get gzip to produce files with blocks using no compression (BType00). At the bottom of this document is information about a "modified" gzip from dsandler that will produce compressed files with specific block types.

Also, I have provided you with a small file that has one block of type BTYPE00 here. For this experiment, print out the binary on paper (either using your printer, or transcribe it). Now, go through and mark with pen what all of the different bytes represent. Can you translate everything?

Now that you can do that, write a small BType00 encoding function. It should take a string in, and output (to file) a compressed (BTYPE00) file (for file manipulations, look up FileInputStream and FileOutputStream). A reminder on the components:

Try it out! Can gunzip uncompress it?! When it does, congratulations! You just wrote a gzip-compliant file!

BTYPE 01 Experiments

The harder bit manipulations start happening with these types of blocks. The first thing you need to be able to do is figure out what the RFC means when it tries (and fails) to describe bit order.

Here is some binary of a BTYPE01 block. You can extract your own, for additional examples, by writing a function that reads in a gzip file, strips the header, and the last 8 bytes. However, a lot of files created by gzip will end up being BTYPE10 blocks, so be careful.

To be able to decompress the following code, you will have to be able to use the default Huffman trees.

01001011 01001100 01001010 01001110 
01001001 11001101 11001101 01001100 
01001001 11001001 01001001 01001101 
00000100 00110001 10111001 00000000

Can you uncompress it? Remember, this is AFTER the header. As a hint, there is only one block, so you know that the first bit should be BFINAL.

By now, you should have determined that the previous example had no lengths or distances. This one does:

01001011 01001100 01001010 01001110 
01001001 01001101 01001011 01001111 
10000100 01010000 01011100 00000000

This is important: are you sure you got the length and distance right? Double check those extra bits and make sure that you get them in the right order. Don't let this bite you later.

Other Experiments

This is pretty much the end of the guide. I hope it has been helpful. Just a few ideas for further experiments (not an exhaustive list):

Other tools: gzip-DEBUG, xxd, and GzipInfo

Finally, we point you to a couple of other tools you may find useful as you wrestle with gzip. First, a version of the gzip executable that lets you choose the BTYPE of compressed blocks:

Path: news.rice.edu!newsfeed.rice.edu!rice!not-for-mail
From: Daniel Sandler 
Newsgroups: rice.owlnews.comp314
Subject: Debug version of gzip on owlnet, and usage of xxd
Date: Wed, 22 Feb 2006 22:35:02 +0000 (UTC)

By popular demand, I have recompiled gzip 1.2.4 with -DDEBUG and -DFORCE_METHOD
and placed the binary on Owlnet. You're free to use it for testing and
reverse-engineering (along with stock gzip).

What do these compile-time #defines do?

-DDEBUG: enables some trace messages by default, and lots more if you supply
         -v on the command line (e.g. gzip -v foo)

-DFORCE_METHOD: turns compression modes -1, -2, and -3 into "special" modes:
         -1: output gzip header, entire input file, and CRC
             (this is NOT a valid gzip file, so it's not really useful)
         -2: valid gzip file with each block stored (no compression)
         -3: valid gzip file with static Huffman trees only 

To use it, run (on Owlnet):

    ~comp314/src/gzip-DEBUG/gzip [flags] [files]

Another useful tool hinted at on the assignment is xxd. For example, here's the
xxd dump of a static-tree compressed version of the single byte 'A' (0x41):

    # use -3 to force static Huffman trees
    $ echo -n A | ~comp314/src/gzip-DEBUG/gzip -3 > A.gz
    [...gzip's debug output on stdout...]
    $ xxd -b A.gz
    0000000: 00011111 10001011 00001000 00000000 00101101 11100000  ....-.
    0000006: 11111100 01000011 00000000 00000011 01110011 00000100  .C..s.
    000000c: 00000000 10001011 10011110 11011001 11010011 00000001  ......
    0000012: 00000000 00000000 00000000                             ...

Very handy for figuring out the correct way to write out things like the CRC32
(in this case 10001011 10011110 11011001 11010011, which is bytewise
little-endian for 0xd3d99e8b or 3554254475).

Happy hacking,
-ds

xxd is an extremely handy hexadecimal (and binary!) dump tool accessible through your own local command line.

Lastly, GzipInfo is a tool created by Emily Fortuna that reads in a .gz file and returns an annotated .html file of the binary dump (sort of like a "smart" xxd viewable in your browser). We have also provided three examples of its output. The first two files are btype00 and btype01 versions of text file containing abc\n. The last file is the alphabet repeated four times, compressed with btype10. The colors of the annotations match up with the bits that they correspond to, and bits with a common purpose (say, MTIME or a Huffman code, which may cross byte boundaries) share a color. If you're confused in reading the Huffman Codes, remember that you start at the leftmost byte, and pick the rightmost bit of that byte and read right to left.