p2: GZip
Overview
Gzip is a name applied to both a compression file format and a utility that can compress and decompress data using the DEFLATE algorithm. The gzip utility is widely used and shows up in a variety of places where you might not expect it. Did you know that it is part of the most recent HTTP specification?
Your assignment is to implement a gzip-compliant compressor and decompressor. When you are finished, your program will be able to decompress files compressed with gzip and will be able to compress files that gzip can decompress.
Requirements
Unlike the project page for ADVENTURE, this page will be quite light on listing requirements. This is because the rules for compliance are described in various RFC (Request-For-Comments) documents. The two RFC's that you will need to understand are:
- RFC 1951 - DEFLATE Compressed Data Format Specification 1.3
- RFC 1952 - GZIP file format specification version 4.3
While it is not required, you may find that RFC 1950, which discusses the ZLIB data format, to also be helpful. Please note that ZLIB is not the same as GZIP, although they both use DEFLATE.
These RFC's form the backbone requirements documentation for this project. Note that the gzip utility has a large number of features not discussed in the RFC's including the ability to decompress files compressed by Unix's compress tool, the knowledge to convert the various CR/LF/CR-LF line terminations between various operating systems, and so forth. You do not have to implement any of these features.
Also, unlike the real gzip program, you only have to deal with system.in and system.out.
You must, however, build a compressor that can produce files decompressable by gzip and a decompressor that can decompress files compressed by gzip. In reading the specifications, you will learn that there are three major compression block-types: uncompressed, compressed with fixed Huffman codes, and compressed with dynamic Huffman codes. For full credit, your compressor and decompressor must handle all three block types correctly.
You will also read in the RFC's that gzip has a number of error-checking devices built into the format. Your code should report all possible errors and inconsistencies correctly. You should definitely describe all possible error conditions in your specification.
In addition to these requirements for your final submission, we are making some additional requirements at each milestone. The following sections outline what you will be expected to do. Towards the end of this document is a grading summary.
Basic Milestones
As before, we will have three basic milestone phasess for this project. These milestones, and their due dates are:
| Milestone | Due Date |
|---|---|
| Specification | 2/28/2008 |
| Beta | 3/13/2008 |
| Final | 3/20/2008 |
As usual, your milestones are due on 12:01am on the date listed above. Note that your specs are due right before "midterm recess." (Not to be confused with "spring recess," which happens in April.)
WARNING: The computers hosting your subversion server will be undergoing maintenance during midterm recess. That means you won't necessarily be able to reach subversion during the break. The machines will be physically moving to a different building and will be logically connecting to a different network (from RiceNet1 to RiceNet2). This may cause unplanned complications. Of course, we'll try to minimize disruptions, and we promise (as best we can) that everything will be back up before you get back home from the beach.
Functionality Milestones
Unlike the milestones for ADVENTURE, the milestones for gzip will require that specific functionality milestones are also met. This extra requirement is necessary because you will need the functionality prototypes to help you understand the requirements. Keep in mind that the RFC's that describe gzip's format are very poorly written. Here is one small example:
In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the *right* margin and proceeding to the *left*, with the most- significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position).
Reading this paragraph in context provides a little more understanding, but not much. This can making spec-writing very difficult. Even though there is a specification milestone, your project specification will be under development continuously with your code. We will describe more of what is expected in your specification in the next section.
Listed below are the functional requirements for each milestone. See the RFC's for more about what these requirements mean.
- Specification Prototype Capabilities: compress and decompress "Uncompressed Blocks" (BType 00)
- Beta Program Capabilities: compress and decompress gzip files using the fixed Huffman codes (BType 01)
The code must be in a more "mature" form by the beta phase. It must be able to read in a real gzip file and output the same. This means that you will have to be able to handle all of the header components, the CRC, and both the BTYPE 00 and BTYPE 01 block types. You should also have unit tests for this code.
It is important that you consider these functionality milestones as bare minimums. Certainly by the time you reach your spec meeting with your graders, you will want to be able to discuss how you are are handling BType 01 blocks, and similarly, how you are handling BType 10 blocks by your beta meeting.
More about the Specification
The specification you will submit for the specification milestone will probably not completely describe how all of your low-level operations will work, nor all of the low-level datastructures you will use. There is no point in writing how something will work, until you know how it will work. The whole reason for the prototyping is so that you can understand the (poorly-written) RFC's sufficiently to provide a detailed design.
To help you with your prototyping, we are providing this guide. It helps you reverse-engineer how gzip works so that you can understand the RFC's better. By working through these examples, you will be able to figure out what a lot of the confusing statements in the RFC mean. Then, you will be able to fill in your spec with a solid design.
However, the guide does not describe experiments for figuring out all of the components of your gzip program. Specifically, it says nothing about how to encode or decode the dynamic huffman trees into the byte stream or how to effectively do hash-chaining. Until you have figured out how these components will work, you will not be able to provide a design for them. It is most likely that you will not have have designs for these and/or other critical parts of the assignment by the spec due date. What then should your spec say about those?
We are requiring that, for ever feature that you do not know how to implement, you must provide detailed methods for determining how that feature will work. You might use one of the methods used in the guide or you might come up with something completely unique. But you must provide a plan for filling in all the blanks.
Your development cycle will work something like this:
- Specify how you will figure out what some part of the RFC means, or how you will implement some specific feature
- Gain understanding about the RFC and your design by experimentation and prototyping
- Update your specification by removing the experiment and inserting the detailed design
- Re-implement the prototyped functionality as mature and stable code.
You should still include a functional test plan for this project. You are not required to implement the functional test plan, but you are required to write one. In the subsequent section "Testing and Design", we describe more about the unit tests that you are required to implement. Please remember that functional testing and unit testing are not the same.
Here is an example of a functional test.
TEST COMPRESSOR-1BLOCK-MAXSIZE: Create a text file of maximum 1-block uncompressed size (65,535 as per RFC 1951) and compress. Uncompress using gzip. Verify that the gzip-uncompressed file and the original file are the same using a comparison utility.
You should fill in unit-test details as you fill in design details. For example, if you have a special utility class for manipulating a byte called ZipByte with a method for setting any one of the eight bits called setBit(), you might have something like this:
TEST ZipByteTest.setBitPositiveTest1() - sets all bits 0-7 to 1, and then bits 0-7 to 0. After each "setBit", calls "getBit" to verify the change
TEST ZipByteTest.setBitUnderflowTest() - Attempts to set bit -1 and verifies that an "IndexOutOfBoundsException" is thrown
TEST ZipByteTest.setBitOverflowTest() - Attempts to set bit 8 and verifies that an "IndexOutOfBoundsException" is thrown
Notebooks
We highly recommend, but do not require, that you keep a notebook detailing your experiments.
One simple method might be to get a three-ring binder, print out the RFC's, and keep notes in the margins and on the back-pages. A more traditional approach would include purchasing a speckled composition notebook (no reason to spend 25$ for an engineering notebook) for keeping a log of your experiments and results. Given the pair-programming model, perhaps an electronic log would also be effective. In fact, you could download an electronic text copy of the RFC's to your shared directory and make comments inline with your initials and the date.
Graders are looking to see that you have made significant progress by the milestone meetings. Such records will not only demonstrate the progress that you have made, but will also provide a better framework for the graders to offer some helpful suggestions.
Milestones Recap
In summary, by the spec deadline, you must have a spec that details how you will implement high and low-level details for the parts you understand, and also detail how you will figure out the parts that you do not understand. You must also have a working prototype that can compress and decompress BType 00 blocks (but handling the entire file format, like headers and footers, is not required).
By the beta deadline, you must have a more mature program that can compress and decompress BType 01 and Btype 00 blocks, as well as handle the gzip header and footer. The code should report errors in the file. Your spec should be updated with the remainder of the design for the project.
Of course, by the final, your code must handle BType 00, 01, and 10. It is advisable to keep logs and/or notes throughout your development process. We have provided a guide to help you prototype, and reverse-engineer gzip so help you understand the RFC's and be able to write a good design in your specification.
Testing and Design
We have stated many times in class that the hardest part of this assignment is understanding the RFC's and coding the algorithms. However, this does not mean that the design and testing aspects of the project are not valuable nor requisite.
With careful preparation that includes prototyping, and good test-driven design, your final code will require little in the way of debugging.
As a reminder, Unit Tests should cover all public methods of all classes. In other words, if you have someClass.someMethod() you should have someClassTest.someMethodTest() unless someMethod is incredibly trivial.
Grading
As before, the grading for the three phases is described on the syllabus.
As a reminder, the grading breaks down like this:
- Spec Milestone: 5 points
- Beta Milestone: 5 points
- Final Submission: 15 points
- Documentation: 5 points
- Implementation: 5 points
- Unit Testing: 5 points
Please note the increased emphasis on unit testing. For the last project it was 3 points. For this project, it is worth 5 points.
Teams
The teams for project 2 are here. Sign up for your milestone meetings.
Miscellany
While it's not a formal requirement for this assignment, performance does matter. The algorithms you choose will have a non-trivial impact on the performance of your system. At the end of the assignment, we will have a "bake-off" where we will compare your work to your peers in terms of running speed, memory usage, and compression ratios on various file types. We will almost certainly hand out "cool points" to teams that excel relative to their peers on any of these metrics.
If you've implemented multiple algorithms or have a switch that can change the behavior of your system, make sure it's clearly documented in your README file so your graders will know which versions to use in which parts of the bake-off.
As a final note, you may be interested to learn more about this evolution of this assignment. This assignment replaced an earlier assignment on fast searching, where you, the student, would implement something akin to the search interface in iTunes. Under the hood, you're doing essentially the same thing here (i.e., building a data structure to help you quickly find strings with a matching prefix).
When we first did the gzip assignment in 2006, the assignment was pretty threadbare. We said, "here are the RFCs, have fun." This was painful, yet the students, in the end, felt strongly that future generations should indeed implement the real gzip specification rather than some simplified compressor that we might otherwise specify. To reduce the pain, we've added all of the information above, the guides, and so forth. You're still going to have a lot of work ahead of you, but don't panic. You'll make it.