SPECTS 2007 - Small Text Compression

Small Text Compression

Dwight Wilcox and B. Scott Jaffa

International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2007)
San Diego, California (USA), July 16-18, 2007

SPECTS_Summary

The Small Text Compression algorithm exploits the observation that most of the characters in a text message come from a limited set of alphabetic characters and that uppercase and lowercase characters are rarely randomly intermixed. The set of possible input text characters is partitioned into four character subsets for (1) uppercase, (2) lowercase, (3) decimal digits, punctuation marks, and symbol, and (4) all remaining characters. Non-text characters, called state characters, are inserted into the compressed sequence to indicate transitions from one subset to another. Both text compression and expansion to restore the original text can be implemented using the presented state machine designs. Most 8-bit characters are replaced with 5-bit compressed characters. Test results show that for English text the size of the compressed output is typically two-thirds the size of the original text input.

START Conference Manager (V2.54.3)

Maintainer: sbranch@scs.org