Small Text Compression
Dwight Wilcox and B. Scott Jaffa
International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2007)
San Diego, California (USA), July 16-18, 2007
SPECTS_Summary
The Small Text Compression algorithm exploits the observation that most of the characters in a text message come from a limited set of alphabetic characters and that uppercase and lowercase characters are rarely randomly intermixed. The set of possible input text characters is partitioned into four character subsets for (1) uppercase, (2) lowercase, (3) decimal digits, punctuation marks, and symbol, and (4) all remaining characters. Non-text characters, called state characters, are inserted into the compressed sequence to indicate transitions from one subset to another. Both text compression and expansion to restore the original text can be implemented using the presented state machine designs. Most 8-bit characters are replaced with 5-bit compressed characters. Test results show that for English text the size of the compressed output is typically two-thirds the size of the original text input.