home | programming | software | artwork | literature | poetry contact me | host
c++ | excel

Programming, C++

[data compression]
RLE - run length encoding
MTF - move to front
BWT - burrows wheeler
Diatomic encoding
The static model

[bit io routines]
Bits & bytes
Bit io demo

'Diatomic Encoding (or RC, Replacement Coding)'

This is an idea I ran with after examining files for repeated bytes & patterns. After which I sat down and wrote the code for this encoding application. I have since learnt (through reading Data Compression - second Edition, by Gilbert Held) that is is termed 'Diatomic Encoding'. Lots of different compression techniques use pattern matching to reduce the end filesize. I noticed in some files that if you could free up 1 byte out of the possible 0-255 chars, you could then use this 'spare' byte to replace repeated patterns or just pairs of bytes to force a dramatic reduction. Indeed if you then use brute force to repeatedly re-encode (or even add some RLE / Delta Encoding) even greater gains can be made. Obviously a lot of ideas or new concepts that you may have usually have been visited by someone, somewhere before. I have tried to put my own twist on this form of 'Diatomic Encoding' and was relatively suprised with the results. In most cases this simple technique almost performs as well as RLE.

Basically to encode a file I find the 2 lowest counts of all symbols read in. I then find the highest pair of bytes read in. For example lowest #1: 'h', lowest #2: 'g', highest pair is 'aa'. Therefore I first replace all occurances of 'h' with 'g' keeping a bitmask of swap flags (i.e. 1 if i swap, 0 if I don't). Then as I have freed up the 'h' I replace any occurances of 'aa' with the single 'h'. I write out as a header 'h','g','a','a',1 (meaning read in bitmask as 1 byte), 1 (bitmask bits formatted to single byte). Please see the code below for implementation or see my bit io routines for examples of bit functions. The decoding is very easy and fast. This method and its code is not rocket science, but it is easy to write, quick to execute and does actually reduce filesize in most cases (even .JPG'S) to some extent. NB: the bitmask in this encoding routine is to specify whether we replaced one byte with another i.e. if we replaced 'h' with 'g' and then we had a normal 'g' we would have a bitmask of '10'.

 Click to view code as text file...   Click to download file... 

[screen shot]

Diatomic (RC) Encoding - screen shot of Application running encoding technique on an Excel file