It is a lossless text-based log compressor and indexor that uses
regexless parsing and does not need prior knowledge of the log format
to achieve good compression. The compression mechanism is IO
instead of CPU bound and uses memory to reduce IO latency.
Why use it?
I was researching methods of parsing text-based
log data without prior knowledge of the data and backed into several
interesting properties of my parsing strategy. One advantage was
that by creating a template that represented the static and variable
parts of a given log line and storing the variable parts in memory, I
could represent any log line by providing the unique identifier for the
format template and a unique identifier for each of the variables.
This allowed me to store several hundred bytes with a few dozen.
With optimization, I could get the ratio up to 400 to 1 and the
strategy was very fast. The second suprize was that by noting the
line number where each variable was seen while I parsed the logs, I
created a search index of all variables on the fly.
I have used the tool on several occasions to store large volumes of log data where free disk space was limited.
What is in the works?
If there is some interest in this tool, I will migrate the build to GNU
autotools and post the source code. My last version uses a pretty
low tech compression algorithm for storing the identifying the
variables in ram and on disk. That could be improved. Some
who have played with the code believe that 1000 to 1 compression is
possible. That has yet to be demonstrated.