Dictionary Definition
lossless adj : characterized by or causing no
dissipation of energy [ant: lossy]
User Contributed Dictionary
English
Adjective
lossless- free from loss, especially not losing electrical energy or force
Derived terms
Extensive Definition
Lossless data compression is a class of data
compression algorithms that allows the
exact original data to be reconstructed from the compressed data.
This can be contrasted to lossy
data compression, which does not allow the exact original data
to be reconstructed from the compressed data.
Lossless data compression is used in many
applications. For example, it is used in the popular ZIP
file format and in the Unix tool gzip. It is also often used as a
component within lossy data compression technologies.
Lossless compression is used when it is important
that the original and the decompressed data be identical, or when
no assumption can be made on whether certain deviation is
uncritical. Typical examples are executable programs and source
code. Some image file formats, like PNG
or
GIF, use only lossless compression, while others like TIFF and
MNG may use either lossless or lossy methods.
Lossless compression techniques
Most lossless compression programs use two different kinds of algorithms: one which generates a statistical model for the input data, and another which maps the input data to bit sequences using this model in such a way that "probable" (e.g. frequently encountered) data will produce shorter output than "improbable" data. Encoding algorithms to produce bit sequences are:- Huffman coding (also used by DEFLATE)
- Arithmetic coding which does not pose such limitations on the probabilities like Huffman coding
Lossless compression methods may be categorized
according to the type of data they are designed to compress. While,
in principle, any general-purpose lossless compression algorithm
(general-purpose means that they can handle all binary input) can
be used on any type of data, many are unable to achieve significant
compression on data that is not of the form for which they were
designed to compress. Many of the lossless compression techniques
used for text also work reasonably well for indexed
images.
Text
Statistical modeling algorithms for text (or text-like binary data such as executables) include:- Context Tree Weighting method (CTW)
- Burrows-Wheeler transform (block sorting preprocessing that makes compression more efficient)
- LZ77 (used by DEFLATE)
- LZW
Multimedia
Techniques that take advantage of the specific characteristics of images such as the common phenomenon of contiguous 2-D areas of similar tones. Every but the first pixel is replaced by the difference to its left neighbor. This leads to small values having a much higher probability than large values. This is often also applied to sound files and can compress files which contain mostly low frequencies and low volumes. For images this step can be repeated by taking the difference to the top pixel, and then in videos the difference to the pixel in the next frame can be taken.A hierarchical version of this technique takes
neighboring pairs of data points, stores their difference and sum,
and on a higher level with lower resolution continues with the
sums. This is called discrete
wavelet transform. JPEG2000
additionally uses data points from other pairs and multiplication
factors to mix then into the difference. These factors have to be
integers so that the result is an integer under all circumstances.
So the values are increased, increasing file size, but hopefully
the distribution of values is more peaked.
The adaptive encoding uses the probabilities from
the previous sample in sound encoding, from the left and upper
pixel in image encoding, and additionally from the previous frame
in video encoding. In the wavelet transformation the probabilities
are also passed through the hierarchy.
Historical legal issues
Many of these methods are implemented in open-source and proprietary tools, particularly LZW and its variants. Some algorithms are patented in the USA and other countries and their legal usage requires licensing by the patent holder. Because of patents on certain kinds of LZW compression, and in particular licensing practices by patent holder Unisys that many developers considered abusive, some open source protagonists encouraged people to avoid using the Graphics Interchange Format (GIF) for compressing image files in favor of Portable Network Graphics PNG, which combines the LZ77-based deflate algorithm with a selection of domain-specific prediction filters. However, the patents on LZW have now expired.Many of the lossless compression techniques used
for text also work reasonably well for indexed
images, but there are other techniques that do not work for
typical text that are useful for some images (particularly simple
bitmaps), and other techniques that take advantage of the specific
characteristics of images (such as the common phenomenon of
contiguous 2-D areas of similar tones, and the fact that color
images usually have a preponderance to a limited range of colors
out of those representable in the color space).
As mentioned previously, lossless sound
compression is a somewhat specialised area. Lossless sound
compression algorithms can take advantage of the repeating patterns
shown by the wave-like nature of the data – essentially using
models to predict the "next" value and encoding the (hopefully
small) difference between the expected value and the actual data.
If the difference between the predicted and the actual data (called
the "error") tends to be small, then certain difference values
(like 0, +1, -1 etc. on sample values) become very frequent, which
can be exploited by encoding them in few output bits.
It is sometimes beneficial to compress only the
differences between two versions of a file (or, in video
compression, of an image). This is called delta
compression (from the Greek letter Δ which is
commonly used in mathematics to denote a difference), but the term
is typically only used if both versions are meaningful outside
compression and decompression. For example, while the process of
compressing the error in the above-mentioned lossless audio
compression scheme could be described as delta
compression from the approximated sound wave to the original
sound wave, the approximated version of the sound wave is not
meaningful in any other context.
Lossless compression methods
No lossless compression algorithm can efficiently compress all possible data, and completely random data streams cannot be compressed. For this reason, many different algorithms exist that are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the uncompressed data are likely to contain.Some of the most common lossless compression
algorithms are listed below.
General purpose
- Run-length encoding – a simple scheme that provides good compression of data containing lots of runs of the same value.
- LZW – used by gif and compress among others
- Deflate – used by gzip, modern versions of zip and as part of the compression process of PNG, PPP, HTTP, SSH
Audio
- Apple Lossless – ALAC (Apple Lossless Audio Codec)
- ATRAC Advanced Lossless
- Audio Lossless Coding – also known as MPEG-4 ALS
- MPEG-4 SLS – also known as HD-AAC
- Direct Stream Transfer – DST
- Dolby TrueHD
- DTS-HD Master Audio
- Free Lossless Audio Codec – FLAC
- Meridian Lossless Packing – MLP
- Monkey's Audio – Monkey's Audio APE
- OptimFROG
- RealPlayer – RealAudio Lossless
- Shorten – SHN
- TTA – True Audio Lossless
- WavPack – WavPack lossless
- WMA Lossless – Windows Media Lossless
Graphics
- ABO – Adaptive Binary Optimization
- GIF – (lossless, but contains a very limited number color range)
- JBIG2 – (lossless or lossy compression of B&W images)
- JPEG-LS – (lossless/near-lossless compression standard)
- JPEG 2000 – (includes lossless compression method, as proven by Sunil Kumar, Prof San Diego State University)
- JPEG XR - formerly WMPhoto and HD Photo, includes a lossless compression method
- PGF – Progressive Graphics File (lossless or lossy compression)
- PNG – Portable Network Graphics
- TIFF
Video
Cryptography
Cryptosystems
often compress data before encryption for added security;
compression prior to encryption helps remove redundancies and
patterns that might facilitate cryptanalysis. However, many
ordinary lossless compression algorithms introduce predictable
patterns (such as headers, wrappers, and tables) into the
compressed data that may actually make cryptanalysis easier.
Therefore, cryptosystems often incorporate specialized compression
algorithms specific to the cryptosystem—or at least
demonstrated or widely held to be cryptographically
secure—rather than standard compression algorithms that
are efficient but provide potential opportunities for
cryptanalysis.
Limitations
Lossless data compression algorithms cannot
guarantee compression for all input data sets. In other words, for
any (lossless) data compression algorithm, there will be an input
data set that does not get smaller when processed by the algorithm.
This is easily proven with elementary mathematics using a counting
argument, as follows:
- Assume that each file is represented as a string of bits of some arbitrary length.
- Suppose that there is a compression algorithm that transforms every file into a distinct file which is no longer than the original file, and that at least one file will be compressed into something that is shorter than itself.
- Let M be the least number such that there is a file F with length M bits that compresses to something shorter. Let N be the length (in bits) of the compressed version of F.
- Because N, every file of length N keeps its size during compression. There are 2^N such files. Together with F, this makes 2^N+1 files which all compress into one of the 2^N files of length N.
- But 2^N is smaller than 2^N+1, so by the pigeonhole principle there must be some file of length N which is simultaneously the output of the compression function on two different inputs. That file cannot be decompressed reliably (which of the two originals should that yield?), which contradicts the assumption that the algorithm was lossless.
- We must therefore conclude that our original hypothesis (that the compression function makes no file longer) is necessarily untrue.
Any lossless compression algorithm that makes
some files shorter must necessarily make some files longer, but it
is not necessary that those files become very much longer. Most
practical compression algorithms provide an "escape" facility that
can turn off the normal coding for files that would become longer
by being encoded. Then the only increase in size is a few bits to
tell the decoder that the normal coding has been turned off for the
entire input. For example, DEFLATE compressed
files never need to grow by more than 5 bytes per 65,535 bytes of
input.
In fact, if we consider files of length N, if all
files were equally probable, then for any lossless compression that
reduces the size of some file, the expected length of a compressed
file (averaged over all possible files of length N) must
necessarily be greater than N. So if we know nothing about the
properties of the data we are compressing, we might as well not
compress it at all. A lossless compression algorithm is only useful
when we are more likely to compress certain types of files than
others; then the algorithm could be designed to compress those
types of data better.
Thus, the main lesson from the argument is not
that one risks big losses, but merely that one cannot always win.
To choose an algorithm always means implicitly to select a subset
of all files that will become usefully shorter. This is the
theoretical reason why we need to have different compression
algorithms for different kinds of files: there cannot be any
algorithm that is good for all kinds of data.
The "trick" that allows lossless compression
algorithms, used on the type of data they were designed for, to
consistently compress such files to a shorter form is that the
files the algorithm are designed to act on all have some form of
easily-modeled
redundancy that the algorithm is designed to remove, and thus
belong to the subset of files that that algorithm can make shorter,
whereas other files would not get compressed or even get bigger.
Algorithms are generally quite specifically tuned to a particular
type of file: for example, lossless audio compression programs do
not work well on text files, and vice versa.
In particular, files of random data cannot be
consistently compressed by any conceivable lossless data
compression algorithm: indeed, this result is used to define the
concept of randomness in
algorithmic complexity theory.
There have been many claims through the years of
companies achieving 'perfect-compression' where an arbitrary number
of random bits can always be compressed to N-1 bits. This is, of
course, impossible: if such an algorithm existed, it could be
applied repeatedly to losslessly reduce any file to length 0. These
kinds of claims can be safely discarded without even looking at any
further details regarding the purported compression scheme.
An algorithm that is asserted to
be able to losslessly compress any data stream is provably
impossible. In a more general sense, any compression algorithm
whose proposed properties contradict fundamental laws of mathematics may be called
magic.
Mathematical background
Any compression
algorithm can be viewed as a function
that maps sequences of units (normally octets) into other sequences of
the same units. Compression is successful if the resulting sequence
is shorter than the original sequence. In order for a compression
algorithm to be considered lossless, there needs to exist
a reverse mapping from compressed bit sequences to original bit
sequences; that is to say, the compression method would need to
encapsulate a bijection between "plain" and
"compressed" bit sequences.
The sequences of length N or less are clearly a
strict superset of the sequences of length N-1 or less. It follows
that there are more sequences of length N or less than there
sequences of length N-1 or less. It therefore follows from the
pigeonhole
principle that it is not possible to map every sequence of
length N or less to a unique sequence of length N-1 or less.
Therefore it is not possible to produce an algorithm that reduces
the size of every possible input sequence.
Psychological background
Most everyday files are relatively 'sparse' in an
information
entropy sense, and thus, most lossless algorithms a layperson
is likely to apply on regular files compress them relatively well.
This may, through misapplication of intuition, lead some
individuals to conclude that a well-designed compression algorithm
can compress any input, thus, constituting a magic compression
algorithm.
Points of application in real compression theory
Real compression algorithm designers accept that
streams of high information entropy cannot be compressed, and
accordingly, include facilities for detecting and handling this
condition. An obvious way of detection is applying a raw
compression algorithm and testing if its output is smaller than its
input. Sometimes, detection is made by heuristics; for example, a
compression application may consider files whose names end in
".zip", ".arj" or ".lha" uncompressible without any more
sophisticated detection. A common way of handling this situation is
quoting input, or uncompressible parts of the input in the output,
minimising the compression overhead. For example, the zip data format specifies the
'compression method' of 'Stored' for input files that have been
copied into the archive verbatim.
The Million Random Number Challenge
Mark Nelson, frustrated over many cranks trying
to claim having invented a magic compression algorithm appearing in
comp.compression,
has constructed a 415,241 byte binary file (http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin)
of highly entropic content, and issued a public challenge of $100
to anyone to write a program that, together with its input, would
be smaller than his provided binary data yet be able to
reconstitute ("decompress") it without error.
The FAQ for the comp.compression
newsgroup contains a
challenge by Mike Goldman offering $5,000 for a program that can
compress random data. Patrick Craig took up the challenge, but
rather than compressing the data, he split it up into separate
files all of which ended in the number '5' which was not stored as
part of the file. Omitting this character allowed the resulting
files (plus, in accordance with the rules, the size of the program
that reassembled them) to be smaller than the original file.
However, no actual compression took place, and the information
stored in the names of the files was necessary in order to
reassemble them in the correct order into the original file, and
this information was not taken into account in the file size
comparison. The files themselves are thus not sufficient to
reconstitute the original file, the file names are also necessary.
A full history of the event, including discussion on whether or not
the challenge was technically met, is on Patrick Craig's web
site.
See also
References
External links
- Lossless data compression Benchmarks and Tests
- Comparison of Lossless Audio Compressors at Hydrogenaudio Wiki
- Comparing lossless and lossy audio formats for music archiving
- Links to data compression topics and tutorials
- http://www.data-compression.com/theory.html — data-compression.com's overview of data compression and its fundamentals limitations
- http://www.faqs.org/faqs/compression-faq/part2/section-4.html — comp.compression's FAQ item 73, What is the theoretical compression limit?
- http://www.c10n.info/archives/439 — c10n.info's overview of US patent #7,096,360, "[a]n "Frequency-Time Based Data Compression Method" supporting the compression, encryption, decompression, and decryption and persistence of many binary digits through frequencies where each frequency represents many bits."
lossless in Arabic: ضغط بيانات غير منقوص
lossless in Czech: Bezeztrátová komprese
lossless in Spanish: Algoritmo de compresión sin
pérdida
lossless in Italian: Compressione dati
lossless
lossless in Japanese: 可逆圧縮
lossless in Georgian: უდანაკარგო ინფორმაციის
შეკუმშვა
lossless in Norwegian: Tapsfri
komprimering
lossless in Polish: Kompresja bezstratna
lossless in Portuguese: Compressão sem perda de
dados
lossless in Russian: Сжатие без потерь
lossless in Finnish: Pakkausohjelma
lossless in Swedish: Icke-förstörande
komprimering
lossless in Thai:
การบีบอัดข้อมูลแบบไม่สูญเสีย
lossless in Ukrainian: Стиснення без втрат
lossless in Chinese: 无损数据压缩