9 1.1. Byte and Its Representation
10 1.2. Multibyte Integers
11 2. Overall Structure of .lzma File
14 2.1.1.1. Header Magic Bytes
19 2.1.2.2. Backward Size
21 2.1.2.4. Footer Magic Bytes
25 3.1.1. Block Header Size
27 3.1.3. Compressed Size
28 3.1.4. Uncompressed Size
29 3.1.5. List of Filter Flags
36 4.2. Number of Records
39 4.3.2. Uncompressed Size
48 5.3.3. Branch/Call/Jump Filters for Executables
50 5.3.4.1. Format of the Encoded Output
51 5.4. Custom Filter IDs
52 5.4.1. Reserved Custom Filter ID Ranges
53 6. Cyclic Redundancy Checks
59 This document describes the .lzma file format (filename suffix
60 `.lzma', MIME type `application/x-lzma'). It is intended that
61 this format replace the format used by the LZMA_Alone tool
62 included in LZMA SDK up to and including version 4.57.
64 IMPORTANT: The version described in this document is a
65 draft, NOT a final, official version. Changes
69 0.1. Copyright Notices
71 Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
72 Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
74 Copying and distribution of this file, with or without
75 modification, are permitted in any medium without royalty
76 provided the copyright notice and this notice are preserved.
77 Modified versions must be marked as such.
79 All source code examples given in this document are put into
80 the public domain by the authors of this document.
82 Special thanks for helping with this document goes to
83 Igor Pavlov. Thanks for helping with this document goes to
84 Mark Adler, H. Peter Anvin, and Mikko Pouru.
89 Last modified: 2008-09-07 10:20+0300
91 (A changelog will be kept once the first official version
97 The keywords `must', `must not', `required', `should',
98 `should not', `recommended', `may', and `optional' in this
99 document are to be interpreted as described in [RFC-2119].
100 These words are not capitalized in this document.
102 Indicating a warning means displaying a message, returning
103 appropriate exit status, or something else to let the user
104 know that something worth warning occurred. The operation
105 should still finish if a warning is indicated.
107 Indicating an error means displaying a message, returning
108 appropriate exit status, or something else to let the user
109 know that something prevented successfully finishing the
110 operation. The operation must be aborted once an error has
114 1.1. Byte and Its Representation
116 In this document, byte is always 8 bits.
118 A `nul byte' has all bits unset. That is, the value of a nul
121 To represent byte blocks, this document uses notation that
122 is similar to the notation used in [RFC-1952]:
129 | Foo | Two bytes; that is, some of the vertical bars
130 +---+---+ can be missing.
133 | Foo | Zero or more bytes.
136 In this document, a boxed byte or a byte sequence declared
137 using this notation is called `a field'. The example field
138 above would be called `the Foo field' or plain `Foo'.
141 1.2. Multibyte Integers
143 Multibyte integers of static length, such as CRC values,
144 are stored in little endian byte order (least significant
147 When smaller values are more likely than bigger values (for
148 example file sizes), multibyte integers are encoded in a
149 variable-length representation:
150 - Numbers in the range [0, 127] are copied as is, and take
152 - Bigger numbers will occupy two or more bytes. All but the
153 last byte of the multibyte representation have the highest
156 For now, the value of the variable-length integers is limited
157 to 63 bits, which limits the encoded size of the integer to
158 nine bytes. These limits may be increased in future if needed.
160 The following C code illustrates encoding and decoding of
161 variable-length integers. The functions return the number of
162 bytes occupied by the integer (1-9), or zero on error.
164 #include <sys/types.h>
165 #include <inttypes.h>
168 encode(uint8_t buf[static 9], uint64_t num)
170 if (num >= UINT64_MAX / 2)
175 while (num >= 0x80) {
176 buf[i++] = (uint8_t)(num) | 0x80;
180 buf[i++] = (uint8_t)(num);
186 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
194 *num = buf[0] & 0x7F;
197 while (buf[i++] & 0x80) {
198 if (i > size_max || buf[i] == 0x00)
201 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
208 2. Overall Structure of .lzma File
210 +========+================+========+================+
211 | Stream | Stream Padding | Stream | Stream Padding | ...
212 +========+================+========+================+
214 A file contains usually only one Stream. However, it is
215 possible to concatenate multiple Streams together with no
216 additional processing. It is up to the implementation to
217 decide if the decoder will continue decoding from the next
218 Stream once the end of the first Stream has been reached.
223 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
224 | Stream Header | Block | Block | ... | Block |
225 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
227 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
228 ---> | Index | Stream Footer |
229 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
231 All the above fields have a size that is a multiple of four. If
232 Stream is used as an internal part of another file format, it
233 is recommended to make the Stream start at an offset that is
234 a multiple of four bytes.
236 Stream Header, Index, and Stream Footer are always present in
237 a Stream. The maximum size of the Index field is 16 GiB (2^34).
239 There are zero or more Blocks. The maximum number of Blocks is
240 limited only by the maximum size of the Index field.
242 Total size of a Stream must be less than 8 EiB (2^63 bytes).
243 The same limit applies to the total amount of uncompressed
244 data stored in a Stream.
246 If an implementation supports handling .lzma files with
247 multiple concatenated Streams, it may apply the above limits
248 to the file as a whole instead of limiting per Stream basis.
253 +---+---+---+---+---+---+-------+------+--+--+--+--+
254 | Header Magic Bytes | Stream Flags | CRC32 |
255 +---+---+---+---+---+---+-------+------+--+--+--+--+
258 2.1.1.1. Header Magic Bytes
260 The first six (6) bytes of the Stream are so called Header
261 Magic Bytes. They can be used to identify the file type.
263 Using a C array and ASCII:
264 const uint8_t HEADER_MAGIC[6]
265 = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 };
267 In plain hexadecimal:
271 - The first byte (0xFF) was chosen so that the files cannot
272 be erroneously detected as being in LZMA_Alone format, in
273 which the first byte is in the range [0x00, 0xE0].
274 - The sixth byte (0x00) was chosen to prevent applications
275 from misdetecting the file as a text file.
277 If the Header Magic Bytes don't match, the decoder must
281 2.1.1.2. Stream Flags
283 The first byte of Stream Flags is always a nul byte. In future
284 this byte may be used to indicate new Stream version or other
287 The second byte of Stream Flags is a bit field:
289 Bit(s) Mask Description
290 0-3 0x0F Type of Check (see Section 3.3):
294 0x02 4 bytes (Reserved)
295 0x03 4 bytes (Reserved)
297 0x05 8 bytes (Reserved)
298 0x06 8 bytes (Reserved)
299 0x07 16 bytes (Reserved)
300 0x08 16 bytes (Reserved)
301 0x09 16 bytes (Reserved)
302 0x0A 32 bytes SHA-256
303 0x0B 32 bytes (Reserved)
304 0x0C 32 bytes (Reserved)
305 0x0D 64 bytes (Reserved)
306 0x0E 64 bytes (Reserved)
307 0x0F 64 bytes (Reserved)
308 4-7 0xF0 Reserved for future use; must be zero for now.
310 Implementations must support at least the Check IDs 0x00 (None)
311 and 0x01 (CRC32). Supporting other Check IDs is optional. If
312 an unsupported Check is used, the decoder should indicate a
315 If any reserved bit is set, the decoder must indicate an error.
316 It is possible that there is a new field present which the
317 decoder is not aware of, and can thus parse the Stream Header
323 The CRC32 is calculated from the Stream Flags field. It is
324 stored as an unsigned 32-bit little endian integer. If the
325 calculated value does not match the stored one, the decoder
326 must indicate an error.
328 The idea is that Stream Flags would always be two bytes, even
329 if new features are needed. This way old decoders will be able
330 to verify the CRC32 calculated from Stream Flags, and thus
331 distinguish between corrupt files (CRC32 doesn't match) and
332 files that the decoder doesn't support (CRC32 matches but
333 Stream Flags has reserved bits set).
338 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
339 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
340 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
345 The CRC32 is calculated from the Backward Size and Stream Flags
346 fields. It is stored as an unsigned 32-bit little endian
347 integer. If the calculated value does not match the stored one,
348 the decoder must indicate an error.
350 The reason to have the CRC32 field before the Backward Size and
351 Stream Flags fields is to keep the four-byte fields aligned to
352 a multiple of four bytes.
355 2.1.2.2. Backward Size
357 Backward Size is stored as a 32-bit little endian integer,
358 which indicates the size of the Index field as multiple of
359 four bytes, minimum value being four bytes:
361 real_backward_size = (stored_backward_size + 1) * 4;
363 Using a fixed-size integer to store this value makes it
364 slightly simpler to parse the Stream Footer when the
365 application needs to parse the Stream backwards.
368 2.1.2.3. Stream Flags
370 This is a copy of the Stream Flags field from the Stream
371 Header. The information stored to Stream Flags is needed
372 when parsing the Stream backwards. The decoder must compare
373 the Stream Flags fields in both Stream Header and Stream
374 Footer, and indicate an error if they are not identical.
377 2.1.2.4. Footer Magic Bytes
379 As the last step of the decoding process, the decoder must
380 verify the existence of Footer Magic Bytes. If they don't
381 match, an error must be indicated.
383 Using a C array and ASCII:
384 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
389 The primary reason to have Footer Magic Bytes is to make
390 it easier to detect incomplete files quickly, without
391 uncompressing. If the file does not end with Footer Magic Bytes
392 (excluding Stream Padding described in Section 2.2), it cannot
393 be undamaged, unless someone has intentionally appended garbage
394 after the end of the Stream.
399 Only the decoders that support decoding of concatenated Streams
400 must support Stream Padding.
402 Stream Padding must contain only nul bytes. Any non-nul byte
403 should be considered as the beginning of a new Stream. To
404 preserve the four-byte alignment of consecutive Streams, the
405 size of Stream Padding must be a multiple of four bytes. Empty
406 Stream Padding is allowed.
408 Note that non-empty Stream Padding is allowed at the end of the
409 file; there doesn't need to be a new Stream after non-empty
410 Stream Padding. This can be convenient in certain situations
413 The possibility of Padding should be taken into account when
414 designing an application that parses the Stream backwards.
419 +==============+=================+=======+
420 | Block Header | Compressed Data | Check |
421 +==============+=================+=======+
426 +-------------------+-------------+=================+
427 | Block Header Size | Block Flags | Compressed Size |
428 +-------------------+-------------+=================+
430 +===================+======================+
431 ---> | Uncompressed Size | List of Filter Flags |
432 +===================+======================+
434 +================+--+--+--+--+
435 ---> | Header Padding | CRC32 |
436 +================+--+--+--+--+
439 3.1.1. Block Header Size
441 This field overlaps with the Index Indicator field (see
444 This field contains the size of the Block Header field,
445 including the Block Header Size field itself. Valid values are
446 in the range [0x01, 0xFF], which indicate the size of the Block
447 Header as multiples of four bytes, minimum size being eight
450 real_header_size = (encoded_header_size + 1) * 4;
452 If bigger Block Header is needed in future, a new field can be
453 added between the current Block Header and Compressed Data
454 fields. The presence of this new field would be indicated in
460 The first byte of the Block Flags field is a bit field:
462 Bit(s) Mask Description
463 0-1 0x03 Number of filters (1-4)
464 2-5 0x3C Reserved for future use; must be zero for now.
465 6 0x40 The Compressed Size field is present.
466 7 0x80 The Uncompressed Size field is present.
468 If any reserved bit is set, the decoder must indicate an error.
469 It is possible that there is a new field present which the
470 decoder is not aware of, and can thus parse the Block Header
474 3.1.3. Compressed Size
476 This field is present only if the appropriate bit is set in
477 the Block Flags field (see Section 3.1.2).
479 This field contains the size of the Compressed Data field as
480 multiple of four bytes, minimum value being four bytes:
482 real_compressed_size = (stored_compressed_size + 1) * 4;
484 The size is stored using the encoding described in Section 1.2.
485 If the Compressed Size does not match the real size of the
486 Compressed Data field, the decoder must indicate an error.
489 3.1.4. Uncompressed Size
491 This field is present only if the appropriate bit is set in
492 the Block Flags field (see Section 3.1.2).
494 The Uncompressed Size field contains the size of the Block
495 after uncompressing. Uncompressed Size is stored using the
496 encoding described in Section 1.2. If the Uncompressed Size
497 does not match the real uncompressed size, the decoder must
500 Storing the Compressed Size and Uncompressed Size fields serves
502 - The decoder knows how much memory it needs to allocate
503 for a temporary buffer in multithreaded mode.
504 - Simple error detection: wrong size indicates a broken file.
505 - Seeking forwards to a specific location in streamed mode.
507 It should be noted that the only reliable way to determine
508 the real uncompressed size is to uncompress the Block,
509 because the Block Header and Index fields may contain
510 (intentionally or unintentionally) invalid information.
513 3.1.5. List of Filter Flags
515 +================+================+ +================+
516 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
517 +================+================+ +================+
519 The number of Filter Flags fields is stored in the Block Flags
520 field (see Section 3.1.2).
522 The format of each Filter Flags field is as follows:
524 +===========+====================+===================+
525 | Filter ID | Size of Properties | Filter Properties |
526 +===========+====================+===================+
528 Both Filter ID and Size of Properties are stored using the
529 encoding described in Section 1.2. Size of Properties indicates
530 the size of the Filter Properties field as bytes. The list of
531 officially defined Filter IDs and the formats of their Filter
532 Properties are described in Section 5.3.
534 Filter IDs greater than or equal to 0x4000_0000_0000_0000
535 (2^62) are reserved for implementation-specific internal use.
536 These Filter IDs must never be used in List of Filter Flags.
539 3.1.6. Header Padding
541 This field contains as many nul byte as it is needed to make
542 the Block Header have the size specified in Block Header Size.
543 If any of the bytes are not nul bytes, the decoder must
544 indicate an error. It is possible that there is a new field
545 present which the decoder is not aware of, and can thus parse
546 the Block Header incorrectly.
551 The CRC32 is calculated over everything in the Block Header
552 field except the CRC32 field itself. It is stored as an
553 unsigned 32-bit little endian integer. If the calculated
554 value does not match the stored one, the decoder must indicate
557 By verifying the CRC32 of the Block Header before parsing the
558 actual contents allows the decoder to distinguish between
559 corrupt and unsupported files.
564 The format of Compressed Data depends on Block Flags and List
565 of Filter Flags. Excluding the descriptions of the simplest
566 filters in Section 5.3, the format of the filter-specific
567 encoded data is out of scope of this document.
569 If the natural size of Compressed Data is not a multiple of
570 four bytes, it must be padded with 1-3 nul bytes to make it
571 a multiple of four bytes.
576 The type and size of the Check field depends on which bits
577 are set in the Stream Flags field (see Section 2.1.1.2).
579 The Check, when used, is calculated from the original
580 uncompressed data. If the calculated Check does not match the
581 stored one, the decoder must indicate an error. If the selected
582 type of Check is not supported by the decoder, it must indicate
588 +-----------------+=========================+
589 | Index Indicator | Number of Index Records |
590 +-----------------+=========================+
592 +=================+=========+-+-+-+-+
593 ---> | List of Records | Padding | CRC32 |
594 +=================+=========+-+-+-+-+
596 Index serves several purporses. Using it, one can
597 - verify that all Blocks in a Stream have been processed;
598 - find out the uncompressed size of a Stream; and
599 - quickly access the beginning of any Block (random access).
604 This field overlaps with the Block Header Size field (see
605 Section 3.1.1). The value of Index Indicator is always 0x00.
608 4.2. Number of Records
610 This field indicates how many Records there are in the List
611 of Records field, and thus how many Blocks there are in the
612 Stream. The value is stored using the encoding described in
613 Section 1.2. If the decoder has decoded all the Blocks of the
614 Stream, and then notices that the Number of Records doesn't
615 match the real number of Blocks, the decoder must indicate an
621 List of Records consists of as many Records as indicated by the
622 Number of Records field:
625 | Record | Record | ...
628 Each Record contains two fields:
630 +============+===================+
631 | Total Size | Uncompressed Size |
632 +============+===================+
634 If the decoder has decoded all the Blocks of the Stream, it
635 must verify that the contents of the Records match the real
636 Total Size and Uncompressed Size of the respective Blocks.
638 Implementation hint: It is possible to verify the Index with
639 constant memory usage by calculating for example SHA256 of both
640 the real size values and the List of Records, then comparing
641 the check values. Implementing this using non-cryptographic
642 check like CRC32 should be avoided unless small code size is
645 If the decoder supports random-access reading, it must verify
646 that Total Size and Uncompressed Size of every completely
647 decoded Block match the sizes stored in the Index. If only
648 partial Block is decoded, the decoder must verify that the
649 processed sizes don't exceed the sizes stored in the Index.
654 This field indicates the encoded size of the respective Block
655 as multiples of four bytes, minimum value being four bytes:
657 real_total_size = (stored_total_size + 1) * 4;
659 The value is stored using the encoding described in Section
663 4.3.2. Uncompressed Size
665 This field indicates the Uncompressed Size of the respective
666 Block as bytes. The value is stored using the encoding
667 described in Section 1.2.
672 This field must contain 0-3 nul bytes to pad the Index to
673 a multiple of four bytes.
678 The CRC32 is calculated over everything in the Index field
679 except the CRC32 field itself. The CRC32 is stored as an
680 unsigned 32-bit little endian integer. If the calculated
681 value does not match the stored one, the decoder must indicate
687 The Block Flags field defines how many filters are used. When
688 more than one filter is used, the filters are chained; that is,
689 the output of one filter is the input of another filter. The
690 following figure illustrates the direction of data flow.
692 v Uncompressed Data ^
694 Encoder | Filter 1 | Decoder
701 Alignment of uncompressed input data is usually the job of
702 the application producing the data. For example, to get the
703 best results, an archiver tool should make sure that all
704 PowerPC executable files in the archive stream start at
705 offsets that are multiples of four bytes.
707 Some filters, for example LZMA, can be configured to take
708 advantage of specified alignment of input data. Note that
709 taking advantage of aligned input can be benefical also when
710 a filter is not the first filter in the chain. For example,
711 if you compress PowerPC executables, you may want to use the
712 PowerPC filter and chain that with the LZMA filter. Because not
713 only the input but also the output alignment of the PowerPC
714 filter is four bytes, it is now benefical to set LZMA settings
715 so that the LZMA encoder can take advantage of its
716 four-byte-aligned input data.
718 The output of the last filter in the chain is stored to the
719 Compressed Data field, which is is guaranteed to be aligned
720 to a multiple of four bytes relative to the beginning of the
721 Stream. This can increase
722 - speed, if the filtered data is handled multiple bytes at
723 a time by the filter-specific encoder and decoder,
724 because accessing aligned data in computer memory is
726 - compression ratio, if the output data is later compressed
727 with an external compression tool.
732 If filters would be allowed to be chained freely, it would be
733 possible to create malicious files, that would be very slow to
734 decode. Such files could be used to create denial of service
737 Slow files could occur when multiple filters are chained:
739 v Compressed input data
740 | Filter 1 decoder (last filter)
741 | Filter 0 decoder (non-last filter)
742 v Uncompressed output data
744 The decoder of the last filter in the chain produces a lot of
745 output from little input. Another filter in the chain takes the
746 output of the last filter, and produces very little output
747 while consuming a lot of input. As a result, a lot of data is
748 moved inside the filter chain, but the filter chain as a whole
749 gets very little work done.
751 To prevent this kind of slow files, there are restrictions on
752 how the filters can be chained. These restrictions must be
753 taken into account when designing new filters.
755 The maximum number of filters in the chain has been limited to
756 four, thus there can be at maximum of three non-last filters.
757 Of these three non-last filters, only two are allowed to change
758 the size of the data.
760 The non-last filters, that change the size of the data, must
761 have a limit how much the decoder can compress the data: the
762 decoder should produce at least n bytes of output when the
763 filter is given 2n bytes of input. This limit is not
764 absolute, but significant deviations must be avoided.
766 The above limitations guarantee that if the last filter in the
767 chain produces 4n bytes of output, the chain as a whole will
768 produce at least n bytes of output.
775 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
776 compression algorithm with high compression ratio and fast
777 decompression. LZMA is based on LZ77 and range coding
781 Size of Filter Properties: 5 bytes
782 Changes size of data: Yes
783 Allow as a non-last filter: No
784 Allow as the last filter: Yes
787 Input data: Adjustable to 1/2/4/8/16 byte(s)
790 At the time of writing, there is no other documentation about
791 how LZMA works than the source code in LZMA SDK. Once such
792 documentation gets written, it will probably be published as
793 a separate document, because including the documentation here
794 would lengthen this document considerably.
796 The format of the Filter Properties field is as follows:
798 +-----------------+----+----+----+----+
799 | LZMA Properties | Dictionary Size |
800 +-----------------+----+----+----+----+
802 The LZMA Properties field contains three properties. An
803 abbreviation is given in parentheses, followed by the value
804 range of the property. The field consists of
806 1) the number of literal context bits (lc, [0, 4]);
807 2) the number of literal position bits (lp, [0, 4]); and
808 3) the number of position bits (pb, [0, 4]).
810 In addition to above ranges, the sum of lc and lp must not
811 exceed four. Note that this limit didn't exist in the old
812 LZMA_Alone format, which allowed lc to be in the range [0, 8].
814 The properties are encoded using the following formula:
816 LZMA Properties = (pb * 5 + lp) * 9 + lc
818 The following C code illustrates a straightforward way to
819 decode the properties:
822 uint8_t prop = get_lzma_properties();
823 if (prop > (4 * 5 + 4) * 9 + 8)
824 return LZMA_PROPERTIES_ERROR;
832 return LZMA_PROPERTIES_ERROR;
834 Dictionary Size is encoded as unsigned 32-bit little endian
840 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
841 LZMA internally, but adds support for flushing the encoder,
842 uncompressed chunks, eases stateful decoder implementations,
843 and improves support for multithreading. For most uses, it is
844 recommended to use LZMA2 instead of LZMA.
847 Size of Filter Properties: 1 byte
848 Changes size of data: Yes
849 Allow as a non-last filter: No
850 Allow as the last filter: Yes
853 Input data: Adjustable to 1/2/4/8/16 byte(s)
856 The format of the one-byte Filter Properties field is as
859 Bits Mask Description
860 0-5 0x3F Dictionary Size
861 6-7 0xC0 Reserved for future use; must be zero for now.
863 Dictionary Size is encoded with one-bit mantissa and five-bit
864 exponent. The smallest dictionary size is 4 KiB and the biggest
867 Raw value Mantissa Exponent Dictionary size
881 40 2 31 4096 MiB - 1 B
883 Instead of having a table in the decoder, the dictionary size
884 can be decoded using the following C code:
886 const uint8_t bits = get_dictionary_flags() & 0x3F;
888 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
890 uint32_t dictionary_size;
892 dictionary_size = UINT32_MAX;
894 dictionary_size = 2 | (bits & 1);
895 dictionary_size <<= bits / 2 + 11;
899 5.3.3. Branch/Call/Jump Filters for Executables
901 These filters convert relative branch, call, and jump
902 instructions to their absolute counterparts in executable
903 files. This conversion increases redundancy and thus
906 Size of Filter Properties: 0 or 4 bytes
907 Changes size of data: No
908 Allow as a non-last filter: Yes
909 Allow as the last filter: No
911 Detecting when all of the data has been decoded:
912 Uncompressed size: Yes
913 End of Payload Marker: No
916 Below is the list of filters in this category. The alignment
917 is the same for both input and output data.
919 Filter ID Alignment Description
920 0x04 1 byte x86 filter (BCJ)
921 0x05 4 bytes PowerPC (big endian) filter
922 0x06 16 bytes IA64 filter
923 0x07 4 bytes ARM (little endian) filter
924 0x08 2 bytes ARM Thumb (little endian) filter
925 0x09 4 bytes SPARC filter
927 If the size of Filter Properties is four bytes, the Filter
928 Properties field contains the start offset used for address
929 conversions. It is stored as an unsigned 32-bit little endian
930 integer. If the size of Filter Properties is zero, the start
933 Setting the start offset may be useful if an executable has
934 multiple sections, and there are many cross-section calls.
935 Taking advantage of this feature usually requires usage of
941 The Delta filter may increase compression ratio when the value
942 of the next byte correlates with the value of an earlier byte
943 at specified distance.
946 Size of Filter Properties: 1 byte
947 Changes size of data: No
948 Allow as a non-last filter: Yes
949 Allow as the last filter: No
953 Output data: Same as the original input data
955 The Properties byte indicates the delta distance, which can be
956 1-256 bytes backwards from the current byte: 0x00 indicates
957 distance of 1 byte and 0xFF distance of 256 bytes.
960 5.3.4.1. Format of the Encoded Output
962 The code below illustrates both encoding and decoding with
965 // Distance is in the range [1, 256].
966 const unsigned int distance = get_properties_byte() + 1;
970 memset(delta, 0, sizeof(delta));
973 const int byte = read_byte();
977 uint8_t tmp = delta[(uint8_t)(distance + pos)];
979 tmp = (uint8_t)(byte) - tmp;
980 delta[pos] = (uint8_t)(byte);
982 tmp = (uint8_t)(byte) + tmp;
991 5.4. Custom Filter IDs
993 If a developer wants to use custom Filter IDs, he has two
994 choices. The first choice is to contact Lasse Collin and ask
995 him to allocate a range of IDs for the developer.
997 The second choice is to generate a 40-bit random integer,
998 which the developer can use as his personal Developer ID.
999 To minimalize the risk of collisions, Developer ID has to be
1000 a randomly generated integer, not manually selected "hex word".
1001 The following command, which works on many free operating
1002 systems, can be used to generate Developer ID:
1004 dd if=/dev/urandom bs=5 count=1 | hexdump
1006 The developer can then use his Developer ID to create unique
1007 (well, hopefully unique) Filter IDs.
1009 Bits Mask Description
1010 0-15 0x0000_0000_0000_FFFF Filter ID
1011 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
1012 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
1014 The resulting 63-bit integer will use 9 bytes of space when
1015 stored using the encoding described in Section 1.2. To get
1016 a shorter ID, see the beginning of this Section how to
1017 request a custom ID range.
1020 5.4.1. Reserved Custom Filter ID Ranges
1023 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
1024 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1027 6. Cyclic Redundancy Checks
1029 There are several incompatible variations to calculate CRC32
1030 and CRC64. For simplicity and clarity, complete examples are
1031 provided to calculate the checks as they are used in this file
1032 format. Implementations may use different code as long as it
1033 gives identical results.
1035 The program below reads data from standard input, calculates
1036 the CRC32 and CRC64 values, and prints the calculated values
1037 as big endian hexadecimal strings to standard output.
1039 #include <sys/types.h>
1040 #include <inttypes.h>
1043 uint32_t crc32_table[256];
1044 uint64_t crc64_table[256];
1049 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1050 static const uint64_t poly64
1051 = UINT64_C(0xC96C5795D7870F42);
1053 for (size_t i = 0; i < 256; ++i) {
1057 for (size_t j = 0; j < 8; ++j) {
1059 crc32 = (crc32 >> 1) ^ poly32;
1064 crc64 = (crc64 >> 1) ^ poly64;
1069 crc32_table[i] = crc32;
1070 crc64_table[i] = crc64;
1075 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1078 for (size_t i = 0; i < size; ++i)
1079 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1085 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1088 for (size_t i = 0; i < size; ++i)
1089 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1099 uint32_t value32 = 0;
1100 uint64_t value64 = 0;
1101 uint64_t total_size = 0;
1105 const size_t buf_size = fread(buf, 1, 8192, stdin);
1109 total_size += buf_size;
1110 value32 = crc32(buf, buf_size, value32);
1111 value64 = crc64(buf, buf_size, value64);
1114 printf("Bytes: %" PRIu64 "\n", total_size);
1115 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1116 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1124 LZMA SDK - The original LZMA implementation
1125 http://7-zip.org/sdk.html
1127 LZMA Utils - LZMA adapted to POSIX-like systems
1128 http://tukaani.org/lzma/
1131 GZIP file format specification version 4.3
1132 http://www.ietf.org/rfc/rfc1952.txt
1133 - Notation of byte boxes in section `2.1. Overall conventions'
1136 Key words for use in RFCs to Indicate Requirement Levels
1137 http://www.ietf.org/rfc/rfc2119.txt
1140 GNU tar 1.16.1 manual
1141 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1142 - Node 9.4.2 `Blocking Factor', paragraph that begins
1143 `gzip will complain about trailing garbage'
1144 - Note that this URL points to the latest version of the
1145 manual, and may some day not contain the note which is in
1146 1.16.1. For the exact version of the manual, download GNU
1147 tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz