5 Version 1.0.0 (2009-01-14)
9 0.1. Notices and Acknowledgements
10 0.2. Getting the Latest Version
13 1.1. Byte and Its Representation
14 1.2. Multibyte Integers
15 2. Overall Structure of .xz File
18 2.1.1.1. Header Magic Bytes
23 2.1.2.2. Backward Size
25 2.1.2.4. Footer Magic Bytes
29 3.1.1. Block Header Size
31 3.1.3. Compressed Size
32 3.1.4. Uncompressed Size
33 3.1.5. List of Filter Flags
41 4.2. Number of Records
44 4.3.2. Uncompressed Size
52 5.3.2. Branch/Call/Jump Filters for Executables
54 5.3.3.1. Format of the Encoded Output
55 5.4. Custom Filter IDs
56 5.4.1. Reserved Custom Filter ID Ranges
57 6. Cyclic Redundancy Checks
63 This document describes the .xz file format (filename suffix
64 ".xz", MIME type "application/x-xz"). It is intended that this
65 this format replace the old .lzma format used by LZMA SDK and
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to
75 Ville Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
81 0.2. Getting the Latest Version
83 The latest official version of this document can be downloaded
84 from <http://tukaani.org/xz/xz-file-format.txt>.
86 Specific versions of this document have a filename
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88 For example, the version 1.0.0 of this document is available
89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
94 Version Date Description
95 1.0.0 2008-01-14 The first official version
100 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
101 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
102 document are to be interpreted as described in [RFC-2119].
104 Indicating a warning means displaying a message, returning
105 appropriate exit status, or doing something else to let the
106 user know that something worth warning occurred. The operation
107 SHOULD still finish if a warning is indicated.
109 Indicating an error means displaying a message, returning
110 appropriate exit status, or doing something else to let the
111 user know that something prevented successfully finishing the
112 operation. The operation MUST be aborted once an error has
116 1.1. Byte and Its Representation
118 In this document, byte is always 8 bits.
120 A "null byte" has all bits unset. That is, the value of a null
123 To represent byte blocks, this document uses notation that
124 is similar to the notation used in [RFC-1952]:
131 | Foo | Two bytes; that is, some of the vertical bars
132 +---+---+ can be missing.
135 | Foo | Zero or more bytes.
138 In this document, a boxed byte or a byte sequence declared
139 using this notation is called "a field". The example field
140 above would be called "the Foo field" or plain "Foo".
142 If there are many fields, they may be split to multiple lines.
143 This is indicated with an arrow ("--->"):
153 The above is equivalent to this:
160 1.2. Multibyte Integers
162 Multibyte integers of static length, such as CRC values,
163 are stored in little endian byte order (least significant
166 When smaller values are more likely than bigger values (for
167 example file sizes), multibyte integers are encoded in a
168 variable-length representation:
169 - Numbers in the range [0, 127] are copied as is, and take
171 - Bigger numbers will occupy two or more bytes. All but the
172 last byte of the multibyte representation have the highest
175 For now, the value of the variable-length integers is limited
176 to 63 bits, which limits the encoded size of the integer to
177 nine bytes. These limits may be increased in future if needed.
179 The following C code illustrates encoding and decoding of
180 variable-length integers. The functions return the number of
181 bytes occupied by the integer (1-9), or zero on error.
184 #include <inttypes.h>
187 encode(uint8_t buf[static 9], uint64_t num)
189 if (num > UINT64_MAX / 2)
194 while (num >= 0x80) {
195 buf[i++] = (uint8_t)(num) | 0x80;
199 buf[i++] = (uint8_t)(num);
205 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
213 *num = buf[0] & 0x7F;
216 while (buf[i++] & 0x80) {
217 if (i >= size_max || buf[i] == 0x00)
220 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
227 2. Overall Structure of .xz File
229 A standalone .xz files consist of one or more Streams which may
230 have Stream Padding between or after them:
232 +========+================+========+================+
233 | Stream | Stream Padding | Stream | Stream Padding | ...
234 +========+================+========+================+
236 While a typical file contains only one Stream and no Stream
237 Padding, a decoder handling standalone .xz files SHOULD support
238 files that have more than one Stream or Stream Padding.
240 In contrast to standalone .xz files, when the .xz file format
241 is used as an internal part of some other file format or
242 communication protocol, it usually is expected that the decoder
243 stops after the first Stream, and doesn't look for Stream
244 Padding or possibly other Streams.
249 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
250 | Stream Header | Block | Block | ... | Block |
251 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
253 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
254 ---> | Index | Stream Footer |
255 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
257 All the above fields have a size that is a multiple of four. If
258 Stream is used as an internal part of another file format, it
259 is RECOMMENDED to make the Stream start at an offset that is
260 a multiple of four bytes.
262 Stream Header, Index, and Stream Footer are always present in
263 a Stream. The maximum size of the Index field is 16 GiB (2^34).
265 There are zero or more Blocks. The maximum number of Blocks is
266 limited only by the maximum size of the Index field.
268 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
269 The same limit applies to the total amount of uncompressed
270 data stored in a Stream.
272 If an implementation supports handling .xz files with multiple
273 concatenated Streams, it MAY apply the above limits to the file
274 as a whole instead of limiting per Stream basis.
279 +---+---+---+---+---+---+-------+------+--+--+--+--+
280 | Header Magic Bytes | Stream Flags | CRC32 |
281 +---+---+---+---+---+---+-------+------+--+--+--+--+
284 2.1.1.1. Header Magic Bytes
286 The first six (6) bytes of the Stream are so called Header
287 Magic Bytes. They can be used to identify the file type.
289 Using a C array and ASCII:
290 const uint8_t HEADER_MAGIC[6]
291 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
293 In plain hexadecimal:
297 - The first byte (0xFD) was chosen so that the files cannot
298 be erroneously detected as being in .lzma format, in which
299 the first byte is in the range [0x00, 0xE0].
300 - The sixth byte (0x00) was chosen to prevent applications
301 from misdetecting the file as a text file.
303 If the Header Magic Bytes don't match, the decoder MUST
307 2.1.1.2. Stream Flags
309 The first byte of Stream Flags is always a null byte. In future
310 this byte may be used to indicate new Stream version or other
313 The second byte of Stream Flags is a bit field:
315 Bit(s) Mask Description
316 0-3 0x0F Type of Check (see Section 3.4):
320 0x02 4 bytes (Reserved)
321 0x03 4 bytes (Reserved)
323 0x05 8 bytes (Reserved)
324 0x06 8 bytes (Reserved)
325 0x07 16 bytes (Reserved)
326 0x08 16 bytes (Reserved)
327 0x09 16 bytes (Reserved)
328 0x0A 32 bytes SHA-256
329 0x0B 32 bytes (Reserved)
330 0x0C 32 bytes (Reserved)
331 0x0D 64 bytes (Reserved)
332 0x0E 64 bytes (Reserved)
333 0x0F 64 bytes (Reserved)
334 4-7 0xF0 Reserved for future use; MUST be zero for now.
336 Implementations SHOULD support at least the Check IDs 0x00
337 (None) and 0x01 (CRC32). Supporting other Check IDs is
338 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
339 indicate a warning or error.
341 If any reserved bit is set, the decoder MUST indicate an error.
342 It is possible that there is a new field present which the
343 decoder is not aware of, and can thus parse the Stream Header
349 The CRC32 is calculated from the Stream Flags field. It is
350 stored as an unsigned 32-bit little endian integer. If the
351 calculated value does not match the stored one, the decoder
352 MUST indicate an error.
354 The idea is that Stream Flags would always be two bytes, even
355 if new features are needed. This way old decoders will be able
356 to verify the CRC32 calculated from Stream Flags, and thus
357 distinguish between corrupt files (CRC32 doesn't match) and
358 files that the decoder doesn't support (CRC32 matches but
359 Stream Flags has reserved bits set).
364 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
365 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
366 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
371 The CRC32 is calculated from the Backward Size and Stream Flags
372 fields. It is stored as an unsigned 32-bit little endian
373 integer. If the calculated value does not match the stored one,
374 the decoder MUST indicate an error.
376 The reason to have the CRC32 field before the Backward Size and
377 Stream Flags fields is to keep the four-byte fields aligned to
378 a multiple of four bytes.
381 2.1.2.2. Backward Size
383 Backward Size is stored as a 32-bit little endian integer,
384 which indicates the size of the Index field as multiple of
385 four bytes, minimum value being four bytes:
387 real_backward_size = (stored_backward_size + 1) * 4;
389 If the stored value does not match the real size of the Index
390 field, the decoder MUST indicate an error.
392 Using a fixed-size integer to store Backward Size makes
393 it slightly simpler to parse the Stream Footer when the
394 application needs to parse the Stream backwards.
397 2.1.2.3. Stream Flags
399 This is a copy of the Stream Flags field from the Stream
400 Header. The information stored to Stream Flags is needed
401 when parsing the Stream backwards. The decoder MUST compare
402 the Stream Flags fields in both Stream Header and Stream
403 Footer, and indicate an error if they are not identical.
406 2.1.2.4. Footer Magic Bytes
408 As the last step of the decoding process, the decoder MUST
409 verify the existence of Footer Magic Bytes. If they don't
410 match, an error MUST be indicated.
412 Using a C array and ASCII:
413 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
418 The primary reason to have Footer Magic Bytes is to make
419 it easier to detect incomplete files quickly, without
420 uncompressing. If the file does not end with Footer Magic Bytes
421 (excluding Stream Padding described in Section 2.2), it cannot
422 be undamaged, unless someone has intentionally appended garbage
423 after the end of the Stream.
428 Only the decoders that support decoding of concatenated Streams
429 MUST support Stream Padding.
431 Stream Padding MUST contain only null bytes. To preserve the
432 four-byte alignment of consecutive Streams, the size of Stream
433 Padding MUST be a multiple of four bytes. Empty Stream Padding
436 Note that non-empty Stream Padding is allowed at the end of the
437 file; there doesn't need to be a new Stream after non-empty
438 Stream Padding. This can be convenient in certain situations
441 The possibility of Stream Padding MUST be taken into account
442 when designing an application that parses Streams backwards,
443 and the application supports concatenated Streams.
448 +==============+=================+===============+=======+
449 | Block Header | Compressed Data | Block Padding | Check |
450 +==============+=================+===============+=======+
455 +-------------------+-------------+=================+
456 | Block Header Size | Block Flags | Compressed Size |
457 +-------------------+-------------+=================+
459 +===================+======================+
460 ---> | Uncompressed Size | List of Filter Flags |
461 +===================+======================+
463 +================+--+--+--+--+
464 ---> | Header Padding | CRC32 |
465 +================+--+--+--+--+
468 3.1.1. Block Header Size
470 This field overlaps with the Index Indicator field (see
473 This field contains the size of the Block Header field,
474 including the Block Header Size field itself. Valid values are
475 in the range [0x01, 0xFF], which indicate the size of the Block
476 Header as multiples of four bytes, minimum size being eight
479 real_header_size = (encoded_header_size + 1) * 4;
481 If bigger Block Header is needed in future, a new field can be
482 added between the current Block Header and Compressed Data
483 fields. The presence of this new field would be indicated in
489 The first byte of the Block Flags field is a bit field:
491 Bit(s) Mask Description
492 0-1 0x03 Number of filters (1-4)
493 2-5 0x3C Reserved for future use; MUST be zero for now.
494 6 0x40 The Compressed Size field is present.
495 7 0x80 The Uncompressed Size field is present.
497 If any reserved bit is set, the decoder MUST indicate an error.
498 It is possible that there is a new field present which the
499 decoder is not aware of, and can thus parse the Block Header
503 3.1.3. Compressed Size
505 This field is present only if the appropriate bit is set in
506 the Block Flags field (see Section 3.1.2).
508 The Compressed Size field contains the size of the Compressed
509 Data field, which MUST be non-zero. Compressed Size is stored
510 using the encoding described in Section 1.2. If the Compressed
511 Size doesn't match the size of the Compressed Data field, the
512 decoder MUST indicate an error.
515 3.1.4. Uncompressed Size
517 This field is present only if the appropriate bit is set in
518 the Block Flags field (see Section 3.1.2).
520 The Uncompressed Size field contains the size of the Block
521 after uncompressing. Uncompressed Size is stored using the
522 encoding described in Section 1.2. If the Uncompressed Size
523 does not match the real uncompressed size, the decoder MUST
526 Storing the Compressed Size and Uncompressed Size fields serves
528 - The decoder knows how much memory it needs to allocate
529 for a temporary buffer in multithreaded mode.
530 - Simple error detection: wrong size indicates a broken file.
531 - Seeking forwards to a specific location in streamed mode.
533 It should be noted that the only reliable way to determine
534 the real uncompressed size is to uncompress the Block,
535 because the Block Header and Index fields may contain
536 (intentionally or unintentionally) invalid information.
539 3.1.5. List of Filter Flags
541 +================+================+ +================+
542 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
543 +================+================+ +================+
545 The number of Filter Flags fields is stored in the Block Flags
546 field (see Section 3.1.2).
548 The format of each Filter Flags field is as follows:
550 +===========+====================+===================+
551 | Filter ID | Size of Properties | Filter Properties |
552 +===========+====================+===================+
554 Both Filter ID and Size of Properties are stored using the
555 encoding described in Section 1.2. Size of Properties indicates
556 the size of the Filter Properties field as bytes. The list of
557 officially defined Filter IDs and the formats of their Filter
558 Properties are described in Section 5.3.
560 Filter IDs greater than or equal to 0x4000_0000_0000_0000
561 (2^62) are reserved for implementation-specific internal use.
562 These Filter IDs MUST never be used in List of Filter Flags.
565 3.1.6. Header Padding
567 This field contains as many null byte as it is needed to make
568 the Block Header have the size specified in Block Header Size.
569 If any of the bytes are not null bytes, the decoder MUST
570 indicate an error. It is possible that there is a new field
571 present which the decoder is not aware of, and can thus parse
572 the Block Header incorrectly.
577 The CRC32 is calculated over everything in the Block Header
578 field except the CRC32 field itself. It is stored as an
579 unsigned 32-bit little endian integer. If the calculated
580 value does not match the stored one, the decoder MUST indicate
583 By verifying the CRC32 of the Block Header before parsing the
584 actual contents allows the decoder to distinguish between
585 corrupt and unsupported files.
590 The format of Compressed Data depends on Block Flags and List
591 of Filter Flags. Excluding the descriptions of the simplest
592 filters in Section 5.3, the format of the filter-specific
593 encoded data is out of scope of this document.
598 Block Padding MUST contain 0-3 null bytes to make the size of
599 the Block a multiple of four bytes. This can be needed when
600 the size of Compressed Data is not a multiple of four.
605 The type and size of the Check field depends on which bits
606 are set in the Stream Flags field (see Section 2.1.1.2).
608 The Check, when used, is calculated from the original
609 uncompressed data. If the calculated Check does not match the
610 stored one, the decoder MUST indicate an error. If the selected
611 type of Check is not supported by the decoder, it SHOULD
612 indicate a warning or error.
617 +-----------------+===================+
618 | Index Indicator | Number of Records |
619 +-----------------+===================+
621 +=================+===============+-+-+-+-+
622 ---> | List of Records | Index Padding | CRC32 |
623 +=================+===============+-+-+-+-+
625 Index serves several purporses. Using it, one can
626 - verify that all Blocks in a Stream have been processed;
627 - find out the uncompressed size of a Stream; and
628 - quickly access the beginning of any Block (random access).
633 This field overlaps with the Block Header Size field (see
634 Section 3.1.1). The value of Index Indicator is always 0x00.
637 4.2. Number of Records
639 This field indicates how many Records there are in the List
640 of Records field, and thus how many Blocks there are in the
641 Stream. The value is stored using the encoding described in
642 Section 1.2. If the decoder has decoded all the Blocks of the
643 Stream, and then notices that the Number of Records doesn't
644 match the real number of Blocks, the decoder MUST indicate an
650 List of Records consists of as many Records as indicated by the
651 Number of Records field:
654 | Record | Record | ...
657 Each Record contains information about one Block:
659 +===============+===================+
660 | Unpadded Size | Uncompressed Size |
661 +===============+===================+
663 If the decoder has decoded all the Blocks of the Stream, it
664 MUST verify that the contents of the Records match the real
665 Unpadded Size and Uncompressed Size of the respective Blocks.
667 Implementation hint: It is possible to verify the Index with
668 constant memory usage by calculating for example SHA-256 of
669 both the real size values and the List of Records, then
670 comparing the hash values. Implementing this using
671 non-cryptographic hash like CRC32 SHOULD be avoided unless
672 small code size is important.
674 If the decoder supports random-access reading, it MUST verify
675 that Unpadded Size and Uncompressed Size of every completely
676 decoded Block match the sizes stored in the Index. If only
677 partial Block is decoded, the decoder MUST verify that the
678 processed sizes don't exceed the sizes stored in the Index.
683 This field indicates the size of the Block excluding the Block
684 Padding field. That is, Unpadded Size is the size of the Block
685 Header, Compressed Data, and Check fields. Unpadded Size is
686 stored using the encoding described in Section 1.2. The value
687 MUST never be zero; with the current structure of Blocks, the
688 actual minimum value for Unpadded Size is five.
690 Implementation note: Because the size of the Block Padding
691 field is not included in Unpadded Size, calculating the total
692 size of a Stream or doing random-access reading requires
693 calculating the actual size of the Blocks by rounding Unpadded
694 Sizes up to the next multiple of four.
696 The reason to exclude Block Padding from Unpadded Size is to
697 ease making a raw copy of Compressed Data without Block
698 Padding. This can be useful, for example, if someone wants
699 to convert Streams to some other file format quickly.
702 4.3.2. Uncompressed Size
704 This field indicates the Uncompressed Size of the respective
705 Block as bytes. The value is stored using the encoding
706 described in Section 1.2.
711 This field MUST contain 0-3 null bytes to pad the Index to
712 a multiple of four bytes.
717 The CRC32 is calculated over everything in the Index field
718 except the CRC32 field itself. The CRC32 is stored as an
719 unsigned 32-bit little endian integer. If the calculated
720 value does not match the stored one, the decoder MUST indicate
726 The Block Flags field defines how many filters are used. When
727 more than one filter is used, the filters are chained; that is,
728 the output of one filter is the input of another filter. The
729 following figure illustrates the direction of data flow.
731 v Uncompressed Data ^
733 Encoder | Filter 1 | Decoder
740 Alignment of uncompressed input data is usually the job of
741 the application producing the data. For example, to get the
742 best results, an archiver tool should make sure that all
743 PowerPC executable files in the archive stream start at
744 offsets that are multiples of four bytes.
746 Some filters, for example LZMA2, can be configured to take
747 advantage of specified alignment of input data. Note that
748 taking advantage of aligned input can be benefical also when
749 a filter is not the first filter in the chain. For example,
750 if you compress PowerPC executables, you may want to use the
751 PowerPC filter and chain that with the LZMA2 filter. Because
752 not only the input but also the output alignment of the PowerPC
753 filter is four bytes, it is now benefical to set LZMA2 settings
754 so that the LZMA2 encoder can take advantage of its
755 four-byte-aligned input data.
757 The output of the last filter in the chain is stored to the
758 Compressed Data field, which is is guaranteed to be aligned
759 to a multiple of four bytes relative to the beginning of the
760 Stream. This can increase
761 - speed, if the filtered data is handled multiple bytes at
762 a time by the filter-specific encoder and decoder,
763 because accessing aligned data in computer memory is
765 - compression ratio, if the output data is later compressed
766 with an external compression tool.
771 If filters would be allowed to be chained freely, it would be
772 possible to create malicious files, that would be very slow to
773 decode. Such files could be used to create denial of service
776 Slow files could occur when multiple filters are chained:
778 v Compressed input data
779 | Filter 1 decoder (last filter)
780 | Filter 0 decoder (non-last filter)
781 v Uncompressed output data
783 The decoder of the last filter in the chain produces a lot of
784 output from little input. Another filter in the chain takes the
785 output of the last filter, and produces very little output
786 while consuming a lot of input. As a result, a lot of data is
787 moved inside the filter chain, but the filter chain as a whole
788 gets very little work done.
790 To prevent this kind of slow files, there are restrictions on
791 how the filters can be chained. These restrictions MUST be
792 taken into account when designing new filters.
794 The maximum number of filters in the chain has been limited to
795 four, thus there can be at maximum of three non-last filters.
796 Of these three non-last filters, only two are allowed to change
797 the size of the data.
799 The non-last filters, that change the size of the data, MUST
800 have a limit how much the decoder can compress the data: the
801 decoder SHOULD produce at least n bytes of output when the
802 filter is given 2n bytes of input. This limit is not
803 absolute, but significant deviations MUST be avoided.
805 The above limitations guarantee that if the last filter in the
806 chain produces 4n bytes of output, the chain as a whole will
807 produce at least n bytes of output.
814 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
815 compression algorithm with high compression ratio and fast
816 decompression. LZMA is based on LZ77 and range coding
819 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
820 LZMA internally, but adds support for flushing the encoder,
821 uncompressed chunks, eases stateful decoder implementations,
822 and improves support for multithreading. Thus, the plain LZMA
823 will not be supported in this file format.
826 Size of Filter Properties: 1 byte
827 Changes size of data: Yes
828 Allow as a non-last filter: No
829 Allow as the last filter: Yes
832 Input data: Adjustable to 1/2/4/8/16 byte(s)
835 The format of the one-byte Filter Properties field is as
838 Bits Mask Description
839 0-5 0x3F Dictionary Size
840 6-7 0xC0 Reserved for future use; MUST be zero for now.
842 Dictionary Size is encoded with one-bit mantissa and five-bit
843 exponent. The smallest dictionary size is 4 KiB and the biggest
846 Raw value Mantissa Exponent Dictionary size
860 40 2 31 4096 MiB - 1 B
862 Instead of having a table in the decoder, the dictionary size
863 can be decoded using the following C code:
865 const uint8_t bits = get_dictionary_flags() & 0x3F;
867 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
869 uint32_t dictionary_size;
871 dictionary_size = UINT32_MAX;
873 dictionary_size = 2 | (bits & 1);
874 dictionary_size <<= bits / 2 + 11;
878 5.3.2. Branch/Call/Jump Filters for Executables
880 These filters convert relative branch, call, and jump
881 instructions to their absolute counterparts in executable
882 files. This conversion increases redundancy and thus
885 Size of Filter Properties: 0 or 4 bytes
886 Changes size of data: No
887 Allow as a non-last filter: Yes
888 Allow as the last filter: No
890 Below is the list of filters in this category. The alignment
891 is the same for both input and output data.
893 Filter ID Alignment Description
894 0x04 1 byte x86 filter (BCJ)
895 0x05 4 bytes PowerPC (big endian) filter
896 0x06 16 bytes IA64 filter
897 0x07 4 bytes ARM (little endian) filter
898 0x08 2 bytes ARM Thumb (little endian) filter
899 0x09 4 bytes SPARC filter
901 If the size of Filter Properties is four bytes, the Filter
902 Properties field contains the start offset used for address
903 conversions. It is stored as an unsigned 32-bit little endian
904 integer. If the size of Filter Properties is zero, the start
907 Setting the start offset may be useful if an executable has
908 multiple sections, and there are many cross-section calls.
909 Taking advantage of this feature usually requires usage of
910 the Subblock filter, whose design is not complete yet.
915 The Delta filter may increase compression ratio when the value
916 of the next byte correlates with the value of an earlier byte
917 at specified distance.
920 Size of Filter Properties: 1 byte
921 Changes size of data: No
922 Allow as a non-last filter: Yes
923 Allow as the last filter: No
927 Output data: Same as the original input data
929 The Properties byte indicates the delta distance, which can be
930 1-256 bytes backwards from the current byte: 0x00 indicates
931 distance of 1 byte and 0xFF distance of 256 bytes.
934 5.3.3.1. Format of the Encoded Output
936 The code below illustrates both encoding and decoding with
939 // Distance is in the range [1, 256].
940 const unsigned int distance = get_properties_byte() + 1;
944 memset(delta, 0, sizeof(delta));
947 const int byte = read_byte();
951 uint8_t tmp = delta[(uint8_t)(distance + pos)];
953 tmp = (uint8_t)(byte) - tmp;
954 delta[pos] = (uint8_t)(byte);
956 tmp = (uint8_t)(byte) + tmp;
965 5.4. Custom Filter IDs
967 If a developer wants to use custom Filter IDs, he has two
968 choices. The first choice is to contact Lasse Collin and ask
969 him to allocate a range of IDs for the developer.
971 The second choice is to generate a 40-bit random integer,
972 which the developer can use as his personal Developer ID.
973 To minimalize the risk of collisions, Developer ID has to be
974 a randomly generated integer, not manually selected "hex word".
975 The following command, which works on many free operating
976 systems, can be used to generate Developer ID:
978 dd if=/dev/urandom bs=5 count=1 | hexdump
980 The developer can then use his Developer ID to create unique
981 (well, hopefully unique) Filter IDs.
983 Bits Mask Description
984 0-15 0x0000_0000_0000_FFFF Filter ID
985 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
986 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
988 The resulting 63-bit integer will use 9 bytes of space when
989 stored using the encoding described in Section 1.2. To get
990 a shorter ID, see the beginning of this Section how to
991 request a custom ID range.
994 5.4.1. Reserved Custom Filter ID Ranges
997 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
998 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
999 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1002 6. Cyclic Redundancy Checks
1004 There are several incompatible variations to calculate CRC32
1005 and CRC64. For simplicity and clarity, complete examples are
1006 provided to calculate the checks as they are used in this file
1007 format. Implementations MAY use different code as long as it
1008 gives identical results.
1010 The program below reads data from standard input, calculates
1011 the CRC32 and CRC64 values, and prints the calculated values
1012 as big endian hexadecimal strings to standard output.
1015 #include <inttypes.h>
1018 uint32_t crc32_table[256];
1019 uint64_t crc64_table[256];
1024 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1025 static const uint64_t poly64
1026 = UINT64_C(0xC96C5795D7870F42);
1028 for (size_t i = 0; i < 256; ++i) {
1032 for (size_t j = 0; j < 8; ++j) {
1034 crc32 = (crc32 >> 1) ^ poly32;
1039 crc64 = (crc64 >> 1) ^ poly64;
1044 crc32_table[i] = crc32;
1045 crc64_table[i] = crc64;
1050 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1053 for (size_t i = 0; i < size; ++i)
1054 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1060 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1063 for (size_t i = 0; i < size; ++i)
1064 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1074 uint32_t value32 = 0;
1075 uint64_t value64 = 0;
1076 uint64_t total_size = 0;
1080 const size_t buf_size
1081 = fread(buf, 1, sizeof(buf), stdin);
1085 total_size += buf_size;
1086 value32 = crc32(buf, buf_size, value32);
1087 value64 = crc64(buf, buf_size, value64);
1090 printf("Bytes: %" PRIu64 "\n", total_size);
1091 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1092 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1100 LZMA SDK - The original LZMA implementation
1101 http://7-zip.org/sdk.html
1103 LZMA Utils - LZMA adapted to POSIX-like systems
1104 http://tukaani.org/lzma/
1106 XZ Utils - The next generation of LZMA Utils
1107 http://tukaani.org/xz/
1110 GZIP file format specification version 4.3
1111 http://www.ietf.org/rfc/rfc1952.txt
1112 - Notation of byte boxes in section "2.1. Overall conventions"
1115 Key words for use in RFCs to Indicate Requirement Levels
1116 http://www.ietf.org/rfc/rfc2119.txt
1120 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1121 - Node 9.4.2 "Blocking Factor", paragraph that begins
1122 "gzip will complain about trailing garbage"
1123 - Note that this URL points to the latest version of the
1124 manual, and may some day not contain the note which is in
1125 1.21. For the exact version of the manual, download GNU
1126 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz