5 Version 1.0.1 (2009-06-01)
9 0.1. Notices and Acknowledgements
10 0.2. Getting the Latest Version
13 1.1. Byte and Its Representation
14 1.2. Multibyte Integers
15 2. Overall Structure of .xz File
18 2.1.1.1. Header Magic Bytes
23 2.1.2.2. Backward Size
25 2.1.2.4. Footer Magic Bytes
29 3.1.1. Block Header Size
31 3.1.3. Compressed Size
32 3.1.4. Uncompressed Size
33 3.1.5. List of Filter Flags
41 4.2. Number of Records
44 4.3.2. Uncompressed Size
52 5.3.2. Branch/Call/Jump Filters for Executables
54 5.3.3.1. Format of the Encoded Output
55 5.4. Custom Filter IDs
56 5.4.1. Reserved Custom Filter ID Ranges
57 6. Cyclic Redundancy Checks
63 This document describes the .xz file format (filename suffix
64 ".xz", MIME type "application/x-xz"). It is intended that this
65 this format replace the old .lzma format used by LZMA SDK and
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to
75 Ville Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
81 0.2. Getting the Latest Version
83 The latest official version of this document can be downloaded
84 from <http://tukaani.org/xz/xz-file-format.txt>.
86 Specific versions of this document have a filename
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88 For example, the version 1.0.0 of this document is available
89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
94 Version Date Description
96 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
97 clarifications to Sections 2, 2.2,
100 1.0.0 2009-01-14 The first official version
105 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
106 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
107 document are to be interpreted as described in [RFC-2119].
109 Indicating a warning means displaying a message, returning
110 appropriate exit status, or doing something else to let the
111 user know that something worth warning occurred. The operation
112 SHOULD still finish if a warning is indicated.
114 Indicating an error means displaying a message, returning
115 appropriate exit status, or doing something else to let the
116 user know that something prevented successfully finishing the
117 operation. The operation MUST be aborted once an error has
121 1.1. Byte and Its Representation
123 In this document, byte is always 8 bits.
125 A "null byte" has all bits unset. That is, the value of a null
128 To represent byte blocks, this document uses notation that
129 is similar to the notation used in [RFC-1952]:
136 | Foo | Two bytes; that is, some of the vertical bars
137 +---+---+ can be missing.
140 | Foo | Zero or more bytes.
143 In this document, a boxed byte or a byte sequence declared
144 using this notation is called "a field". The example field
145 above would be called "the Foo field" or plain "Foo".
147 If there are many fields, they may be split to multiple lines.
148 This is indicated with an arrow ("--->"):
158 The above is equivalent to this:
165 1.2. Multibyte Integers
167 Multibyte integers of static length, such as CRC values,
168 are stored in little endian byte order (least significant
171 When smaller values are more likely than bigger values (for
172 example file sizes), multibyte integers are encoded in a
173 variable-length representation:
174 - Numbers in the range [0, 127] are copied as is, and take
176 - Bigger numbers will occupy two or more bytes. All but the
177 last byte of the multibyte representation have the highest
180 For now, the value of the variable-length integers is limited
181 to 63 bits, which limits the encoded size of the integer to
182 nine bytes. These limits may be increased in future if needed.
184 The following C code illustrates encoding and decoding of
185 variable-length integers. The functions return the number of
186 bytes occupied by the integer (1-9), or zero on error.
189 #include <inttypes.h>
192 encode(uint8_t buf[static 9], uint64_t num)
194 if (num > UINT64_MAX / 2)
199 while (num >= 0x80) {
200 buf[i++] = (uint8_t)(num) | 0x80;
204 buf[i++] = (uint8_t)(num);
210 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
218 *num = buf[0] & 0x7F;
221 while (buf[i++] & 0x80) {
222 if (i >= size_max || buf[i] == 0x00)
225 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
232 2. Overall Structure of .xz File
234 A standalone .xz files consist of one or more Streams which may
235 have Stream Padding between or after them:
237 +========+================+========+================+
238 | Stream | Stream Padding | Stream | Stream Padding | ...
239 +========+================+========+================+
241 The sizes of Stream and Stream Padding are always multiples
242 of four bytes, thus the size of every valid .xz file MUST be
243 a multiple of four bytes.
245 While a typical file contains only one Stream and no Stream
246 Padding, a decoder handling standalone .xz files SHOULD support
247 files that have more than one Stream or Stream Padding.
249 In contrast to standalone .xz files, when the .xz file format
250 is used as an internal part of some other file format or
251 communication protocol, it usually is expected that the decoder
252 stops after the first Stream, and doesn't look for Stream
253 Padding or possibly other Streams.
258 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
259 | Stream Header | Block | Block | ... | Block |
260 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
262 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
263 ---> | Index | Stream Footer |
264 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
266 All the above fields have a size that is a multiple of four. If
267 Stream is used as an internal part of another file format, it
268 is RECOMMENDED to make the Stream start at an offset that is
269 a multiple of four bytes.
271 Stream Header, Index, and Stream Footer are always present in
272 a Stream. The maximum size of the Index field is 16 GiB (2^34).
274 There are zero or more Blocks. The maximum number of Blocks is
275 limited only by the maximum size of the Index field.
277 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
278 The same limit applies to the total amount of uncompressed
279 data stored in a Stream.
281 If an implementation supports handling .xz files with multiple
282 concatenated Streams, it MAY apply the above limits to the file
283 as a whole instead of limiting per Stream basis.
288 +---+---+---+---+---+---+-------+------+--+--+--+--+
289 | Header Magic Bytes | Stream Flags | CRC32 |
290 +---+---+---+---+---+---+-------+------+--+--+--+--+
293 2.1.1.1. Header Magic Bytes
295 The first six (6) bytes of the Stream are so called Header
296 Magic Bytes. They can be used to identify the file type.
298 Using a C array and ASCII:
299 const uint8_t HEADER_MAGIC[6]
300 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
302 In plain hexadecimal:
306 - The first byte (0xFD) was chosen so that the files cannot
307 be erroneously detected as being in .lzma format, in which
308 the first byte is in the range [0x00, 0xE0].
309 - The sixth byte (0x00) was chosen to prevent applications
310 from misdetecting the file as a text file.
312 If the Header Magic Bytes don't match, the decoder MUST
316 2.1.1.2. Stream Flags
318 The first byte of Stream Flags is always a null byte. In future
319 this byte may be used to indicate new Stream version or other
322 The second byte of Stream Flags is a bit field:
324 Bit(s) Mask Description
325 0-3 0x0F Type of Check (see Section 3.4):
329 0x02 4 bytes (Reserved)
330 0x03 4 bytes (Reserved)
332 0x05 8 bytes (Reserved)
333 0x06 8 bytes (Reserved)
334 0x07 16 bytes (Reserved)
335 0x08 16 bytes (Reserved)
336 0x09 16 bytes (Reserved)
337 0x0A 32 bytes SHA-256
338 0x0B 32 bytes (Reserved)
339 0x0C 32 bytes (Reserved)
340 0x0D 64 bytes (Reserved)
341 0x0E 64 bytes (Reserved)
342 0x0F 64 bytes (Reserved)
343 4-7 0xF0 Reserved for future use; MUST be zero for now.
345 Implementations SHOULD support at least the Check IDs 0x00
346 (None) and 0x01 (CRC32). Supporting other Check IDs is
347 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
348 indicate a warning or error.
350 If any reserved bit is set, the decoder MUST indicate an error.
351 It is possible that there is a new field present which the
352 decoder is not aware of, and can thus parse the Stream Header
358 The CRC32 is calculated from the Stream Flags field. It is
359 stored as an unsigned 32-bit little endian integer. If the
360 calculated value does not match the stored one, the decoder
361 MUST indicate an error.
363 The idea is that Stream Flags would always be two bytes, even
364 if new features are needed. This way old decoders will be able
365 to verify the CRC32 calculated from Stream Flags, and thus
366 distinguish between corrupt files (CRC32 doesn't match) and
367 files that the decoder doesn't support (CRC32 matches but
368 Stream Flags has reserved bits set).
373 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
374 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
375 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
380 The CRC32 is calculated from the Backward Size and Stream Flags
381 fields. It is stored as an unsigned 32-bit little endian
382 integer. If the calculated value does not match the stored one,
383 the decoder MUST indicate an error.
385 The reason to have the CRC32 field before the Backward Size and
386 Stream Flags fields is to keep the four-byte fields aligned to
387 a multiple of four bytes.
390 2.1.2.2. Backward Size
392 Backward Size is stored as a 32-bit little endian integer,
393 which indicates the size of the Index field as multiple of
394 four bytes, minimum value being four bytes:
396 real_backward_size = (stored_backward_size + 1) * 4;
398 If the stored value does not match the real size of the Index
399 field, the decoder MUST indicate an error.
401 Using a fixed-size integer to store Backward Size makes
402 it slightly simpler to parse the Stream Footer when the
403 application needs to parse the Stream backwards.
406 2.1.2.3. Stream Flags
408 This is a copy of the Stream Flags field from the Stream
409 Header. The information stored to Stream Flags is needed
410 when parsing the Stream backwards. The decoder MUST compare
411 the Stream Flags fields in both Stream Header and Stream
412 Footer, and indicate an error if they are not identical.
415 2.1.2.4. Footer Magic Bytes
417 As the last step of the decoding process, the decoder MUST
418 verify the existence of Footer Magic Bytes. If they don't
419 match, an error MUST be indicated.
421 Using a C array and ASCII:
422 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
427 The primary reason to have Footer Magic Bytes is to make
428 it easier to detect incomplete files quickly, without
429 uncompressing. If the file does not end with Footer Magic Bytes
430 (excluding Stream Padding described in Section 2.2), it cannot
431 be undamaged, unless someone has intentionally appended garbage
432 after the end of the Stream.
437 Only the decoders that support decoding of concatenated Streams
438 MUST support Stream Padding.
440 Stream Padding MUST contain only null bytes. To preserve the
441 four-byte alignment of consecutive Streams, the size of Stream
442 Padding MUST be a multiple of four bytes. Empty Stream Padding
443 is allowed. If these requirements are not met, the decoder MUST
446 Note that non-empty Stream Padding is allowed at the end of the
447 file; there doesn't need to be a new Stream after non-empty
448 Stream Padding. This can be convenient in certain situations
451 The possibility of Stream Padding MUST be taken into account
452 when designing an application that parses Streams backwards,
453 and the application supports concatenated Streams.
458 +==============+=================+===============+=======+
459 | Block Header | Compressed Data | Block Padding | Check |
460 +==============+=================+===============+=======+
465 +-------------------+-------------+=================+
466 | Block Header Size | Block Flags | Compressed Size |
467 +-------------------+-------------+=================+
469 +===================+======================+
470 ---> | Uncompressed Size | List of Filter Flags |
471 +===================+======================+
473 +================+--+--+--+--+
474 ---> | Header Padding | CRC32 |
475 +================+--+--+--+--+
478 3.1.1. Block Header Size
480 This field overlaps with the Index Indicator field (see
483 This field contains the size of the Block Header field,
484 including the Block Header Size field itself. Valid values are
485 in the range [0x01, 0xFF], which indicate the size of the Block
486 Header as multiples of four bytes, minimum size being eight
489 real_header_size = (encoded_header_size + 1) * 4;
491 If bigger Block Header is needed in future, a new field can be
492 added between the current Block Header and Compressed Data
493 fields. The presence of this new field would be indicated in
499 The first byte of the Block Flags field is a bit field:
501 Bit(s) Mask Description
502 0-1 0x03 Number of filters (1-4)
503 2-5 0x3C Reserved for future use; MUST be zero for now.
504 6 0x40 The Compressed Size field is present.
505 7 0x80 The Uncompressed Size field is present.
507 If any reserved bit is set, the decoder MUST indicate an error.
508 It is possible that there is a new field present which the
509 decoder is not aware of, and can thus parse the Block Header
513 3.1.3. Compressed Size
515 This field is present only if the appropriate bit is set in
516 the Block Flags field (see Section 3.1.2).
518 The Compressed Size field contains the size of the Compressed
519 Data field, which MUST be non-zero. Compressed Size is stored
520 using the encoding described in Section 1.2. If the Compressed
521 Size doesn't match the size of the Compressed Data field, the
522 decoder MUST indicate an error.
525 3.1.4. Uncompressed Size
527 This field is present only if the appropriate bit is set in
528 the Block Flags field (see Section 3.1.2).
530 The Uncompressed Size field contains the size of the Block
531 after uncompressing. Uncompressed Size is stored using the
532 encoding described in Section 1.2. If the Uncompressed Size
533 does not match the real uncompressed size, the decoder MUST
536 Storing the Compressed Size and Uncompressed Size fields serves
538 - The decoder knows how much memory it needs to allocate
539 for a temporary buffer in multithreaded mode.
540 - Simple error detection: wrong size indicates a broken file.
541 - Seeking forwards to a specific location in streamed mode.
543 It should be noted that the only reliable way to determine
544 the real uncompressed size is to uncompress the Block,
545 because the Block Header and Index fields may contain
546 (intentionally or unintentionally) invalid information.
549 3.1.5. List of Filter Flags
551 +================+================+ +================+
552 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
553 +================+================+ +================+
555 The number of Filter Flags fields is stored in the Block Flags
556 field (see Section 3.1.2).
558 The format of each Filter Flags field is as follows:
560 +===========+====================+===================+
561 | Filter ID | Size of Properties | Filter Properties |
562 +===========+====================+===================+
564 Both Filter ID and Size of Properties are stored using the
565 encoding described in Section 1.2. Size of Properties indicates
566 the size of the Filter Properties field as bytes. The list of
567 officially defined Filter IDs and the formats of their Filter
568 Properties are described in Section 5.3.
570 Filter IDs greater than or equal to 0x4000_0000_0000_0000
571 (2^62) are reserved for implementation-specific internal use.
572 These Filter IDs MUST never be used in List of Filter Flags.
575 3.1.6. Header Padding
577 This field contains as many null byte as it is needed to make
578 the Block Header have the size specified in Block Header Size.
579 If any of the bytes are not null bytes, the decoder MUST
580 indicate an error. It is possible that there is a new field
581 present which the decoder is not aware of, and can thus parse
582 the Block Header incorrectly.
587 The CRC32 is calculated over everything in the Block Header
588 field except the CRC32 field itself. It is stored as an
589 unsigned 32-bit little endian integer. If the calculated
590 value does not match the stored one, the decoder MUST indicate
593 By verifying the CRC32 of the Block Header before parsing the
594 actual contents allows the decoder to distinguish between
595 corrupt and unsupported files.
600 The format of Compressed Data depends on Block Flags and List
601 of Filter Flags. Excluding the descriptions of the simplest
602 filters in Section 5.3, the format of the filter-specific
603 encoded data is out of scope of this document.
608 Block Padding MUST contain 0-3 null bytes to make the size of
609 the Block a multiple of four bytes. This can be needed when
610 the size of Compressed Data is not a multiple of four. If any
611 of the bytes in Block Padding are not null bytes, the decoder
612 MUST indicate an error.
617 The type and size of the Check field depends on which bits
618 are set in the Stream Flags field (see Section 2.1.1.2).
620 The Check, when used, is calculated from the original
621 uncompressed data. If the calculated Check does not match the
622 stored one, the decoder MUST indicate an error. If the selected
623 type of Check is not supported by the decoder, it SHOULD
624 indicate a warning or error.
629 +-----------------+===================+
630 | Index Indicator | Number of Records |
631 +-----------------+===================+
633 +=================+===============+-+-+-+-+
634 ---> | List of Records | Index Padding | CRC32 |
635 +=================+===============+-+-+-+-+
637 Index serves several purporses. Using it, one can
638 - verify that all Blocks in a Stream have been processed;
639 - find out the uncompressed size of a Stream; and
640 - quickly access the beginning of any Block (random access).
645 This field overlaps with the Block Header Size field (see
646 Section 3.1.1). The value of Index Indicator is always 0x00.
649 4.2. Number of Records
651 This field indicates how many Records there are in the List
652 of Records field, and thus how many Blocks there are in the
653 Stream. The value is stored using the encoding described in
654 Section 1.2. If the decoder has decoded all the Blocks of the
655 Stream, and then notices that the Number of Records doesn't
656 match the real number of Blocks, the decoder MUST indicate an
662 List of Records consists of as many Records as indicated by the
663 Number of Records field:
666 | Record | Record | ...
669 Each Record contains information about one Block:
671 +===============+===================+
672 | Unpadded Size | Uncompressed Size |
673 +===============+===================+
675 If the decoder has decoded all the Blocks of the Stream, it
676 MUST verify that the contents of the Records match the real
677 Unpadded Size and Uncompressed Size of the respective Blocks.
679 Implementation hint: It is possible to verify the Index with
680 constant memory usage by calculating for example SHA-256 of
681 both the real size values and the List of Records, then
682 comparing the hash values. Implementing this using
683 non-cryptographic hash like CRC32 SHOULD be avoided unless
684 small code size is important.
686 If the decoder supports random-access reading, it MUST verify
687 that Unpadded Size and Uncompressed Size of every completely
688 decoded Block match the sizes stored in the Index. If only
689 partial Block is decoded, the decoder MUST verify that the
690 processed sizes don't exceed the sizes stored in the Index.
695 This field indicates the size of the Block excluding the Block
696 Padding field. That is, Unpadded Size is the size of the Block
697 Header, Compressed Data, and Check fields. Unpadded Size is
698 stored using the encoding described in Section 1.2. The value
699 MUST never be zero; with the current structure of Blocks, the
700 actual minimum value for Unpadded Size is five.
702 Implementation note: Because the size of the Block Padding
703 field is not included in Unpadded Size, calculating the total
704 size of a Stream or doing random-access reading requires
705 calculating the actual size of the Blocks by rounding Unpadded
706 Sizes up to the next multiple of four.
708 The reason to exclude Block Padding from Unpadded Size is to
709 ease making a raw copy of Compressed Data without Block
710 Padding. This can be useful, for example, if someone wants
711 to convert Streams to some other file format quickly.
714 4.3.2. Uncompressed Size
716 This field indicates the Uncompressed Size of the respective
717 Block as bytes. The value is stored using the encoding
718 described in Section 1.2.
723 This field MUST contain 0-3 null bytes to pad the Index to
724 a multiple of four bytes. If any of the bytes are not null
725 bytes, the decoder MUST indicate an error.
730 The CRC32 is calculated over everything in the Index field
731 except the CRC32 field itself. The CRC32 is stored as an
732 unsigned 32-bit little endian integer. If the calculated
733 value does not match the stored one, the decoder MUST indicate
739 The Block Flags field defines how many filters are used. When
740 more than one filter is used, the filters are chained; that is,
741 the output of one filter is the input of another filter. The
742 following figure illustrates the direction of data flow.
744 v Uncompressed Data ^
746 Encoder | Filter 1 | Decoder
753 Alignment of uncompressed input data is usually the job of
754 the application producing the data. For example, to get the
755 best results, an archiver tool should make sure that all
756 PowerPC executable files in the archive stream start at
757 offsets that are multiples of four bytes.
759 Some filters, for example LZMA2, can be configured to take
760 advantage of specified alignment of input data. Note that
761 taking advantage of aligned input can be benefical also when
762 a filter is not the first filter in the chain. For example,
763 if you compress PowerPC executables, you may want to use the
764 PowerPC filter and chain that with the LZMA2 filter. Because
765 not only the input but also the output alignment of the PowerPC
766 filter is four bytes, it is now benefical to set LZMA2 settings
767 so that the LZMA2 encoder can take advantage of its
768 four-byte-aligned input data.
770 The output of the last filter in the chain is stored to the
771 Compressed Data field, which is is guaranteed to be aligned
772 to a multiple of four bytes relative to the beginning of the
773 Stream. This can increase
774 - speed, if the filtered data is handled multiple bytes at
775 a time by the filter-specific encoder and decoder,
776 because accessing aligned data in computer memory is
778 - compression ratio, if the output data is later compressed
779 with an external compression tool.
784 If filters would be allowed to be chained freely, it would be
785 possible to create malicious files, that would be very slow to
786 decode. Such files could be used to create denial of service
789 Slow files could occur when multiple filters are chained:
791 v Compressed input data
792 | Filter 1 decoder (last filter)
793 | Filter 0 decoder (non-last filter)
794 v Uncompressed output data
796 The decoder of the last filter in the chain produces a lot of
797 output from little input. Another filter in the chain takes the
798 output of the last filter, and produces very little output
799 while consuming a lot of input. As a result, a lot of data is
800 moved inside the filter chain, but the filter chain as a whole
801 gets very little work done.
803 To prevent this kind of slow files, there are restrictions on
804 how the filters can be chained. These restrictions MUST be
805 taken into account when designing new filters.
807 The maximum number of filters in the chain has been limited to
808 four, thus there can be at maximum of three non-last filters.
809 Of these three non-last filters, only two are allowed to change
810 the size of the data.
812 The non-last filters, that change the size of the data, MUST
813 have a limit how much the decoder can compress the data: the
814 decoder SHOULD produce at least n bytes of output when the
815 filter is given 2n bytes of input. This limit is not
816 absolute, but significant deviations MUST be avoided.
818 The above limitations guarantee that if the last filter in the
819 chain produces 4n bytes of output, the chain as a whole will
820 produce at least n bytes of output.
827 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
828 compression algorithm with high compression ratio and fast
829 decompression. LZMA is based on LZ77 and range coding
832 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
833 LZMA internally, but adds support for flushing the encoder,
834 uncompressed chunks, eases stateful decoder implementations,
835 and improves support for multithreading. Thus, the plain LZMA
836 will not be supported in this file format.
839 Size of Filter Properties: 1 byte
840 Changes size of data: Yes
841 Allow as a non-last filter: No
842 Allow as the last filter: Yes
845 Input data: Adjustable to 1/2/4/8/16 byte(s)
848 The format of the one-byte Filter Properties field is as
851 Bits Mask Description
852 0-5 0x3F Dictionary Size
853 6-7 0xC0 Reserved for future use; MUST be zero for now.
855 Dictionary Size is encoded with one-bit mantissa and five-bit
856 exponent. The smallest dictionary size is 4 KiB and the biggest
859 Raw value Mantissa Exponent Dictionary size
873 40 2 31 4096 MiB - 1 B
875 Instead of having a table in the decoder, the dictionary size
876 can be decoded using the following C code:
878 const uint8_t bits = get_dictionary_flags() & 0x3F;
880 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
882 uint32_t dictionary_size;
884 dictionary_size = UINT32_MAX;
886 dictionary_size = 2 | (bits & 1);
887 dictionary_size <<= bits / 2 + 11;
891 5.3.2. Branch/Call/Jump Filters for Executables
893 These filters convert relative branch, call, and jump
894 instructions to their absolute counterparts in executable
895 files. This conversion increases redundancy and thus
898 Size of Filter Properties: 0 or 4 bytes
899 Changes size of data: No
900 Allow as a non-last filter: Yes
901 Allow as the last filter: No
903 Below is the list of filters in this category. The alignment
904 is the same for both input and output data.
906 Filter ID Alignment Description
907 0x04 1 byte x86 filter (BCJ)
908 0x05 4 bytes PowerPC (big endian) filter
909 0x06 16 bytes IA64 filter
910 0x07 4 bytes ARM (little endian) filter
911 0x08 2 bytes ARM Thumb (little endian) filter
912 0x09 4 bytes SPARC filter
914 If the size of Filter Properties is four bytes, the Filter
915 Properties field contains the start offset used for address
916 conversions. It is stored as an unsigned 32-bit little endian
917 integer. The start offset MUST be a multiple of the alignment
918 of the filter as listed in the table above; if it isn't, the
919 decoder MUST indicate an error. If the size of Filter
920 Properties is zero, the start offset is zero.
922 Setting the start offset may be useful if an executable has
923 multiple sections, and there are many cross-section calls.
924 Taking advantage of this feature usually requires usage of
925 the Subblock filter, whose design is not complete yet.
930 The Delta filter may increase compression ratio when the value
931 of the next byte correlates with the value of an earlier byte
932 at specified distance.
935 Size of Filter Properties: 1 byte
936 Changes size of data: No
937 Allow as a non-last filter: Yes
938 Allow as the last filter: No
942 Output data: Same as the original input data
944 The Properties byte indicates the delta distance, which can be
945 1-256 bytes backwards from the current byte: 0x00 indicates
946 distance of 1 byte and 0xFF distance of 256 bytes.
949 5.3.3.1. Format of the Encoded Output
951 The code below illustrates both encoding and decoding with
954 // Distance is in the range [1, 256].
955 const unsigned int distance = get_properties_byte() + 1;
959 memset(delta, 0, sizeof(delta));
962 const int byte = read_byte();
966 uint8_t tmp = delta[(uint8_t)(distance + pos)];
968 tmp = (uint8_t)(byte) - tmp;
969 delta[pos] = (uint8_t)(byte);
971 tmp = (uint8_t)(byte) + tmp;
980 5.4. Custom Filter IDs
982 If a developer wants to use custom Filter IDs, he has two
983 choices. The first choice is to contact Lasse Collin and ask
984 him to allocate a range of IDs for the developer.
986 The second choice is to generate a 40-bit random integer,
987 which the developer can use as his personal Developer ID.
988 To minimalize the risk of collisions, Developer ID has to be
989 a randomly generated integer, not manually selected "hex word".
990 The following command, which works on many free operating
991 systems, can be used to generate Developer ID:
993 dd if=/dev/urandom bs=5 count=1 | hexdump
995 The developer can then use his Developer ID to create unique
996 (well, hopefully unique) Filter IDs.
998 Bits Mask Description
999 0-15 0x0000_0000_0000_FFFF Filter ID
1000 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
1001 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
1003 The resulting 63-bit integer will use 9 bytes of space when
1004 stored using the encoding described in Section 1.2. To get
1005 a shorter ID, see the beginning of this Section how to
1006 request a custom ID range.
1009 5.4.1. Reserved Custom Filter ID Ranges
1012 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
1013 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
1014 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1017 6. Cyclic Redundancy Checks
1019 There are several incompatible variations to calculate CRC32
1020 and CRC64. For simplicity and clarity, complete examples are
1021 provided to calculate the checks as they are used in this file
1022 format. Implementations MAY use different code as long as it
1023 gives identical results.
1025 The program below reads data from standard input, calculates
1026 the CRC32 and CRC64 values, and prints the calculated values
1027 as big endian hexadecimal strings to standard output.
1030 #include <inttypes.h>
1033 uint32_t crc32_table[256];
1034 uint64_t crc64_table[256];
1039 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1040 static const uint64_t poly64
1041 = UINT64_C(0xC96C5795D7870F42);
1043 for (size_t i = 0; i < 256; ++i) {
1047 for (size_t j = 0; j < 8; ++j) {
1049 crc32 = (crc32 >> 1) ^ poly32;
1054 crc64 = (crc64 >> 1) ^ poly64;
1059 crc32_table[i] = crc32;
1060 crc64_table[i] = crc64;
1065 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1068 for (size_t i = 0; i < size; ++i)
1069 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1075 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1078 for (size_t i = 0; i < size; ++i)
1079 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1089 uint32_t value32 = 0;
1090 uint64_t value64 = 0;
1091 uint64_t total_size = 0;
1095 const size_t buf_size
1096 = fread(buf, 1, sizeof(buf), stdin);
1100 total_size += buf_size;
1101 value32 = crc32(buf, buf_size, value32);
1102 value64 = crc64(buf, buf_size, value64);
1105 printf("Bytes: %" PRIu64 "\n", total_size);
1106 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1107 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1115 LZMA SDK - The original LZMA implementation
1116 http://7-zip.org/sdk.html
1118 LZMA Utils - LZMA adapted to POSIX-like systems
1119 http://tukaani.org/lzma/
1121 XZ Utils - The next generation of LZMA Utils
1122 http://tukaani.org/xz/
1125 GZIP file format specification version 4.3
1126 http://www.ietf.org/rfc/rfc1952.txt
1127 - Notation of byte boxes in section "2.1. Overall conventions"
1130 Key words for use in RFCs to Indicate Requirement Levels
1131 http://www.ietf.org/rfc/rfc2119.txt
1135 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1136 - Node 9.4.2 "Blocking Factor", paragraph that begins
1137 "gzip will complain about trailing garbage"
1138 - Note that this URL points to the latest version of the
1139 manual, and may some day not contain the note which is in
1140 1.21. For the exact version of the manual, download GNU
1141 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz