5 Version 1.0.3 (2009-06-05)
9 0.1. Notices and Acknowledgements
10 0.2. Getting the Latest Version
13 1.1. Byte and Its Representation
14 1.2. Multibyte Integers
15 2. Overall Structure of .xz File
18 2.1.1.1. Header Magic Bytes
23 2.1.2.2. Backward Size
25 2.1.2.4. Footer Magic Bytes
29 3.1.1. Block Header Size
31 3.1.3. Compressed Size
32 3.1.4. Uncompressed Size
33 3.1.5. List of Filter Flags
41 4.2. Number of Records
44 4.3.2. Uncompressed Size
52 5.3.2. Branch/Call/Jump Filters for Executables
54 5.3.3.1. Format of the Encoded Output
55 5.4. Custom Filter IDs
56 5.4.1. Reserved Custom Filter ID Ranges
57 6. Cyclic Redundancy Checks
63 This document describes the .xz file format (filename suffix
64 ".xz", MIME type "application/x-xz"). It is intended that this
65 this format replace the old .lzma format used by LZMA SDK and
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to
75 Ville Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
81 0.2. Getting the Latest Version
83 The latest official version of this document can be downloaded
84 from <http://tukaani.org/xz/xz-file-format.txt>.
86 Specific versions of this document have a filename
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88 For example, the version 1.0.0 of this document is available
89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
94 Version Date Description
96 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4
98 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1
100 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
101 clarifications to Sections 2, 2.2,
104 1.0.0 2009-01-14 The first official version
109 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
110 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
111 document are to be interpreted as described in [RFC-2119].
113 Indicating a warning means displaying a message, returning
114 appropriate exit status, or doing something else to let the
115 user know that something worth warning occurred. The operation
116 SHOULD still finish if a warning is indicated.
118 Indicating an error means displaying a message, returning
119 appropriate exit status, or doing something else to let the
120 user know that something prevented successfully finishing the
121 operation. The operation MUST be aborted once an error has
125 1.1. Byte and Its Representation
127 In this document, byte is always 8 bits.
129 A "null byte" has all bits unset. That is, the value of a null
132 To represent byte blocks, this document uses notation that
133 is similar to the notation used in [RFC-1952]:
140 | Foo | Two bytes; that is, some of the vertical bars
141 +---+---+ can be missing.
144 | Foo | Zero or more bytes.
147 In this document, a boxed byte or a byte sequence declared
148 using this notation is called "a field". The example field
149 above would be called "the Foo field" or plain "Foo".
151 If there are many fields, they may be split to multiple lines.
152 This is indicated with an arrow ("--->"):
162 The above is equivalent to this:
169 1.2. Multibyte Integers
171 Multibyte integers of static length, such as CRC values,
172 are stored in little endian byte order (least significant
175 When smaller values are more likely than bigger values (for
176 example file sizes), multibyte integers are encoded in a
177 variable-length representation:
178 - Numbers in the range [0, 127] are copied as is, and take
180 - Bigger numbers will occupy two or more bytes. All but the
181 last byte of the multibyte representation have the highest
184 For now, the value of the variable-length integers is limited
185 to 63 bits, which limits the encoded size of the integer to
186 nine bytes. These limits may be increased in future if needed.
188 The following C code illustrates encoding and decoding of
189 variable-length integers. The functions return the number of
190 bytes occupied by the integer (1-9), or zero on error.
193 #include <inttypes.h>
196 encode(uint8_t buf[static 9], uint64_t num)
198 if (num > UINT64_MAX / 2)
203 while (num >= 0x80) {
204 buf[i++] = (uint8_t)(num) | 0x80;
208 buf[i++] = (uint8_t)(num);
214 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
222 *num = buf[0] & 0x7F;
225 while (buf[i++] & 0x80) {
226 if (i >= size_max || buf[i] == 0x00)
229 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
236 2. Overall Structure of .xz File
238 A standalone .xz files consist of one or more Streams which may
239 have Stream Padding between or after them:
241 +========+================+========+================+
242 | Stream | Stream Padding | Stream | Stream Padding | ...
243 +========+================+========+================+
245 The sizes of Stream and Stream Padding are always multiples
246 of four bytes, thus the size of every valid .xz file MUST be
247 a multiple of four bytes.
249 While a typical file contains only one Stream and no Stream
250 Padding, a decoder handling standalone .xz files SHOULD support
251 files that have more than one Stream or Stream Padding.
253 In contrast to standalone .xz files, when the .xz file format
254 is used as an internal part of some other file format or
255 communication protocol, it usually is expected that the decoder
256 stops after the first Stream, and doesn't look for Stream
257 Padding or possibly other Streams.
262 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
263 | Stream Header | Block | Block | ... | Block |
264 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
266 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
267 ---> | Index | Stream Footer |
268 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
270 All the above fields have a size that is a multiple of four. If
271 Stream is used as an internal part of another file format, it
272 is RECOMMENDED to make the Stream start at an offset that is
273 a multiple of four bytes.
275 Stream Header, Index, and Stream Footer are always present in
276 a Stream. The maximum size of the Index field is 16 GiB (2^34).
278 There are zero or more Blocks. The maximum number of Blocks is
279 limited only by the maximum size of the Index field.
281 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
282 The same limit applies to the total amount of uncompressed
283 data stored in a Stream.
285 If an implementation supports handling .xz files with multiple
286 concatenated Streams, it MAY apply the above limits to the file
287 as a whole instead of limiting per Stream basis.
292 +---+---+---+---+---+---+-------+------+--+--+--+--+
293 | Header Magic Bytes | Stream Flags | CRC32 |
294 +---+---+---+---+---+---+-------+------+--+--+--+--+
297 2.1.1.1. Header Magic Bytes
299 The first six (6) bytes of the Stream are so called Header
300 Magic Bytes. They can be used to identify the file type.
302 Using a C array and ASCII:
303 const uint8_t HEADER_MAGIC[6]
304 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
306 In plain hexadecimal:
310 - The first byte (0xFD) was chosen so that the files cannot
311 be erroneously detected as being in .lzma format, in which
312 the first byte is in the range [0x00, 0xE0].
313 - The sixth byte (0x00) was chosen to prevent applications
314 from misdetecting the file as a text file.
316 If the Header Magic Bytes don't match, the decoder MUST
320 2.1.1.2. Stream Flags
322 The first byte of Stream Flags is always a null byte. In future
323 this byte may be used to indicate new Stream version or other
326 The second byte of Stream Flags is a bit field:
328 Bit(s) Mask Description
329 0-3 0x0F Type of Check (see Section 3.4):
333 0x02 4 bytes (Reserved)
334 0x03 4 bytes (Reserved)
336 0x05 8 bytes (Reserved)
337 0x06 8 bytes (Reserved)
338 0x07 16 bytes (Reserved)
339 0x08 16 bytes (Reserved)
340 0x09 16 bytes (Reserved)
341 0x0A 32 bytes SHA-256
342 0x0B 32 bytes (Reserved)
343 0x0C 32 bytes (Reserved)
344 0x0D 64 bytes (Reserved)
345 0x0E 64 bytes (Reserved)
346 0x0F 64 bytes (Reserved)
347 4-7 0xF0 Reserved for future use; MUST be zero for now.
349 Implementations SHOULD support at least the Check IDs 0x00
350 (None) and 0x01 (CRC32). Supporting other Check IDs is
351 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
352 indicate a warning or error.
354 If any reserved bit is set, the decoder MUST indicate an error.
355 It is possible that there is a new field present which the
356 decoder is not aware of, and can thus parse the Stream Header
362 The CRC32 is calculated from the Stream Flags field. It is
363 stored as an unsigned 32-bit little endian integer. If the
364 calculated value does not match the stored one, the decoder
365 MUST indicate an error.
367 The idea is that Stream Flags would always be two bytes, even
368 if new features are needed. This way old decoders will be able
369 to verify the CRC32 calculated from Stream Flags, and thus
370 distinguish between corrupt files (CRC32 doesn't match) and
371 files that the decoder doesn't support (CRC32 matches but
372 Stream Flags has reserved bits set).
377 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
378 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
379 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
384 The CRC32 is calculated from the Backward Size and Stream Flags
385 fields. It is stored as an unsigned 32-bit little endian
386 integer. If the calculated value does not match the stored one,
387 the decoder MUST indicate an error.
389 The reason to have the CRC32 field before the Backward Size and
390 Stream Flags fields is to keep the four-byte fields aligned to
391 a multiple of four bytes.
394 2.1.2.2. Backward Size
396 Backward Size is stored as a 32-bit little endian integer,
397 which indicates the size of the Index field as multiple of
398 four bytes, minimum value being four bytes:
400 real_backward_size = (stored_backward_size + 1) * 4;
402 If the stored value does not match the real size of the Index
403 field, the decoder MUST indicate an error.
405 Using a fixed-size integer to store Backward Size makes
406 it slightly simpler to parse the Stream Footer when the
407 application needs to parse the Stream backwards.
410 2.1.2.3. Stream Flags
412 This is a copy of the Stream Flags field from the Stream
413 Header. The information stored to Stream Flags is needed
414 when parsing the Stream backwards. The decoder MUST compare
415 the Stream Flags fields in both Stream Header and Stream
416 Footer, and indicate an error if they are not identical.
419 2.1.2.4. Footer Magic Bytes
421 As the last step of the decoding process, the decoder MUST
422 verify the existence of Footer Magic Bytes. If they don't
423 match, an error MUST be indicated.
425 Using a C array and ASCII:
426 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
431 The primary reason to have Footer Magic Bytes is to make
432 it easier to detect incomplete files quickly, without
433 uncompressing. If the file does not end with Footer Magic Bytes
434 (excluding Stream Padding described in Section 2.2), it cannot
435 be undamaged, unless someone has intentionally appended garbage
436 after the end of the Stream.
441 Only the decoders that support decoding of concatenated Streams
442 MUST support Stream Padding.
444 Stream Padding MUST contain only null bytes. To preserve the
445 four-byte alignment of consecutive Streams, the size of Stream
446 Padding MUST be a multiple of four bytes. Empty Stream Padding
447 is allowed. If these requirements are not met, the decoder MUST
450 Note that non-empty Stream Padding is allowed at the end of the
451 file; there doesn't need to be a new Stream after non-empty
452 Stream Padding. This can be convenient in certain situations
455 The possibility of Stream Padding MUST be taken into account
456 when designing an application that parses Streams backwards,
457 and the application supports concatenated Streams.
462 +==============+=================+===============+=======+
463 | Block Header | Compressed Data | Block Padding | Check |
464 +==============+=================+===============+=======+
469 +-------------------+-------------+=================+
470 | Block Header Size | Block Flags | Compressed Size |
471 +-------------------+-------------+=================+
473 +===================+======================+
474 ---> | Uncompressed Size | List of Filter Flags |
475 +===================+======================+
477 +================+--+--+--+--+
478 ---> | Header Padding | CRC32 |
479 +================+--+--+--+--+
482 3.1.1. Block Header Size
484 This field overlaps with the Index Indicator field (see
487 This field contains the size of the Block Header field,
488 including the Block Header Size field itself. Valid values are
489 in the range [0x01, 0xFF], which indicate the size of the Block
490 Header as multiples of four bytes, minimum size being eight
493 real_header_size = (encoded_header_size + 1) * 4;
495 If bigger Block Header is needed in future, a new field can be
496 added between the current Block Header and Compressed Data
497 fields. The presence of this new field would be indicated in
503 The first byte of the Block Flags field is a bit field:
505 Bit(s) Mask Description
506 0-1 0x03 Number of filters (1-4)
507 2-5 0x3C Reserved for future use; MUST be zero for now.
508 6 0x40 The Compressed Size field is present.
509 7 0x80 The Uncompressed Size field is present.
511 If any reserved bit is set, the decoder MUST indicate an error.
512 It is possible that there is a new field present which the
513 decoder is not aware of, and can thus parse the Block Header
517 3.1.3. Compressed Size
519 This field is present only if the appropriate bit is set in
520 the Block Flags field (see Section 3.1.2).
522 The Compressed Size field contains the size of the Compressed
523 Data field, which MUST be non-zero. Compressed Size is stored
524 using the encoding described in Section 1.2. If the Compressed
525 Size doesn't match the size of the Compressed Data field, the
526 decoder MUST indicate an error.
529 3.1.4. Uncompressed Size
531 This field is present only if the appropriate bit is set in
532 the Block Flags field (see Section 3.1.2).
534 The Uncompressed Size field contains the size of the Block
535 after uncompressing. Uncompressed Size is stored using the
536 encoding described in Section 1.2. If the Uncompressed Size
537 does not match the real uncompressed size, the decoder MUST
540 Storing the Compressed Size and Uncompressed Size fields serves
542 - The decoder knows how much memory it needs to allocate
543 for a temporary buffer in multithreaded mode.
544 - Simple error detection: wrong size indicates a broken file.
545 - Seeking forwards to a specific location in streamed mode.
547 It should be noted that the only reliable way to determine
548 the real uncompressed size is to uncompress the Block,
549 because the Block Header and Index fields may contain
550 (intentionally or unintentionally) invalid information.
553 3.1.5. List of Filter Flags
555 +================+================+ +================+
556 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
557 +================+================+ +================+
559 The number of Filter Flags fields is stored in the Block Flags
560 field (see Section 3.1.2).
562 The format of each Filter Flags field is as follows:
564 +===========+====================+===================+
565 | Filter ID | Size of Properties | Filter Properties |
566 +===========+====================+===================+
568 Both Filter ID and Size of Properties are stored using the
569 encoding described in Section 1.2. Size of Properties indicates
570 the size of the Filter Properties field as bytes. The list of
571 officially defined Filter IDs and the formats of their Filter
572 Properties are described in Section 5.3.
574 Filter IDs greater than or equal to 0x4000_0000_0000_0000
575 (2^62) are reserved for implementation-specific internal use.
576 These Filter IDs MUST never be used in List of Filter Flags.
579 3.1.6. Header Padding
581 This field contains as many null byte as it is needed to make
582 the Block Header have the size specified in Block Header Size.
583 If any of the bytes are not null bytes, the decoder MUST
584 indicate an error. It is possible that there is a new field
585 present which the decoder is not aware of, and can thus parse
586 the Block Header incorrectly.
591 The CRC32 is calculated over everything in the Block Header
592 field except the CRC32 field itself. It is stored as an
593 unsigned 32-bit little endian integer. If the calculated
594 value does not match the stored one, the decoder MUST indicate
597 By verifying the CRC32 of the Block Header before parsing the
598 actual contents allows the decoder to distinguish between
599 corrupt and unsupported files.
604 The format of Compressed Data depends on Block Flags and List
605 of Filter Flags. Excluding the descriptions of the simplest
606 filters in Section 5.3, the format of the filter-specific
607 encoded data is out of scope of this document.
612 Block Padding MUST contain 0-3 null bytes to make the size of
613 the Block a multiple of four bytes. This can be needed when
614 the size of Compressed Data is not a multiple of four. If any
615 of the bytes in Block Padding are not null bytes, the decoder
616 MUST indicate an error.
621 The type and size of the Check field depends on which bits
622 are set in the Stream Flags field (see Section 2.1.1.2).
624 The Check, when used, is calculated from the original
625 uncompressed data. If the calculated Check does not match the
626 stored one, the decoder MUST indicate an error. If the selected
627 type of Check is not supported by the decoder, it SHOULD
628 indicate a warning or error.
633 +-----------------+===================+
634 | Index Indicator | Number of Records |
635 +-----------------+===================+
637 +=================+===============+-+-+-+-+
638 ---> | List of Records | Index Padding | CRC32 |
639 +=================+===============+-+-+-+-+
641 Index serves several purposes. Using it, one can
642 - verify that all Blocks in a Stream have been processed;
643 - find out the uncompressed size of a Stream; and
644 - quickly access the beginning of any Block (random access).
649 This field overlaps with the Block Header Size field (see
650 Section 3.1.1). The value of Index Indicator is always 0x00.
653 4.2. Number of Records
655 This field indicates how many Records there are in the List
656 of Records field, and thus how many Blocks there are in the
657 Stream. The value is stored using the encoding described in
658 Section 1.2. If the decoder has decoded all the Blocks of the
659 Stream, and then notices that the Number of Records doesn't
660 match the real number of Blocks, the decoder MUST indicate an
666 List of Records consists of as many Records as indicated by the
667 Number of Records field:
670 | Record | Record | ...
673 Each Record contains information about one Block:
675 +===============+===================+
676 | Unpadded Size | Uncompressed Size |
677 +===============+===================+
679 If the decoder has decoded all the Blocks of the Stream, it
680 MUST verify that the contents of the Records match the real
681 Unpadded Size and Uncompressed Size of the respective Blocks.
683 Implementation hint: It is possible to verify the Index with
684 constant memory usage by calculating for example SHA-256 of
685 both the real size values and the List of Records, then
686 comparing the hash values. Implementing this using
687 non-cryptographic hash like CRC32 SHOULD be avoided unless
688 small code size is important.
690 If the decoder supports random-access reading, it MUST verify
691 that Unpadded Size and Uncompressed Size of every completely
692 decoded Block match the sizes stored in the Index. If only
693 partial Block is decoded, the decoder MUST verify that the
694 processed sizes don't exceed the sizes stored in the Index.
699 This field indicates the size of the Block excluding the Block
700 Padding field. That is, Unpadded Size is the size of the Block
701 Header, Compressed Data, and Check fields. Unpadded Size is
702 stored using the encoding described in Section 1.2. The value
703 MUST never be zero; with the current structure of Blocks, the
704 actual minimum value for Unpadded Size is five.
706 Implementation note: Because the size of the Block Padding
707 field is not included in Unpadded Size, calculating the total
708 size of a Stream or doing random-access reading requires
709 calculating the actual size of the Blocks by rounding Unpadded
710 Sizes up to the next multiple of four.
712 The reason to exclude Block Padding from Unpadded Size is to
713 ease making a raw copy of Compressed Data without Block
714 Padding. This can be useful, for example, if someone wants
715 to convert Streams to some other file format quickly.
718 4.3.2. Uncompressed Size
720 This field indicates the Uncompressed Size of the respective
721 Block as bytes. The value is stored using the encoding
722 described in Section 1.2.
727 This field MUST contain 0-3 null bytes to pad the Index to
728 a multiple of four bytes. If any of the bytes are not null
729 bytes, the decoder MUST indicate an error.
734 The CRC32 is calculated over everything in the Index field
735 except the CRC32 field itself. The CRC32 is stored as an
736 unsigned 32-bit little endian integer. If the calculated
737 value does not match the stored one, the decoder MUST indicate
743 The Block Flags field defines how many filters are used. When
744 more than one filter is used, the filters are chained; that is,
745 the output of one filter is the input of another filter. The
746 following figure illustrates the direction of data flow.
748 v Uncompressed Data ^
750 Encoder | Filter 1 | Decoder
757 Alignment of uncompressed input data is usually the job of
758 the application producing the data. For example, to get the
759 best results, an archiver tool should make sure that all
760 PowerPC executable files in the archive stream start at
761 offsets that are multiples of four bytes.
763 Some filters, for example LZMA2, can be configured to take
764 advantage of specified alignment of input data. Note that
765 taking advantage of aligned input can be beneficial also when
766 a filter is not the first filter in the chain. For example,
767 if you compress PowerPC executables, you may want to use the
768 PowerPC filter and chain that with the LZMA2 filter. Because
769 not only the input but also the output alignment of the PowerPC
770 filter is four bytes, it is now beneficial to set LZMA2
771 settings so that the LZMA2 encoder can take advantage of its
772 four-byte-aligned input data.
774 The output of the last filter in the chain is stored to the
775 Compressed Data field, which is is guaranteed to be aligned
776 to a multiple of four bytes relative to the beginning of the
777 Stream. This can increase
778 - speed, if the filtered data is handled multiple bytes at
779 a time by the filter-specific encoder and decoder,
780 because accessing aligned data in computer memory is
782 - compression ratio, if the output data is later compressed
783 with an external compression tool.
788 If filters would be allowed to be chained freely, it would be
789 possible to create malicious files, that would be very slow to
790 decode. Such files could be used to create denial of service
793 Slow files could occur when multiple filters are chained:
795 v Compressed input data
796 | Filter 1 decoder (last filter)
797 | Filter 0 decoder (non-last filter)
798 v Uncompressed output data
800 The decoder of the last filter in the chain produces a lot of
801 output from little input. Another filter in the chain takes the
802 output of the last filter, and produces very little output
803 while consuming a lot of input. As a result, a lot of data is
804 moved inside the filter chain, but the filter chain as a whole
805 gets very little work done.
807 To prevent this kind of slow files, there are restrictions on
808 how the filters can be chained. These restrictions MUST be
809 taken into account when designing new filters.
811 The maximum number of filters in the chain has been limited to
812 four, thus there can be at maximum of three non-last filters.
813 Of these three non-last filters, only two are allowed to change
814 the size of the data.
816 The non-last filters, that change the size of the data, MUST
817 have a limit how much the decoder can compress the data: the
818 decoder SHOULD produce at least n bytes of output when the
819 filter is given 2n bytes of input. This limit is not
820 absolute, but significant deviations MUST be avoided.
822 The above limitations guarantee that if the last filter in the
823 chain produces 4n bytes of output, the chain as a whole will
824 produce at least n bytes of output.
831 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
832 compression algorithm with high compression ratio and fast
833 decompression. LZMA is based on LZ77 and range coding
836 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
837 LZMA internally, but adds support for flushing the encoder,
838 uncompressed chunks, eases stateful decoder implementations,
839 and improves support for multithreading. Thus, the plain LZMA
840 will not be supported in this file format.
843 Size of Filter Properties: 1 byte
844 Changes size of data: Yes
845 Allow as a non-last filter: No
846 Allow as the last filter: Yes
849 Input data: Adjustable to 1/2/4/8/16 byte(s)
852 The format of the one-byte Filter Properties field is as
855 Bits Mask Description
856 0-5 0x3F Dictionary Size
857 6-7 0xC0 Reserved for future use; MUST be zero for now.
859 Dictionary Size is encoded with one-bit mantissa and five-bit
860 exponent. The smallest dictionary size is 4 KiB and the biggest
863 Raw value Mantissa Exponent Dictionary size
877 40 2 31 4096 MiB - 1 B
879 Instead of having a table in the decoder, the dictionary size
880 can be decoded using the following C code:
882 const uint8_t bits = get_dictionary_flags() & 0x3F;
884 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
886 uint32_t dictionary_size;
888 dictionary_size = UINT32_MAX;
890 dictionary_size = 2 | (bits & 1);
891 dictionary_size <<= bits / 2 + 11;
895 5.3.2. Branch/Call/Jump Filters for Executables
897 These filters convert relative branch, call, and jump
898 instructions to their absolute counterparts in executable
899 files. This conversion increases redundancy and thus
902 Size of Filter Properties: 0 or 4 bytes
903 Changes size of data: No
904 Allow as a non-last filter: Yes
905 Allow as the last filter: No
907 Below is the list of filters in this category. The alignment
908 is the same for both input and output data.
910 Filter ID Alignment Description
911 0x04 1 byte x86 filter (BCJ)
912 0x05 4 bytes PowerPC (big endian) filter
913 0x06 16 bytes IA64 filter
914 0x07 4 bytes ARM (little endian) filter
915 0x08 2 bytes ARM Thumb (little endian) filter
916 0x09 4 bytes SPARC filter
918 If the size of Filter Properties is four bytes, the Filter
919 Properties field contains the start offset used for address
920 conversions. It is stored as an unsigned 32-bit little endian
921 integer. The start offset MUST be a multiple of the alignment
922 of the filter as listed in the table above; if it isn't, the
923 decoder MUST indicate an error. If the size of Filter
924 Properties is zero, the start offset is zero.
926 Setting the start offset may be useful if an executable has
927 multiple sections, and there are many cross-section calls.
928 Taking advantage of this feature usually requires usage of
929 the Subblock filter, whose design is not complete yet.
934 The Delta filter may increase compression ratio when the value
935 of the next byte correlates with the value of an earlier byte
936 at specified distance.
939 Size of Filter Properties: 1 byte
940 Changes size of data: No
941 Allow as a non-last filter: Yes
942 Allow as the last filter: No
946 Output data: Same as the original input data
948 The Properties byte indicates the delta distance, which can be
949 1-256 bytes backwards from the current byte: 0x00 indicates
950 distance of 1 byte and 0xFF distance of 256 bytes.
953 5.3.3.1. Format of the Encoded Output
955 The code below illustrates both encoding and decoding with
958 // Distance is in the range [1, 256].
959 const unsigned int distance = get_properties_byte() + 1;
963 memset(delta, 0, sizeof(delta));
966 const int byte = read_byte();
970 uint8_t tmp = delta[(uint8_t)(distance + pos)];
972 tmp = (uint8_t)(byte) - tmp;
973 delta[pos] = (uint8_t)(byte);
975 tmp = (uint8_t)(byte) + tmp;
984 5.4. Custom Filter IDs
986 If a developer wants to use custom Filter IDs, he has two
987 choices. The first choice is to contact Lasse Collin and ask
988 him to allocate a range of IDs for the developer.
990 The second choice is to generate a 40-bit random integer,
991 which the developer can use as his personal Developer ID.
992 To minimize the risk of collisions, Developer ID has to be
993 a randomly generated integer, not manually selected "hex word".
994 The following command, which works on many free operating
995 systems, can be used to generate Developer ID:
997 dd if=/dev/urandom bs=5 count=1 | hexdump
999 The developer can then use his Developer ID to create unique
1000 (well, hopefully unique) Filter IDs.
1002 Bits Mask Description
1003 0-15 0x0000_0000_0000_FFFF Filter ID
1004 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
1005 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
1007 The resulting 63-bit integer will use 9 bytes of space when
1008 stored using the encoding described in Section 1.2. To get
1009 a shorter ID, see the beginning of this Section how to
1010 request a custom ID range.
1013 5.4.1. Reserved Custom Filter ID Ranges
1016 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
1017 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
1018 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1021 6. Cyclic Redundancy Checks
1023 There are several incompatible variations to calculate CRC32
1024 and CRC64. For simplicity and clarity, complete examples are
1025 provided to calculate the checks as they are used in this file
1026 format. Implementations MAY use different code as long as it
1027 gives identical results.
1029 The program below reads data from standard input, calculates
1030 the CRC32 and CRC64 values, and prints the calculated values
1031 as big endian hexadecimal strings to standard output.
1034 #include <inttypes.h>
1037 uint32_t crc32_table[256];
1038 uint64_t crc64_table[256];
1043 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1044 static const uint64_t poly64
1045 = UINT64_C(0xC96C5795D7870F42);
1047 for (size_t i = 0; i < 256; ++i) {
1051 for (size_t j = 0; j < 8; ++j) {
1053 crc32 = (crc32 >> 1) ^ poly32;
1058 crc64 = (crc64 >> 1) ^ poly64;
1063 crc32_table[i] = crc32;
1064 crc64_table[i] = crc64;
1069 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1072 for (size_t i = 0; i < size; ++i)
1073 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1079 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1082 for (size_t i = 0; i < size; ++i)
1083 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1093 uint32_t value32 = 0;
1094 uint64_t value64 = 0;
1095 uint64_t total_size = 0;
1099 const size_t buf_size
1100 = fread(buf, 1, sizeof(buf), stdin);
1104 total_size += buf_size;
1105 value32 = crc32(buf, buf_size, value32);
1106 value64 = crc64(buf, buf_size, value64);
1109 printf("Bytes: %" PRIu64 "\n", total_size);
1110 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1111 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1119 LZMA SDK - The original LZMA implementation
1120 http://7-zip.org/sdk.html
1122 LZMA Utils - LZMA adapted to POSIX-like systems
1123 http://tukaani.org/lzma/
1125 XZ Utils - The next generation of LZMA Utils
1126 http://tukaani.org/xz/
1129 GZIP file format specification version 4.3
1130 http://www.ietf.org/rfc/rfc1952.txt
1131 - Notation of byte boxes in section "2.1. Overall conventions"
1134 Key words for use in RFCs to Indicate Requirement Levels
1135 http://www.ietf.org/rfc/rfc2119.txt
1139 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1140 - Node 9.4.2 "Blocking Factor", paragraph that begins
1141 "gzip will complain about trailing garbage"
1142 - Note that this URL points to the latest version of the
1143 manual, and may some day not contain the note which is in
1144 1.21. For the exact version of the manual, download GNU
1145 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz