5 Version 1.0.2 (2009-06-04)
9 0.1. Notices and Acknowledgements
10 0.2. Getting the Latest Version
13 1.1. Byte and Its Representation
14 1.2. Multibyte Integers
15 2. Overall Structure of .xz File
18 2.1.1.1. Header Magic Bytes
23 2.1.2.2. Backward Size
25 2.1.2.4. Footer Magic Bytes
29 3.1.1. Block Header Size
31 3.1.3. Compressed Size
32 3.1.4. Uncompressed Size
33 3.1.5. List of Filter Flags
41 4.2. Number of Records
44 4.3.2. Uncompressed Size
52 5.3.2. Branch/Call/Jump Filters for Executables
54 5.3.3.1. Format of the Encoded Output
55 5.4. Custom Filter IDs
56 5.4.1. Reserved Custom Filter ID Ranges
57 6. Cyclic Redundancy Checks
63 This document describes the .xz file format (filename suffix
64 ".xz", MIME type "application/x-xz"). It is intended that this
65 this format replace the old .lzma format used by LZMA SDK and
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to
75 Ville Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
81 0.2. Getting the Latest Version
83 The latest official version of this document can be downloaded
84 from <http://tukaani.org/xz/xz-file-format.txt>.
86 Specific versions of this document have a filename
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88 For example, the version 1.0.0 of this document is available
89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
94 Version Date Description
96 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1
98 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
99 clarifications to Sections 2, 2.2,
102 1.0.0 2009-01-14 The first official version
107 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
108 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
109 document are to be interpreted as described in [RFC-2119].
111 Indicating a warning means displaying a message, returning
112 appropriate exit status, or doing something else to let the
113 user know that something worth warning occurred. The operation
114 SHOULD still finish if a warning is indicated.
116 Indicating an error means displaying a message, returning
117 appropriate exit status, or doing something else to let the
118 user know that something prevented successfully finishing the
119 operation. The operation MUST be aborted once an error has
123 1.1. Byte and Its Representation
125 In this document, byte is always 8 bits.
127 A "null byte" has all bits unset. That is, the value of a null
130 To represent byte blocks, this document uses notation that
131 is similar to the notation used in [RFC-1952]:
138 | Foo | Two bytes; that is, some of the vertical bars
139 +---+---+ can be missing.
142 | Foo | Zero or more bytes.
145 In this document, a boxed byte or a byte sequence declared
146 using this notation is called "a field". The example field
147 above would be called "the Foo field" or plain "Foo".
149 If there are many fields, they may be split to multiple lines.
150 This is indicated with an arrow ("--->"):
160 The above is equivalent to this:
167 1.2. Multibyte Integers
169 Multibyte integers of static length, such as CRC values,
170 are stored in little endian byte order (least significant
173 When smaller values are more likely than bigger values (for
174 example file sizes), multibyte integers are encoded in a
175 variable-length representation:
176 - Numbers in the range [0, 127] are copied as is, and take
178 - Bigger numbers will occupy two or more bytes. All but the
179 last byte of the multibyte representation have the highest
182 For now, the value of the variable-length integers is limited
183 to 63 bits, which limits the encoded size of the integer to
184 nine bytes. These limits may be increased in future if needed.
186 The following C code illustrates encoding and decoding of
187 variable-length integers. The functions return the number of
188 bytes occupied by the integer (1-9), or zero on error.
191 #include <inttypes.h>
194 encode(uint8_t buf[static 9], uint64_t num)
196 if (num > UINT64_MAX / 2)
201 while (num >= 0x80) {
202 buf[i++] = (uint8_t)(num) | 0x80;
206 buf[i++] = (uint8_t)(num);
212 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
220 *num = buf[0] & 0x7F;
223 while (buf[i++] & 0x80) {
224 if (i >= size_max || buf[i] == 0x00)
227 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
234 2. Overall Structure of .xz File
236 A standalone .xz files consist of one or more Streams which may
237 have Stream Padding between or after them:
239 +========+================+========+================+
240 | Stream | Stream Padding | Stream | Stream Padding | ...
241 +========+================+========+================+
243 The sizes of Stream and Stream Padding are always multiples
244 of four bytes, thus the size of every valid .xz file MUST be
245 a multiple of four bytes.
247 While a typical file contains only one Stream and no Stream
248 Padding, a decoder handling standalone .xz files SHOULD support
249 files that have more than one Stream or Stream Padding.
251 In contrast to standalone .xz files, when the .xz file format
252 is used as an internal part of some other file format or
253 communication protocol, it usually is expected that the decoder
254 stops after the first Stream, and doesn't look for Stream
255 Padding or possibly other Streams.
260 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
261 | Stream Header | Block | Block | ... | Block |
262 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
264 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
265 ---> | Index | Stream Footer |
266 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
268 All the above fields have a size that is a multiple of four. If
269 Stream is used as an internal part of another file format, it
270 is RECOMMENDED to make the Stream start at an offset that is
271 a multiple of four bytes.
273 Stream Header, Index, and Stream Footer are always present in
274 a Stream. The maximum size of the Index field is 16 GiB (2^34).
276 There are zero or more Blocks. The maximum number of Blocks is
277 limited only by the maximum size of the Index field.
279 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
280 The same limit applies to the total amount of uncompressed
281 data stored in a Stream.
283 If an implementation supports handling .xz files with multiple
284 concatenated Streams, it MAY apply the above limits to the file
285 as a whole instead of limiting per Stream basis.
290 +---+---+---+---+---+---+-------+------+--+--+--+--+
291 | Header Magic Bytes | Stream Flags | CRC32 |
292 +---+---+---+---+---+---+-------+------+--+--+--+--+
295 2.1.1.1. Header Magic Bytes
297 The first six (6) bytes of the Stream are so called Header
298 Magic Bytes. They can be used to identify the file type.
300 Using a C array and ASCII:
301 const uint8_t HEADER_MAGIC[6]
302 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
304 In plain hexadecimal:
308 - The first byte (0xFD) was chosen so that the files cannot
309 be erroneously detected as being in .lzma format, in which
310 the first byte is in the range [0x00, 0xE0].
311 - The sixth byte (0x00) was chosen to prevent applications
312 from misdetecting the file as a text file.
314 If the Header Magic Bytes don't match, the decoder MUST
318 2.1.1.2. Stream Flags
320 The first byte of Stream Flags is always a null byte. In future
321 this byte may be used to indicate new Stream version or other
324 The second byte of Stream Flags is a bit field:
326 Bit(s) Mask Description
327 0-3 0x0F Type of Check (see Section 3.4):
331 0x02 4 bytes (Reserved)
332 0x03 4 bytes (Reserved)
334 0x05 8 bytes (Reserved)
335 0x06 8 bytes (Reserved)
336 0x07 16 bytes (Reserved)
337 0x08 16 bytes (Reserved)
338 0x09 16 bytes (Reserved)
339 0x0A 32 bytes SHA-256
340 0x0B 32 bytes (Reserved)
341 0x0C 32 bytes (Reserved)
342 0x0D 64 bytes (Reserved)
343 0x0E 64 bytes (Reserved)
344 0x0F 64 bytes (Reserved)
345 4-7 0xF0 Reserved for future use; MUST be zero for now.
347 Implementations SHOULD support at least the Check IDs 0x00
348 (None) and 0x01 (CRC32). Supporting other Check IDs is
349 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
350 indicate a warning or error.
352 If any reserved bit is set, the decoder MUST indicate an error.
353 It is possible that there is a new field present which the
354 decoder is not aware of, and can thus parse the Stream Header
360 The CRC32 is calculated from the Stream Flags field. It is
361 stored as an unsigned 32-bit little endian integer. If the
362 calculated value does not match the stored one, the decoder
363 MUST indicate an error.
365 The idea is that Stream Flags would always be two bytes, even
366 if new features are needed. This way old decoders will be able
367 to verify the CRC32 calculated from Stream Flags, and thus
368 distinguish between corrupt files (CRC32 doesn't match) and
369 files that the decoder doesn't support (CRC32 matches but
370 Stream Flags has reserved bits set).
375 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
376 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
377 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
382 The CRC32 is calculated from the Backward Size and Stream Flags
383 fields. It is stored as an unsigned 32-bit little endian
384 integer. If the calculated value does not match the stored one,
385 the decoder MUST indicate an error.
387 The reason to have the CRC32 field before the Backward Size and
388 Stream Flags fields is to keep the four-byte fields aligned to
389 a multiple of four bytes.
392 2.1.2.2. Backward Size
394 Backward Size is stored as a 32-bit little endian integer,
395 which indicates the size of the Index field as multiple of
396 four bytes, minimum value being four bytes:
398 real_backward_size = (stored_backward_size + 1) * 4;
400 If the stored value does not match the real size of the Index
401 field, the decoder MUST indicate an error.
403 Using a fixed-size integer to store Backward Size makes
404 it slightly simpler to parse the Stream Footer when the
405 application needs to parse the Stream backwards.
408 2.1.2.3. Stream Flags
410 This is a copy of the Stream Flags field from the Stream
411 Header. The information stored to Stream Flags is needed
412 when parsing the Stream backwards. The decoder MUST compare
413 the Stream Flags fields in both Stream Header and Stream
414 Footer, and indicate an error if they are not identical.
417 2.1.2.4. Footer Magic Bytes
419 As the last step of the decoding process, the decoder MUST
420 verify the existence of Footer Magic Bytes. If they don't
421 match, an error MUST be indicated.
423 Using a C array and ASCII:
424 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
429 The primary reason to have Footer Magic Bytes is to make
430 it easier to detect incomplete files quickly, without
431 uncompressing. If the file does not end with Footer Magic Bytes
432 (excluding Stream Padding described in Section 2.2), it cannot
433 be undamaged, unless someone has intentionally appended garbage
434 after the end of the Stream.
439 Only the decoders that support decoding of concatenated Streams
440 MUST support Stream Padding.
442 Stream Padding MUST contain only null bytes. To preserve the
443 four-byte alignment of consecutive Streams, the size of Stream
444 Padding MUST be a multiple of four bytes. Empty Stream Padding
445 is allowed. If these requirements are not met, the decoder MUST
448 Note that non-empty Stream Padding is allowed at the end of the
449 file; there doesn't need to be a new Stream after non-empty
450 Stream Padding. This can be convenient in certain situations
453 The possibility of Stream Padding MUST be taken into account
454 when designing an application that parses Streams backwards,
455 and the application supports concatenated Streams.
460 +==============+=================+===============+=======+
461 | Block Header | Compressed Data | Block Padding | Check |
462 +==============+=================+===============+=======+
467 +-------------------+-------------+=================+
468 | Block Header Size | Block Flags | Compressed Size |
469 +-------------------+-------------+=================+
471 +===================+======================+
472 ---> | Uncompressed Size | List of Filter Flags |
473 +===================+======================+
475 +================+--+--+--+--+
476 ---> | Header Padding | CRC32 |
477 +================+--+--+--+--+
480 3.1.1. Block Header Size
482 This field overlaps with the Index Indicator field (see
485 This field contains the size of the Block Header field,
486 including the Block Header Size field itself. Valid values are
487 in the range [0x01, 0xFF], which indicate the size of the Block
488 Header as multiples of four bytes, minimum size being eight
491 real_header_size = (encoded_header_size + 1) * 4;
493 If bigger Block Header is needed in future, a new field can be
494 added between the current Block Header and Compressed Data
495 fields. The presence of this new field would be indicated in
501 The first byte of the Block Flags field is a bit field:
503 Bit(s) Mask Description
504 0-1 0x03 Number of filters (1-4)
505 2-5 0x3C Reserved for future use; MUST be zero for now.
506 6 0x40 The Compressed Size field is present.
507 7 0x80 The Uncompressed Size field is present.
509 If any reserved bit is set, the decoder MUST indicate an error.
510 It is possible that there is a new field present which the
511 decoder is not aware of, and can thus parse the Block Header
515 3.1.3. Compressed Size
517 This field is present only if the appropriate bit is set in
518 the Block Flags field (see Section 3.1.2).
520 The Compressed Size field contains the size of the Compressed
521 Data field, which MUST be non-zero. Compressed Size is stored
522 using the encoding described in Section 1.2. If the Compressed
523 Size doesn't match the size of the Compressed Data field, the
524 decoder MUST indicate an error.
527 3.1.4. Uncompressed Size
529 This field is present only if the appropriate bit is set in
530 the Block Flags field (see Section 3.1.2).
532 The Uncompressed Size field contains the size of the Block
533 after uncompressing. Uncompressed Size is stored using the
534 encoding described in Section 1.2. If the Uncompressed Size
535 does not match the real uncompressed size, the decoder MUST
538 Storing the Compressed Size and Uncompressed Size fields serves
540 - The decoder knows how much memory it needs to allocate
541 for a temporary buffer in multithreaded mode.
542 - Simple error detection: wrong size indicates a broken file.
543 - Seeking forwards to a specific location in streamed mode.
545 It should be noted that the only reliable way to determine
546 the real uncompressed size is to uncompress the Block,
547 because the Block Header and Index fields may contain
548 (intentionally or unintentionally) invalid information.
551 3.1.5. List of Filter Flags
553 +================+================+ +================+
554 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
555 +================+================+ +================+
557 The number of Filter Flags fields is stored in the Block Flags
558 field (see Section 3.1.2).
560 The format of each Filter Flags field is as follows:
562 +===========+====================+===================+
563 | Filter ID | Size of Properties | Filter Properties |
564 +===========+====================+===================+
566 Both Filter ID and Size of Properties are stored using the
567 encoding described in Section 1.2. Size of Properties indicates
568 the size of the Filter Properties field as bytes. The list of
569 officially defined Filter IDs and the formats of their Filter
570 Properties are described in Section 5.3.
572 Filter IDs greater than or equal to 0x4000_0000_0000_0000
573 (2^62) are reserved for implementation-specific internal use.
574 These Filter IDs MUST never be used in List of Filter Flags.
577 3.1.6. Header Padding
579 This field contains as many null byte as it is needed to make
580 the Block Header have the size specified in Block Header Size.
581 If any of the bytes are not null bytes, the decoder MUST
582 indicate an error. It is possible that there is a new field
583 present which the decoder is not aware of, and can thus parse
584 the Block Header incorrectly.
589 The CRC32 is calculated over everything in the Block Header
590 field except the CRC32 field itself. It is stored as an
591 unsigned 32-bit little endian integer. If the calculated
592 value does not match the stored one, the decoder MUST indicate
595 By verifying the CRC32 of the Block Header before parsing the
596 actual contents allows the decoder to distinguish between
597 corrupt and unsupported files.
602 The format of Compressed Data depends on Block Flags and List
603 of Filter Flags. Excluding the descriptions of the simplest
604 filters in Section 5.3, the format of the filter-specific
605 encoded data is out of scope of this document.
610 Block Padding MUST contain 0-3 null bytes to make the size of
611 the Block a multiple of four bytes. This can be needed when
612 the size of Compressed Data is not a multiple of four. If any
613 of the bytes in Block Padding are not null bytes, the decoder
614 MUST indicate an error.
619 The type and size of the Check field depends on which bits
620 are set in the Stream Flags field (see Section 2.1.1.2).
622 The Check, when used, is calculated from the original
623 uncompressed data. If the calculated Check does not match the
624 stored one, the decoder MUST indicate an error. If the selected
625 type of Check is not supported by the decoder, it SHOULD
626 indicate a warning or error.
631 +-----------------+===================+
632 | Index Indicator | Number of Records |
633 +-----------------+===================+
635 +=================+===============+-+-+-+-+
636 ---> | List of Records | Index Padding | CRC32 |
637 +=================+===============+-+-+-+-+
639 Index serves several purposes. Using it, one can
640 - verify that all Blocks in a Stream have been processed;
641 - find out the uncompressed size of a Stream; and
642 - quickly access the beginning of any Block (random access).
647 This field overlaps with the Block Header Size field (see
648 Section 3.1.1). The value of Index Indicator is always 0x00.
651 4.2. Number of Records
653 This field indicates how many Records there are in the List
654 of Records field, and thus how many Blocks there are in the
655 Stream. The value is stored using the encoding described in
656 Section 1.2. If the decoder has decoded all the Blocks of the
657 Stream, and then notices that the Number of Records doesn't
658 match the real number of Blocks, the decoder MUST indicate an
664 List of Records consists of as many Records as indicated by the
665 Number of Records field:
668 | Record | Record | ...
671 Each Record contains information about one Block:
673 +===============+===================+
674 | Unpadded Size | Uncompressed Size |
675 +===============+===================+
677 If the decoder has decoded all the Blocks of the Stream, it
678 MUST verify that the contents of the Records match the real
679 Unpadded Size and Uncompressed Size of the respective Blocks.
681 Implementation hint: It is possible to verify the Index with
682 constant memory usage by calculating for example SHA-256 of
683 both the real size values and the List of Records, then
684 comparing the hash values. Implementing this using
685 non-cryptographic hash like CRC32 SHOULD be avoided unless
686 small code size is important.
688 If the decoder supports random-access reading, it MUST verify
689 that Unpadded Size and Uncompressed Size of every completely
690 decoded Block match the sizes stored in the Index. If only
691 partial Block is decoded, the decoder MUST verify that the
692 processed sizes don't exceed the sizes stored in the Index.
697 This field indicates the size of the Block excluding the Block
698 Padding field. That is, Unpadded Size is the size of the Block
699 Header, Compressed Data, and Check fields. Unpadded Size is
700 stored using the encoding described in Section 1.2. The value
701 MUST never be zero; with the current structure of Blocks, the
702 actual minimum value for Unpadded Size is five.
704 Implementation note: Because the size of the Block Padding
705 field is not included in Unpadded Size, calculating the total
706 size of a Stream or doing random-access reading requires
707 calculating the actual size of the Blocks by rounding Unpadded
708 Sizes up to the next multiple of four.
710 The reason to exclude Block Padding from Unpadded Size is to
711 ease making a raw copy of Compressed Data without Block
712 Padding. This can be useful, for example, if someone wants
713 to convert Streams to some other file format quickly.
716 4.3.2. Uncompressed Size
718 This field indicates the Uncompressed Size of the respective
719 Block as bytes. The value is stored using the encoding
720 described in Section 1.2.
725 This field MUST contain 0-3 null bytes to pad the Index to
726 a multiple of four bytes. If any of the bytes are not null
727 bytes, the decoder MUST indicate an error.
732 The CRC32 is calculated over everything in the Index field
733 except the CRC32 field itself. The CRC32 is stored as an
734 unsigned 32-bit little endian integer. If the calculated
735 value does not match the stored one, the decoder MUST indicate
741 The Block Flags field defines how many filters are used. When
742 more than one filter is used, the filters are chained; that is,
743 the output of one filter is the input of another filter. The
744 following figure illustrates the direction of data flow.
746 v Uncompressed Data ^
748 Encoder | Filter 1 | Decoder
755 Alignment of uncompressed input data is usually the job of
756 the application producing the data. For example, to get the
757 best results, an archiver tool should make sure that all
758 PowerPC executable files in the archive stream start at
759 offsets that are multiples of four bytes.
761 Some filters, for example LZMA2, can be configured to take
762 advantage of specified alignment of input data. Note that
763 taking advantage of aligned input can be benefical also when
764 a filter is not the first filter in the chain. For example,
765 if you compress PowerPC executables, you may want to use the
766 PowerPC filter and chain that with the LZMA2 filter. Because
767 not only the input but also the output alignment of the PowerPC
768 filter is four bytes, it is now benefical to set LZMA2 settings
769 so that the LZMA2 encoder can take advantage of its
770 four-byte-aligned input data.
772 The output of the last filter in the chain is stored to the
773 Compressed Data field, which is is guaranteed to be aligned
774 to a multiple of four bytes relative to the beginning of the
775 Stream. This can increase
776 - speed, if the filtered data is handled multiple bytes at
777 a time by the filter-specific encoder and decoder,
778 because accessing aligned data in computer memory is
780 - compression ratio, if the output data is later compressed
781 with an external compression tool.
786 If filters would be allowed to be chained freely, it would be
787 possible to create malicious files, that would be very slow to
788 decode. Such files could be used to create denial of service
791 Slow files could occur when multiple filters are chained:
793 v Compressed input data
794 | Filter 1 decoder (last filter)
795 | Filter 0 decoder (non-last filter)
796 v Uncompressed output data
798 The decoder of the last filter in the chain produces a lot of
799 output from little input. Another filter in the chain takes the
800 output of the last filter, and produces very little output
801 while consuming a lot of input. As a result, a lot of data is
802 moved inside the filter chain, but the filter chain as a whole
803 gets very little work done.
805 To prevent this kind of slow files, there are restrictions on
806 how the filters can be chained. These restrictions MUST be
807 taken into account when designing new filters.
809 The maximum number of filters in the chain has been limited to
810 four, thus there can be at maximum of three non-last filters.
811 Of these three non-last filters, only two are allowed to change
812 the size of the data.
814 The non-last filters, that change the size of the data, MUST
815 have a limit how much the decoder can compress the data: the
816 decoder SHOULD produce at least n bytes of output when the
817 filter is given 2n bytes of input. This limit is not
818 absolute, but significant deviations MUST be avoided.
820 The above limitations guarantee that if the last filter in the
821 chain produces 4n bytes of output, the chain as a whole will
822 produce at least n bytes of output.
829 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
830 compression algorithm with high compression ratio and fast
831 decompression. LZMA is based on LZ77 and range coding
834 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
835 LZMA internally, but adds support for flushing the encoder,
836 uncompressed chunks, eases stateful decoder implementations,
837 and improves support for multithreading. Thus, the plain LZMA
838 will not be supported in this file format.
841 Size of Filter Properties: 1 byte
842 Changes size of data: Yes
843 Allow as a non-last filter: No
844 Allow as the last filter: Yes
847 Input data: Adjustable to 1/2/4/8/16 byte(s)
850 The format of the one-byte Filter Properties field is as
853 Bits Mask Description
854 0-5 0x3F Dictionary Size
855 6-7 0xC0 Reserved for future use; MUST be zero for now.
857 Dictionary Size is encoded with one-bit mantissa and five-bit
858 exponent. The smallest dictionary size is 4 KiB and the biggest
861 Raw value Mantissa Exponent Dictionary size
875 40 2 31 4096 MiB - 1 B
877 Instead of having a table in the decoder, the dictionary size
878 can be decoded using the following C code:
880 const uint8_t bits = get_dictionary_flags() & 0x3F;
882 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
884 uint32_t dictionary_size;
886 dictionary_size = UINT32_MAX;
888 dictionary_size = 2 | (bits & 1);
889 dictionary_size <<= bits / 2 + 11;
893 5.3.2. Branch/Call/Jump Filters for Executables
895 These filters convert relative branch, call, and jump
896 instructions to their absolute counterparts in executable
897 files. This conversion increases redundancy and thus
900 Size of Filter Properties: 0 or 4 bytes
901 Changes size of data: No
902 Allow as a non-last filter: Yes
903 Allow as the last filter: No
905 Below is the list of filters in this category. The alignment
906 is the same for both input and output data.
908 Filter ID Alignment Description
909 0x04 1 byte x86 filter (BCJ)
910 0x05 4 bytes PowerPC (big endian) filter
911 0x06 16 bytes IA64 filter
912 0x07 4 bytes ARM (little endian) filter
913 0x08 2 bytes ARM Thumb (little endian) filter
914 0x09 4 bytes SPARC filter
916 If the size of Filter Properties is four bytes, the Filter
917 Properties field contains the start offset used for address
918 conversions. It is stored as an unsigned 32-bit little endian
919 integer. The start offset MUST be a multiple of the alignment
920 of the filter as listed in the table above; if it isn't, the
921 decoder MUST indicate an error. If the size of Filter
922 Properties is zero, the start offset is zero.
924 Setting the start offset may be useful if an executable has
925 multiple sections, and there are many cross-section calls.
926 Taking advantage of this feature usually requires usage of
927 the Subblock filter, whose design is not complete yet.
932 The Delta filter may increase compression ratio when the value
933 of the next byte correlates with the value of an earlier byte
934 at specified distance.
937 Size of Filter Properties: 1 byte
938 Changes size of data: No
939 Allow as a non-last filter: Yes
940 Allow as the last filter: No
944 Output data: Same as the original input data
946 The Properties byte indicates the delta distance, which can be
947 1-256 bytes backwards from the current byte: 0x00 indicates
948 distance of 1 byte and 0xFF distance of 256 bytes.
951 5.3.3.1. Format of the Encoded Output
953 The code below illustrates both encoding and decoding with
956 // Distance is in the range [1, 256].
957 const unsigned int distance = get_properties_byte() + 1;
961 memset(delta, 0, sizeof(delta));
964 const int byte = read_byte();
968 uint8_t tmp = delta[(uint8_t)(distance + pos)];
970 tmp = (uint8_t)(byte) - tmp;
971 delta[pos] = (uint8_t)(byte);
973 tmp = (uint8_t)(byte) + tmp;
982 5.4. Custom Filter IDs
984 If a developer wants to use custom Filter IDs, he has two
985 choices. The first choice is to contact Lasse Collin and ask
986 him to allocate a range of IDs for the developer.
988 The second choice is to generate a 40-bit random integer,
989 which the developer can use as his personal Developer ID.
990 To minimalize the risk of collisions, Developer ID has to be
991 a randomly generated integer, not manually selected "hex word".
992 The following command, which works on many free operating
993 systems, can be used to generate Developer ID:
995 dd if=/dev/urandom bs=5 count=1 | hexdump
997 The developer can then use his Developer ID to create unique
998 (well, hopefully unique) Filter IDs.
1000 Bits Mask Description
1001 0-15 0x0000_0000_0000_FFFF Filter ID
1002 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
1003 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
1005 The resulting 63-bit integer will use 9 bytes of space when
1006 stored using the encoding described in Section 1.2. To get
1007 a shorter ID, see the beginning of this Section how to
1008 request a custom ID range.
1011 5.4.1. Reserved Custom Filter ID Ranges
1014 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
1015 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
1016 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1019 6. Cyclic Redundancy Checks
1021 There are several incompatible variations to calculate CRC32
1022 and CRC64. For simplicity and clarity, complete examples are
1023 provided to calculate the checks as they are used in this file
1024 format. Implementations MAY use different code as long as it
1025 gives identical results.
1027 The program below reads data from standard input, calculates
1028 the CRC32 and CRC64 values, and prints the calculated values
1029 as big endian hexadecimal strings to standard output.
1032 #include <inttypes.h>
1035 uint32_t crc32_table[256];
1036 uint64_t crc64_table[256];
1041 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1042 static const uint64_t poly64
1043 = UINT64_C(0xC96C5795D7870F42);
1045 for (size_t i = 0; i < 256; ++i) {
1049 for (size_t j = 0; j < 8; ++j) {
1051 crc32 = (crc32 >> 1) ^ poly32;
1056 crc64 = (crc64 >> 1) ^ poly64;
1061 crc32_table[i] = crc32;
1062 crc64_table[i] = crc64;
1067 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1070 for (size_t i = 0; i < size; ++i)
1071 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1077 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1080 for (size_t i = 0; i < size; ++i)
1081 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1091 uint32_t value32 = 0;
1092 uint64_t value64 = 0;
1093 uint64_t total_size = 0;
1097 const size_t buf_size
1098 = fread(buf, 1, sizeof(buf), stdin);
1102 total_size += buf_size;
1103 value32 = crc32(buf, buf_size, value32);
1104 value64 = crc64(buf, buf_size, value64);
1107 printf("Bytes: %" PRIu64 "\n", total_size);
1108 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1109 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1117 LZMA SDK - The original LZMA implementation
1118 http://7-zip.org/sdk.html
1120 LZMA Utils - LZMA adapted to POSIX-like systems
1121 http://tukaani.org/lzma/
1123 XZ Utils - The next generation of LZMA Utils
1124 http://tukaani.org/xz/
1127 GZIP file format specification version 4.3
1128 http://www.ietf.org/rfc/rfc1952.txt
1129 - Notation of byte boxes in section "2.1. Overall conventions"
1132 Key words for use in RFCs to Indicate Requirement Levels
1133 http://www.ietf.org/rfc/rfc2119.txt
1137 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1138 - Node 9.4.2 "Blocking Factor", paragraph that begins
1139 "gzip will complain about trailing garbage"
1140 - Note that this URL points to the latest version of the
1141 manual, and may some day not contain the note which is in
1142 1.21. For the exact version of the manual, download GNU
1143 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz