6 0.1. Notices and Acknowledgements
9 1.1. Byte and Its Representation
10 1.2. Multibyte Integers
11 2. Overall Structure of .xz File
14 2.1.1.1. Header Magic Bytes
19 2.1.2.2. Backward Size
21 2.1.2.4. Footer Magic Bytes
25 3.1.1. Block Header Size
27 3.1.3. Compressed Size
28 3.1.4. Uncompressed Size
29 3.1.5. List of Filter Flags
37 4.2. Number of Records
40 4.3.2. Uncompressed Size
48 5.3.2. Branch/Call/Jump Filters for Executables
50 5.3.3.1. Format of the Encoded Output
51 5.4. Custom Filter IDs
52 5.4.1. Reserved Custom Filter ID Ranges
53 6. Cyclic Redundancy Checks
59 This document describes the .xz file format (filename suffix
60 ".xz", MIME type "application/x-xz"). It is intended that this
61 this format replace the old .lzma format used by LZMA SDK and
64 IMPORTANT: The version described in this document is a
65 draft, NOT a final, official version. Changes
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to Ville
75 Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
83 Last modified: 2008-12-05 12:45+0200
85 (A changelog will be kept once the first official version
91 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
92 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
93 document are to be interpreted as described in [RFC-2119].
95 Indicating a warning means displaying a message, returning
96 appropriate exit status, or doing something else to let the
97 user know that something worth warning occurred. The operation
98 SHOULD still finish if a warning is indicated.
100 Indicating an error means displaying a message, returning
101 appropriate exit status, or doing something else to let the
102 user know that something prevented successfully finishing the
103 operation. The operation MUST be aborted once an error has
107 1.1. Byte and Its Representation
109 In this document, byte is always 8 bits.
111 A "null byte" has all bits unset. That is, the value of a null
114 To represent byte blocks, this document uses notation that
115 is similar to the notation used in [RFC-1952]:
122 | Foo | Two bytes; that is, some of the vertical bars
123 +---+---+ can be missing.
126 | Foo | Zero or more bytes.
129 In this document, a boxed byte or a byte sequence declared
130 using this notation is called "a field". The example field
131 above would be called "the Foo field" or plain "Foo".
133 If there are many fields, they may be split to multiple lines.
134 This is indicated with an arrow ("--->"):
144 The above is equivalent to this:
151 1.2. Multibyte Integers
153 Multibyte integers of static length, such as CRC values,
154 are stored in little endian byte order (least significant
157 When smaller values are more likely than bigger values (for
158 example file sizes), multibyte integers are encoded in a
159 variable-length representation:
160 - Numbers in the range [0, 127] are copied as is, and take
162 - Bigger numbers will occupy two or more bytes. All but the
163 last byte of the multibyte representation have the highest
166 For now, the value of the variable-length integers is limited
167 to 63 bits, which limits the encoded size of the integer to
168 nine bytes. These limits may be increased in future if needed.
170 The following C code illustrates encoding and decoding of
171 variable-length integers. The functions return the number of
172 bytes occupied by the integer (1-9), or zero on error.
174 #include <sys/types.h>
175 #include <inttypes.h>
178 encode(uint8_t buf[static 9], uint64_t num)
180 if (num > UINT64_MAX / 2)
185 while (num >= 0x80) {
186 buf[i++] = (uint8_t)(num) | 0x80;
190 buf[i++] = (uint8_t)(num);
196 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
204 *num = buf[0] & 0x7F;
207 while (buf[i++] & 0x80) {
208 if (i >= size_max || buf[i] == 0x00)
211 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
218 2. Overall Structure of .xz File
220 A standalone .xz files consist of one or more Streams which may
221 have Stream Padding between or after them:
223 +========+================+========+================+
224 | Stream | Stream Padding | Stream | Stream Padding | ...
225 +========+================+========+================+
227 While a typical file contains only one Stream and no Stream
228 Padding, a decoder handling standalone .xz files SHOULD support
229 files that have more than one Stream or Stream Padding.
231 In contrast to standalone .xz files, when the .xz file format
232 is used as an internal part of some other file format or
233 communication protocol, it usually is expected that the decoder
234 stops after the first Stream, and doesn't look for Stream
235 Padding or possibly other Streams.
240 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
241 | Stream Header | Block | Block | ... | Block |
242 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
244 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
245 ---> | Index | Stream Footer |
246 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
248 All the above fields have a size that is a multiple of four. If
249 Stream is used as an internal part of another file format, it
250 is RECOMMENDED to make the Stream start at an offset that is
251 a multiple of four bytes.
253 Stream Header, Index, and Stream Footer are always present in
254 a Stream. The maximum size of the Index field is 16 GiB (2^34).
256 There are zero or more Blocks. The maximum number of Blocks is
257 limited only by the maximum size of the Index field.
259 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
260 The same limit applies to the total amount of uncompressed
261 data stored in a Stream.
263 If an implementation supports handling .xz files with multiple
264 concatenated Streams, it MAY apply the above limits to the file
265 as a whole instead of limiting per Stream basis.
270 +---+---+---+---+---+---+-------+------+--+--+--+--+
271 | Header Magic Bytes | Stream Flags | CRC32 |
272 +---+---+---+---+---+---+-------+------+--+--+--+--+
275 2.1.1.1. Header Magic Bytes
277 The first six (6) bytes of the Stream are so called Header
278 Magic Bytes. They can be used to identify the file type.
280 Using a C array and ASCII:
281 const uint8_t HEADER_MAGIC[6]
282 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
284 In plain hexadecimal:
288 - The first byte (0xFD) was chosen so that the files cannot
289 be erroneously detected as being in .lzma format, in which
290 the first byte is in the range [0x00, 0xE0].
291 - The sixth byte (0x00) was chosen to prevent applications
292 from misdetecting the file as a text file.
294 If the Header Magic Bytes don't match, the decoder MUST
298 2.1.1.2. Stream Flags
300 The first byte of Stream Flags is always a null byte. In future
301 this byte may be used to indicate new Stream version or other
304 The second byte of Stream Flags is a bit field:
306 Bit(s) Mask Description
307 0-3 0x0F Type of Check (see Section 3.4):
311 0x02 4 bytes (Reserved)
312 0x03 4 bytes (Reserved)
314 0x05 8 bytes (Reserved)
315 0x06 8 bytes (Reserved)
316 0x07 16 bytes (Reserved)
317 0x08 16 bytes (Reserved)
318 0x09 16 bytes (Reserved)
319 0x0A 32 bytes SHA-256
320 0x0B 32 bytes (Reserved)
321 0x0C 32 bytes (Reserved)
322 0x0D 64 bytes (Reserved)
323 0x0E 64 bytes (Reserved)
324 0x0F 64 bytes (Reserved)
325 4-7 0xF0 Reserved for future use; MUST be zero for now.
327 Implementations SHOULD support at least the Check IDs 0x00
328 (None) and 0x01 (CRC32). Supporting other Check IDs is
329 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
330 indicate a warning or error.
332 If any reserved bit is set, the decoder MUST indicate an error.
333 It is possible that there is a new field present which the
334 decoder is not aware of, and can thus parse the Stream Header
340 The CRC32 is calculated from the Stream Flags field. It is
341 stored as an unsigned 32-bit little endian integer. If the
342 calculated value does not match the stored one, the decoder
343 MUST indicate an error.
345 The idea is that Stream Flags would always be two bytes, even
346 if new features are needed. This way old decoders will be able
347 to verify the CRC32 calculated from Stream Flags, and thus
348 distinguish between corrupt files (CRC32 doesn't match) and
349 files that the decoder doesn't support (CRC32 matches but
350 Stream Flags has reserved bits set).
355 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
356 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
357 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
362 The CRC32 is calculated from the Backward Size and Stream Flags
363 fields. It is stored as an unsigned 32-bit little endian
364 integer. If the calculated value does not match the stored one,
365 the decoder MUST indicate an error.
367 The reason to have the CRC32 field before the Backward Size and
368 Stream Flags fields is to keep the four-byte fields aligned to
369 a multiple of four bytes.
372 2.1.2.2. Backward Size
374 Backward Size is stored as a 32-bit little endian integer,
375 which indicates the size of the Index field as multiple of
376 four bytes, minimum value being four bytes:
378 real_backward_size = (stored_backward_size + 1) * 4;
380 If the stored value does not match the real size of the Index
381 field, the decoder MUST indicate an error.
383 Using a fixed-size integer to store Backward Size makes
384 it slightly simpler to parse the Stream Footer when the
385 application needs to parse the Stream backwards.
388 2.1.2.3. Stream Flags
390 This is a copy of the Stream Flags field from the Stream
391 Header. The information stored to Stream Flags is needed
392 when parsing the Stream backwards. The decoder MUST compare
393 the Stream Flags fields in both Stream Header and Stream
394 Footer, and indicate an error if they are not identical.
397 2.1.2.4. Footer Magic Bytes
399 As the last step of the decoding process, the decoder MUST
400 verify the existence of Footer Magic Bytes. If they don't
401 match, an error MUST be indicated.
403 Using a C array and ASCII:
404 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
409 The primary reason to have Footer Magic Bytes is to make
410 it easier to detect incomplete files quickly, without
411 uncompressing. If the file does not end with Footer Magic Bytes
412 (excluding Stream Padding described in Section 2.2), it cannot
413 be undamaged, unless someone has intentionally appended garbage
414 after the end of the Stream.
419 Only the decoders that support decoding of concatenated Streams
420 MUST support Stream Padding.
422 Stream Padding MUST contain only null bytes. To preserve the
423 four-byte alignment of consecutive Streams, the size of Stream
424 Padding MUST be a multiple of four bytes. Empty Stream Padding
427 Note that non-empty Stream Padding is allowed at the end of the
428 file; there doesn't need to be a new Stream after non-empty
429 Stream Padding. This can be convenient in certain situations
432 The possibility of Padding MUST be taken into account when
433 designing an application that parses Streams backwards, and
434 the application supports concatenated Streams.
439 +==============+=================+===============+=======+
440 | Block Header | Compressed Data | Block Padding | Check |
441 +==============+=================+===============+=======+
446 +-------------------+-------------+=================+
447 | Block Header Size | Block Flags | Compressed Size |
448 +-------------------+-------------+=================+
450 +===================+======================+
451 ---> | Uncompressed Size | List of Filter Flags |
452 +===================+======================+
454 +================+--+--+--+--+
455 ---> | Header Padding | CRC32 |
456 +================+--+--+--+--+
459 3.1.1. Block Header Size
461 This field overlaps with the Index Indicator field (see
464 This field contains the size of the Block Header field,
465 including the Block Header Size field itself. Valid values are
466 in the range [0x01, 0xFF], which indicate the size of the Block
467 Header as multiples of four bytes, minimum size being eight
470 real_header_size = (encoded_header_size + 1) * 4;
472 If bigger Block Header is needed in future, a new field can be
473 added between the current Block Header and Compressed Data
474 fields. The presence of this new field would be indicated in
480 The first byte of the Block Flags field is a bit field:
482 Bit(s) Mask Description
483 0-1 0x03 Number of filters (1-4)
484 2-5 0x3C Reserved for future use; MUST be zero for now.
485 6 0x40 The Compressed Size field is present.
486 7 0x80 The Uncompressed Size field is present.
488 If any reserved bit is set, the decoder MUST indicate an error.
489 It is possible that there is a new field present which the
490 decoder is not aware of, and can thus parse the Block Header
494 3.1.3. Compressed Size
496 This field is present only if the appropriate bit is set in
497 the Block Flags field (see Section 3.1.2).
499 The Compressed Size field contains the size of the Compressed
500 Data field, which MUST be non-zero. Compressed Size is stored
501 using the encoding described in Section 1.2. If the Compressed
502 Size doesn't match the size of the Compressed Data field, the
503 decoder MUST indicate an error.
506 3.1.4. Uncompressed Size
508 This field is present only if the appropriate bit is set in
509 the Block Flags field (see Section 3.1.2).
511 The Uncompressed Size field contains the size of the Block
512 after uncompressing. Uncompressed Size is stored using the
513 encoding described in Section 1.2. If the Uncompressed Size
514 does not match the real uncompressed size, the decoder MUST
517 Storing the Compressed Size and Uncompressed Size fields serves
519 - The decoder knows how much memory it needs to allocate
520 for a temporary buffer in multithreaded mode.
521 - Simple error detection: wrong size indicates a broken file.
522 - Seeking forwards to a specific location in streamed mode.
524 It should be noted that the only reliable way to determine
525 the real uncompressed size is to uncompress the Block,
526 because the Block Header and Index fields may contain
527 (intentionally or unintentionally) invalid information.
530 3.1.5. List of Filter Flags
532 +================+================+ +================+
533 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
534 +================+================+ +================+
536 The number of Filter Flags fields is stored in the Block Flags
537 field (see Section 3.1.2).
539 The format of each Filter Flags field is as follows:
541 +===========+====================+===================+
542 | Filter ID | Size of Properties | Filter Properties |
543 +===========+====================+===================+
545 Both Filter ID and Size of Properties are stored using the
546 encoding described in Section 1.2. Size of Properties indicates
547 the size of the Filter Properties field as bytes. The list of
548 officially defined Filter IDs and the formats of their Filter
549 Properties are described in Section 5.3.
551 Filter IDs greater than or equal to 0x4000_0000_0000_0000
552 (2^62) are reserved for implementation-specific internal use.
553 These Filter IDs MUST never be used in List of Filter Flags.
556 3.1.6. Header Padding
558 This field contains as many null byte as it is needed to make
559 the Block Header have the size specified in Block Header Size.
560 If any of the bytes are not null bytes, the decoder MUST
561 indicate an error. It is possible that there is a new field
562 present which the decoder is not aware of, and can thus parse
563 the Block Header incorrectly.
568 The CRC32 is calculated over everything in the Block Header
569 field except the CRC32 field itself. It is stored as an
570 unsigned 32-bit little endian integer. If the calculated
571 value does not match the stored one, the decoder MUST indicate
574 By verifying the CRC32 of the Block Header before parsing the
575 actual contents allows the decoder to distinguish between
576 corrupt and unsupported files.
581 The format of Compressed Data depends on Block Flags and List
582 of Filter Flags. Excluding the descriptions of the simplest
583 filters in Section 5.3, the format of the filter-specific
584 encoded data is out of scope of this document.
589 Block Padding MUST contain 0-3 null bytes to make the size of
590 the Block a multiple of four bytes. This can be needed when
591 the size of Compressed Data is not a multiple of four.
596 The type and size of the Check field depends on which bits
597 are set in the Stream Flags field (see Section 2.1.1.2).
599 The Check, when used, is calculated from the original
600 uncompressed data. If the calculated Check does not match the
601 stored one, the decoder MUST indicate an error. If the selected
602 type of Check is not supported by the decoder, it MUST indicate
608 +-----------------+=========================+
609 | Index Indicator | Number of Index Records |
610 +-----------------+=========================+
612 +=================+=========+-+-+-+-+
613 ---> | List of Records | Padding | CRC32 |
614 +=================+=========+-+-+-+-+
616 Index serves several purporses. Using it, one can
617 - verify that all Blocks in a Stream have been processed;
618 - find out the uncompressed size of a Stream; and
619 - quickly access the beginning of any Block (random access).
624 This field overlaps with the Block Header Size field (see
625 Section 3.1.1). The value of Index Indicator is always 0x00.
628 4.2. Number of Records
630 This field indicates how many Records there are in the List
631 of Records field, and thus how many Blocks there are in the
632 Stream. The value is stored using the encoding described in
633 Section 1.2. If the decoder has decoded all the Blocks of the
634 Stream, and then notices that the Number of Records doesn't
635 match the real number of Blocks, the decoder MUST indicate an
641 List of Records consists of as many Records as indicated by the
642 Number of Records field:
645 | Record | Record | ...
648 Each Record contains information about one Block:
650 +===============+===================+
651 | Unpadded Size | Uncompressed Size |
652 +===============+===================+
654 If the decoder has decoded all the Blocks of the Stream, it
655 MUST verify that the contents of the Records match the real
656 Unpadded Size and Uncompressed Size of the respective Blocks.
658 Implementation hint: It is possible to verify the Index with
659 constant memory usage by calculating for example SHA256 of both
660 the real size values and the List of Records, then comparing
661 the check values. Implementing this using non-cryptographic
662 check like CRC32 SHOULD be avoided unless small code size is
665 If the decoder supports random-access reading, it MUST verify
666 that Unpadded Size and Uncompressed Size of every completely
667 decoded Block match the sizes stored in the Index. If only
668 partial Block is decoded, the decoder MUST verify that the
669 processed sizes don't exceed the sizes stored in the Index.
674 This field indicates the size of the Block excluding the Block
675 Padding field. That is, Unpadded Size is the size of the Block
676 Header, Compressed Data, and Check fields. Unpadded Size is
677 stored using the encoding described in Section 1.2. The value
678 MUST never be zero; with the current structure of Blocks, the
679 actual minimum value for Unpadded Size is five.
681 Implementation note: Because the size of the Block Padding
682 field is not included in Unpadded Size, calculating the total
683 size of a Stream or doing random-access reading requires
684 calculating the actual size of the Blocks by rounding Unpadded
685 Sizes up to the next multiple of four.
687 The reason to exclude Block Padding from Unpadded Size is to
688 ease making a raw copy of Compressed Data without Block
689 Padding. This can be useful, for example, if someone wants
690 to convert Streams to some other file format quickly.
693 4.3.2. Uncompressed Size
695 This field indicates the Uncompressed Size of the respective
696 Block as bytes. The value is stored using the encoding
697 described in Section 1.2.
702 This field MUST contain 0-3 null bytes to pad the Index to
703 a multiple of four bytes.
708 The CRC32 is calculated over everything in the Index field
709 except the CRC32 field itself. The CRC32 is stored as an
710 unsigned 32-bit little endian integer. If the calculated
711 value does not match the stored one, the decoder MUST indicate
717 The Block Flags field defines how many filters are used. When
718 more than one filter is used, the filters are chained; that is,
719 the output of one filter is the input of another filter. The
720 following figure illustrates the direction of data flow.
722 v Uncompressed Data ^
724 Encoder | Filter 1 | Decoder
731 Alignment of uncompressed input data is usually the job of
732 the application producing the data. For example, to get the
733 best results, an archiver tool should make sure that all
734 PowerPC executable files in the archive stream start at
735 offsets that are multiples of four bytes.
737 Some filters, for example LZMA2, can be configured to take
738 advantage of specified alignment of input data. Note that
739 taking advantage of aligned input can be benefical also when
740 a filter is not the first filter in the chain. For example,
741 if you compress PowerPC executables, you may want to use the
742 PowerPC filter and chain that with the LZMA2 filter. Because
743 not only the input but also the output alignment of the PowerPC
744 filter is four bytes, it is now benefical to set LZMA2 settings
745 so that the LZMA2 encoder can take advantage of its
746 four-byte-aligned input data.
748 The output of the last filter in the chain is stored to the
749 Compressed Data field, which is is guaranteed to be aligned
750 to a multiple of four bytes relative to the beginning of the
751 Stream. This can increase
752 - speed, if the filtered data is handled multiple bytes at
753 a time by the filter-specific encoder and decoder,
754 because accessing aligned data in computer memory is
756 - compression ratio, if the output data is later compressed
757 with an external compression tool.
762 If filters would be allowed to be chained freely, it would be
763 possible to create malicious files, that would be very slow to
764 decode. Such files could be used to create denial of service
767 Slow files could occur when multiple filters are chained:
769 v Compressed input data
770 | Filter 1 decoder (last filter)
771 | Filter 0 decoder (non-last filter)
772 v Uncompressed output data
774 The decoder of the last filter in the chain produces a lot of
775 output from little input. Another filter in the chain takes the
776 output of the last filter, and produces very little output
777 while consuming a lot of input. As a result, a lot of data is
778 moved inside the filter chain, but the filter chain as a whole
779 gets very little work done.
781 To prevent this kind of slow files, there are restrictions on
782 how the filters can be chained. These restrictions MUST be
783 taken into account when designing new filters.
785 The maximum number of filters in the chain has been limited to
786 four, thus there can be at maximum of three non-last filters.
787 Of these three non-last filters, only two are allowed to change
788 the size of the data.
790 The non-last filters, that change the size of the data, MUST
791 have a limit how much the decoder can compress the data: the
792 decoder SHOULD produce at least n bytes of output when the
793 filter is given 2n bytes of input. This limit is not
794 absolute, but significant deviations MUST be avoided.
796 The above limitations guarantee that if the last filter in the
797 chain produces 4n bytes of output, the chain as a whole will
798 produce at least n bytes of output.
805 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
806 compression algorithm with high compression ratio and fast
807 decompression. LZMA is based on LZ77 and range coding
810 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
811 LZMA internally, but adds support for flushing the encoder,
812 uncompressed chunks, eases stateful decoder implementations,
813 and improves support for multithreading. Thus, the plain LZMA
814 will not be supported in this file format.
817 Size of Filter Properties: 1 byte
818 Changes size of data: Yes
819 Allow as a non-last filter: No
820 Allow as the last filter: Yes
823 Input data: Adjustable to 1/2/4/8/16 byte(s)
826 The format of the one-byte Filter Properties field is as
829 Bits Mask Description
830 0-5 0x3F Dictionary Size
831 6-7 0xC0 Reserved for future use; MUST be zero for now.
833 Dictionary Size is encoded with one-bit mantissa and five-bit
834 exponent. The smallest dictionary size is 4 KiB and the biggest
837 Raw value Mantissa Exponent Dictionary size
851 40 2 31 4096 MiB - 1 B
853 Instead of having a table in the decoder, the dictionary size
854 can be decoded using the following C code:
856 const uint8_t bits = get_dictionary_flags() & 0x3F;
858 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
860 uint32_t dictionary_size;
862 dictionary_size = UINT32_MAX;
864 dictionary_size = 2 | (bits & 1);
865 dictionary_size <<= bits / 2 + 11;
869 5.3.2. Branch/Call/Jump Filters for Executables
871 These filters convert relative branch, call, and jump
872 instructions to their absolute counterparts in executable
873 files. This conversion increases redundancy and thus
876 Size of Filter Properties: 0 or 4 bytes
877 Changes size of data: No
878 Allow as a non-last filter: Yes
879 Allow as the last filter: No
881 Below is the list of filters in this category. The alignment
882 is the same for both input and output data.
884 Filter ID Alignment Description
885 0x04 1 byte x86 filter (BCJ)
886 0x05 4 bytes PowerPC (big endian) filter
887 0x06 16 bytes IA64 filter
888 0x07 4 bytes ARM (little endian) filter
889 0x08 2 bytes ARM Thumb (little endian) filter
890 0x09 4 bytes SPARC filter
892 If the size of Filter Properties is four bytes, the Filter
893 Properties field contains the start offset used for address
894 conversions. It is stored as an unsigned 32-bit little endian
895 integer. If the size of Filter Properties is zero, the start
898 Setting the start offset may be useful if an executable has
899 multiple sections, and there are many cross-section calls.
900 Taking advantage of this feature usually requires usage of
906 The Delta filter may increase compression ratio when the value
907 of the next byte correlates with the value of an earlier byte
908 at specified distance.
911 Size of Filter Properties: 1 byte
912 Changes size of data: No
913 Allow as a non-last filter: Yes
914 Allow as the last filter: No
918 Output data: Same as the original input data
920 The Properties byte indicates the delta distance, which can be
921 1-256 bytes backwards from the current byte: 0x00 indicates
922 distance of 1 byte and 0xFF distance of 256 bytes.
925 5.3.3.1. Format of the Encoded Output
927 The code below illustrates both encoding and decoding with
930 // Distance is in the range [1, 256].
931 const unsigned int distance = get_properties_byte() + 1;
935 memset(delta, 0, sizeof(delta));
938 const int byte = read_byte();
942 uint8_t tmp = delta[(uint8_t)(distance + pos)];
944 tmp = (uint8_t)(byte) - tmp;
945 delta[pos] = (uint8_t)(byte);
947 tmp = (uint8_t)(byte) + tmp;
956 5.4. Custom Filter IDs
958 If a developer wants to use custom Filter IDs, he has two
959 choices. The first choice is to contact Lasse Collin and ask
960 him to allocate a range of IDs for the developer.
962 The second choice is to generate a 40-bit random integer,
963 which the developer can use as his personal Developer ID.
964 To minimalize the risk of collisions, Developer ID has to be
965 a randomly generated integer, not manually selected "hex word".
966 The following command, which works on many free operating
967 systems, can be used to generate Developer ID:
969 dd if=/dev/urandom bs=5 count=1 | hexdump
971 The developer can then use his Developer ID to create unique
972 (well, hopefully unique) Filter IDs.
974 Bits Mask Description
975 0-15 0x0000_0000_0000_FFFF Filter ID
976 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
977 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
979 The resulting 63-bit integer will use 9 bytes of space when
980 stored using the encoding described in Section 1.2. To get
981 a shorter ID, see the beginning of this Section how to
982 request a custom ID range.
985 5.4.1. Reserved Custom Filter ID Ranges
988 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
989 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
992 6. Cyclic Redundancy Checks
994 There are several incompatible variations to calculate CRC32
995 and CRC64. For simplicity and clarity, complete examples are
996 provided to calculate the checks as they are used in this file
997 format. Implementations MAY use different code as long as it
998 gives identical results.
1000 The program below reads data from standard input, calculates
1001 the CRC32 and CRC64 values, and prints the calculated values
1002 as big endian hexadecimal strings to standard output.
1004 #include <sys/types.h>
1005 #include <inttypes.h>
1008 uint32_t crc32_table[256];
1009 uint64_t crc64_table[256];
1014 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1015 static const uint64_t poly64
1016 = UINT64_C(0xC96C5795D7870F42);
1018 for (size_t i = 0; i < 256; ++i) {
1022 for (size_t j = 0; j < 8; ++j) {
1024 crc32 = (crc32 >> 1) ^ poly32;
1029 crc64 = (crc64 >> 1) ^ poly64;
1034 crc32_table[i] = crc32;
1035 crc64_table[i] = crc64;
1040 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1043 for (size_t i = 0; i < size; ++i)
1044 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1050 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1053 for (size_t i = 0; i < size; ++i)
1054 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1064 uint32_t value32 = 0;
1065 uint64_t value64 = 0;
1066 uint64_t total_size = 0;
1070 const size_t buf_size = fread(buf, 1, 8192, stdin);
1074 total_size += buf_size;
1075 value32 = crc32(buf, buf_size, value32);
1076 value64 = crc64(buf, buf_size, value64);
1079 printf("Bytes: %" PRIu64 "\n", total_size);
1080 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1081 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1089 LZMA SDK - The original LZMA implementation
1090 http://7-zip.org/sdk.html
1092 LZMA Utils - LZMA adapted to POSIX-like systems
1093 http://tukaani.org/lzma/
1096 GZIP file format specification version 4.3
1097 http://www.ietf.org/rfc/rfc1952.txt
1098 - Notation of byte boxes in section "2.1. Overall conventions"
1101 Key words for use in RFCs to Indicate Requirement Levels
1102 http://www.ietf.org/rfc/rfc2119.txt
1106 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1107 - Node 9.4.2 "Blocking Factor", paragraph that begins
1108 "gzip will complain about trailing garbage"
1109 - Note that this URL points to the latest version of the
1110 manual, and may some day not contain the note which is in
1111 1.20. For the exact version of the manual, download GNU
1112 tar 1.20: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.20.tar.gz