9 1.1. Byte and Its Representation
10 1.2. Multibyte Integers
11 2. Overall Structure of .xz File
14 2.1.1.1. Header Magic Bytes
19 2.1.2.2. Backward Size
21 2.1.2.4. Footer Magic Bytes
25 3.1.1. Block Header Size
27 3.1.3. Compressed Size
28 3.1.4. Uncompressed Size
29 3.1.5. List of Filter Flags
37 4.2. Number of Records
40 4.3.2. Uncompressed Size
48 5.3.2. Branch/Call/Jump Filters for Executables
50 5.3.3.1. Format of the Encoded Output
51 5.4. Custom Filter IDs
52 5.4.1. Reserved Custom Filter ID Ranges
53 6. Cyclic Redundancy Checks
59 This document describes the .xz file format (filename suffix
60 ".xz", MIME type "application/x-xz"). It is intended that this
61 this format replace the old .lzma format used by LZMA SDK and
64 IMPORTANT: The version described in this document is a
65 draft, NOT a final, official version. Changes
69 0.1. Copyright Notices
71 Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
72 Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
74 Copying and distribution of this file, with or without
75 modification, are permitted in any medium without royalty
76 provided the copyright notice and this notice are preserved.
77 Modified versions must be marked as such.
79 All source code examples given in this document are put into
80 the public domain by the authors of this document.
82 Special thanks for helping with this document goes to
83 Igor Pavlov. Thanks for helping with this document goes to
84 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
89 Last modified: 2008-11-03 00:35+0200
91 (A changelog will be kept once the first official version
97 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
98 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
99 document are to be interpreted as described in [RFC-2119].
101 Indicating a warning means displaying a message, returning
102 appropriate exit status, or doing something else to let the
103 user know that something worth warning occurred. The operation
104 SHOULD still finish if a warning is indicated.
106 Indicating an error means displaying a message, returning
107 appropriate exit status, or doing something else to let the
108 user know that something prevented successfully finishing the
109 operation. The operation MUST be aborted once an error has
113 1.1. Byte and Its Representation
115 In this document, byte is always 8 bits.
117 A "null byte" has all bits unset. That is, the value of a null
120 To represent byte blocks, this document uses notation that
121 is similar to the notation used in [RFC-1952]:
128 | Foo | Two bytes; that is, some of the vertical bars
129 +---+---+ can be missing.
132 | Foo | Zero or more bytes.
135 In this document, a boxed byte or a byte sequence declared
136 using this notation is called "a field". The example field
137 above would be called "the Foo field" or plain "Foo".
139 If there are many fields, they may be split to multiple lines.
140 This is indicated with an arrow ("--->"):
150 The above is equivalent to this:
157 1.2. Multibyte Integers
159 Multibyte integers of static length, such as CRC values,
160 are stored in little endian byte order (least significant
163 When smaller values are more likely than bigger values (for
164 example file sizes), multibyte integers are encoded in a
165 variable-length representation:
166 - Numbers in the range [0, 127] are copied as is, and take
168 - Bigger numbers will occupy two or more bytes. All but the
169 last byte of the multibyte representation have the highest
172 For now, the value of the variable-length integers is limited
173 to 63 bits, which limits the encoded size of the integer to
174 nine bytes. These limits may be increased in future if needed.
176 The following C code illustrates encoding and decoding of
177 variable-length integers. The functions return the number of
178 bytes occupied by the integer (1-9), or zero on error.
180 #include <sys/types.h>
181 #include <inttypes.h>
184 encode(uint8_t buf[static 9], uint64_t num)
186 if (num > UINT64_MAX / 2)
191 while (num >= 0x80) {
192 buf[i++] = (uint8_t)(num) | 0x80;
196 buf[i++] = (uint8_t)(num);
202 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
210 *num = buf[0] & 0x7F;
213 while (buf[i++] & 0x80) {
214 if (i >= size_max || buf[i] == 0x00)
217 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
224 2. Overall Structure of .xz File
226 A standalone .xz files consist of one or more Streams which may
227 have Stream Padding between or after them:
229 +========+================+========+================+
230 | Stream | Stream Padding | Stream | Stream Padding | ...
231 +========+================+========+================+
233 While a typical file contains only one Stream and no Stream
234 Padding, a decoder handling standalone .xz files SHOULD support
235 files that have more than one Stream or Stream Padding.
237 In contrast to standalone .xz files, when the .xz file format
238 is used as an internal part of some other file format or
239 communication protocol, it usually is expected that the decoder
240 stops after the first Stream, and doesn't look for Stream
241 Padding or possibly other Streams.
246 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
247 | Stream Header | Block | Block | ... | Block |
248 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
250 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
251 ---> | Index | Stream Footer |
252 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
254 All the above fields have a size that is a multiple of four. If
255 Stream is used as an internal part of another file format, it
256 is RECOMMENDED to make the Stream start at an offset that is
257 a multiple of four bytes.
259 Stream Header, Index, and Stream Footer are always present in
260 a Stream. The maximum size of the Index field is 16 GiB (2^34).
262 There are zero or more Blocks. The maximum number of Blocks is
263 limited only by the maximum size of the Index field.
265 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
266 The same limit applies to the total amount of uncompressed
267 data stored in a Stream.
269 If an implementation supports handling .xz files with multiple
270 concatenated Streams, it MAY apply the above limits to the file
271 as a whole instead of limiting per Stream basis.
276 +---+---+---+---+---+---+-------+------+--+--+--+--+
277 | Header Magic Bytes | Stream Flags | CRC32 |
278 +---+---+---+---+---+---+-------+------+--+--+--+--+
281 2.1.1.1. Header Magic Bytes
283 The first six (6) bytes of the Stream are so called Header
284 Magic Bytes. They can be used to identify the file type.
286 Using a C array and ASCII:
287 const uint8_t HEADER_MAGIC[6]
288 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
290 In plain hexadecimal:
294 - The first byte (0xFD) was chosen so that the files cannot
295 be erroneously detected as being in .lzma format, in which
296 the first byte is in the range [0x00, 0xE0].
297 - The sixth byte (0x00) was chosen to prevent applications
298 from misdetecting the file as a text file.
300 If the Header Magic Bytes don't match, the decoder MUST
304 2.1.1.2. Stream Flags
306 The first byte of Stream Flags is always a null byte. In future
307 this byte may be used to indicate new Stream version or other
310 The second byte of Stream Flags is a bit field:
312 Bit(s) Mask Description
313 0-3 0x0F Type of Check (see Section 3.4):
317 0x02 4 bytes (Reserved)
318 0x03 4 bytes (Reserved)
320 0x05 8 bytes (Reserved)
321 0x06 8 bytes (Reserved)
322 0x07 16 bytes (Reserved)
323 0x08 16 bytes (Reserved)
324 0x09 16 bytes (Reserved)
325 0x0A 32 bytes SHA-256
326 0x0B 32 bytes (Reserved)
327 0x0C 32 bytes (Reserved)
328 0x0D 64 bytes (Reserved)
329 0x0E 64 bytes (Reserved)
330 0x0F 64 bytes (Reserved)
331 4-7 0xF0 Reserved for future use; MUST be zero for now.
333 Implementations SHOULD support at least the Check IDs 0x00
334 (None) and 0x01 (CRC32). Supporting other Check IDs is
335 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
336 indicate a warning or error.
338 If any reserved bit is set, the decoder MUST indicate an error.
339 It is possible that there is a new field present which the
340 decoder is not aware of, and can thus parse the Stream Header
346 The CRC32 is calculated from the Stream Flags field. It is
347 stored as an unsigned 32-bit little endian integer. If the
348 calculated value does not match the stored one, the decoder
349 MUST indicate an error.
351 The idea is that Stream Flags would always be two bytes, even
352 if new features are needed. This way old decoders will be able
353 to verify the CRC32 calculated from Stream Flags, and thus
354 distinguish between corrupt files (CRC32 doesn't match) and
355 files that the decoder doesn't support (CRC32 matches but
356 Stream Flags has reserved bits set).
361 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
362 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
363 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
368 The CRC32 is calculated from the Backward Size and Stream Flags
369 fields. It is stored as an unsigned 32-bit little endian
370 integer. If the calculated value does not match the stored one,
371 the decoder MUST indicate an error.
373 The reason to have the CRC32 field before the Backward Size and
374 Stream Flags fields is to keep the four-byte fields aligned to
375 a multiple of four bytes.
378 2.1.2.2. Backward Size
380 Backward Size is stored as a 32-bit little endian integer,
381 which indicates the size of the Index field as multiple of
382 four bytes, minimum value being four bytes:
384 real_backward_size = (stored_backward_size + 1) * 4;
386 If the stored value does not match the real size of the Index
387 field, the decoder MUST indicate an error.
389 Using a fixed-size integer to store Backward Size makes
390 it slightly simpler to parse the Stream Footer when the
391 application needs to parse the Stream backwards.
394 2.1.2.3. Stream Flags
396 This is a copy of the Stream Flags field from the Stream
397 Header. The information stored to Stream Flags is needed
398 when parsing the Stream backwards. The decoder MUST compare
399 the Stream Flags fields in both Stream Header and Stream
400 Footer, and indicate an error if they are not identical.
403 2.1.2.4. Footer Magic Bytes
405 As the last step of the decoding process, the decoder MUST
406 verify the existence of Footer Magic Bytes. If they don't
407 match, an error MUST be indicated.
409 Using a C array and ASCII:
410 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
415 The primary reason to have Footer Magic Bytes is to make
416 it easier to detect incomplete files quickly, without
417 uncompressing. If the file does not end with Footer Magic Bytes
418 (excluding Stream Padding described in Section 2.2), it cannot
419 be undamaged, unless someone has intentionally appended garbage
420 after the end of the Stream.
425 Only the decoders that support decoding of concatenated Streams
426 MUST support Stream Padding.
428 Stream Padding MUST contain only null bytes. To preserve the
429 four-byte alignment of consecutive Streams, the size of Stream
430 Padding MUST be a multiple of four bytes. Empty Stream Padding
433 Note that non-empty Stream Padding is allowed at the end of the
434 file; there doesn't need to be a new Stream after non-empty
435 Stream Padding. This can be convenient in certain situations
438 The possibility of Padding MUST be taken into account when
439 designing an application that parses Streams backwards, and
440 the application supports concatenated Streams.
445 +==============+=================+===============+=======+
446 | Block Header | Compressed Data | Block Padding | Check |
447 +==============+=================+===============+=======+
452 +-------------------+-------------+=================+
453 | Block Header Size | Block Flags | Compressed Size |
454 +-------------------+-------------+=================+
456 +===================+======================+
457 ---> | Uncompressed Size | List of Filter Flags |
458 +===================+======================+
460 +================+--+--+--+--+
461 ---> | Header Padding | CRC32 |
462 +================+--+--+--+--+
465 3.1.1. Block Header Size
467 This field overlaps with the Index Indicator field (see
470 This field contains the size of the Block Header field,
471 including the Block Header Size field itself. Valid values are
472 in the range [0x01, 0xFF], which indicate the size of the Block
473 Header as multiples of four bytes, minimum size being eight
476 real_header_size = (encoded_header_size + 1) * 4;
478 If bigger Block Header is needed in future, a new field can be
479 added between the current Block Header and Compressed Data
480 fields. The presence of this new field would be indicated in
486 The first byte of the Block Flags field is a bit field:
488 Bit(s) Mask Description
489 0-1 0x03 Number of filters (1-4)
490 2-5 0x3C Reserved for future use; MUST be zero for now.
491 6 0x40 The Compressed Size field is present.
492 7 0x80 The Uncompressed Size field is present.
494 If any reserved bit is set, the decoder MUST indicate an error.
495 It is possible that there is a new field present which the
496 decoder is not aware of, and can thus parse the Block Header
500 3.1.3. Compressed Size
502 This field is present only if the appropriate bit is set in
503 the Block Flags field (see Section 3.1.2).
505 The Compressed Size field contains the size of the Compressed
506 Data field, which MUST be non-zero. Compressed Size is stored
507 using the encoding described in Section 1.2. If the Compressed
508 Size doesn't match the size of the Compressed Data field, the
509 decoder MUST indicate an error.
512 3.1.4. Uncompressed Size
514 This field is present only if the appropriate bit is set in
515 the Block Flags field (see Section 3.1.2).
517 The Uncompressed Size field contains the size of the Block
518 after uncompressing. Uncompressed Size is stored using the
519 encoding described in Section 1.2. If the Uncompressed Size
520 does not match the real uncompressed size, the decoder MUST
523 Storing the Compressed Size and Uncompressed Size fields serves
525 - The decoder knows how much memory it needs to allocate
526 for a temporary buffer in multithreaded mode.
527 - Simple error detection: wrong size indicates a broken file.
528 - Seeking forwards to a specific location in streamed mode.
530 It should be noted that the only reliable way to determine
531 the real uncompressed size is to uncompress the Block,
532 because the Block Header and Index fields may contain
533 (intentionally or unintentionally) invalid information.
536 3.1.5. List of Filter Flags
538 +================+================+ +================+
539 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
540 +================+================+ +================+
542 The number of Filter Flags fields is stored in the Block Flags
543 field (see Section 3.1.2).
545 The format of each Filter Flags field is as follows:
547 +===========+====================+===================+
548 | Filter ID | Size of Properties | Filter Properties |
549 +===========+====================+===================+
551 Both Filter ID and Size of Properties are stored using the
552 encoding described in Section 1.2. Size of Properties indicates
553 the size of the Filter Properties field as bytes. The list of
554 officially defined Filter IDs and the formats of their Filter
555 Properties are described in Section 5.3.
557 Filter IDs greater than or equal to 0x4000_0000_0000_0000
558 (2^62) are reserved for implementation-specific internal use.
559 These Filter IDs MUST never be used in List of Filter Flags.
562 3.1.6. Header Padding
564 This field contains as many null byte as it is needed to make
565 the Block Header have the size specified in Block Header Size.
566 If any of the bytes are not null bytes, the decoder MUST
567 indicate an error. It is possible that there is a new field
568 present which the decoder is not aware of, and can thus parse
569 the Block Header incorrectly.
574 The CRC32 is calculated over everything in the Block Header
575 field except the CRC32 field itself. It is stored as an
576 unsigned 32-bit little endian integer. If the calculated
577 value does not match the stored one, the decoder MUST indicate
580 By verifying the CRC32 of the Block Header before parsing the
581 actual contents allows the decoder to distinguish between
582 corrupt and unsupported files.
587 The format of Compressed Data depends on Block Flags and List
588 of Filter Flags. Excluding the descriptions of the simplest
589 filters in Section 5.3, the format of the filter-specific
590 encoded data is out of scope of this document.
595 Block Padding MUST contain 0-3 null bytes to make the size of
596 the Block a multiple of four bytes. This can be needed when
597 the size of Compressed Data is not a multiple of four.
602 The type and size of the Check field depends on which bits
603 are set in the Stream Flags field (see Section 2.1.1.2).
605 The Check, when used, is calculated from the original
606 uncompressed data. If the calculated Check does not match the
607 stored one, the decoder MUST indicate an error. If the selected
608 type of Check is not supported by the decoder, it MUST indicate
614 +-----------------+=========================+
615 | Index Indicator | Number of Index Records |
616 +-----------------+=========================+
618 +=================+=========+-+-+-+-+
619 ---> | List of Records | Padding | CRC32 |
620 +=================+=========+-+-+-+-+
622 Index serves several purporses. Using it, one can
623 - verify that all Blocks in a Stream have been processed;
624 - find out the uncompressed size of a Stream; and
625 - quickly access the beginning of any Block (random access).
630 This field overlaps with the Block Header Size field (see
631 Section 3.1.1). The value of Index Indicator is always 0x00.
634 4.2. Number of Records
636 This field indicates how many Records there are in the List
637 of Records field, and thus how many Blocks there are in the
638 Stream. The value is stored using the encoding described in
639 Section 1.2. If the decoder has decoded all the Blocks of the
640 Stream, and then notices that the Number of Records doesn't
641 match the real number of Blocks, the decoder MUST indicate an
647 List of Records consists of as many Records as indicated by the
648 Number of Records field:
651 | Record | Record | ...
654 Each Record contains information about one Block:
656 +===============+===================+
657 | Unpadded Size | Uncompressed Size |
658 +===============+===================+
660 If the decoder has decoded all the Blocks of the Stream, it
661 MUST verify that the contents of the Records match the real
662 Unpadded Size and Uncompressed Size of the respective Blocks.
664 Implementation hint: It is possible to verify the Index with
665 constant memory usage by calculating for example SHA256 of both
666 the real size values and the List of Records, then comparing
667 the check values. Implementing this using non-cryptographic
668 check like CRC32 SHOULD be avoided unless small code size is
671 If the decoder supports random-access reading, it MUST verify
672 that Unpadded Size and Uncompressed Size of every completely
673 decoded Block match the sizes stored in the Index. If only
674 partial Block is decoded, the decoder MUST verify that the
675 processed sizes don't exceed the sizes stored in the Index.
680 This field indicates the size of the Block excluding the Block
681 Padding field. That is, Unpadded Size is the size of the Block
682 Header, Compressed Data, and Check fields. Unpadded Size is
683 stored using the encoding described in Section 1.2. The value
684 MUST never be zero; with the current structure of Blocks, the
685 actual minimum value for Unpadded Size is five.
687 Implementation note: Because the size of the Block Padding
688 field is not included in Unpadded Size, calculating the total
689 size of a Stream or doing random-access reading requires
690 calculating the actual size of the Blocks by rounding Unpadded
691 Sizes up to the next multiple of four.
693 The reason to exclude Block Padding from Unpadded Size is to
694 ease making a raw copy of Compressed Data without Block
695 Padding. This can be useful, for example, if someone wants
696 to convert Streams to some other file format quickly.
699 4.3.2. Uncompressed Size
701 This field indicates the Uncompressed Size of the respective
702 Block as bytes. The value is stored using the encoding
703 described in Section 1.2.
708 This field MUST contain 0-3 null bytes to pad the Index to
709 a multiple of four bytes.
714 The CRC32 is calculated over everything in the Index field
715 except the CRC32 field itself. The CRC32 is stored as an
716 unsigned 32-bit little endian integer. If the calculated
717 value does not match the stored one, the decoder MUST indicate
723 The Block Flags field defines how many filters are used. When
724 more than one filter is used, the filters are chained; that is,
725 the output of one filter is the input of another filter. The
726 following figure illustrates the direction of data flow.
728 v Uncompressed Data ^
730 Encoder | Filter 1 | Decoder
737 Alignment of uncompressed input data is usually the job of
738 the application producing the data. For example, to get the
739 best results, an archiver tool should make sure that all
740 PowerPC executable files in the archive stream start at
741 offsets that are multiples of four bytes.
743 Some filters, for example LZMA2, can be configured to take
744 advantage of specified alignment of input data. Note that
745 taking advantage of aligned input can be benefical also when
746 a filter is not the first filter in the chain. For example,
747 if you compress PowerPC executables, you may want to use the
748 PowerPC filter and chain that with the LZMA2 filter. Because
749 not only the input but also the output alignment of the PowerPC
750 filter is four bytes, it is now benefical to set LZMA2 settings
751 so that the LZMA2 encoder can take advantage of its
752 four-byte-aligned input data.
754 The output of the last filter in the chain is stored to the
755 Compressed Data field, which is is guaranteed to be aligned
756 to a multiple of four bytes relative to the beginning of the
757 Stream. This can increase
758 - speed, if the filtered data is handled multiple bytes at
759 a time by the filter-specific encoder and decoder,
760 because accessing aligned data in computer memory is
762 - compression ratio, if the output data is later compressed
763 with an external compression tool.
768 If filters would be allowed to be chained freely, it would be
769 possible to create malicious files, that would be very slow to
770 decode. Such files could be used to create denial of service
773 Slow files could occur when multiple filters are chained:
775 v Compressed input data
776 | Filter 1 decoder (last filter)
777 | Filter 0 decoder (non-last filter)
778 v Uncompressed output data
780 The decoder of the last filter in the chain produces a lot of
781 output from little input. Another filter in the chain takes the
782 output of the last filter, and produces very little output
783 while consuming a lot of input. As a result, a lot of data is
784 moved inside the filter chain, but the filter chain as a whole
785 gets very little work done.
787 To prevent this kind of slow files, there are restrictions on
788 how the filters can be chained. These restrictions MUST be
789 taken into account when designing new filters.
791 The maximum number of filters in the chain has been limited to
792 four, thus there can be at maximum of three non-last filters.
793 Of these three non-last filters, only two are allowed to change
794 the size of the data.
796 The non-last filters, that change the size of the data, MUST
797 have a limit how much the decoder can compress the data: the
798 decoder SHOULD produce at least n bytes of output when the
799 filter is given 2n bytes of input. This limit is not
800 absolute, but significant deviations MUST be avoided.
802 The above limitations guarantee that if the last filter in the
803 chain produces 4n bytes of output, the chain as a whole will
804 produce at least n bytes of output.
811 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
812 compression algorithm with high compression ratio and fast
813 decompression. LZMA is based on LZ77 and range coding
816 LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
817 LZMA internally, but adds support for flushing the encoder,
818 uncompressed chunks, eases stateful decoder implementations,
819 and improves support for multithreading. Thus, the plain LZMA
820 will not be supported in this file format.
823 Size of Filter Properties: 1 byte
824 Changes size of data: Yes
825 Allow as a non-last filter: No
826 Allow as the last filter: Yes
829 Input data: Adjustable to 1/2/4/8/16 byte(s)
832 The format of the one-byte Filter Properties field is as
835 Bits Mask Description
836 0-5 0x3F Dictionary Size
837 6-7 0xC0 Reserved for future use; MUST be zero for now.
839 Dictionary Size is encoded with one-bit mantissa and five-bit
840 exponent. The smallest dictionary size is 4 KiB and the biggest
843 Raw value Mantissa Exponent Dictionary size
857 40 2 31 4096 MiB - 1 B
859 Instead of having a table in the decoder, the dictionary size
860 can be decoded using the following C code:
862 const uint8_t bits = get_dictionary_flags() & 0x3F;
864 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
866 uint32_t dictionary_size;
868 dictionary_size = UINT32_MAX;
870 dictionary_size = 2 | (bits & 1);
871 dictionary_size <<= bits / 2 + 11;
875 5.3.2. Branch/Call/Jump Filters for Executables
877 These filters convert relative branch, call, and jump
878 instructions to their absolute counterparts in executable
879 files. This conversion increases redundancy and thus
882 Size of Filter Properties: 0 or 4 bytes
883 Changes size of data: No
884 Allow as a non-last filter: Yes
885 Allow as the last filter: No
887 Below is the list of filters in this category. The alignment
888 is the same for both input and output data.
890 Filter ID Alignment Description
891 0x04 1 byte x86 filter (BCJ)
892 0x05 4 bytes PowerPC (big endian) filter
893 0x06 16 bytes IA64 filter
894 0x07 4 bytes ARM (little endian) filter
895 0x08 2 bytes ARM Thumb (little endian) filter
896 0x09 4 bytes SPARC filter
898 If the size of Filter Properties is four bytes, the Filter
899 Properties field contains the start offset used for address
900 conversions. It is stored as an unsigned 32-bit little endian
901 integer. If the size of Filter Properties is zero, the start
904 Setting the start offset may be useful if an executable has
905 multiple sections, and there are many cross-section calls.
906 Taking advantage of this feature usually requires usage of
912 The Delta filter may increase compression ratio when the value
913 of the next byte correlates with the value of an earlier byte
914 at specified distance.
917 Size of Filter Properties: 1 byte
918 Changes size of data: No
919 Allow as a non-last filter: Yes
920 Allow as the last filter: No
924 Output data: Same as the original input data
926 The Properties byte indicates the delta distance, which can be
927 1-256 bytes backwards from the current byte: 0x00 indicates
928 distance of 1 byte and 0xFF distance of 256 bytes.
931 5.3.3.1. Format of the Encoded Output
933 The code below illustrates both encoding and decoding with
936 // Distance is in the range [1, 256].
937 const unsigned int distance = get_properties_byte() + 1;
941 memset(delta, 0, sizeof(delta));
944 const int byte = read_byte();
948 uint8_t tmp = delta[(uint8_t)(distance + pos)];
950 tmp = (uint8_t)(byte) - tmp;
951 delta[pos] = (uint8_t)(byte);
953 tmp = (uint8_t)(byte) + tmp;
962 5.4. Custom Filter IDs
964 If a developer wants to use custom Filter IDs, he has two
965 choices. The first choice is to contact Lasse Collin and ask
966 him to allocate a range of IDs for the developer.
968 The second choice is to generate a 40-bit random integer,
969 which the developer can use as his personal Developer ID.
970 To minimalize the risk of collisions, Developer ID has to be
971 a randomly generated integer, not manually selected "hex word".
972 The following command, which works on many free operating
973 systems, can be used to generate Developer ID:
975 dd if=/dev/urandom bs=5 count=1 | hexdump
977 The developer can then use his Developer ID to create unique
978 (well, hopefully unique) Filter IDs.
980 Bits Mask Description
981 0-15 0x0000_0000_0000_FFFF Filter ID
982 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
983 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
985 The resulting 63-bit integer will use 9 bytes of space when
986 stored using the encoding described in Section 1.2. To get
987 a shorter ID, see the beginning of this Section how to
988 request a custom ID range.
991 5.4.1. Reserved Custom Filter ID Ranges
994 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
995 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
998 6. Cyclic Redundancy Checks
1000 There are several incompatible variations to calculate CRC32
1001 and CRC64. For simplicity and clarity, complete examples are
1002 provided to calculate the checks as they are used in this file
1003 format. Implementations MAY use different code as long as it
1004 gives identical results.
1006 The program below reads data from standard input, calculates
1007 the CRC32 and CRC64 values, and prints the calculated values
1008 as big endian hexadecimal strings to standard output.
1010 #include <sys/types.h>
1011 #include <inttypes.h>
1014 uint32_t crc32_table[256];
1015 uint64_t crc64_table[256];
1020 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1021 static const uint64_t poly64
1022 = UINT64_C(0xC96C5795D7870F42);
1024 for (size_t i = 0; i < 256; ++i) {
1028 for (size_t j = 0; j < 8; ++j) {
1030 crc32 = (crc32 >> 1) ^ poly32;
1035 crc64 = (crc64 >> 1) ^ poly64;
1040 crc32_table[i] = crc32;
1041 crc64_table[i] = crc64;
1046 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1049 for (size_t i = 0; i < size; ++i)
1050 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1056 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1059 for (size_t i = 0; i < size; ++i)
1060 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1070 uint32_t value32 = 0;
1071 uint64_t value64 = 0;
1072 uint64_t total_size = 0;
1076 const size_t buf_size = fread(buf, 1, 8192, stdin);
1080 total_size += buf_size;
1081 value32 = crc32(buf, buf_size, value32);
1082 value64 = crc64(buf, buf_size, value64);
1085 printf("Bytes: %" PRIu64 "\n", total_size);
1086 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1087 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1095 LZMA SDK - The original LZMA implementation
1096 http://7-zip.org/sdk.html
1098 LZMA Utils - LZMA adapted to POSIX-like systems
1099 http://tukaani.org/lzma/
1102 GZIP file format specification version 4.3
1103 http://www.ietf.org/rfc/rfc1952.txt
1104 - Notation of byte boxes in section "2.1. Overall conventions"
1107 Key words for use in RFCs to Indicate Requirement Levels
1108 http://www.ietf.org/rfc/rfc2119.txt
1112 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1113 - Node 9.4.2 "Blocking Factor", paragraph that begins
1114 "gzip will complain about trailing garbage"
1115 - Note that this URL points to the latest version of the
1116 manual, and may some day not contain the note which is in
1117 1.20. For the exact version of the manual, download GNU
1118 tar 1.20: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.20.tar.gz