doc/file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.0.0 (2009-01-14)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95         1.0.0     2008-01-14    The first official version
  96
  97
  98 1. Conventions
  99
 100         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 101         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 102         document are to be interpreted as described in [RFC-2119].
 103
 104         Indicating a warning means displaying a message, returning
 105         appropriate exit status, or doing something else to let the
 106         user know that something worth warning occurred. The operation
 107         SHOULD still finish if a warning is indicated.
 108
 109         Indicating an error means displaying a message, returning
 110         appropriate exit status, or doing something else to let the
 111         user know that something prevented successfully finishing the
 112         operation. The operation MUST be aborted once an error has
 113         been indicated.
 114
 115
 116 1.1. Byte and Its Representation
 117
 118         In this document, byte is always 8 bits.
 119
 120         A "null byte" has all bits unset. That is, the value of a null
 121         byte is 0x00.
 122
 123         To represent byte blocks, this document uses notation that
 124         is similar to the notation used in [RFC-1952]:
 125
 126             +-------+
 127             |  Foo  |   One byte.
 128             +-------+
 129
 130             +---+---+
 131             |  Foo  |   Two bytes; that is, some of the vertical bars
 132             +---+---+   can be missing.
 133
 134             +=======+
 135             |  Foo  |   Zero or more bytes.
 136             +=======+
 137
 138         In this document, a boxed byte or a byte sequence declared
 139         using this notation is called "a field". The example field
 140         above would be called "the Foo field" or plain "Foo".
 141
 142         If there are many fields, they may be split to multiple lines.
 143         This is indicated with an arrow ("--->"):
 144
 145             +=====+
 146             | Foo |
 147             +=====+
 148
 149                  +=====+
 150             ---> | Bar |
 151                  +=====+
 152
 153         The above is equivalent to this:
 154
 155             +=====+=====+
 156             | Foo | Bar |
 157             +=====+=====+
 158
 159
 160 1.2. Multibyte Integers
 161
 162         Multibyte integers of static length, such as CRC values,
 163         are stored in little endian byte order (least significant
 164         byte first).
 165
 166         When smaller values are more likely than bigger values (for
 167         example file sizes), multibyte integers are encoded in a
 168         variable-length representation:
 169           - Numbers in the range [0, 127] are copied as is, and take
 170             one byte of space.
 171           - Bigger numbers will occupy two or more bytes. All but the
 172             last byte of the multibyte representation have the highest
 173             (eighth) bit set.
 174
 175         For now, the value of the variable-length integers is limited
 176         to 63 bits, which limits the encoded size of the integer to
 177         nine bytes. These limits may be increased in future if needed.
 178
 179         The following C code illustrates encoding and decoding of
 180         variable-length integers. The functions return the number of
 181         bytes occupied by the integer (1-9), or zero on error.
 182
 183             #include <stddef.h>
 184             #include <inttypes.h>
 185
 186             size_t
 187             encode(uint8_t buf[static 9], uint64_t num)
 188             {
 189                 if (num > UINT64_MAX / 2)
 190                     return 0;
 191
 192                 size_t i = 0;
 193
 194                 while (num >= 0x80) {
 195                     buf[i++] = (uint8_t)(num) | 0x80;
 196                     num >>= 7;
 197                 }
 198
 199                 buf[i++] = (uint8_t)(num);
 200
 201                 return i;
 202             }
 203
 204             size_t
 205             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 206             {
 207                 if (size_max == 0)
 208                     return 0;
 209
 210                 if (size_max > 9)
 211                     size_max = 9;
 212
 213                 *num = buf[0] & 0x7F;
 214                 size_t i = 0;
 215
 216                 while (buf[i++] & 0x80) {
 217                     if (i >= size_max || buf[i] == 0x00)
 218                         return 0;
 219
 220                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 221                 }
 222
 223                 return i;
 224             }
 225
 226
 227 2. Overall Structure of .xz File
 228
 229         A standalone .xz files consist of one or more Streams which may
 230         have Stream Padding between or after them:
 231
 232             +========+================+========+================+
 233             | Stream | Stream Padding | Stream | Stream Padding | ...
 234             +========+================+========+================+
 235
 236         While a typical file contains only one Stream and no Stream
 237         Padding, a decoder handling standalone .xz files SHOULD support
 238         files that have more than one Stream or Stream Padding.
 239
 240         In contrast to standalone .xz files, when the .xz file format
 241         is used as an internal part of some other file format or
 242         communication protocol, it usually is expected that the decoder
 243         stops after the first Stream, and doesn't look for Stream
 244         Padding or possibly other Streams.
 245
 246
 247 2.1. Stream
 248
 249         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 250         |     Stream Header     | Block | Block | ... | Block |
 251         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 252
 253              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 254         ---> | Index |     Stream Footer     |
 255              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 256
 257         All the above fields have a size that is a multiple of four. If
 258         Stream is used as an internal part of another file format, it
 259         is RECOMMENDED to make the Stream start at an offset that is
 260         a multiple of four bytes.
 261
 262         Stream Header, Index, and Stream Footer are always present in
 263         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 264
 265         There are zero or more Blocks. The maximum number of Blocks is
 266         limited only by the maximum size of the Index field.
 267
 268         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 269         The same limit applies to the total amount of uncompressed
 270         data stored in a Stream.
 271
 272         If an implementation supports handling .xz files with multiple
 273         concatenated Streams, it MAY apply the above limits to the file
 274         as a whole instead of limiting per Stream basis.
 275
 276
 277 2.1.1. Stream Header
 278
 279         +---+---+---+---+---+---+-------+------+--+--+--+--+
 280         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 281         +---+---+---+---+---+---+-------+------+--+--+--+--+
 282
 283
 284 2.1.1.1. Header Magic Bytes
 285
 286         The first six (6) bytes of the Stream are so called Header
 287         Magic Bytes. They can be used to identify the file type.
 288
 289             Using a C array and ASCII:
 290             const uint8_t HEADER_MAGIC[6]
 291                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 292
 293             In plain hexadecimal:
 294             FD 37 7A 58 5A 00
 295
 296         Notes:
 297           - The first byte (0xFD) was chosen so that the files cannot
 298             be erroneously detected as being in .lzma format, in which
 299             the first byte is in the range [0x00, 0xE0].
 300           - The sixth byte (0x00) was chosen to prevent applications
 301             from misdetecting the file as a text file.
 302
 303         If the Header Magic Bytes don't match, the decoder MUST
 304         indicate an error.
 305
 306
 307 2.1.1.2. Stream Flags
 308
 309         The first byte of Stream Flags is always a null byte. In future
 310         this byte may be used to indicate new Stream version or other
 311         Stream properties.
 312
 313         The second byte of Stream Flags is a bit field:
 314
 315             Bit(s)  Mask  Description
 316              0-3    0x0F  Type of Check (see Section 3.4):
 317                               ID    Size      Check name
 318                               0x00   0 bytes  None
 319                               0x01   4 bytes  CRC32
 320                               0x02   4 bytes  (Reserved)
 321                               0x03   4 bytes  (Reserved)
 322                               0x04   8 bytes  CRC64
 323                               0x05   8 bytes  (Reserved)
 324                               0x06   8 bytes  (Reserved)
 325                               0x07  16 bytes  (Reserved)
 326                               0x08  16 bytes  (Reserved)
 327                               0x09  16 bytes  (Reserved)
 328                               0x0A  32 bytes  SHA-256
 329                               0x0B  32 bytes  (Reserved)
 330                               0x0C  32 bytes  (Reserved)
 331                               0x0D  64 bytes  (Reserved)
 332                               0x0E  64 bytes  (Reserved)
 333                               0x0F  64 bytes  (Reserved)
 334              4-7    0xF0  Reserved for future use; MUST be zero for now.
 335
 336         Implementations SHOULD support at least the Check IDs 0x00
 337         (None) and 0x01 (CRC32). Supporting other Check IDs is
 338         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 339         indicate a warning or error.
 340
 341         If any reserved bit is set, the decoder MUST indicate an error.
 342         It is possible that there is a new field present which the
 343         decoder is not aware of, and can thus parse the Stream Header
 344         incorrectly.
 345
 346
 347 2.1.1.3. CRC32
 348
 349         The CRC32 is calculated from the Stream Flags field. It is
 350         stored as an unsigned 32-bit little endian integer. If the
 351         calculated value does not match the stored one, the decoder
 352         MUST indicate an error.
 353
 354         The idea is that Stream Flags would always be two bytes, even
 355         if new features are needed. This way old decoders will be able
 356         to verify the CRC32 calculated from Stream Flags, and thus
 357         distinguish between corrupt files (CRC32 doesn't match) and
 358         files that the decoder doesn't support (CRC32 matches but
 359         Stream Flags has reserved bits set).
 360
 361
 362 2.1.2. Stream Footer
 363
 364         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 365         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 366         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 367
 368
 369 2.1.2.1. CRC32
 370
 371         The CRC32 is calculated from the Backward Size and Stream Flags
 372         fields. It is stored as an unsigned 32-bit little endian
 373         integer. If the calculated value does not match the stored one,
 374         the decoder MUST indicate an error.
 375
 376         The reason to have the CRC32 field before the Backward Size and
 377         Stream Flags fields is to keep the four-byte fields aligned to
 378         a multiple of four bytes.
 379
 380
 381 2.1.2.2. Backward Size
 382
 383         Backward Size is stored as a 32-bit little endian integer,
 384         which indicates the size of the Index field as multiple of
 385         four bytes, minimum value being four bytes:
 386
 387             real_backward_size = (stored_backward_size + 1) * 4;
 388
 389         If the stored value does not match the real size of the Index
 390         field, the decoder MUST indicate an error.
 391
 392         Using a fixed-size integer to store Backward Size makes
 393         it slightly simpler to parse the Stream Footer when the
 394         application needs to parse the Stream backwards.
 395
 396
 397 2.1.2.3. Stream Flags
 398
 399         This is a copy of the Stream Flags field from the Stream
 400         Header. The information stored to Stream Flags is needed
 401         when parsing the Stream backwards. The decoder MUST compare
 402         the Stream Flags fields in both Stream Header and Stream
 403         Footer, and indicate an error if they are not identical.
 404
 405
 406 2.1.2.4. Footer Magic Bytes
 407
 408         As the last step of the decoding process, the decoder MUST
 409         verify the existence of Footer Magic Bytes. If they don't
 410         match, an error MUST be indicated.
 411
 412             Using a C array and ASCII:
 413             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 414
 415             In hexadecimal:
 416             59 5A
 417
 418         The primary reason to have Footer Magic Bytes is to make
 419         it easier to detect incomplete files quickly, without
 420         uncompressing. If the file does not end with Footer Magic Bytes
 421         (excluding Stream Padding described in Section 2.2), it cannot
 422         be undamaged, unless someone has intentionally appended garbage
 423         after the end of the Stream.
 424
 425
 426 2.2. Stream Padding
 427
 428         Only the decoders that support decoding of concatenated Streams
 429         MUST support Stream Padding.
 430
 431         Stream Padding MUST contain only null bytes. To preserve the
 432         four-byte alignment of consecutive Streams, the size of Stream
 433         Padding MUST be a multiple of four bytes. Empty Stream Padding
 434         is allowed.
 435
 436         Note that non-empty Stream Padding is allowed at the end of the
 437         file; there doesn't need to be a new Stream after non-empty
 438         Stream Padding. This can be convenient in certain situations
 439         [GNU-tar].
 440
 441         The possibility of Stream Padding MUST be taken into account
 442         when designing an application that parses Streams backwards,
 443         and the application supports concatenated Streams.
 444
 445
 446 3. Block
 447
 448         +==============+=================+===============+=======+
 449         | Block Header | Compressed Data | Block Padding | Check |
 450         +==============+=================+===============+=======+
 451
 452
 453 3.1. Block Header
 454
 455         +-------------------+-------------+=================+
 456         | Block Header Size | Block Flags | Compressed Size |
 457         +-------------------+-------------+=================+
 458
 459              +===================+======================+
 460         ---> | Uncompressed Size | List of Filter Flags |
 461              +===================+======================+
 462
 463              +================+--+--+--+--+
 464         ---> | Header Padding |   CRC32   |
 465              +================+--+--+--+--+
 466
 467
 468 3.1.1. Block Header Size
 469
 470         This field overlaps with the Index Indicator field (see
 471         Section 4.1).
 472
 473         This field contains the size of the Block Header field,
 474         including the Block Header Size field itself. Valid values are
 475         in the range [0x01, 0xFF], which indicate the size of the Block
 476         Header as multiples of four bytes, minimum size being eight
 477         bytes:
 478
 479             real_header_size = (encoded_header_size + 1) * 4;
 480
 481         If bigger Block Header is needed in future, a new field can be
 482         added between the current Block Header and Compressed Data
 483         fields. The presence of this new field would be indicated in
 484         the Block Header.
 485
 486
 487 3.1.2. Block Flags
 488
 489         The first byte of the Block Flags field is a bit field:
 490
 491             Bit(s)  Mask  Description
 492              0-1    0x03  Number of filters (1-4)
 493              2-5    0x3C  Reserved for future use; MUST be zero for now.
 494               6     0x40  The Compressed Size field is present.
 495               7     0x80  The Uncompressed Size field is present.
 496
 497         If any reserved bit is set, the decoder MUST indicate an error.
 498         It is possible that there is a new field present which the
 499         decoder is not aware of, and can thus parse the Block Header
 500         incorrectly.
 501
 502
 503 3.1.3. Compressed Size
 504
 505         This field is present only if the appropriate bit is set in
 506         the Block Flags field (see Section 3.1.2).
 507
 508         The Compressed Size field contains the size of the Compressed
 509         Data field, which MUST be non-zero. Compressed Size is stored
 510         using the encoding described in Section 1.2. If the Compressed
 511         Size doesn't match the size of the Compressed Data field, the
 512         decoder MUST indicate an error.
 513
 514
 515 3.1.4. Uncompressed Size
 516
 517         This field is present only if the appropriate bit is set in
 518         the Block Flags field (see Section 3.1.2).
 519
 520         The Uncompressed Size field contains the size of the Block
 521         after uncompressing. Uncompressed Size is stored using the
 522         encoding described in Section 1.2. If the Uncompressed Size
 523         does not match the real uncompressed size, the decoder MUST
 524         indicate an error.
 525
 526         Storing the Compressed Size and Uncompressed Size fields serves
 527         several purposes:
 528           - The decoder knows how much memory it needs to allocate
 529             for a temporary buffer in multithreaded mode.
 530           - Simple error detection: wrong size indicates a broken file.
 531           - Seeking forwards to a specific location in streamed mode.
 532
 533         It should be noted that the only reliable way to determine
 534         the real uncompressed size is to uncompress the Block,
 535         because the Block Header and Index fields may contain
 536         (intentionally or unintentionally) invalid information.
 537
 538
 539 3.1.5. List of Filter Flags
 540
 541         +================+================+     +================+
 542         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 543         +================+================+     +================+
 544
 545         The number of Filter Flags fields is stored in the Block Flags
 546         field (see Section 3.1.2).
 547
 548         The format of each Filter Flags field is as follows:
 549
 550             +===========+====================+===================+
 551             | Filter ID | Size of Properties | Filter Properties |
 552             +===========+====================+===================+
 553
 554         Both Filter ID and Size of Properties are stored using the
 555         encoding described in Section 1.2. Size of Properties indicates
 556         the size of the Filter Properties field as bytes. The list of
 557         officially defined Filter IDs and the formats of their Filter
 558         Properties are described in Section 5.3.
 559
 560         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 561         (2^62) are reserved for implementation-specific internal use.
 562         These Filter IDs MUST never be used in List of Filter Flags.
 563
 564
 565 3.1.6. Header Padding
 566
 567         This field contains as many null byte as it is needed to make
 568         the Block Header have the size specified in Block Header Size.
 569         If any of the bytes are not null bytes, the decoder MUST
 570         indicate an error. It is possible that there is a new field
 571         present which the decoder is not aware of, and can thus parse
 572         the Block Header incorrectly.
 573
 574
 575 3.1.7. CRC32
 576
 577         The CRC32 is calculated over everything in the Block Header
 578         field except the CRC32 field itself. It is stored as an
 579         unsigned 32-bit little endian integer. If the calculated
 580         value does not match the stored one, the decoder MUST indicate
 581         an error.
 582
 583         By verifying the CRC32 of the Block Header before parsing the
 584         actual contents allows the decoder to distinguish between
 585         corrupt and unsupported files.
 586
 587
 588 3.2. Compressed Data
 589
 590         The format of Compressed Data depends on Block Flags and List
 591         of Filter Flags. Excluding the descriptions of the simplest
 592         filters in Section 5.3, the format of the filter-specific
 593         encoded data is out of scope of this document.
 594
 595
 596 3.3. Block Padding
 597
 598         Block Padding MUST contain 0-3 null bytes to make the size of
 599         the Block a multiple of four bytes. This can be needed when
 600         the size of Compressed Data is not a multiple of four.
 601
 602
 603 3.4. Check
 604
 605         The type and size of the Check field depends on which bits
 606         are set in the Stream Flags field (see Section 2.1.1.2).
 607
 608         The Check, when used, is calculated from the original
 609         uncompressed data. If the calculated Check does not match the
 610         stored one, the decoder MUST indicate an error. If the selected
 611         type of Check is not supported by the decoder, it SHOULD
 612         indicate a warning or error.
 613
 614
 615 4. Index
 616
 617         +-----------------+===================+
 618         | Index Indicator | Number of Records |
 619         +-----------------+===================+
 620
 621              +=================+===============+-+-+-+-+
 622         ---> | List of Records | Index Padding | CRC32 |
 623              +=================+===============+-+-+-+-+
 624
 625         Index serves several purporses. Using it, one can
 626           - verify that all Blocks in a Stream have been processed;
 627           - find out the uncompressed size of a Stream; and
 628           - quickly access the beginning of any Block (random access).
 629
 630
 631 4.1. Index Indicator
 632
 633         This field overlaps with the Block Header Size field (see
 634         Section 3.1.1). The value of Index Indicator is always 0x00.
 635
 636
 637 4.2. Number of Records
 638
 639         This field indicates how many Records there are in the List
 640         of Records field, and thus how many Blocks there are in the
 641         Stream. The value is stored using the encoding described in
 642         Section 1.2. If the decoder has decoded all the Blocks of the
 643         Stream, and then notices that the Number of Records doesn't
 644         match the real number of Blocks, the decoder MUST indicate an
 645         error.
 646
 647
 648 4.3. List of Records
 649
 650         List of Records consists of as many Records as indicated by the
 651         Number of Records field:
 652
 653             +========+========+
 654             | Record | Record | ...
 655             +========+========+
 656
 657         Each Record contains information about one Block:
 658
 659             +===============+===================+
 660             | Unpadded Size | Uncompressed Size |
 661             +===============+===================+
 662
 663         If the decoder has decoded all the Blocks of the Stream, it
 664         MUST verify that the contents of the Records match the real
 665         Unpadded Size and Uncompressed Size of the respective Blocks.
 666
 667         Implementation hint: It is possible to verify the Index with
 668         constant memory usage by calculating for example SHA-256 of
 669         both the real size values and the List of Records, then
 670         comparing the hash values. Implementing this using
 671         non-cryptographic hash like CRC32 SHOULD be avoided unless
 672         small code size is important.
 673
 674         If the decoder supports random-access reading, it MUST verify
 675         that Unpadded Size and Uncompressed Size of every completely
 676         decoded Block match the sizes stored in the Index. If only
 677         partial Block is decoded, the decoder MUST verify that the
 678         processed sizes don't exceed the sizes stored in the Index.
 679
 680
 681 4.3.1. Unpadded Size
 682
 683         This field indicates the size of the Block excluding the Block
 684         Padding field. That is, Unpadded Size is the size of the Block
 685         Header, Compressed Data, and Check fields. Unpadded Size is
 686         stored using the encoding described in Section 1.2. The value
 687         MUST never be zero; with the current structure of Blocks, the
 688         actual minimum value for Unpadded Size is five.
 689
 690         Implementation note: Because the size of the Block Padding
 691         field is not included in Unpadded Size, calculating the total
 692         size of a Stream or doing random-access reading requires
 693         calculating the actual size of the Blocks by rounding Unpadded
 694         Sizes up to the next multiple of four.
 695
 696         The reason to exclude Block Padding from Unpadded Size is to
 697         ease making a raw copy of Compressed Data without Block
 698         Padding. This can be useful, for example, if someone wants
 699         to convert Streams to some other file format quickly.
 700
 701
 702 4.3.2. Uncompressed Size
 703
 704         This field indicates the Uncompressed Size of the respective
 705         Block as bytes. The value is stored using the encoding
 706         described in Section 1.2.
 707
 708
 709 4.4. Index Padding
 710
 711         This field MUST contain 0-3 null bytes to pad the Index to
 712         a multiple of four bytes.
 713
 714
 715 4.5. CRC32
 716
 717         The CRC32 is calculated over everything in the Index field
 718         except the CRC32 field itself. The CRC32 is stored as an
 719         unsigned 32-bit little endian integer. If the calculated
 720         value does not match the stored one, the decoder MUST indicate
 721         an error.
 722
 723
 724 5. Filter Chains
 725
 726         The Block Flags field defines how many filters are used. When
 727         more than one filter is used, the filters are chained; that is,
 728         the output of one filter is the input of another filter. The
 729         following figure illustrates the direction of data flow.
 730
 731                     v   Uncompressed Data   ^
 732                     |       Filter 0        |
 733             Encoder |       Filter 1        | Decoder
 734                     |       Filter n        |
 735                     v    Compressed Data    ^
 736
 737
 738 5.1. Alignment
 739
 740         Alignment of uncompressed input data is usually the job of
 741         the application producing the data. For example, to get the
 742         best results, an archiver tool should make sure that all
 743         PowerPC executable files in the archive stream start at
 744         offsets that are multiples of four bytes.
 745
 746         Some filters, for example LZMA2, can be configured to take
 747         advantage of specified alignment of input data. Note that
 748         taking advantage of aligned input can be benefical also when
 749         a filter is not the first filter in the chain. For example,
 750         if you compress PowerPC executables, you may want to use the
 751         PowerPC filter and chain that with the LZMA2 filter. Because
 752         not only the input but also the output alignment of the PowerPC
 753         filter is four bytes, it is now benefical to set LZMA2 settings
 754         so that the LZMA2 encoder can take advantage of its
 755         four-byte-aligned input data.
 756
 757         The output of the last filter in the chain is stored to the
 758         Compressed Data field, which is is guaranteed to be aligned
 759         to a multiple of four bytes relative to the beginning of the
 760         Stream. This can increase
 761           - speed, if the filtered data is handled multiple bytes at
 762             a time by the filter-specific encoder and decoder,
 763             because accessing aligned data in computer memory is
 764             usually faster; and
 765           - compression ratio, if the output data is later compressed
 766             with an external compression tool.
 767
 768
 769 5.2. Security
 770
 771         If filters would be allowed to be chained freely, it would be
 772         possible to create malicious files, that would be very slow to
 773         decode. Such files could be used to create denial of service
 774         attacks.
 775
 776         Slow files could occur when multiple filters are chained:
 777
 778             v   Compressed input data
 779             |   Filter 1 decoder (last filter)
 780             |   Filter 0 decoder (non-last filter)
 781             v   Uncompressed output data
 782
 783         The decoder of the last filter in the chain produces a lot of
 784         output from little input. Another filter in the chain takes the
 785         output of the last filter, and produces very little output
 786         while consuming a lot of input. As a result, a lot of data is
 787         moved inside the filter chain, but the filter chain as a whole
 788         gets very little work done.
 789
 790         To prevent this kind of slow files, there are restrictions on
 791         how the filters can be chained. These restrictions MUST be
 792         taken into account when designing new filters.
 793
 794         The maximum number of filters in the chain has been limited to
 795         four, thus there can be at maximum of three non-last filters.
 796         Of these three non-last filters, only two are allowed to change
 797         the size of the data.
 798
 799         The non-last filters, that change the size of the data, MUST
 800         have a limit how much the decoder can compress the data: the
 801         decoder SHOULD produce at least n bytes of output when the
 802         filter is given 2n bytes of input. This  limit is not
 803         absolute, but significant deviations MUST be avoided.
 804
 805         The above limitations guarantee that if the last filter in the
 806         chain produces 4n bytes of output, the chain as a whole will
 807         produce at least n bytes of output.
 808
 809
 810 5.3. Filters
 811
 812 5.3.1. LZMA2
 813
 814         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 815         compression algorithm with high compression ratio and fast
 816         decompression. LZMA is based on LZ77 and range coding
 817         algorithms.
 818
 819         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 820         LZMA internally, but adds support for flushing the encoder,
 821         uncompressed chunks, eases stateful decoder implementations,
 822         and improves support for multithreading. Thus, the plain LZMA
 823         will not be supported in this file format.
 824
 825             Filter ID:                  0x21
 826             Size of Filter Properties:  1 byte
 827             Changes size of data:       Yes
 828             Allow as a non-last filter: No
 829             Allow as the last filter:   Yes
 830
 831             Preferred alignment:
 832                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 833                 Output data:            1 byte
 834
 835         The format of the one-byte Filter Properties field is as
 836         follows:
 837
 838             Bits   Mask   Description
 839             0-5    0x3F   Dictionary Size
 840             6-7    0xC0   Reserved for future use; MUST be zero for now.
 841
 842         Dictionary Size is encoded with one-bit mantissa and five-bit
 843         exponent. The smallest dictionary size is 4 KiB and the biggest
 844         is 4 GiB.
 845
 846             Raw value   Mantissa   Exponent   Dictionary size
 847                 0           2         11         4 KiB
 848                 1           3         11         6 KiB
 849                 2           2         12         8 KiB
 850                 3           3         12        12 KiB
 851                 4           2         13        16 KiB
 852                 5           3         13        24 KiB
 853                 6           2         14        32 KiB
 854               ...         ...        ...      ...
 855                35           3         27       768 MiB
 856                36           2         28      1024 MiB
 857                37           3         29      1536 MiB
 858                38           2         30      2048 MiB
 859                39           3         30      3072 MiB
 860                40           2         31      4096 MiB - 1 B
 861
 862         Instead of having a table in the decoder, the dictionary size
 863         can be decoded using the following C code:
 864
 865             const uint8_t bits = get_dictionary_flags() & 0x3F;
 866             if (bits > 40)
 867                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 868
 869             uint32_t dictionary_size;
 870             if (bits == 40) {
 871                 dictionary_size = UINT32_MAX;
 872             } else {
 873                 dictionary_size = 2 | (bits & 1);
 874                 dictionary_size <<= bits / 2 + 11;
 875             }
 876
 877
 878 5.3.2. Branch/Call/Jump Filters for Executables
 879
 880         These filters convert relative branch, call, and jump
 881         instructions to their absolute counterparts in executable
 882         files. This conversion increases redundancy and thus
 883         compression ratio.
 884
 885             Size of Filter Properties:  0 or 4 bytes
 886             Changes size of data:       No
 887             Allow as a non-last filter: Yes
 888             Allow as the last filter:   No
 889
 890         Below is the list of filters in this category. The alignment
 891         is the same for both input and output data.
 892
 893             Filter ID   Alignment   Description
 894               0x04       1 byte     x86 filter (BCJ)
 895               0x05       4 bytes    PowerPC (big endian) filter
 896               0x06      16 bytes    IA64 filter
 897               0x07       4 bytes    ARM (little endian) filter
 898               0x08       2 bytes    ARM Thumb (little endian) filter
 899               0x09       4 bytes    SPARC filter
 900
 901         If the size of Filter Properties is four bytes, the Filter
 902         Properties field contains the start offset used for address
 903         conversions. It is stored as an unsigned 32-bit little endian
 904         integer. If the size of Filter Properties is zero, the start
 905         offset is zero.
 906
 907         Setting the start offset may be useful if an executable has
 908         multiple sections, and there are many cross-section calls.
 909         Taking advantage of this feature usually requires usage of
 910         the Subblock filter, whose design is not complete yet.
 911
 912
 913 5.3.3. Delta
 914
 915         The Delta filter may increase compression ratio when the value
 916         of the next byte correlates with the value of an earlier byte
 917         at specified distance.
 918
 919             Filter ID:                  0x03
 920             Size of Filter Properties:  1 byte
 921             Changes size of data:       No
 922             Allow as a non-last filter: Yes
 923             Allow as the last filter:   No
 924
 925             Preferred alignment:
 926                 Input data:             1 byte
 927                 Output data:            Same as the original input data
 928
 929         The Properties byte indicates the delta distance, which can be
 930         1-256 bytes backwards from the current byte: 0x00 indicates
 931         distance of 1 byte and 0xFF distance of 256 bytes.
 932
 933
 934 5.3.3.1. Format of the Encoded Output
 935
 936         The code below illustrates both encoding and decoding with
 937         the Delta filter.
 938
 939             // Distance is in the range [1, 256].
 940             const unsigned int distance = get_properties_byte() + 1;
 941             uint8_t pos = 0;
 942             uint8_t delta[256];
 943
 944             memset(delta, 0, sizeof(delta));
 945
 946             while (1) {
 947                 const int byte = read_byte();
 948                 if (byte == EOF)
 949                     break;
 950
 951                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 952                 if (is_encoder) {
 953                     tmp = (uint8_t)(byte) - tmp;
 954                     delta[pos] = (uint8_t)(byte);
 955                 } else {
 956                     tmp = (uint8_t)(byte) + tmp;
 957                     delta[pos] = tmp;
 958                 }
 959
 960                 write_byte(tmp);
 961                 --pos;
 962             }
 963
 964
 965 5.4. Custom Filter IDs
 966
 967         If a developer wants to use custom Filter IDs, he has two
 968         choices. The first choice is to contact Lasse Collin and ask
 969         him to allocate a range of IDs for the developer.
 970
 971         The second choice is to generate a 40-bit random integer,
 972         which the developer can use as his personal Developer ID.
 973         To minimalize the risk of collisions, Developer ID has to be
 974         a randomly generated integer, not manually selected "hex word".
 975         The following command, which works on many free operating
 976         systems, can be used to generate Developer ID:
 977
 978             dd if=/dev/urandom bs=5 count=1 | hexdump
 979
 980         The developer can then use his Developer ID to create unique
 981         (well, hopefully unique) Filter IDs.
 982
 983             Bits    Mask                    Description
 984              0-15   0x0000_0000_0000_FFFF   Filter ID
 985             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
 986             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
 987
 988         The resulting 63-bit integer will use 9 bytes of space when
 989         stored using the encoding described in Section 1.2. To get
 990         a shorter ID, see the beginning of this Section how to
 991         request a custom ID range.
 992
 993
 994 5.4.1. Reserved Custom Filter ID Ranges
 995
 996         Range                       Description
 997         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
 998         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
 999         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1000
1001
1002 6. Cyclic Redundancy Checks
1003
1004         There are several incompatible variations to calculate CRC32
1005         and CRC64. For simplicity and clarity, complete examples are
1006         provided to calculate the checks as they are used in this file
1007         format. Implementations MAY use different code as long as it
1008         gives identical results.
1009
1010         The program below reads data from standard input, calculates
1011         the CRC32 and CRC64 values, and prints the calculated values
1012         as big endian hexadecimal strings to standard output.
1013
1014             #include <stddef.h>
1015             #include <inttypes.h>
1016             #include <stdio.h>
1017
1018             uint32_t crc32_table[256];
1019             uint64_t crc64_table[256];
1020
1021             void
1022             init(void)
1023             {
1024                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1025                 static const uint64_t poly64
1026                         = UINT64_C(0xC96C5795D7870F42);
1027
1028                 for (size_t i = 0; i < 256; ++i) {
1029                     uint32_t crc32 = i;
1030                     uint64_t crc64 = i;
1031
1032                     for (size_t j = 0; j < 8; ++j) {
1033                         if (crc32 & 1)
1034                             crc32 = (crc32 >> 1) ^ poly32;
1035                         else
1036                             crc32 >>= 1;
1037
1038                         if (crc64 & 1)
1039                             crc64 = (crc64 >> 1) ^ poly64;
1040                         else
1041                             crc64 >>= 1;
1042                     }
1043
1044                     crc32_table[i] = crc32;
1045                     crc64_table[i] = crc64;
1046                 }
1047             }
1048
1049             uint32_t
1050             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1051             {
1052                 crc = ~crc;
1053                 for (size_t i = 0; i < size; ++i)
1054                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1055                             ^ (crc >> 8);
1056                 return ~crc;
1057             }
1058
1059             uint64_t
1060             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1061             {
1062                 crc = ~crc;
1063                 for (size_t i = 0; i < size; ++i)
1064                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1065                             ^ (crc >> 8);
1066                 return ~crc;
1067             }
1068
1069             int
1070             main()
1071             {
1072                 init();
1073
1074                 uint32_t value32 = 0;
1075                 uint64_t value64 = 0;
1076                 uint64_t total_size = 0;
1077                 uint8_t buf[8192];
1078
1079                 while (1) {
1080                     const size_t buf_size
1081                             = fread(buf, 1, sizeof(buf), stdin);
1082                     if (buf_size == 0)
1083                         break;
1084
1085                     total_size += buf_size;
1086                     value32 = crc32(buf, buf_size, value32);
1087                     value64 = crc64(buf, buf_size, value64);
1088                 }
1089
1090                 printf("Bytes:  %" PRIu64 "\n", total_size);
1091                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1092                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1093
1094                 return 0;
1095             }
1096
1097
1098 7. References
1099
1100         LZMA SDK - The original LZMA implementation
1101         http://7-zip.org/sdk.html
1102
1103         LZMA Utils - LZMA adapted to POSIX-like systems
1104         http://tukaani.org/lzma/
1105
1106         XZ Utils - The next generation of LZMA Utils
1107         http://tukaani.org/xz/
1108
1109         [RFC-1952]
1110         GZIP file format specification version 4.3
1111         http://www.ietf.org/rfc/rfc1952.txt
1112           - Notation of byte boxes in section "2.1. Overall conventions"
1113
1114         [RFC-2119]
1115         Key words for use in RFCs to Indicate Requirement Levels
1116         http://www.ietf.org/rfc/rfc2119.txt
1117
1118         [GNU-tar]
1119         GNU tar 1.21 manual
1120         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1121           - Node 9.4.2 "Blocking Factor", paragraph that begins
1122             "gzip will complain about trailing garbage"
1123           - Note that this URL points to the latest version of the
1124             manual, and may some day not contain the note which is in
1125             1.21. For the exact version of the manual, download GNU
1126             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1127