doc/file-format.txt

   1
   2 The .xz File Format
   3 -------------------
   4
   5         0. Preface
   6            0.1. Notices and Acknowledgements
   7            0.2. Changes
   8         1. Conventions
   9            1.1. Byte and Its Representation
  10            1.2. Multibyte Integers
  11         2. Overall Structure of .xz File
  12            2.1. Stream
  13                 2.1.1. Stream Header
  14                        2.1.1.1. Header Magic Bytes
  15                        2.1.1.2. Stream Flags
  16                        2.1.1.3. CRC32
  17                 2.1.2. Stream Footer
  18                        2.1.2.1. CRC32
  19                        2.1.2.2. Backward Size
  20                        2.1.2.3. Stream Flags
  21                        2.1.2.4. Footer Magic Bytes
  22            2.2. Stream Padding
  23         3. Block
  24            3.1. Block Header
  25                 3.1.1. Block Header Size
  26                 3.1.2. Block Flags
  27                 3.1.3. Compressed Size
  28                 3.1.4. Uncompressed Size
  29                 3.1.5. List of Filter Flags
  30                 3.1.6. Header Padding
  31                 3.1.7. CRC32
  32            3.2. Compressed Data
  33            3.3. Block Padding
  34            3.4. Check
  35         4. Index
  36            4.1. Index Indicator
  37            4.2. Number of Records
  38            4.3. List of Records
  39                 4.3.1. Unpadded Size
  40                 4.3.2. Uncompressed Size
  41            4.4. Index Padding
  42            4.5. CRC32
  43         5. Filter Chains
  44            5.1. Alignment
  45            5.2. Security
  46            5.3. Filters
  47                 5.3.1. LZMA2
  48                 5.3.2. Branch/Call/Jump Filters for Executables
  49                 5.3.3. Delta
  50                        5.3.3.1. Format of the Encoded Output
  51            5.4. Custom Filter IDs
  52                 5.4.1. Reserved Custom Filter ID Ranges
  53         6. Cyclic Redundancy Checks
  54         7. References
  55
  56
  57 0. Preface
  58
  59         This document describes the .xz file format (filename suffix
  60         ".xz", MIME type "application/x-xz"). It is intended that this
  61         this format replace the old .lzma format used by LZMA SDK and
  62         LZMA Utils.
  63
  64         IMPORTANT:  The version described in this document is a
  65                     draft, NOT a final, official version. Changes
  66                     are possible.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to Ville
  75         Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Changes
  82
  83         Last modified: 2008-12-05 12:45+0200
  84
  85         (A changelog will be kept once the first official version
  86         is made.)
  87
  88
  89 1. Conventions
  90
  91         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
  92         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  93         document are to be interpreted as described in [RFC-2119].
  94
  95         Indicating a warning means displaying a message, returning
  96         appropriate exit status, or doing something else to let the
  97         user know that something worth warning occurred. The operation
  98         SHOULD still finish if a warning is indicated.
  99
 100         Indicating an error means displaying a message, returning
 101         appropriate exit status, or doing something else to let the
 102         user know that something prevented successfully finishing the
 103         operation. The operation MUST be aborted once an error has
 104         been indicated.
 105
 106
 107 1.1. Byte and Its Representation
 108
 109         In this document, byte is always 8 bits.
 110
 111         A "null byte" has all bits unset. That is, the value of a null
 112         byte is 0x00.
 113
 114         To represent byte blocks, this document uses notation that
 115         is similar to the notation used in [RFC-1952]:
 116
 117             +-------+
 118             |  Foo  |   One byte.
 119             +-------+
 120
 121             +---+---+
 122             |  Foo  |   Two bytes; that is, some of the vertical bars
 123             +---+---+   can be missing.
 124
 125             +=======+
 126             |  Foo  |   Zero or more bytes.
 127             +=======+
 128
 129         In this document, a boxed byte or a byte sequence declared
 130         using this notation is called "a field". The example field
 131         above would be called "the Foo field" or plain "Foo".
 132
 133         If there are many fields, they may be split to multiple lines.
 134         This is indicated with an arrow ("--->"):
 135
 136             +=====+
 137             | Foo |
 138             +=====+
 139
 140                  +=====+
 141             ---> | Bar |
 142                  +=====+
 143
 144         The above is equivalent to this:
 145
 146             +=====+=====+
 147             | Foo | Bar |
 148             +=====+=====+
 149
 150
 151 1.2. Multibyte Integers
 152
 153         Multibyte integers of static length, such as CRC values,
 154         are stored in little endian byte order (least significant
 155         byte first).
 156
 157         When smaller values are more likely than bigger values (for
 158         example file sizes), multibyte integers are encoded in a
 159         variable-length representation:
 160           - Numbers in the range [0, 127] are copied as is, and take
 161             one byte of space.
 162           - Bigger numbers will occupy two or more bytes. All but the
 163             last byte of the multibyte representation have the highest
 164             (eighth) bit set.
 165
 166         For now, the value of the variable-length integers is limited
 167         to 63 bits, which limits the encoded size of the integer to
 168         nine bytes. These limits may be increased in future if needed.
 169
 170         The following C code illustrates encoding and decoding of
 171         variable-length integers. The functions return the number of
 172         bytes occupied by the integer (1-9), or zero on error.
 173
 174             #include <sys/types.h>
 175             #include <inttypes.h>
 176
 177             size_t
 178             encode(uint8_t buf[static 9], uint64_t num)
 179             {
 180                 if (num > UINT64_MAX / 2)
 181                     return 0;
 182
 183                 size_t i = 0;
 184
 185                 while (num >= 0x80) {
 186                     buf[i++] = (uint8_t)(num) | 0x80;
 187                     num >>= 7;
 188                 }
 189
 190                 buf[i++] = (uint8_t)(num);
 191
 192                 return i;
 193             }
 194
 195             size_t
 196             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 197             {
 198                 if (size_max == 0)
 199                     return 0;
 200
 201                 if (size_max > 9)
 202                     size_max = 9;
 203
 204                 *num = buf[0] & 0x7F;
 205                 size_t i = 0;
 206
 207                 while (buf[i++] & 0x80) {
 208                     if (i >= size_max || buf[i] == 0x00)
 209                         return 0;
 210
 211                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 212                 }
 213
 214                 return i;
 215             }
 216
 217
 218 2. Overall Structure of .xz File
 219
 220         A standalone .xz files consist of one or more Streams which may
 221         have Stream Padding between or after them:
 222
 223             +========+================+========+================+
 224             | Stream | Stream Padding | Stream | Stream Padding | ...
 225             +========+================+========+================+
 226
 227         While a typical file contains only one Stream and no Stream
 228         Padding, a decoder handling standalone .xz files SHOULD support
 229         files that have more than one Stream or Stream Padding.
 230
 231         In contrast to standalone .xz files, when the .xz file format
 232         is used as an internal part of some other file format or
 233         communication protocol, it usually is expected that the decoder
 234         stops after the first Stream, and doesn't look for Stream
 235         Padding or possibly other Streams.
 236
 237
 238 2.1. Stream
 239
 240         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 241         |     Stream Header     | Block | Block | ... | Block |
 242         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 243
 244              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 245         ---> | Index |     Stream Footer     |
 246              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 247
 248         All the above fields have a size that is a multiple of four. If
 249         Stream is used as an internal part of another file format, it
 250         is RECOMMENDED to make the Stream start at an offset that is
 251         a multiple of four bytes.
 252
 253         Stream Header, Index, and Stream Footer are always present in
 254         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 255
 256         There are zero or more Blocks. The maximum number of Blocks is
 257         limited only by the maximum size of the Index field.
 258
 259         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 260         The same limit applies to the total amount of uncompressed
 261         data stored in a Stream.
 262
 263         If an implementation supports handling .xz files with multiple
 264         concatenated Streams, it MAY apply the above limits to the file
 265         as a whole instead of limiting per Stream basis.
 266
 267
 268 2.1.1. Stream Header
 269
 270         +---+---+---+---+---+---+-------+------+--+--+--+--+
 271         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 272         +---+---+---+---+---+---+-------+------+--+--+--+--+
 273
 274
 275 2.1.1.1. Header Magic Bytes
 276
 277         The first six (6) bytes of the Stream are so called Header
 278         Magic Bytes. They can be used to identify the file type.
 279
 280             Using a C array and ASCII:
 281             const uint8_t HEADER_MAGIC[6]
 282                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 283
 284             In plain hexadecimal:
 285             FD 37 7A 58 5A 00
 286
 287         Notes:
 288           - The first byte (0xFD) was chosen so that the files cannot
 289             be erroneously detected as being in .lzma format, in which
 290             the first byte is in the range [0x00, 0xE0].
 291           - The sixth byte (0x00) was chosen to prevent applications
 292             from misdetecting the file as a text file.
 293
 294         If the Header Magic Bytes don't match, the decoder MUST
 295         indicate an error.
 296
 297
 298 2.1.1.2. Stream Flags
 299
 300         The first byte of Stream Flags is always a null byte. In future
 301         this byte may be used to indicate new Stream version or other
 302         Stream properties.
 303
 304         The second byte of Stream Flags is a bit field:
 305
 306             Bit(s)  Mask  Description
 307              0-3    0x0F  Type of Check (see Section 3.4):
 308                               ID    Size      Check name
 309                               0x00   0 bytes  None
 310                               0x01   4 bytes  CRC32
 311                               0x02   4 bytes  (Reserved)
 312                               0x03   4 bytes  (Reserved)
 313                               0x04   8 bytes  CRC64
 314                               0x05   8 bytes  (Reserved)
 315                               0x06   8 bytes  (Reserved)
 316                               0x07  16 bytes  (Reserved)
 317                               0x08  16 bytes  (Reserved)
 318                               0x09  16 bytes  (Reserved)
 319                               0x0A  32 bytes  SHA-256
 320                               0x0B  32 bytes  (Reserved)
 321                               0x0C  32 bytes  (Reserved)
 322                               0x0D  64 bytes  (Reserved)
 323                               0x0E  64 bytes  (Reserved)
 324                               0x0F  64 bytes  (Reserved)
 325              4-7    0xF0  Reserved for future use; MUST be zero for now.
 326
 327         Implementations SHOULD support at least the Check IDs 0x00
 328         (None) and 0x01 (CRC32). Supporting other Check IDs is
 329         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 330         indicate a warning or error.
 331
 332         If any reserved bit is set, the decoder MUST indicate an error.
 333         It is possible that there is a new field present which the
 334         decoder is not aware of, and can thus parse the Stream Header
 335         incorrectly.
 336
 337
 338 2.1.1.3. CRC32
 339
 340         The CRC32 is calculated from the Stream Flags field. It is
 341         stored as an unsigned 32-bit little endian integer. If the
 342         calculated value does not match the stored one, the decoder
 343         MUST indicate an error.
 344
 345         The idea is that Stream Flags would always be two bytes, even
 346         if new features are needed. This way old decoders will be able
 347         to verify the CRC32 calculated from Stream Flags, and thus
 348         distinguish between corrupt files (CRC32 doesn't match) and
 349         files that the decoder doesn't support (CRC32 matches but
 350         Stream Flags has reserved bits set).
 351
 352
 353 2.1.2. Stream Footer
 354
 355         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 356         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 357         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 358
 359
 360 2.1.2.1. CRC32
 361
 362         The CRC32 is calculated from the Backward Size and Stream Flags
 363         fields. It is stored as an unsigned 32-bit little endian
 364         integer. If the calculated value does not match the stored one,
 365         the decoder MUST indicate an error.
 366
 367         The reason to have the CRC32 field before the Backward Size and
 368         Stream Flags fields is to keep the four-byte fields aligned to
 369         a multiple of four bytes.
 370
 371
 372 2.1.2.2. Backward Size
 373
 374         Backward Size is stored as a 32-bit little endian integer,
 375         which indicates the size of the Index field as multiple of
 376         four bytes, minimum value being four bytes:
 377
 378             real_backward_size = (stored_backward_size + 1) * 4;
 379
 380         If the stored value does not match the real size of the Index
 381         field, the decoder MUST indicate an error.
 382
 383         Using a fixed-size integer to store Backward Size makes
 384         it slightly simpler to parse the Stream Footer when the
 385         application needs to parse the Stream backwards.
 386
 387
 388 2.1.2.3. Stream Flags
 389
 390         This is a copy of the Stream Flags field from the Stream
 391         Header. The information stored to Stream Flags is needed
 392         when parsing the Stream backwards. The decoder MUST compare
 393         the Stream Flags fields in both Stream Header and Stream
 394         Footer, and indicate an error if they are not identical.
 395
 396
 397 2.1.2.4. Footer Magic Bytes
 398
 399         As the last step of the decoding process, the decoder MUST
 400         verify the existence of Footer Magic Bytes. If they don't
 401         match, an error MUST be indicated.
 402
 403             Using a C array and ASCII:
 404             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 405
 406             In hexadecimal:
 407             59 5A
 408
 409         The primary reason to have Footer Magic Bytes is to make
 410         it easier to detect incomplete files quickly, without
 411         uncompressing. If the file does not end with Footer Magic Bytes
 412         (excluding Stream Padding described in Section 2.2), it cannot
 413         be undamaged, unless someone has intentionally appended garbage
 414         after the end of the Stream.
 415
 416
 417 2.2. Stream Padding
 418
 419         Only the decoders that support decoding of concatenated Streams
 420         MUST support Stream Padding.
 421
 422         Stream Padding MUST contain only null bytes. To preserve the
 423         four-byte alignment of consecutive Streams, the size of Stream
 424         Padding MUST be a multiple of four bytes. Empty Stream Padding
 425         is allowed.
 426
 427         Note that non-empty Stream Padding is allowed at the end of the
 428         file; there doesn't need to be a new Stream after non-empty
 429         Stream Padding. This can be convenient in certain situations
 430         [GNU-tar].
 431
 432         The possibility of Padding MUST be taken into account when
 433         designing an application that parses Streams backwards, and
 434         the application supports concatenated Streams.
 435
 436
 437 3. Block
 438
 439         +==============+=================+===============+=======+
 440         | Block Header | Compressed Data | Block Padding | Check |
 441         +==============+=================+===============+=======+
 442
 443
 444 3.1. Block Header
 445
 446         +-------------------+-------------+=================+
 447         | Block Header Size | Block Flags | Compressed Size |
 448         +-------------------+-------------+=================+
 449
 450              +===================+======================+
 451         ---> | Uncompressed Size | List of Filter Flags |
 452              +===================+======================+
 453
 454              +================+--+--+--+--+
 455         ---> | Header Padding |   CRC32   |
 456              +================+--+--+--+--+
 457
 458
 459 3.1.1. Block Header Size
 460
 461         This field overlaps with the Index Indicator field (see
 462         Section 4.1).
 463
 464         This field contains the size of the Block Header field,
 465         including the Block Header Size field itself. Valid values are
 466         in the range [0x01, 0xFF], which indicate the size of the Block
 467         Header as multiples of four bytes, minimum size being eight
 468         bytes:
 469
 470             real_header_size = (encoded_header_size + 1) * 4;
 471
 472         If bigger Block Header is needed in future, a new field can be
 473         added between the current Block Header and Compressed Data
 474         fields. The presence of this new field would be indicated in
 475         the Block Header.
 476
 477
 478 3.1.2. Block Flags
 479
 480         The first byte of the Block Flags field is a bit field:
 481
 482             Bit(s)  Mask  Description
 483              0-1    0x03  Number of filters (1-4)
 484              2-5    0x3C  Reserved for future use; MUST be zero for now.
 485               6     0x40  The Compressed Size field is present.
 486               7     0x80  The Uncompressed Size field is present.
 487
 488         If any reserved bit is set, the decoder MUST indicate an error.
 489         It is possible that there is a new field present which the
 490         decoder is not aware of, and can thus parse the Block Header
 491         incorrectly.
 492
 493
 494 3.1.3. Compressed Size
 495
 496         This field is present only if the appropriate bit is set in
 497         the Block Flags field (see Section 3.1.2).
 498
 499         The Compressed Size field contains the size of the Compressed
 500         Data field, which MUST be non-zero. Compressed Size is stored
 501         using the encoding described in Section 1.2. If the Compressed
 502         Size doesn't match the size of the Compressed Data field, the
 503         decoder MUST indicate an error.
 504
 505
 506 3.1.4. Uncompressed Size
 507
 508         This field is present only if the appropriate bit is set in
 509         the Block Flags field (see Section 3.1.2).
 510
 511         The Uncompressed Size field contains the size of the Block
 512         after uncompressing. Uncompressed Size is stored using the
 513         encoding described in Section 1.2. If the Uncompressed Size
 514         does not match the real uncompressed size, the decoder MUST
 515         indicate an error.
 516
 517         Storing the Compressed Size and Uncompressed Size fields serves
 518         several purposes:
 519           - The decoder knows how much memory it needs to allocate
 520             for a temporary buffer in multithreaded mode.
 521           - Simple error detection: wrong size indicates a broken file.
 522           - Seeking forwards to a specific location in streamed mode.
 523
 524         It should be noted that the only reliable way to determine
 525         the real uncompressed size is to uncompress the Block,
 526         because the Block Header and Index fields may contain
 527         (intentionally or unintentionally) invalid information.
 528
 529
 530 3.1.5. List of Filter Flags
 531
 532         +================+================+     +================+
 533         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 534         +================+================+     +================+
 535
 536         The number of Filter Flags fields is stored in the Block Flags
 537         field (see Section 3.1.2).
 538
 539         The format of each Filter Flags field is as follows:
 540
 541             +===========+====================+===================+
 542             | Filter ID | Size of Properties | Filter Properties |
 543             +===========+====================+===================+
 544
 545         Both Filter ID and Size of Properties are stored using the
 546         encoding described in Section 1.2. Size of Properties indicates
 547         the size of the Filter Properties field as bytes. The list of
 548         officially defined Filter IDs and the formats of their Filter
 549         Properties are described in Section 5.3.
 550
 551         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 552         (2^62) are reserved for implementation-specific internal use.
 553         These Filter IDs MUST never be used in List of Filter Flags.
 554
 555
 556 3.1.6. Header Padding
 557
 558         This field contains as many null byte as it is needed to make
 559         the Block Header have the size specified in Block Header Size.
 560         If any of the bytes are not null bytes, the decoder MUST
 561         indicate an error. It is possible that there is a new field
 562         present which the decoder is not aware of, and can thus parse
 563         the Block Header incorrectly.
 564
 565
 566 3.1.7. CRC32
 567
 568         The CRC32 is calculated over everything in the Block Header
 569         field except the CRC32 field itself. It is stored as an
 570         unsigned 32-bit little endian integer. If the calculated
 571         value does not match the stored one, the decoder MUST indicate
 572         an error.
 573
 574         By verifying the CRC32 of the Block Header before parsing the
 575         actual contents allows the decoder to distinguish between
 576         corrupt and unsupported files.
 577
 578
 579 3.2. Compressed Data
 580
 581         The format of Compressed Data depends on Block Flags and List
 582         of Filter Flags. Excluding the descriptions of the simplest
 583         filters in Section 5.3, the format of the filter-specific
 584         encoded data is out of scope of this document.
 585
 586
 587 3.3. Block Padding
 588
 589         Block Padding MUST contain 0-3 null bytes to make the size of
 590         the Block a multiple of four bytes. This can be needed when
 591         the size of Compressed Data is not a multiple of four.
 592
 593
 594 3.4. Check
 595
 596         The type and size of the Check field depends on which bits
 597         are set in the Stream Flags field (see Section 2.1.1.2).
 598
 599         The Check, when used, is calculated from the original
 600         uncompressed data. If the calculated Check does not match the
 601         stored one, the decoder MUST indicate an error. If the selected
 602         type of Check is not supported by the decoder, it MUST indicate
 603         a warning or error.
 604
 605
 606 4. Index
 607
 608         +-----------------+=========================+
 609         | Index Indicator | Number of Index Records |
 610         +-----------------+=========================+
 611
 612              +=================+=========+-+-+-+-+
 613         ---> | List of Records | Padding | CRC32 |
 614              +=================+=========+-+-+-+-+
 615
 616         Index serves several purporses. Using it, one can
 617           - verify that all Blocks in a Stream have been processed;
 618           - find out the uncompressed size of a Stream; and
 619           - quickly access the beginning of any Block (random access).
 620
 621
 622 4.1. Index Indicator
 623
 624         This field overlaps with the Block Header Size field (see
 625         Section 3.1.1). The value of Index Indicator is always 0x00.
 626
 627
 628 4.2. Number of Records
 629
 630         This field indicates how many Records there are in the List
 631         of Records field, and thus how many Blocks there are in the
 632         Stream. The value is stored using the encoding described in
 633         Section 1.2. If the decoder has decoded all the Blocks of the
 634         Stream, and then notices that the Number of Records doesn't
 635         match the real number of Blocks, the decoder MUST indicate an
 636         error.
 637
 638
 639 4.3. List of Records
 640
 641         List of Records consists of as many Records as indicated by the
 642         Number of Records field:
 643
 644             +========+========+
 645             | Record | Record | ...
 646             +========+========+
 647
 648         Each Record contains information about one Block:
 649
 650             +===============+===================+
 651             | Unpadded Size | Uncompressed Size |
 652             +===============+===================+
 653
 654         If the decoder has decoded all the Blocks of the Stream, it
 655         MUST verify that the contents of the Records match the real
 656         Unpadded Size and Uncompressed Size of the respective Blocks.
 657
 658         Implementation hint: It is possible to verify the Index with
 659         constant memory usage by calculating for example SHA256 of both
 660         the real size values and the List of Records, then comparing
 661         the check values. Implementing this using non-cryptographic
 662         check like CRC32 SHOULD be avoided unless small code size is
 663         important.
 664
 665         If the decoder supports random-access reading, it MUST verify
 666         that Unpadded Size and Uncompressed Size of every completely
 667         decoded Block match the sizes stored in the Index. If only
 668         partial Block is decoded, the decoder MUST verify that the
 669         processed sizes don't exceed the sizes stored in the Index.
 670
 671
 672 4.3.1. Unpadded Size
 673
 674         This field indicates the size of the Block excluding the Block
 675         Padding field. That is, Unpadded Size is the size of the Block
 676         Header, Compressed Data, and Check fields. Unpadded Size is
 677         stored using the encoding described in Section 1.2. The value
 678         MUST never be zero; with the current structure of Blocks, the
 679         actual minimum value for Unpadded Size is five.
 680
 681         Implementation note: Because the size of the Block Padding
 682         field is not included in Unpadded Size, calculating the total
 683         size of a Stream or doing random-access reading requires
 684         calculating the actual size of the Blocks by rounding Unpadded
 685         Sizes up to the next multiple of four.
 686
 687         The reason to exclude Block Padding from Unpadded Size is to
 688         ease making a raw copy of Compressed Data without Block
 689         Padding. This can be useful, for example, if someone wants
 690         to convert Streams to some other file format quickly.
 691
 692
 693 4.3.2. Uncompressed Size
 694
 695         This field indicates the Uncompressed Size of the respective
 696         Block as bytes. The value is stored using the encoding
 697         described in Section 1.2.
 698
 699
 700 4.4. Index Padding
 701
 702         This field MUST contain 0-3 null bytes to pad the Index to
 703         a multiple of four bytes.
 704
 705
 706 4.5. CRC32
 707
 708         The CRC32 is calculated over everything in the Index field
 709         except the CRC32 field itself. The CRC32 is stored as an
 710         unsigned 32-bit little endian integer. If the calculated
 711         value does not match the stored one, the decoder MUST indicate
 712         an error.
 713
 714
 715 5. Filter Chains
 716
 717         The Block Flags field defines how many filters are used. When
 718         more than one filter is used, the filters are chained; that is,
 719         the output of one filter is the input of another filter. The
 720         following figure illustrates the direction of data flow.
 721
 722                     v   Uncompressed Data   ^
 723                     |       Filter 0        |
 724             Encoder |       Filter 1        | Decoder
 725                     |       Filter n        |
 726                     v    Compressed Data    ^
 727
 728
 729 5.1. Alignment
 730
 731         Alignment of uncompressed input data is usually the job of
 732         the application producing the data. For example, to get the
 733         best results, an archiver tool should make sure that all
 734         PowerPC executable files in the archive stream start at
 735         offsets that are multiples of four bytes.
 736
 737         Some filters, for example LZMA2, can be configured to take
 738         advantage of specified alignment of input data. Note that
 739         taking advantage of aligned input can be benefical also when
 740         a filter is not the first filter in the chain. For example,
 741         if you compress PowerPC executables, you may want to use the
 742         PowerPC filter and chain that with the LZMA2 filter. Because
 743         not only the input but also the output alignment of the PowerPC
 744         filter is four bytes, it is now benefical to set LZMA2 settings
 745         so that the LZMA2 encoder can take advantage of its
 746         four-byte-aligned input data.
 747
 748         The output of the last filter in the chain is stored to the
 749         Compressed Data field, which is is guaranteed to be aligned
 750         to a multiple of four bytes relative to the beginning of the
 751         Stream. This can increase
 752           - speed, if the filtered data is handled multiple bytes at
 753             a time by the filter-specific encoder and decoder,
 754             because accessing aligned data in computer memory is
 755             usually faster; and
 756           - compression ratio, if the output data is later compressed
 757             with an external compression tool.
 758
 759
 760 5.2. Security
 761
 762         If filters would be allowed to be chained freely, it would be
 763         possible to create malicious files, that would be very slow to
 764         decode. Such files could be used to create denial of service
 765         attacks.
 766
 767         Slow files could occur when multiple filters are chained:
 768
 769             v   Compressed input data
 770             |   Filter 1 decoder (last filter)
 771             |   Filter 0 decoder (non-last filter)
 772             v   Uncompressed output data
 773
 774         The decoder of the last filter in the chain produces a lot of
 775         output from little input. Another filter in the chain takes the
 776         output of the last filter, and produces very little output
 777         while consuming a lot of input. As a result, a lot of data is
 778         moved inside the filter chain, but the filter chain as a whole
 779         gets very little work done.
 780
 781         To prevent this kind of slow files, there are restrictions on
 782         how the filters can be chained. These restrictions MUST be
 783         taken into account when designing new filters.
 784
 785         The maximum number of filters in the chain has been limited to
 786         four, thus there can be at maximum of three non-last filters.
 787         Of these three non-last filters, only two are allowed to change
 788         the size of the data.
 789
 790         The non-last filters, that change the size of the data, MUST
 791         have a limit how much the decoder can compress the data: the
 792         decoder SHOULD produce at least n bytes of output when the
 793         filter is given 2n bytes of input. This  limit is not
 794         absolute, but significant deviations MUST be avoided.
 795
 796         The above limitations guarantee that if the last filter in the
 797         chain produces 4n bytes of output, the chain as a whole will
 798         produce at least n bytes of output.
 799
 800
 801 5.3. Filters
 802
 803 5.3.1. LZMA2
 804
 805         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 806         compression algorithm with high compression ratio and fast
 807         decompression. LZMA is based on LZ77 and range coding
 808         algorithms.
 809
 810         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 811         LZMA internally, but adds support for flushing the encoder,
 812         uncompressed chunks, eases stateful decoder implementations,
 813         and improves support for multithreading. Thus, the plain LZMA
 814         will not be supported in this file format.
 815
 816             Filter ID:                  0x21
 817             Size of Filter Properties:  1 byte
 818             Changes size of data:       Yes
 819             Allow as a non-last filter: No
 820             Allow as the last filter:   Yes
 821
 822             Preferred alignment:
 823                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 824                 Output data:            1 byte
 825
 826         The format of the one-byte Filter Properties field is as
 827         follows:
 828
 829             Bits   Mask   Description
 830             0-5    0x3F   Dictionary Size
 831             6-7    0xC0   Reserved for future use; MUST be zero for now.
 832
 833         Dictionary Size is encoded with one-bit mantissa and five-bit
 834         exponent. The smallest dictionary size is 4 KiB and the biggest
 835         is 4 GiB.
 836
 837             Raw value   Mantissa   Exponent   Dictionary size
 838                 0           2         11         4 KiB
 839                 1           3         11         6 KiB
 840                 2           2         12         8 KiB
 841                 3           3         12        12 KiB
 842                 4           2         13        16 KiB
 843                 5           3         13        24 KiB
 844                 6           2         14        32 KiB
 845               ...         ...        ...      ...
 846                35           3         27       768 MiB
 847                36           2         28      1024 MiB
 848                37           3         29      1536 MiB
 849                38           2         30      2048 MiB
 850                39           3         30      3072 MiB
 851                40           2         31      4096 MiB - 1 B
 852
 853         Instead of having a table in the decoder, the dictionary size
 854         can be decoded using the following C code:
 855
 856             const uint8_t bits = get_dictionary_flags() & 0x3F;
 857             if (bits > 40)
 858                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 859
 860             uint32_t dictionary_size;
 861             if (bits == 40) {
 862                 dictionary_size = UINT32_MAX;
 863             } else {
 864                 dictionary_size = 2 | (bits & 1);
 865                 dictionary_size <<= bits / 2 + 11;
 866             }
 867
 868
 869 5.3.2. Branch/Call/Jump Filters for Executables
 870
 871         These filters convert relative branch, call, and jump
 872         instructions to their absolute counterparts in executable
 873         files. This conversion increases redundancy and thus
 874         compression ratio.
 875
 876             Size of Filter Properties:  0 or 4 bytes
 877             Changes size of data:       No
 878             Allow as a non-last filter: Yes
 879             Allow as the last filter:   No
 880
 881         Below is the list of filters in this category. The alignment
 882         is the same for both input and output data.
 883
 884             Filter ID   Alignment   Description
 885               0x04       1 byte     x86 filter (BCJ)
 886               0x05       4 bytes    PowerPC (big endian) filter
 887               0x06      16 bytes    IA64 filter
 888               0x07       4 bytes    ARM (little endian) filter
 889               0x08       2 bytes    ARM Thumb (little endian) filter
 890               0x09       4 bytes    SPARC filter
 891
 892         If the size of Filter Properties is four bytes, the Filter
 893         Properties field contains the start offset used for address
 894         conversions. It is stored as an unsigned 32-bit little endian
 895         integer. If the size of Filter Properties is zero, the start
 896         offset is zero.
 897
 898         Setting the start offset may be useful if an executable has
 899         multiple sections, and there are many cross-section calls.
 900         Taking advantage of this feature usually requires usage of
 901         the Subblock filter.
 902
 903
 904 5.3.3. Delta
 905
 906         The Delta filter may increase compression ratio when the value
 907         of the next byte correlates with the value of an earlier byte
 908         at specified distance.
 909
 910             Filter ID:                  0x03
 911             Size of Filter Properties:  1 byte
 912             Changes size of data:       No
 913             Allow as a non-last filter: Yes
 914             Allow as the last filter:   No
 915
 916             Preferred alignment:
 917                 Input data:             1 byte
 918                 Output data:            Same as the original input data
 919
 920         The Properties byte indicates the delta distance, which can be
 921         1-256 bytes backwards from the current byte: 0x00 indicates
 922         distance of 1 byte and 0xFF distance of 256 bytes.
 923
 924
 925 5.3.3.1. Format of the Encoded Output
 926
 927         The code below illustrates both encoding and decoding with
 928         the Delta filter.
 929
 930             // Distance is in the range [1, 256].
 931             const unsigned int distance = get_properties_byte() + 1;
 932             uint8_t pos = 0;
 933             uint8_t delta[256];
 934
 935             memset(delta, 0, sizeof(delta));
 936
 937             while (1) {
 938                 const int byte = read_byte();
 939                 if (byte == EOF)
 940                     break;
 941
 942                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 943                 if (is_encoder) {
 944                     tmp = (uint8_t)(byte) - tmp;
 945                     delta[pos] = (uint8_t)(byte);
 946                 } else {
 947                     tmp = (uint8_t)(byte) + tmp;
 948                     delta[pos] = tmp;
 949                 }
 950
 951                 write_byte(tmp);
 952                 --pos;
 953             }
 954
 955
 956 5.4. Custom Filter IDs
 957
 958         If a developer wants to use custom Filter IDs, he has two
 959         choices. The first choice is to contact Lasse Collin and ask
 960         him to allocate a range of IDs for the developer.
 961
 962         The second choice is to generate a 40-bit random integer,
 963         which the developer can use as his personal Developer ID.
 964         To minimalize the risk of collisions, Developer ID has to be
 965         a randomly generated integer, not manually selected "hex word".
 966         The following command, which works on many free operating
 967         systems, can be used to generate Developer ID:
 968
 969             dd if=/dev/urandom bs=5 count=1 | hexdump
 970
 971         The developer can then use his Developer ID to create unique
 972         (well, hopefully unique) Filter IDs.
 973
 974             Bits    Mask                    Description
 975              0-15   0x0000_0000_0000_FFFF   Filter ID
 976             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
 977             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
 978
 979         The resulting 63-bit integer will use 9 bytes of space when
 980         stored using the encoding described in Section 1.2. To get
 981         a shorter ID, see the beginning of this Section how to
 982         request a custom ID range.
 983
 984
 985 5.4.1. Reserved Custom Filter ID Ranges
 986
 987         Range                       Description
 988         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
 989         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
 990
 991
 992 6. Cyclic Redundancy Checks
 993
 994         There are several incompatible variations to calculate CRC32
 995         and CRC64. For simplicity and clarity, complete examples are
 996         provided to calculate the checks as they are used in this file
 997         format. Implementations MAY use different code as long as it
 998         gives identical results.
 999
1000         The program below reads data from standard input, calculates
1001         the CRC32 and CRC64 values, and prints the calculated values
1002         as big endian hexadecimal strings to standard output.
1003
1004             #include <sys/types.h>
1005             #include <inttypes.h>
1006             #include <stdio.h>
1007
1008             uint32_t crc32_table[256];
1009             uint64_t crc64_table[256];
1010
1011             void
1012             init(void)
1013             {
1014                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1015                 static const uint64_t poly64
1016                         = UINT64_C(0xC96C5795D7870F42);
1017
1018                 for (size_t i = 0; i < 256; ++i) {
1019                     uint32_t crc32 = i;
1020                     uint64_t crc64 = i;
1021
1022                     for (size_t j = 0; j < 8; ++j) {
1023                         if (crc32 & 1)
1024                             crc32 = (crc32 >> 1) ^ poly32;
1025                         else
1026                             crc32 >>= 1;
1027
1028                         if (crc64 & 1)
1029                             crc64 = (crc64 >> 1) ^ poly64;
1030                         else
1031                             crc64 >>= 1;
1032                     }
1033
1034                     crc32_table[i] = crc32;
1035                     crc64_table[i] = crc64;
1036                 }
1037             }
1038
1039             uint32_t
1040             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1041             {
1042                 crc = ~crc;
1043                 for (size_t i = 0; i < size; ++i)
1044                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1045                             ^ (crc >> 8);
1046                 return ~crc;
1047             }
1048
1049             uint64_t
1050             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1051             {
1052                 crc = ~crc;
1053                 for (size_t i = 0; i < size; ++i)
1054                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1055                             ^ (crc >> 8);
1056                 return ~crc;
1057             }
1058
1059             int
1060             main()
1061             {
1062                 init();
1063
1064                 uint32_t value32 = 0;
1065                 uint64_t value64 = 0;
1066                 uint64_t total_size = 0;
1067                 uint8_t buf[8192];
1068
1069                 while (1) {
1070                     const size_t buf_size = fread(buf, 1, 8192, stdin);
1071                     if (buf_size == 0)
1072                         break;
1073
1074                     total_size += buf_size;
1075                     value32 = crc32(buf, buf_size, value32);
1076                     value64 = crc64(buf, buf_size, value64);
1077                 }
1078
1079                 printf("Bytes:  %" PRIu64 "\n", total_size);
1080                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1081                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1082
1083                 return 0;
1084             }
1085
1086
1087 7. References
1088
1089         LZMA SDK - The original LZMA implementation
1090         http://7-zip.org/sdk.html
1091
1092         LZMA Utils - LZMA adapted to POSIX-like systems
1093         http://tukaani.org/lzma/
1094
1095         [RFC-1952]
1096         GZIP file format specification version 4.3
1097         http://www.ietf.org/rfc/rfc1952.txt
1098           - Notation of byte boxes in section "2.1. Overall conventions"
1099
1100         [RFC-2119]
1101         Key words for use in RFCs to Indicate Requirement Levels
1102         http://www.ietf.org/rfc/rfc2119.txt
1103
1104         [GNU-tar]
1105         GNU tar 1.20 manual
1106         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1107           - Node 9.4.2 "Blocking Factor", paragraph that begins
1108             "gzip will complain about trailing garbage"
1109           - Note that this URL points to the latest version of the
1110             manual, and may some day not contain the note which is in
1111             1.20. For the exact version of the manual, download GNU
1112             tar 1.20: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.20.tar.gz
1113