doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.0.4 (2009-08-27)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.0.4     2009-08-27    Language improvements in Sections 1.2,
  97                                 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
  98
  99         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
 100
 101         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
 102
 103         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
 104                                 clarifications to Sections 2, 2.2,
 105                                 3.3, 4.4, and 5.3.2
 106
 107         1.0.0     2009-01-14    The first official version
 108
 109
 110 1. Conventions
 111
 112         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 113         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 114         document are to be interpreted as described in [RFC-2119].
 115
 116         Indicating a warning means displaying a message, returning
 117         appropriate exit status, or doing something else to let the
 118         user know that something worth warning occurred. The operation
 119         SHOULD still finish if a warning is indicated.
 120
 121         Indicating an error means displaying a message, returning
 122         appropriate exit status, or doing something else to let the
 123         user know that something prevented successfully finishing the
 124         operation. The operation MUST be aborted once an error has
 125         been indicated.
 126
 127
 128 1.1. Byte and Its Representation
 129
 130         In this document, byte is always 8 bits.
 131
 132         A "null byte" has all bits unset. That is, the value of a null
 133         byte is 0x00.
 134
 135         To represent byte blocks, this document uses notation that
 136         is similar to the notation used in [RFC-1952]:
 137
 138             +-------+
 139             |  Foo  |   One byte.
 140             +-------+
 141
 142             +---+---+
 143             |  Foo  |   Two bytes; that is, some of the vertical bars
 144             +---+---+   can be missing.
 145
 146             +=======+
 147             |  Foo  |   Zero or more bytes.
 148             +=======+
 149
 150         In this document, a boxed byte or a byte sequence declared
 151         using this notation is called "a field". The example field
 152         above would be called "the Foo field" or plain "Foo".
 153
 154         If there are many fields, they may be split to multiple lines.
 155         This is indicated with an arrow ("--->"):
 156
 157             +=====+
 158             | Foo |
 159             +=====+
 160
 161                  +=====+
 162             ---> | Bar |
 163                  +=====+
 164
 165         The above is equivalent to this:
 166
 167             +=====+=====+
 168             | Foo | Bar |
 169             +=====+=====+
 170
 171
 172 1.2. Multibyte Integers
 173
 174         Multibyte integers of static length, such as CRC values,
 175         are stored in little endian byte order (least significant
 176         byte first).
 177
 178         When smaller values are more likely than bigger values (for
 179         example file sizes), multibyte integers are encoded in a
 180         variable-length representation:
 181           - Numbers in the range [0, 127] are copied as is, and take
 182             one byte of space.
 183           - Bigger numbers will occupy two or more bytes. All but the
 184             last byte of the multibyte representation have the highest
 185             (eighth) bit set.
 186
 187         For now, the value of the variable-length integers is limited
 188         to 63 bits, which limits the encoded size of the integer to
 189         nine bytes. These limits may be increased in the future if
 190         needed.
 191
 192         The following C code illustrates encoding and decoding of
 193         variable-length integers. The functions return the number of
 194         bytes occupied by the integer (1-9), or zero on error.
 195
 196             #include <stddef.h>
 197             #include <inttypes.h>
 198
 199             size_t
 200             encode(uint8_t buf[static 9], uint64_t num)
 201             {
 202                 if (num > UINT64_MAX / 2)
 203                     return 0;
 204
 205                 size_t i = 0;
 206
 207                 while (num >= 0x80) {
 208                     buf[i++] = (uint8_t)(num) | 0x80;
 209                     num >>= 7;
 210                 }
 211
 212                 buf[i++] = (uint8_t)(num);
 213
 214                 return i;
 215             }
 216
 217             size_t
 218             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 219             {
 220                 if (size_max == 0)
 221                     return 0;
 222
 223                 if (size_max > 9)
 224                     size_max = 9;
 225
 226                 *num = buf[0] & 0x7F;
 227                 size_t i = 0;
 228
 229                 while (buf[i++] & 0x80) {
 230                     if (i >= size_max || buf[i] == 0x00)
 231                         return 0;
 232
 233                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 234                 }
 235
 236                 return i;
 237             }
 238
 239
 240 2. Overall Structure of .xz File
 241
 242         A standalone .xz files consist of one or more Streams which may
 243         have Stream Padding between or after them:
 244
 245             +========+================+========+================+
 246             | Stream | Stream Padding | Stream | Stream Padding | ...
 247             +========+================+========+================+
 248
 249         The sizes of Stream and Stream Padding are always multiples
 250         of four bytes, thus the size of every valid .xz file MUST be
 251         a multiple of four bytes.
 252
 253         While a typical file contains only one Stream and no Stream
 254         Padding, a decoder handling standalone .xz files SHOULD support
 255         files that have more than one Stream or Stream Padding.
 256
 257         In contrast to standalone .xz files, when the .xz file format
 258         is used as an internal part of some other file format or
 259         communication protocol, it usually is expected that the decoder
 260         stops after the first Stream, and doesn't look for Stream
 261         Padding or possibly other Streams.
 262
 263
 264 2.1. Stream
 265
 266         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 267         |     Stream Header     | Block | Block | ... | Block |
 268         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 269
 270              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 271         ---> | Index |     Stream Footer     |
 272              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 273
 274         All the above fields have a size that is a multiple of four. If
 275         Stream is used as an internal part of another file format, it
 276         is RECOMMENDED to make the Stream start at an offset that is
 277         a multiple of four bytes.
 278
 279         Stream Header, Index, and Stream Footer are always present in
 280         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 281
 282         There are zero or more Blocks. The maximum number of Blocks is
 283         limited only by the maximum size of the Index field.
 284
 285         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 286         The same limit applies to the total amount of uncompressed
 287         data stored in a Stream.
 288
 289         If an implementation supports handling .xz files with multiple
 290         concatenated Streams, it MAY apply the above limits to the file
 291         as a whole instead of limiting per Stream basis.
 292
 293
 294 2.1.1. Stream Header
 295
 296         +---+---+---+---+---+---+-------+------+--+--+--+--+
 297         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 298         +---+---+---+---+---+---+-------+------+--+--+--+--+
 299
 300
 301 2.1.1.1. Header Magic Bytes
 302
 303         The first six (6) bytes of the Stream are so called Header
 304         Magic Bytes. They can be used to identify the file type.
 305
 306             Using a C array and ASCII:
 307             const uint8_t HEADER_MAGIC[6]
 308                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 309
 310             In plain hexadecimal:
 311             FD 37 7A 58 5A 00
 312
 313         Notes:
 314           - The first byte (0xFD) was chosen so that the files cannot
 315             be erroneously detected as being in .lzma format, in which
 316             the first byte is in the range [0x00, 0xE0].
 317           - The sixth byte (0x00) was chosen to prevent applications
 318             from misdetecting the file as a text file.
 319
 320         If the Header Magic Bytes don't match, the decoder MUST
 321         indicate an error.
 322
 323
 324 2.1.1.2. Stream Flags
 325
 326         The first byte of Stream Flags is always a null byte. In the
 327         future, this byte may be used to indicate a new Stream version
 328         or other Stream properties.
 329
 330         The second byte of Stream Flags is a bit field:
 331
 332             Bit(s)  Mask  Description
 333              0-3    0x0F  Type of Check (see Section 3.4):
 334                               ID    Size      Check name
 335                               0x00   0 bytes  None
 336                               0x01   4 bytes  CRC32
 337                               0x02   4 bytes  (Reserved)
 338                               0x03   4 bytes  (Reserved)
 339                               0x04   8 bytes  CRC64
 340                               0x05   8 bytes  (Reserved)
 341                               0x06   8 bytes  (Reserved)
 342                               0x07  16 bytes  (Reserved)
 343                               0x08  16 bytes  (Reserved)
 344                               0x09  16 bytes  (Reserved)
 345                               0x0A  32 bytes  SHA-256
 346                               0x0B  32 bytes  (Reserved)
 347                               0x0C  32 bytes  (Reserved)
 348                               0x0D  64 bytes  (Reserved)
 349                               0x0E  64 bytes  (Reserved)
 350                               0x0F  64 bytes  (Reserved)
 351              4-7    0xF0  Reserved for future use; MUST be zero for now.
 352
 353         Implementations SHOULD support at least the Check IDs 0x00
 354         (None) and 0x01 (CRC32). Supporting other Check IDs is
 355         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 356         indicate a warning or error.
 357
 358         If any reserved bit is set, the decoder MUST indicate an error.
 359         It is possible that there is a new field present which the
 360         decoder is not aware of, and can thus parse the Stream Header
 361         incorrectly.
 362
 363
 364 2.1.1.3. CRC32
 365
 366         The CRC32 is calculated from the Stream Flags field. It is
 367         stored as an unsigned 32-bit little endian integer. If the
 368         calculated value does not match the stored one, the decoder
 369         MUST indicate an error.
 370
 371         The idea is that Stream Flags would always be two bytes, even
 372         if new features are needed. This way old decoders will be able
 373         to verify the CRC32 calculated from Stream Flags, and thus
 374         distinguish between corrupt files (CRC32 doesn't match) and
 375         files that the decoder doesn't support (CRC32 matches but
 376         Stream Flags has reserved bits set).
 377
 378
 379 2.1.2. Stream Footer
 380
 381         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 382         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 383         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 384
 385
 386 2.1.2.1. CRC32
 387
 388         The CRC32 is calculated from the Backward Size and Stream Flags
 389         fields. It is stored as an unsigned 32-bit little endian
 390         integer. If the calculated value does not match the stored one,
 391         the decoder MUST indicate an error.
 392
 393         The reason to have the CRC32 field before the Backward Size and
 394         Stream Flags fields is to keep the four-byte fields aligned to
 395         a multiple of four bytes.
 396
 397
 398 2.1.2.2. Backward Size
 399
 400         Backward Size is stored as a 32-bit little endian integer,
 401         which indicates the size of the Index field as multiple of
 402         four bytes, minimum value being four bytes:
 403
 404             real_backward_size = (stored_backward_size + 1) * 4;
 405
 406         If the stored value does not match the real size of the Index
 407         field, the decoder MUST indicate an error.
 408
 409         Using a fixed-size integer to store Backward Size makes
 410         it slightly simpler to parse the Stream Footer when the
 411         application needs to parse the Stream backwards.
 412
 413
 414 2.1.2.3. Stream Flags
 415
 416         This is a copy of the Stream Flags field from the Stream
 417         Header. The information stored to Stream Flags is needed
 418         when parsing the Stream backwards. The decoder MUST compare
 419         the Stream Flags fields in both Stream Header and Stream
 420         Footer, and indicate an error if they are not identical.
 421
 422
 423 2.1.2.4. Footer Magic Bytes
 424
 425         As the last step of the decoding process, the decoder MUST
 426         verify the existence of Footer Magic Bytes. If they don't
 427         match, an error MUST be indicated.
 428
 429             Using a C array and ASCII:
 430             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 431
 432             In hexadecimal:
 433             59 5A
 434
 435         The primary reason to have Footer Magic Bytes is to make
 436         it easier to detect incomplete files quickly, without
 437         uncompressing. If the file does not end with Footer Magic Bytes
 438         (excluding Stream Padding described in Section 2.2), it cannot
 439         be undamaged, unless someone has intentionally appended garbage
 440         after the end of the Stream.
 441
 442
 443 2.2. Stream Padding
 444
 445         Only the decoders that support decoding of concatenated Streams
 446         MUST support Stream Padding.
 447
 448         Stream Padding MUST contain only null bytes. To preserve the
 449         four-byte alignment of consecutive Streams, the size of Stream
 450         Padding MUST be a multiple of four bytes. Empty Stream Padding
 451         is allowed. If these requirements are not met, the decoder MUST
 452         indicate an error.
 453
 454         Note that non-empty Stream Padding is allowed at the end of the
 455         file; there doesn't need to be a new Stream after non-empty
 456         Stream Padding. This can be convenient in certain situations
 457         [GNU-tar].
 458
 459         The possibility of Stream Padding MUST be taken into account
 460         when designing an application that parses Streams backwards,
 461         and the application supports concatenated Streams.
 462
 463
 464 3. Block
 465
 466         +==============+=================+===============+=======+
 467         | Block Header | Compressed Data | Block Padding | Check |
 468         +==============+=================+===============+=======+
 469
 470
 471 3.1. Block Header
 472
 473         +-------------------+-------------+=================+
 474         | Block Header Size | Block Flags | Compressed Size |
 475         +-------------------+-------------+=================+
 476
 477              +===================+======================+
 478         ---> | Uncompressed Size | List of Filter Flags |
 479              +===================+======================+
 480
 481              +================+--+--+--+--+
 482         ---> | Header Padding |   CRC32   |
 483              +================+--+--+--+--+
 484
 485
 486 3.1.1. Block Header Size
 487
 488         This field overlaps with the Index Indicator field (see
 489         Section 4.1).
 490
 491         This field contains the size of the Block Header field,
 492         including the Block Header Size field itself. Valid values are
 493         in the range [0x01, 0xFF], which indicate the size of the Block
 494         Header as multiples of four bytes, minimum size being eight
 495         bytes:
 496
 497             real_header_size = (encoded_header_size + 1) * 4;
 498
 499         If a Block Header bigger than 1024 bytes is needed in the
 500         future, a new field can be added between the Block Header and
 501         Compressed Data fields. The presence of this new field would
 502         be indicated in the Block Header field.
 503
 504
 505 3.1.2. Block Flags
 506
 507         The Block Flags field is a bit field:
 508
 509             Bit(s)  Mask  Description
 510              0-1    0x03  Number of filters (1-4)
 511              2-5    0x3C  Reserved for future use; MUST be zero for now.
 512               6     0x40  The Compressed Size field is present.
 513               7     0x80  The Uncompressed Size field is present.
 514
 515         If any reserved bit is set, the decoder MUST indicate an error.
 516         It is possible that there is a new field present which the
 517         decoder is not aware of, and can thus parse the Block Header
 518         incorrectly.
 519
 520
 521 3.1.3. Compressed Size
 522
 523         This field is present only if the appropriate bit is set in
 524         the Block Flags field (see Section 3.1.2).
 525
 526         The Compressed Size field contains the size of the Compressed
 527         Data field, which MUST be non-zero. Compressed Size is stored
 528         using the encoding described in Section 1.2. If the Compressed
 529         Size doesn't match the size of the Compressed Data field, the
 530         decoder MUST indicate an error.
 531
 532
 533 3.1.4. Uncompressed Size
 534
 535         This field is present only if the appropriate bit is set in
 536         the Block Flags field (see Section 3.1.2).
 537
 538         The Uncompressed Size field contains the size of the Block
 539         after uncompressing. Uncompressed Size is stored using the
 540         encoding described in Section 1.2. If the Uncompressed Size
 541         does not match the real uncompressed size, the decoder MUST
 542         indicate an error.
 543
 544         Storing the Compressed Size and Uncompressed Size fields serves
 545         several purposes:
 546           - The decoder knows how much memory it needs to allocate
 547             for a temporary buffer in multithreaded mode.
 548           - Simple error detection: wrong size indicates a broken file.
 549           - Seeking forwards to a specific location in streamed mode.
 550
 551         It should be noted that the only reliable way to determine
 552         the real uncompressed size is to uncompress the Block,
 553         because the Block Header and Index fields may contain
 554         (intentionally or unintentionally) invalid information.
 555
 556
 557 3.1.5. List of Filter Flags
 558
 559         +================+================+     +================+
 560         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 561         +================+================+     +================+
 562
 563         The number of Filter Flags fields is stored in the Block Flags
 564         field (see Section 3.1.2).
 565
 566         The format of each Filter Flags field is as follows:
 567
 568             +===========+====================+===================+
 569             | Filter ID | Size of Properties | Filter Properties |
 570             +===========+====================+===================+
 571
 572         Both Filter ID and Size of Properties are stored using the
 573         encoding described in Section 1.2. Size of Properties indicates
 574         the size of the Filter Properties field as bytes. The list of
 575         officially defined Filter IDs and the formats of their Filter
 576         Properties are described in Section 5.3.
 577
 578         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 579         (2^62) are reserved for implementation-specific internal use.
 580         These Filter IDs MUST never be used in List of Filter Flags.
 581
 582
 583 3.1.6. Header Padding
 584
 585         This field contains as many null byte as it is needed to make
 586         the Block Header have the size specified in Block Header Size.
 587         If any of the bytes are not null bytes, the decoder MUST
 588         indicate an error. It is possible that there is a new field
 589         present which the decoder is not aware of, and can thus parse
 590         the Block Header incorrectly.
 591
 592
 593 3.1.7. CRC32
 594
 595         The CRC32 is calculated over everything in the Block Header
 596         field except the CRC32 field itself. It is stored as an
 597         unsigned 32-bit little endian integer. If the calculated
 598         value does not match the stored one, the decoder MUST indicate
 599         an error.
 600
 601         By verifying the CRC32 of the Block Header before parsing the
 602         actual contents allows the decoder to distinguish between
 603         corrupt and unsupported files.
 604
 605
 606 3.2. Compressed Data
 607
 608         The format of Compressed Data depends on Block Flags and List
 609         of Filter Flags. Excluding the descriptions of the simplest
 610         filters in Section 5.3, the format of the filter-specific
 611         encoded data is out of scope of this document.
 612
 613
 614 3.3. Block Padding
 615
 616         Block Padding MUST contain 0-3 null bytes to make the size of
 617         the Block a multiple of four bytes. This can be needed when
 618         the size of Compressed Data is not a multiple of four. If any
 619         of the bytes in Block Padding are not null bytes, the decoder
 620         MUST indicate an error.
 621
 622
 623 3.4. Check
 624
 625         The type and size of the Check field depends on which bits
 626         are set in the Stream Flags field (see Section 2.1.1.2).
 627
 628         The Check, when used, is calculated from the original
 629         uncompressed data. If the calculated Check does not match the
 630         stored one, the decoder MUST indicate an error. If the selected
 631         type of Check is not supported by the decoder, it SHOULD
 632         indicate a warning or error.
 633
 634
 635 4. Index
 636
 637         +-----------------+===================+
 638         | Index Indicator | Number of Records |
 639         +-----------------+===================+
 640
 641              +=================+===============+-+-+-+-+
 642         ---> | List of Records | Index Padding | CRC32 |
 643              +=================+===============+-+-+-+-+
 644
 645         Index serves several purposes. Using it, one can
 646           - verify that all Blocks in a Stream have been processed;
 647           - find out the uncompressed size of a Stream; and
 648           - quickly access the beginning of any Block (random access).
 649
 650
 651 4.1. Index Indicator
 652
 653         This field overlaps with the Block Header Size field (see
 654         Section 3.1.1). The value of Index Indicator is always 0x00.
 655
 656
 657 4.2. Number of Records
 658
 659         This field indicates how many Records there are in the List
 660         of Records field, and thus how many Blocks there are in the
 661         Stream. The value is stored using the encoding described in
 662         Section 1.2. If the decoder has decoded all the Blocks of the
 663         Stream, and then notices that the Number of Records doesn't
 664         match the real number of Blocks, the decoder MUST indicate an
 665         error.
 666
 667
 668 4.3. List of Records
 669
 670         List of Records consists of as many Records as indicated by the
 671         Number of Records field:
 672
 673             +========+========+
 674             | Record | Record | ...
 675             +========+========+
 676
 677         Each Record contains information about one Block:
 678
 679             +===============+===================+
 680             | Unpadded Size | Uncompressed Size |
 681             +===============+===================+
 682
 683         If the decoder has decoded all the Blocks of the Stream, it
 684         MUST verify that the contents of the Records match the real
 685         Unpadded Size and Uncompressed Size of the respective Blocks.
 686
 687         Implementation hint: It is possible to verify the Index with
 688         constant memory usage by calculating for example SHA-256 of
 689         both the real size values and the List of Records, then
 690         comparing the hash values. Implementing this using
 691         non-cryptographic hash like CRC32 SHOULD be avoided unless
 692         small code size is important.
 693
 694         If the decoder supports random-access reading, it MUST verify
 695         that Unpadded Size and Uncompressed Size of every completely
 696         decoded Block match the sizes stored in the Index. If only
 697         partial Block is decoded, the decoder MUST verify that the
 698         processed sizes don't exceed the sizes stored in the Index.
 699
 700
 701 4.3.1. Unpadded Size
 702
 703         This field indicates the size of the Block excluding the Block
 704         Padding field. That is, Unpadded Size is the size of the Block
 705         Header, Compressed Data, and Check fields. Unpadded Size is
 706         stored using the encoding described in Section 1.2. The value
 707         MUST never be zero; with the current structure of Blocks, the
 708         actual minimum value for Unpadded Size is five.
 709
 710         Implementation note: Because the size of the Block Padding
 711         field is not included in Unpadded Size, calculating the total
 712         size of a Stream or doing random-access reading requires
 713         calculating the actual size of the Blocks by rounding Unpadded
 714         Sizes up to the next multiple of four.
 715
 716         The reason to exclude Block Padding from Unpadded Size is to
 717         ease making a raw copy of Compressed Data without Block
 718         Padding. This can be useful, for example, if someone wants
 719         to convert Streams to some other file format quickly.
 720
 721
 722 4.3.2. Uncompressed Size
 723
 724         This field indicates the Uncompressed Size of the respective
 725         Block as bytes. The value is stored using the encoding
 726         described in Section 1.2.
 727
 728
 729 4.4. Index Padding
 730
 731         This field MUST contain 0-3 null bytes to pad the Index to
 732         a multiple of four bytes. If any of the bytes are not null
 733         bytes, the decoder MUST indicate an error.
 734
 735
 736 4.5. CRC32
 737
 738         The CRC32 is calculated over everything in the Index field
 739         except the CRC32 field itself. The CRC32 is stored as an
 740         unsigned 32-bit little endian integer. If the calculated
 741         value does not match the stored one, the decoder MUST indicate
 742         an error.
 743
 744
 745 5. Filter Chains
 746
 747         The Block Flags field defines how many filters are used. When
 748         more than one filter is used, the filters are chained; that is,
 749         the output of one filter is the input of another filter. The
 750         following figure illustrates the direction of data flow.
 751
 752                     v   Uncompressed Data   ^
 753                     |       Filter 0        |
 754             Encoder |       Filter 1        | Decoder
 755                     |       Filter n        |
 756                     v    Compressed Data    ^
 757
 758
 759 5.1. Alignment
 760
 761         Alignment of uncompressed input data is usually the job of
 762         the application producing the data. For example, to get the
 763         best results, an archiver tool should make sure that all
 764         PowerPC executable files in the archive stream start at
 765         offsets that are multiples of four bytes.
 766
 767         Some filters, for example LZMA2, can be configured to take
 768         advantage of specified alignment of input data. Note that
 769         taking advantage of aligned input can be beneficial also when
 770         a filter is not the first filter in the chain. For example,
 771         if you compress PowerPC executables, you may want to use the
 772         PowerPC filter and chain that with the LZMA2 filter. Because
 773         not only the input but also the output alignment of the PowerPC
 774         filter is four bytes, it is now beneficial to set LZMA2
 775         settings so that the LZMA2 encoder can take advantage of its
 776         four-byte-aligned input data.
 777
 778         The output of the last filter in the chain is stored to the
 779         Compressed Data field, which is is guaranteed to be aligned
 780         to a multiple of four bytes relative to the beginning of the
 781         Stream. This can increase
 782           - speed, if the filtered data is handled multiple bytes at
 783             a time by the filter-specific encoder and decoder,
 784             because accessing aligned data in computer memory is
 785             usually faster; and
 786           - compression ratio, if the output data is later compressed
 787             with an external compression tool.
 788
 789
 790 5.2. Security
 791
 792         If filters would be allowed to be chained freely, it would be
 793         possible to create malicious files, that would be very slow to
 794         decode. Such files could be used to create denial of service
 795         attacks.
 796
 797         Slow files could occur when multiple filters are chained:
 798
 799             v   Compressed input data
 800             |   Filter 1 decoder (last filter)
 801             |   Filter 0 decoder (non-last filter)
 802             v   Uncompressed output data
 803
 804         The decoder of the last filter in the chain produces a lot of
 805         output from little input. Another filter in the chain takes the
 806         output of the last filter, and produces very little output
 807         while consuming a lot of input. As a result, a lot of data is
 808         moved inside the filter chain, but the filter chain as a whole
 809         gets very little work done.
 810
 811         To prevent this kind of slow files, there are restrictions on
 812         how the filters can be chained. These restrictions MUST be
 813         taken into account when designing new filters.
 814
 815         The maximum number of filters in the chain has been limited to
 816         four, thus there can be at maximum of three non-last filters.
 817         Of these three non-last filters, only two are allowed to change
 818         the size of the data.
 819
 820         The non-last filters, that change the size of the data, MUST
 821         have a limit how much the decoder can compress the data: the
 822         decoder SHOULD produce at least n bytes of output when the
 823         filter is given 2n bytes of input. This  limit is not
 824         absolute, but significant deviations MUST be avoided.
 825
 826         The above limitations guarantee that if the last filter in the
 827         chain produces 4n bytes of output, the chain as a whole will
 828         produce at least n bytes of output.
 829
 830
 831 5.3. Filters
 832
 833 5.3.1. LZMA2
 834
 835         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 836         compression algorithm with high compression ratio and fast
 837         decompression. LZMA is based on LZ77 and range coding
 838         algorithms.
 839
 840         LZMA2 is an extension on top of the original LZMA. LZMA2 uses
 841         LZMA internally, but adds support for flushing the encoder,
 842         uncompressed chunks, eases stateful decoder implementations,
 843         and improves support for multithreading. Thus, the plain LZMA
 844         will not be supported in this file format.
 845
 846             Filter ID:                  0x21
 847             Size of Filter Properties:  1 byte
 848             Changes size of data:       Yes
 849             Allow as a non-last filter: No
 850             Allow as the last filter:   Yes
 851
 852             Preferred alignment:
 853                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 854                 Output data:            1 byte
 855
 856         The format of the one-byte Filter Properties field is as
 857         follows:
 858
 859             Bits   Mask   Description
 860             0-5    0x3F   Dictionary Size
 861             6-7    0xC0   Reserved for future use; MUST be zero for now.
 862
 863         Dictionary Size is encoded with one-bit mantissa and five-bit
 864         exponent. The smallest dictionary size is 4 KiB and the biggest
 865         is 4 GiB.
 866
 867             Raw value   Mantissa   Exponent   Dictionary size
 868                 0           2         11         4 KiB
 869                 1           3         11         6 KiB
 870                 2           2         12         8 KiB
 871                 3           3         12        12 KiB
 872                 4           2         13        16 KiB
 873                 5           3         13        24 KiB
 874                 6           2         14        32 KiB
 875               ...         ...        ...      ...
 876                35           3         27       768 MiB
 877                36           2         28      1024 MiB
 878                37           3         29      1536 MiB
 879                38           2         30      2048 MiB
 880                39           3         30      3072 MiB
 881                40           2         31      4096 MiB - 1 B
 882
 883         Instead of having a table in the decoder, the dictionary size
 884         can be decoded using the following C code:
 885
 886             const uint8_t bits = get_dictionary_flags() & 0x3F;
 887             if (bits > 40)
 888                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 889
 890             uint32_t dictionary_size;
 891             if (bits == 40) {
 892                 dictionary_size = UINT32_MAX;
 893             } else {
 894                 dictionary_size = 2 | (bits & 1);
 895                 dictionary_size <<= bits / 2 + 11;
 896             }
 897
 898
 899 5.3.2. Branch/Call/Jump Filters for Executables
 900
 901         These filters convert relative branch, call, and jump
 902         instructions to their absolute counterparts in executable
 903         files. This conversion increases redundancy and thus
 904         compression ratio.
 905
 906             Size of Filter Properties:  0 or 4 bytes
 907             Changes size of data:       No
 908             Allow as a non-last filter: Yes
 909             Allow as the last filter:   No
 910
 911         Below is the list of filters in this category. The alignment
 912         is the same for both input and output data.
 913
 914             Filter ID   Alignment   Description
 915               0x04       1 byte     x86 filter (BCJ)
 916               0x05       4 bytes    PowerPC (big endian) filter
 917               0x06      16 bytes    IA64 filter
 918               0x07       4 bytes    ARM (little endian) filter
 919               0x08       2 bytes    ARM Thumb (little endian) filter
 920               0x09       4 bytes    SPARC filter
 921
 922         If the size of Filter Properties is four bytes, the Filter
 923         Properties field contains the start offset used for address
 924         conversions. It is stored as an unsigned 32-bit little endian
 925         integer. The start offset MUST be a multiple of the alignment
 926         of the filter as listed in the table above; if it isn't, the
 927         decoder MUST indicate an error. If the size of Filter
 928         Properties is zero, the start offset is zero.
 929
 930         Setting the start offset may be useful if an executable has
 931         multiple sections, and there are many cross-section calls.
 932         Taking advantage of this feature usually requires usage of
 933         the Subblock filter, whose design is not complete yet.
 934
 935
 936 5.3.3. Delta
 937
 938         The Delta filter may increase compression ratio when the value
 939         of the next byte correlates with the value of an earlier byte
 940         at specified distance.
 941
 942             Filter ID:                  0x03
 943             Size of Filter Properties:  1 byte
 944             Changes size of data:       No
 945             Allow as a non-last filter: Yes
 946             Allow as the last filter:   No
 947
 948             Preferred alignment:
 949                 Input data:             1 byte
 950                 Output data:            Same as the original input data
 951
 952         The Properties byte indicates the delta distance, which can be
 953         1-256 bytes backwards from the current byte: 0x00 indicates
 954         distance of 1 byte and 0xFF distance of 256 bytes.
 955
 956
 957 5.3.3.1. Format of the Encoded Output
 958
 959         The code below illustrates both encoding and decoding with
 960         the Delta filter.
 961
 962             // Distance is in the range [1, 256].
 963             const unsigned int distance = get_properties_byte() + 1;
 964             uint8_t pos = 0;
 965             uint8_t delta[256];
 966
 967             memset(delta, 0, sizeof(delta));
 968
 969             while (1) {
 970                 const int byte = read_byte();
 971                 if (byte == EOF)
 972                     break;
 973
 974                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 975                 if (is_encoder) {
 976                     tmp = (uint8_t)(byte) - tmp;
 977                     delta[pos] = (uint8_t)(byte);
 978                 } else {
 979                     tmp = (uint8_t)(byte) + tmp;
 980                     delta[pos] = tmp;
 981                 }
 982
 983                 write_byte(tmp);
 984                 --pos;
 985             }
 986
 987
 988 5.4. Custom Filter IDs
 989
 990         If a developer wants to use custom Filter IDs, he has two
 991         choices. The first choice is to contact Lasse Collin and ask
 992         him to allocate a range of IDs for the developer.
 993
 994         The second choice is to generate a 40-bit random integer,
 995         which the developer can use as his personal Developer ID.
 996         To minimize the risk of collisions, Developer ID has to be
 997         a randomly generated integer, not manually selected "hex word".
 998         The following command, which works on many free operating
 999         systems, can be used to generate Developer ID:
1000
1001             dd if=/dev/urandom bs=5 count=1 | hexdump
1002
1003         The developer can then use his Developer ID to create unique
1004         (well, hopefully unique) Filter IDs.
1005
1006             Bits    Mask                    Description
1007              0-15   0x0000_0000_0000_FFFF   Filter ID
1008             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1009             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1010
1011         The resulting 63-bit integer will use 9 bytes of space when
1012         stored using the encoding described in Section 1.2. To get
1013         a shorter ID, see the beginning of this Section how to
1014         request a custom ID range.
1015
1016
1017 5.4.1. Reserved Custom Filter ID Ranges
1018
1019         Range                       Description
1020         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1021         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1022         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1023
1024
1025 6. Cyclic Redundancy Checks
1026
1027         There are several incompatible variations to calculate CRC32
1028         and CRC64. For simplicity and clarity, complete examples are
1029         provided to calculate the checks as they are used in this file
1030         format. Implementations MAY use different code as long as it
1031         gives identical results.
1032
1033         The program below reads data from standard input, calculates
1034         the CRC32 and CRC64 values, and prints the calculated values
1035         as big endian hexadecimal strings to standard output.
1036
1037             #include <stddef.h>
1038             #include <inttypes.h>
1039             #include <stdio.h>
1040
1041             uint32_t crc32_table[256];
1042             uint64_t crc64_table[256];
1043
1044             void
1045             init(void)
1046             {
1047                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1048                 static const uint64_t poly64
1049                         = UINT64_C(0xC96C5795D7870F42);
1050
1051                 for (size_t i = 0; i < 256; ++i) {
1052                     uint32_t crc32 = i;
1053                     uint64_t crc64 = i;
1054
1055                     for (size_t j = 0; j < 8; ++j) {
1056                         if (crc32 & 1)
1057                             crc32 = (crc32 >> 1) ^ poly32;
1058                         else
1059                             crc32 >>= 1;
1060
1061                         if (crc64 & 1)
1062                             crc64 = (crc64 >> 1) ^ poly64;
1063                         else
1064                             crc64 >>= 1;
1065                     }
1066
1067                     crc32_table[i] = crc32;
1068                     crc64_table[i] = crc64;
1069                 }
1070             }
1071
1072             uint32_t
1073             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1074             {
1075                 crc = ~crc;
1076                 for (size_t i = 0; i < size; ++i)
1077                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1078                             ^ (crc >> 8);
1079                 return ~crc;
1080             }
1081
1082             uint64_t
1083             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1084             {
1085                 crc = ~crc;
1086                 for (size_t i = 0; i < size; ++i)
1087                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1088                             ^ (crc >> 8);
1089                 return ~crc;
1090             }
1091
1092             int
1093             main()
1094             {
1095                 init();
1096
1097                 uint32_t value32 = 0;
1098                 uint64_t value64 = 0;
1099                 uint64_t total_size = 0;
1100                 uint8_t buf[8192];
1101
1102                 while (1) {
1103                     const size_t buf_size
1104                             = fread(buf, 1, sizeof(buf), stdin);
1105                     if (buf_size == 0)
1106                         break;
1107
1108                     total_size += buf_size;
1109                     value32 = crc32(buf, buf_size, value32);
1110                     value64 = crc64(buf, buf_size, value64);
1111                 }
1112
1113                 printf("Bytes:  %" PRIu64 "\n", total_size);
1114                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1115                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1116
1117                 return 0;
1118             }
1119
1120
1121 7. References
1122
1123         LZMA SDK - The original LZMA implementation
1124         http://7-zip.org/sdk.html
1125
1126         LZMA Utils - LZMA adapted to POSIX-like systems
1127         http://tukaani.org/lzma/
1128
1129         XZ Utils - The next generation of LZMA Utils
1130         http://tukaani.org/xz/
1131
1132         [RFC-1952]
1133         GZIP file format specification version 4.3
1134         http://www.ietf.org/rfc/rfc1952.txt
1135           - Notation of byte boxes in section "2.1. Overall conventions"
1136
1137         [RFC-2119]
1138         Key words for use in RFCs to Indicate Requirement Levels
1139         http://www.ietf.org/rfc/rfc2119.txt
1140
1141         [GNU-tar]
1142         GNU tar 1.21 manual
1143         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1144           - Node 9.4.2 "Blocking Factor", paragraph that begins
1145             "gzip will complain about trailing garbage"
1146           - Note that this URL points to the latest version of the
1147             manual, and may some day not contain the note which is in
1148             1.21. For the exact version of the manual, download GNU
1149             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1150