doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.0.1 (2009-06-01)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
  97                                 clarifications to Sections 2, 2.2,
  98                                 3.3, 4.4, and 5.3.2
  99
 100         1.0.0     2009-01-14    The first official version
 101
 102
 103 1. Conventions
 104
 105         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 106         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 107         document are to be interpreted as described in [RFC-2119].
 108
 109         Indicating a warning means displaying a message, returning
 110         appropriate exit status, or doing something else to let the
 111         user know that something worth warning occurred. The operation
 112         SHOULD still finish if a warning is indicated.
 113
 114         Indicating an error means displaying a message, returning
 115         appropriate exit status, or doing something else to let the
 116         user know that something prevented successfully finishing the
 117         operation. The operation MUST be aborted once an error has
 118         been indicated.
 119
 120
 121 1.1. Byte and Its Representation
 122
 123         In this document, byte is always 8 bits.
 124
 125         A "null byte" has all bits unset. That is, the value of a null
 126         byte is 0x00.
 127
 128         To represent byte blocks, this document uses notation that
 129         is similar to the notation used in [RFC-1952]:
 130
 131             +-------+
 132             |  Foo  |   One byte.
 133             +-------+
 134
 135             +---+---+
 136             |  Foo  |   Two bytes; that is, some of the vertical bars
 137             +---+---+   can be missing.
 138
 139             +=======+
 140             |  Foo  |   Zero or more bytes.
 141             +=======+
 142
 143         In this document, a boxed byte or a byte sequence declared
 144         using this notation is called "a field". The example field
 145         above would be called "the Foo field" or plain "Foo".
 146
 147         If there are many fields, they may be split to multiple lines.
 148         This is indicated with an arrow ("--->"):
 149
 150             +=====+
 151             | Foo |
 152             +=====+
 153
 154                  +=====+
 155             ---> | Bar |
 156                  +=====+
 157
 158         The above is equivalent to this:
 159
 160             +=====+=====+
 161             | Foo | Bar |
 162             +=====+=====+
 163
 164
 165 1.2. Multibyte Integers
 166
 167         Multibyte integers of static length, such as CRC values,
 168         are stored in little endian byte order (least significant
 169         byte first).
 170
 171         When smaller values are more likely than bigger values (for
 172         example file sizes), multibyte integers are encoded in a
 173         variable-length representation:
 174           - Numbers in the range [0, 127] are copied as is, and take
 175             one byte of space.
 176           - Bigger numbers will occupy two or more bytes. All but the
 177             last byte of the multibyte representation have the highest
 178             (eighth) bit set.
 179
 180         For now, the value of the variable-length integers is limited
 181         to 63 bits, which limits the encoded size of the integer to
 182         nine bytes. These limits may be increased in future if needed.
 183
 184         The following C code illustrates encoding and decoding of
 185         variable-length integers. The functions return the number of
 186         bytes occupied by the integer (1-9), or zero on error.
 187
 188             #include <stddef.h>
 189             #include <inttypes.h>
 190
 191             size_t
 192             encode(uint8_t buf[static 9], uint64_t num)
 193             {
 194                 if (num > UINT64_MAX / 2)
 195                     return 0;
 196
 197                 size_t i = 0;
 198
 199                 while (num >= 0x80) {
 200                     buf[i++] = (uint8_t)(num) | 0x80;
 201                     num >>= 7;
 202                 }
 203
 204                 buf[i++] = (uint8_t)(num);
 205
 206                 return i;
 207             }
 208
 209             size_t
 210             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 211             {
 212                 if (size_max == 0)
 213                     return 0;
 214
 215                 if (size_max > 9)
 216                     size_max = 9;
 217
 218                 *num = buf[0] & 0x7F;
 219                 size_t i = 0;
 220
 221                 while (buf[i++] & 0x80) {
 222                     if (i >= size_max || buf[i] == 0x00)
 223                         return 0;
 224
 225                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 226                 }
 227
 228                 return i;
 229             }
 230
 231
 232 2. Overall Structure of .xz File
 233
 234         A standalone .xz files consist of one or more Streams which may
 235         have Stream Padding between or after them:
 236
 237             +========+================+========+================+
 238             | Stream | Stream Padding | Stream | Stream Padding | ...
 239             +========+================+========+================+
 240
 241         The sizes of Stream and Stream Padding are always multiples
 242         of four bytes, thus the size of every valid .xz file MUST be
 243         a multiple of four bytes.
 244
 245         While a typical file contains only one Stream and no Stream
 246         Padding, a decoder handling standalone .xz files SHOULD support
 247         files that have more than one Stream or Stream Padding.
 248
 249         In contrast to standalone .xz files, when the .xz file format
 250         is used as an internal part of some other file format or
 251         communication protocol, it usually is expected that the decoder
 252         stops after the first Stream, and doesn't look for Stream
 253         Padding or possibly other Streams.
 254
 255
 256 2.1. Stream
 257
 258         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 259         |     Stream Header     | Block | Block | ... | Block |
 260         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 261
 262              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 263         ---> | Index |     Stream Footer     |
 264              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 265
 266         All the above fields have a size that is a multiple of four. If
 267         Stream is used as an internal part of another file format, it
 268         is RECOMMENDED to make the Stream start at an offset that is
 269         a multiple of four bytes.
 270
 271         Stream Header, Index, and Stream Footer are always present in
 272         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 273
 274         There are zero or more Blocks. The maximum number of Blocks is
 275         limited only by the maximum size of the Index field.
 276
 277         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 278         The same limit applies to the total amount of uncompressed
 279         data stored in a Stream.
 280
 281         If an implementation supports handling .xz files with multiple
 282         concatenated Streams, it MAY apply the above limits to the file
 283         as a whole instead of limiting per Stream basis.
 284
 285
 286 2.1.1. Stream Header
 287
 288         +---+---+---+---+---+---+-------+------+--+--+--+--+
 289         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 290         +---+---+---+---+---+---+-------+------+--+--+--+--+
 291
 292
 293 2.1.1.1. Header Magic Bytes
 294
 295         The first six (6) bytes of the Stream are so called Header
 296         Magic Bytes. They can be used to identify the file type.
 297
 298             Using a C array and ASCII:
 299             const uint8_t HEADER_MAGIC[6]
 300                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 301
 302             In plain hexadecimal:
 303             FD 37 7A 58 5A 00
 304
 305         Notes:
 306           - The first byte (0xFD) was chosen so that the files cannot
 307             be erroneously detected as being in .lzma format, in which
 308             the first byte is in the range [0x00, 0xE0].
 309           - The sixth byte (0x00) was chosen to prevent applications
 310             from misdetecting the file as a text file.
 311
 312         If the Header Magic Bytes don't match, the decoder MUST
 313         indicate an error.
 314
 315
 316 2.1.1.2. Stream Flags
 317
 318         The first byte of Stream Flags is always a null byte. In future
 319         this byte may be used to indicate new Stream version or other
 320         Stream properties.
 321
 322         The second byte of Stream Flags is a bit field:
 323
 324             Bit(s)  Mask  Description
 325              0-3    0x0F  Type of Check (see Section 3.4):
 326                               ID    Size      Check name
 327                               0x00   0 bytes  None
 328                               0x01   4 bytes  CRC32
 329                               0x02   4 bytes  (Reserved)
 330                               0x03   4 bytes  (Reserved)
 331                               0x04   8 bytes  CRC64
 332                               0x05   8 bytes  (Reserved)
 333                               0x06   8 bytes  (Reserved)
 334                               0x07  16 bytes  (Reserved)
 335                               0x08  16 bytes  (Reserved)
 336                               0x09  16 bytes  (Reserved)
 337                               0x0A  32 bytes  SHA-256
 338                               0x0B  32 bytes  (Reserved)
 339                               0x0C  32 bytes  (Reserved)
 340                               0x0D  64 bytes  (Reserved)
 341                               0x0E  64 bytes  (Reserved)
 342                               0x0F  64 bytes  (Reserved)
 343              4-7    0xF0  Reserved for future use; MUST be zero for now.
 344
 345         Implementations SHOULD support at least the Check IDs 0x00
 346         (None) and 0x01 (CRC32). Supporting other Check IDs is
 347         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 348         indicate a warning or error.
 349
 350         If any reserved bit is set, the decoder MUST indicate an error.
 351         It is possible that there is a new field present which the
 352         decoder is not aware of, and can thus parse the Stream Header
 353         incorrectly.
 354
 355
 356 2.1.1.3. CRC32
 357
 358         The CRC32 is calculated from the Stream Flags field. It is
 359         stored as an unsigned 32-bit little endian integer. If the
 360         calculated value does not match the stored one, the decoder
 361         MUST indicate an error.
 362
 363         The idea is that Stream Flags would always be two bytes, even
 364         if new features are needed. This way old decoders will be able
 365         to verify the CRC32 calculated from Stream Flags, and thus
 366         distinguish between corrupt files (CRC32 doesn't match) and
 367         files that the decoder doesn't support (CRC32 matches but
 368         Stream Flags has reserved bits set).
 369
 370
 371 2.1.2. Stream Footer
 372
 373         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 374         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 375         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 376
 377
 378 2.1.2.1. CRC32
 379
 380         The CRC32 is calculated from the Backward Size and Stream Flags
 381         fields. It is stored as an unsigned 32-bit little endian
 382         integer. If the calculated value does not match the stored one,
 383         the decoder MUST indicate an error.
 384
 385         The reason to have the CRC32 field before the Backward Size and
 386         Stream Flags fields is to keep the four-byte fields aligned to
 387         a multiple of four bytes.
 388
 389
 390 2.1.2.2. Backward Size
 391
 392         Backward Size is stored as a 32-bit little endian integer,
 393         which indicates the size of the Index field as multiple of
 394         four bytes, minimum value being four bytes:
 395
 396             real_backward_size = (stored_backward_size + 1) * 4;
 397
 398         If the stored value does not match the real size of the Index
 399         field, the decoder MUST indicate an error.
 400
 401         Using a fixed-size integer to store Backward Size makes
 402         it slightly simpler to parse the Stream Footer when the
 403         application needs to parse the Stream backwards.
 404
 405
 406 2.1.2.3. Stream Flags
 407
 408         This is a copy of the Stream Flags field from the Stream
 409         Header. The information stored to Stream Flags is needed
 410         when parsing the Stream backwards. The decoder MUST compare
 411         the Stream Flags fields in both Stream Header and Stream
 412         Footer, and indicate an error if they are not identical.
 413
 414
 415 2.1.2.4. Footer Magic Bytes
 416
 417         As the last step of the decoding process, the decoder MUST
 418         verify the existence of Footer Magic Bytes. If they don't
 419         match, an error MUST be indicated.
 420
 421             Using a C array and ASCII:
 422             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 423
 424             In hexadecimal:
 425             59 5A
 426
 427         The primary reason to have Footer Magic Bytes is to make
 428         it easier to detect incomplete files quickly, without
 429         uncompressing. If the file does not end with Footer Magic Bytes
 430         (excluding Stream Padding described in Section 2.2), it cannot
 431         be undamaged, unless someone has intentionally appended garbage
 432         after the end of the Stream.
 433
 434
 435 2.2. Stream Padding
 436
 437         Only the decoders that support decoding of concatenated Streams
 438         MUST support Stream Padding.
 439
 440         Stream Padding MUST contain only null bytes. To preserve the
 441         four-byte alignment of consecutive Streams, the size of Stream
 442         Padding MUST be a multiple of four bytes. Empty Stream Padding
 443         is allowed. If these requirements are not met, the decoder MUST
 444         indicate an error.
 445
 446         Note that non-empty Stream Padding is allowed at the end of the
 447         file; there doesn't need to be a new Stream after non-empty
 448         Stream Padding. This can be convenient in certain situations
 449         [GNU-tar].
 450
 451         The possibility of Stream Padding MUST be taken into account
 452         when designing an application that parses Streams backwards,
 453         and the application supports concatenated Streams.
 454
 455
 456 3. Block
 457
 458         +==============+=================+===============+=======+
 459         | Block Header | Compressed Data | Block Padding | Check |
 460         +==============+=================+===============+=======+
 461
 462
 463 3.1. Block Header
 464
 465         +-------------------+-------------+=================+
 466         | Block Header Size | Block Flags | Compressed Size |
 467         +-------------------+-------------+=================+
 468
 469              +===================+======================+
 470         ---> | Uncompressed Size | List of Filter Flags |
 471              +===================+======================+
 472
 473              +================+--+--+--+--+
 474         ---> | Header Padding |   CRC32   |
 475              +================+--+--+--+--+
 476
 477
 478 3.1.1. Block Header Size
 479
 480         This field overlaps with the Index Indicator field (see
 481         Section 4.1).
 482
 483         This field contains the size of the Block Header field,
 484         including the Block Header Size field itself. Valid values are
 485         in the range [0x01, 0xFF], which indicate the size of the Block
 486         Header as multiples of four bytes, minimum size being eight
 487         bytes:
 488
 489             real_header_size = (encoded_header_size + 1) * 4;
 490
 491         If bigger Block Header is needed in future, a new field can be
 492         added between the current Block Header and Compressed Data
 493         fields. The presence of this new field would be indicated in
 494         the Block Header.
 495
 496
 497 3.1.2. Block Flags
 498
 499         The first byte of the Block Flags field is a bit field:
 500
 501             Bit(s)  Mask  Description
 502              0-1    0x03  Number of filters (1-4)
 503              2-5    0x3C  Reserved for future use; MUST be zero for now.
 504               6     0x40  The Compressed Size field is present.
 505               7     0x80  The Uncompressed Size field is present.
 506
 507         If any reserved bit is set, the decoder MUST indicate an error.
 508         It is possible that there is a new field present which the
 509         decoder is not aware of, and can thus parse the Block Header
 510         incorrectly.
 511
 512
 513 3.1.3. Compressed Size
 514
 515         This field is present only if the appropriate bit is set in
 516         the Block Flags field (see Section 3.1.2).
 517
 518         The Compressed Size field contains the size of the Compressed
 519         Data field, which MUST be non-zero. Compressed Size is stored
 520         using the encoding described in Section 1.2. If the Compressed
 521         Size doesn't match the size of the Compressed Data field, the
 522         decoder MUST indicate an error.
 523
 524
 525 3.1.4. Uncompressed Size
 526
 527         This field is present only if the appropriate bit is set in
 528         the Block Flags field (see Section 3.1.2).
 529
 530         The Uncompressed Size field contains the size of the Block
 531         after uncompressing. Uncompressed Size is stored using the
 532         encoding described in Section 1.2. If the Uncompressed Size
 533         does not match the real uncompressed size, the decoder MUST
 534         indicate an error.
 535
 536         Storing the Compressed Size and Uncompressed Size fields serves
 537         several purposes:
 538           - The decoder knows how much memory it needs to allocate
 539             for a temporary buffer in multithreaded mode.
 540           - Simple error detection: wrong size indicates a broken file.
 541           - Seeking forwards to a specific location in streamed mode.
 542
 543         It should be noted that the only reliable way to determine
 544         the real uncompressed size is to uncompress the Block,
 545         because the Block Header and Index fields may contain
 546         (intentionally or unintentionally) invalid information.
 547
 548
 549 3.1.5. List of Filter Flags
 550
 551         +================+================+     +================+
 552         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 553         +================+================+     +================+
 554
 555         The number of Filter Flags fields is stored in the Block Flags
 556         field (see Section 3.1.2).
 557
 558         The format of each Filter Flags field is as follows:
 559
 560             +===========+====================+===================+
 561             | Filter ID | Size of Properties | Filter Properties |
 562             +===========+====================+===================+
 563
 564         Both Filter ID and Size of Properties are stored using the
 565         encoding described in Section 1.2. Size of Properties indicates
 566         the size of the Filter Properties field as bytes. The list of
 567         officially defined Filter IDs and the formats of their Filter
 568         Properties are described in Section 5.3.
 569
 570         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 571         (2^62) are reserved for implementation-specific internal use.
 572         These Filter IDs MUST never be used in List of Filter Flags.
 573
 574
 575 3.1.6. Header Padding
 576
 577         This field contains as many null byte as it is needed to make
 578         the Block Header have the size specified in Block Header Size.
 579         If any of the bytes are not null bytes, the decoder MUST
 580         indicate an error. It is possible that there is a new field
 581         present which the decoder is not aware of, and can thus parse
 582         the Block Header incorrectly.
 583
 584
 585 3.1.7. CRC32
 586
 587         The CRC32 is calculated over everything in the Block Header
 588         field except the CRC32 field itself. It is stored as an
 589         unsigned 32-bit little endian integer. If the calculated
 590         value does not match the stored one, the decoder MUST indicate
 591         an error.
 592
 593         By verifying the CRC32 of the Block Header before parsing the
 594         actual contents allows the decoder to distinguish between
 595         corrupt and unsupported files.
 596
 597
 598 3.2. Compressed Data
 599
 600         The format of Compressed Data depends on Block Flags and List
 601         of Filter Flags. Excluding the descriptions of the simplest
 602         filters in Section 5.3, the format of the filter-specific
 603         encoded data is out of scope of this document.
 604
 605
 606 3.3. Block Padding
 607
 608         Block Padding MUST contain 0-3 null bytes to make the size of
 609         the Block a multiple of four bytes. This can be needed when
 610         the size of Compressed Data is not a multiple of four. If any
 611         of the bytes in Block Padding are not null bytes, the decoder
 612         MUST indicate an error.
 613
 614
 615 3.4. Check
 616
 617         The type and size of the Check field depends on which bits
 618         are set in the Stream Flags field (see Section 2.1.1.2).
 619
 620         The Check, when used, is calculated from the original
 621         uncompressed data. If the calculated Check does not match the
 622         stored one, the decoder MUST indicate an error. If the selected
 623         type of Check is not supported by the decoder, it SHOULD
 624         indicate a warning or error.
 625
 626
 627 4. Index
 628
 629         +-----------------+===================+
 630         | Index Indicator | Number of Records |
 631         +-----------------+===================+
 632
 633              +=================+===============+-+-+-+-+
 634         ---> | List of Records | Index Padding | CRC32 |
 635              +=================+===============+-+-+-+-+
 636
 637         Index serves several purporses. Using it, one can
 638           - verify that all Blocks in a Stream have been processed;
 639           - find out the uncompressed size of a Stream; and
 640           - quickly access the beginning of any Block (random access).
 641
 642
 643 4.1. Index Indicator
 644
 645         This field overlaps with the Block Header Size field (see
 646         Section 3.1.1). The value of Index Indicator is always 0x00.
 647
 648
 649 4.2. Number of Records
 650
 651         This field indicates how many Records there are in the List
 652         of Records field, and thus how many Blocks there are in the
 653         Stream. The value is stored using the encoding described in
 654         Section 1.2. If the decoder has decoded all the Blocks of the
 655         Stream, and then notices that the Number of Records doesn't
 656         match the real number of Blocks, the decoder MUST indicate an
 657         error.
 658
 659
 660 4.3. List of Records
 661
 662         List of Records consists of as many Records as indicated by the
 663         Number of Records field:
 664
 665             +========+========+
 666             | Record | Record | ...
 667             +========+========+
 668
 669         Each Record contains information about one Block:
 670
 671             +===============+===================+
 672             | Unpadded Size | Uncompressed Size |
 673             +===============+===================+
 674
 675         If the decoder has decoded all the Blocks of the Stream, it
 676         MUST verify that the contents of the Records match the real
 677         Unpadded Size and Uncompressed Size of the respective Blocks.
 678
 679         Implementation hint: It is possible to verify the Index with
 680         constant memory usage by calculating for example SHA-256 of
 681         both the real size values and the List of Records, then
 682         comparing the hash values. Implementing this using
 683         non-cryptographic hash like CRC32 SHOULD be avoided unless
 684         small code size is important.
 685
 686         If the decoder supports random-access reading, it MUST verify
 687         that Unpadded Size and Uncompressed Size of every completely
 688         decoded Block match the sizes stored in the Index. If only
 689         partial Block is decoded, the decoder MUST verify that the
 690         processed sizes don't exceed the sizes stored in the Index.
 691
 692
 693 4.3.1. Unpadded Size
 694
 695         This field indicates the size of the Block excluding the Block
 696         Padding field. That is, Unpadded Size is the size of the Block
 697         Header, Compressed Data, and Check fields. Unpadded Size is
 698         stored using the encoding described in Section 1.2. The value
 699         MUST never be zero; with the current structure of Blocks, the
 700         actual minimum value for Unpadded Size is five.
 701
 702         Implementation note: Because the size of the Block Padding
 703         field is not included in Unpadded Size, calculating the total
 704         size of a Stream or doing random-access reading requires
 705         calculating the actual size of the Blocks by rounding Unpadded
 706         Sizes up to the next multiple of four.
 707
 708         The reason to exclude Block Padding from Unpadded Size is to
 709         ease making a raw copy of Compressed Data without Block
 710         Padding. This can be useful, for example, if someone wants
 711         to convert Streams to some other file format quickly.
 712
 713
 714 4.3.2. Uncompressed Size
 715
 716         This field indicates the Uncompressed Size of the respective
 717         Block as bytes. The value is stored using the encoding
 718         described in Section 1.2.
 719
 720
 721 4.4. Index Padding
 722
 723         This field MUST contain 0-3 null bytes to pad the Index to
 724         a multiple of four bytes. If any of the bytes are not null
 725         bytes, the decoder MUST indicate an error.
 726
 727
 728 4.5. CRC32
 729
 730         The CRC32 is calculated over everything in the Index field
 731         except the CRC32 field itself. The CRC32 is stored as an
 732         unsigned 32-bit little endian integer. If the calculated
 733         value does not match the stored one, the decoder MUST indicate
 734         an error.
 735
 736
 737 5. Filter Chains
 738
 739         The Block Flags field defines how many filters are used. When
 740         more than one filter is used, the filters are chained; that is,
 741         the output of one filter is the input of another filter. The
 742         following figure illustrates the direction of data flow.
 743
 744                     v   Uncompressed Data   ^
 745                     |       Filter 0        |
 746             Encoder |       Filter 1        | Decoder
 747                     |       Filter n        |
 748                     v    Compressed Data    ^
 749
 750
 751 5.1. Alignment
 752
 753         Alignment of uncompressed input data is usually the job of
 754         the application producing the data. For example, to get the
 755         best results, an archiver tool should make sure that all
 756         PowerPC executable files in the archive stream start at
 757         offsets that are multiples of four bytes.
 758
 759         Some filters, for example LZMA2, can be configured to take
 760         advantage of specified alignment of input data. Note that
 761         taking advantage of aligned input can be benefical also when
 762         a filter is not the first filter in the chain. For example,
 763         if you compress PowerPC executables, you may want to use the
 764         PowerPC filter and chain that with the LZMA2 filter. Because
 765         not only the input but also the output alignment of the PowerPC
 766         filter is four bytes, it is now benefical to set LZMA2 settings
 767         so that the LZMA2 encoder can take advantage of its
 768         four-byte-aligned input data.
 769
 770         The output of the last filter in the chain is stored to the
 771         Compressed Data field, which is is guaranteed to be aligned
 772         to a multiple of four bytes relative to the beginning of the
 773         Stream. This can increase
 774           - speed, if the filtered data is handled multiple bytes at
 775             a time by the filter-specific encoder and decoder,
 776             because accessing aligned data in computer memory is
 777             usually faster; and
 778           - compression ratio, if the output data is later compressed
 779             with an external compression tool.
 780
 781
 782 5.2. Security
 783
 784         If filters would be allowed to be chained freely, it would be
 785         possible to create malicious files, that would be very slow to
 786         decode. Such files could be used to create denial of service
 787         attacks.
 788
 789         Slow files could occur when multiple filters are chained:
 790
 791             v   Compressed input data
 792             |   Filter 1 decoder (last filter)
 793             |   Filter 0 decoder (non-last filter)
 794             v   Uncompressed output data
 795
 796         The decoder of the last filter in the chain produces a lot of
 797         output from little input. Another filter in the chain takes the
 798         output of the last filter, and produces very little output
 799         while consuming a lot of input. As a result, a lot of data is
 800         moved inside the filter chain, but the filter chain as a whole
 801         gets very little work done.
 802
 803         To prevent this kind of slow files, there are restrictions on
 804         how the filters can be chained. These restrictions MUST be
 805         taken into account when designing new filters.
 806
 807         The maximum number of filters in the chain has been limited to
 808         four, thus there can be at maximum of three non-last filters.
 809         Of these three non-last filters, only two are allowed to change
 810         the size of the data.
 811
 812         The non-last filters, that change the size of the data, MUST
 813         have a limit how much the decoder can compress the data: the
 814         decoder SHOULD produce at least n bytes of output when the
 815         filter is given 2n bytes of input. This  limit is not
 816         absolute, but significant deviations MUST be avoided.
 817
 818         The above limitations guarantee that if the last filter in the
 819         chain produces 4n bytes of output, the chain as a whole will
 820         produce at least n bytes of output.
 821
 822
 823 5.3. Filters
 824
 825 5.3.1. LZMA2
 826
 827         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 828         compression algorithm with high compression ratio and fast
 829         decompression. LZMA is based on LZ77 and range coding
 830         algorithms.
 831
 832         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 833         LZMA internally, but adds support for flushing the encoder,
 834         uncompressed chunks, eases stateful decoder implementations,
 835         and improves support for multithreading. Thus, the plain LZMA
 836         will not be supported in this file format.
 837
 838             Filter ID:                  0x21
 839             Size of Filter Properties:  1 byte
 840             Changes size of data:       Yes
 841             Allow as a non-last filter: No
 842             Allow as the last filter:   Yes
 843
 844             Preferred alignment:
 845                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 846                 Output data:            1 byte
 847
 848         The format of the one-byte Filter Properties field is as
 849         follows:
 850
 851             Bits   Mask   Description
 852             0-5    0x3F   Dictionary Size
 853             6-7    0xC0   Reserved for future use; MUST be zero for now.
 854
 855         Dictionary Size is encoded with one-bit mantissa and five-bit
 856         exponent. The smallest dictionary size is 4 KiB and the biggest
 857         is 4 GiB.
 858
 859             Raw value   Mantissa   Exponent   Dictionary size
 860                 0           2         11         4 KiB
 861                 1           3         11         6 KiB
 862                 2           2         12         8 KiB
 863                 3           3         12        12 KiB
 864                 4           2         13        16 KiB
 865                 5           3         13        24 KiB
 866                 6           2         14        32 KiB
 867               ...         ...        ...      ...
 868                35           3         27       768 MiB
 869                36           2         28      1024 MiB
 870                37           3         29      1536 MiB
 871                38           2         30      2048 MiB
 872                39           3         30      3072 MiB
 873                40           2         31      4096 MiB - 1 B
 874
 875         Instead of having a table in the decoder, the dictionary size
 876         can be decoded using the following C code:
 877
 878             const uint8_t bits = get_dictionary_flags() & 0x3F;
 879             if (bits > 40)
 880                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 881
 882             uint32_t dictionary_size;
 883             if (bits == 40) {
 884                 dictionary_size = UINT32_MAX;
 885             } else {
 886                 dictionary_size = 2 | (bits & 1);
 887                 dictionary_size <<= bits / 2 + 11;
 888             }
 889
 890
 891 5.3.2. Branch/Call/Jump Filters for Executables
 892
 893         These filters convert relative branch, call, and jump
 894         instructions to their absolute counterparts in executable
 895         files. This conversion increases redundancy and thus
 896         compression ratio.
 897
 898             Size of Filter Properties:  0 or 4 bytes
 899             Changes size of data:       No
 900             Allow as a non-last filter: Yes
 901             Allow as the last filter:   No
 902
 903         Below is the list of filters in this category. The alignment
 904         is the same for both input and output data.
 905
 906             Filter ID   Alignment   Description
 907               0x04       1 byte     x86 filter (BCJ)
 908               0x05       4 bytes    PowerPC (big endian) filter
 909               0x06      16 bytes    IA64 filter
 910               0x07       4 bytes    ARM (little endian) filter
 911               0x08       2 bytes    ARM Thumb (little endian) filter
 912               0x09       4 bytes    SPARC filter
 913
 914         If the size of Filter Properties is four bytes, the Filter
 915         Properties field contains the start offset used for address
 916         conversions. It is stored as an unsigned 32-bit little endian
 917         integer. The start offset MUST be a multiple of the alignment
 918         of the filter as listed in the table above; if it isn't, the
 919         decoder MUST indicate an error. If the size of Filter
 920         Properties is zero, the start offset is zero.
 921
 922         Setting the start offset may be useful if an executable has
 923         multiple sections, and there are many cross-section calls.
 924         Taking advantage of this feature usually requires usage of
 925         the Subblock filter, whose design is not complete yet.
 926
 927
 928 5.3.3. Delta
 929
 930         The Delta filter may increase compression ratio when the value
 931         of the next byte correlates with the value of an earlier byte
 932         at specified distance.
 933
 934             Filter ID:                  0x03
 935             Size of Filter Properties:  1 byte
 936             Changes size of data:       No
 937             Allow as a non-last filter: Yes
 938             Allow as the last filter:   No
 939
 940             Preferred alignment:
 941                 Input data:             1 byte
 942                 Output data:            Same as the original input data
 943
 944         The Properties byte indicates the delta distance, which can be
 945         1-256 bytes backwards from the current byte: 0x00 indicates
 946         distance of 1 byte and 0xFF distance of 256 bytes.
 947
 948
 949 5.3.3.1. Format of the Encoded Output
 950
 951         The code below illustrates both encoding and decoding with
 952         the Delta filter.
 953
 954             // Distance is in the range [1, 256].
 955             const unsigned int distance = get_properties_byte() + 1;
 956             uint8_t pos = 0;
 957             uint8_t delta[256];
 958
 959             memset(delta, 0, sizeof(delta));
 960
 961             while (1) {
 962                 const int byte = read_byte();
 963                 if (byte == EOF)
 964                     break;
 965
 966                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 967                 if (is_encoder) {
 968                     tmp = (uint8_t)(byte) - tmp;
 969                     delta[pos] = (uint8_t)(byte);
 970                 } else {
 971                     tmp = (uint8_t)(byte) + tmp;
 972                     delta[pos] = tmp;
 973                 }
 974
 975                 write_byte(tmp);
 976                 --pos;
 977             }
 978
 979
 980 5.4. Custom Filter IDs
 981
 982         If a developer wants to use custom Filter IDs, he has two
 983         choices. The first choice is to contact Lasse Collin and ask
 984         him to allocate a range of IDs for the developer.
 985
 986         The second choice is to generate a 40-bit random integer,
 987         which the developer can use as his personal Developer ID.
 988         To minimalize the risk of collisions, Developer ID has to be
 989         a randomly generated integer, not manually selected "hex word".
 990         The following command, which works on many free operating
 991         systems, can be used to generate Developer ID:
 992
 993             dd if=/dev/urandom bs=5 count=1 | hexdump
 994
 995         The developer can then use his Developer ID to create unique
 996         (well, hopefully unique) Filter IDs.
 997
 998             Bits    Mask                    Description
 999              0-15   0x0000_0000_0000_FFFF   Filter ID
1000             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1001             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1002
1003         The resulting 63-bit integer will use 9 bytes of space when
1004         stored using the encoding described in Section 1.2. To get
1005         a shorter ID, see the beginning of this Section how to
1006         request a custom ID range.
1007
1008
1009 5.4.1. Reserved Custom Filter ID Ranges
1010
1011         Range                       Description
1012         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1013         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1014         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1015
1016
1017 6. Cyclic Redundancy Checks
1018
1019         There are several incompatible variations to calculate CRC32
1020         and CRC64. For simplicity and clarity, complete examples are
1021         provided to calculate the checks as they are used in this file
1022         format. Implementations MAY use different code as long as it
1023         gives identical results.
1024
1025         The program below reads data from standard input, calculates
1026         the CRC32 and CRC64 values, and prints the calculated values
1027         as big endian hexadecimal strings to standard output.
1028
1029             #include <stddef.h>
1030             #include <inttypes.h>
1031             #include <stdio.h>
1032
1033             uint32_t crc32_table[256];
1034             uint64_t crc64_table[256];
1035
1036             void
1037             init(void)
1038             {
1039                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1040                 static const uint64_t poly64
1041                         = UINT64_C(0xC96C5795D7870F42);
1042
1043                 for (size_t i = 0; i < 256; ++i) {
1044                     uint32_t crc32 = i;
1045                     uint64_t crc64 = i;
1046
1047                     for (size_t j = 0; j < 8; ++j) {
1048                         if (crc32 & 1)
1049                             crc32 = (crc32 >> 1) ^ poly32;
1050                         else
1051                             crc32 >>= 1;
1052
1053                         if (crc64 & 1)
1054                             crc64 = (crc64 >> 1) ^ poly64;
1055                         else
1056                             crc64 >>= 1;
1057                     }
1058
1059                     crc32_table[i] = crc32;
1060                     crc64_table[i] = crc64;
1061                 }
1062             }
1063
1064             uint32_t
1065             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1066             {
1067                 crc = ~crc;
1068                 for (size_t i = 0; i < size; ++i)
1069                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1070                             ^ (crc >> 8);
1071                 return ~crc;
1072             }
1073
1074             uint64_t
1075             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1076             {
1077                 crc = ~crc;
1078                 for (size_t i = 0; i < size; ++i)
1079                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1080                             ^ (crc >> 8);
1081                 return ~crc;
1082             }
1083
1084             int
1085             main()
1086             {
1087                 init();
1088
1089                 uint32_t value32 = 0;
1090                 uint64_t value64 = 0;
1091                 uint64_t total_size = 0;
1092                 uint8_t buf[8192];
1093
1094                 while (1) {
1095                     const size_t buf_size
1096                             = fread(buf, 1, sizeof(buf), stdin);
1097                     if (buf_size == 0)
1098                         break;
1099
1100                     total_size += buf_size;
1101                     value32 = crc32(buf, buf_size, value32);
1102                     value64 = crc64(buf, buf_size, value64);
1103                 }
1104
1105                 printf("Bytes:  %" PRIu64 "\n", total_size);
1106                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1107                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1108
1109                 return 0;
1110             }
1111
1112
1113 7. References
1114
1115         LZMA SDK - The original LZMA implementation
1116         http://7-zip.org/sdk.html
1117
1118         LZMA Utils - LZMA adapted to POSIX-like systems
1119         http://tukaani.org/lzma/
1120
1121         XZ Utils - The next generation of LZMA Utils
1122         http://tukaani.org/xz/
1123
1124         [RFC-1952]
1125         GZIP file format specification version 4.3
1126         http://www.ietf.org/rfc/rfc1952.txt
1127           - Notation of byte boxes in section "2.1. Overall conventions"
1128
1129         [RFC-2119]
1130         Key words for use in RFCs to Indicate Requirement Levels
1131         http://www.ietf.org/rfc/rfc2119.txt
1132
1133         [GNU-tar]
1134         GNU tar 1.21 manual
1135         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1136           - Node 9.4.2 "Blocking Factor", paragraph that begins
1137             "gzip will complain about trailing garbage"
1138           - Note that this URL points to the latest version of the
1139             manual, and may some day not contain the note which is in
1140             1.21. For the exact version of the manual, download GNU
1141             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1142