doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.0.2 (2009-06-04)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
  97
  98         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
  99                                 clarifications to Sections 2, 2.2,
 100                                 3.3, 4.4, and 5.3.2
 101
 102         1.0.0     2009-01-14    The first official version
 103
 104
 105 1. Conventions
 106
 107         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 108         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 109         document are to be interpreted as described in [RFC-2119].
 110
 111         Indicating a warning means displaying a message, returning
 112         appropriate exit status, or doing something else to let the
 113         user know that something worth warning occurred. The operation
 114         SHOULD still finish if a warning is indicated.
 115
 116         Indicating an error means displaying a message, returning
 117         appropriate exit status, or doing something else to let the
 118         user know that something prevented successfully finishing the
 119         operation. The operation MUST be aborted once an error has
 120         been indicated.
 121
 122
 123 1.1. Byte and Its Representation
 124
 125         In this document, byte is always 8 bits.
 126
 127         A "null byte" has all bits unset. That is, the value of a null
 128         byte is 0x00.
 129
 130         To represent byte blocks, this document uses notation that
 131         is similar to the notation used in [RFC-1952]:
 132
 133             +-------+
 134             |  Foo  |   One byte.
 135             +-------+
 136
 137             +---+---+
 138             |  Foo  |   Two bytes; that is, some of the vertical bars
 139             +---+---+   can be missing.
 140
 141             +=======+
 142             |  Foo  |   Zero or more bytes.
 143             +=======+
 144
 145         In this document, a boxed byte or a byte sequence declared
 146         using this notation is called "a field". The example field
 147         above would be called "the Foo field" or plain "Foo".
 148
 149         If there are many fields, they may be split to multiple lines.
 150         This is indicated with an arrow ("--->"):
 151
 152             +=====+
 153             | Foo |
 154             +=====+
 155
 156                  +=====+
 157             ---> | Bar |
 158                  +=====+
 159
 160         The above is equivalent to this:
 161
 162             +=====+=====+
 163             | Foo | Bar |
 164             +=====+=====+
 165
 166
 167 1.2. Multibyte Integers
 168
 169         Multibyte integers of static length, such as CRC values,
 170         are stored in little endian byte order (least significant
 171         byte first).
 172
 173         When smaller values are more likely than bigger values (for
 174         example file sizes), multibyte integers are encoded in a
 175         variable-length representation:
 176           - Numbers in the range [0, 127] are copied as is, and take
 177             one byte of space.
 178           - Bigger numbers will occupy two or more bytes. All but the
 179             last byte of the multibyte representation have the highest
 180             (eighth) bit set.
 181
 182         For now, the value of the variable-length integers is limited
 183         to 63 bits, which limits the encoded size of the integer to
 184         nine bytes. These limits may be increased in future if needed.
 185
 186         The following C code illustrates encoding and decoding of
 187         variable-length integers. The functions return the number of
 188         bytes occupied by the integer (1-9), or zero on error.
 189
 190             #include <stddef.h>
 191             #include <inttypes.h>
 192
 193             size_t
 194             encode(uint8_t buf[static 9], uint64_t num)
 195             {
 196                 if (num > UINT64_MAX / 2)
 197                     return 0;
 198
 199                 size_t i = 0;
 200
 201                 while (num >= 0x80) {
 202                     buf[i++] = (uint8_t)(num) | 0x80;
 203                     num >>= 7;
 204                 }
 205
 206                 buf[i++] = (uint8_t)(num);
 207
 208                 return i;
 209             }
 210
 211             size_t
 212             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 213             {
 214                 if (size_max == 0)
 215                     return 0;
 216
 217                 if (size_max > 9)
 218                     size_max = 9;
 219
 220                 *num = buf[0] & 0x7F;
 221                 size_t i = 0;
 222
 223                 while (buf[i++] & 0x80) {
 224                     if (i >= size_max || buf[i] == 0x00)
 225                         return 0;
 226
 227                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 228                 }
 229
 230                 return i;
 231             }
 232
 233
 234 2. Overall Structure of .xz File
 235
 236         A standalone .xz files consist of one or more Streams which may
 237         have Stream Padding between or after them:
 238
 239             +========+================+========+================+
 240             | Stream | Stream Padding | Stream | Stream Padding | ...
 241             +========+================+========+================+
 242
 243         The sizes of Stream and Stream Padding are always multiples
 244         of four bytes, thus the size of every valid .xz file MUST be
 245         a multiple of four bytes.
 246
 247         While a typical file contains only one Stream and no Stream
 248         Padding, a decoder handling standalone .xz files SHOULD support
 249         files that have more than one Stream or Stream Padding.
 250
 251         In contrast to standalone .xz files, when the .xz file format
 252         is used as an internal part of some other file format or
 253         communication protocol, it usually is expected that the decoder
 254         stops after the first Stream, and doesn't look for Stream
 255         Padding or possibly other Streams.
 256
 257
 258 2.1. Stream
 259
 260         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 261         |     Stream Header     | Block | Block | ... | Block |
 262         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 263
 264              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 265         ---> | Index |     Stream Footer     |
 266              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 267
 268         All the above fields have a size that is a multiple of four. If
 269         Stream is used as an internal part of another file format, it
 270         is RECOMMENDED to make the Stream start at an offset that is
 271         a multiple of four bytes.
 272
 273         Stream Header, Index, and Stream Footer are always present in
 274         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 275
 276         There are zero or more Blocks. The maximum number of Blocks is
 277         limited only by the maximum size of the Index field.
 278
 279         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 280         The same limit applies to the total amount of uncompressed
 281         data stored in a Stream.
 282
 283         If an implementation supports handling .xz files with multiple
 284         concatenated Streams, it MAY apply the above limits to the file
 285         as a whole instead of limiting per Stream basis.
 286
 287
 288 2.1.1. Stream Header
 289
 290         +---+---+---+---+---+---+-------+------+--+--+--+--+
 291         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 292         +---+---+---+---+---+---+-------+------+--+--+--+--+
 293
 294
 295 2.1.1.1. Header Magic Bytes
 296
 297         The first six (6) bytes of the Stream are so called Header
 298         Magic Bytes. They can be used to identify the file type.
 299
 300             Using a C array and ASCII:
 301             const uint8_t HEADER_MAGIC[6]
 302                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 303
 304             In plain hexadecimal:
 305             FD 37 7A 58 5A 00
 306
 307         Notes:
 308           - The first byte (0xFD) was chosen so that the files cannot
 309             be erroneously detected as being in .lzma format, in which
 310             the first byte is in the range [0x00, 0xE0].
 311           - The sixth byte (0x00) was chosen to prevent applications
 312             from misdetecting the file as a text file.
 313
 314         If the Header Magic Bytes don't match, the decoder MUST
 315         indicate an error.
 316
 317
 318 2.1.1.2. Stream Flags
 319
 320         The first byte of Stream Flags is always a null byte. In future
 321         this byte may be used to indicate new Stream version or other
 322         Stream properties.
 323
 324         The second byte of Stream Flags is a bit field:
 325
 326             Bit(s)  Mask  Description
 327              0-3    0x0F  Type of Check (see Section 3.4):
 328                               ID    Size      Check name
 329                               0x00   0 bytes  None
 330                               0x01   4 bytes  CRC32
 331                               0x02   4 bytes  (Reserved)
 332                               0x03   4 bytes  (Reserved)
 333                               0x04   8 bytes  CRC64
 334                               0x05   8 bytes  (Reserved)
 335                               0x06   8 bytes  (Reserved)
 336                               0x07  16 bytes  (Reserved)
 337                               0x08  16 bytes  (Reserved)
 338                               0x09  16 bytes  (Reserved)
 339                               0x0A  32 bytes  SHA-256
 340                               0x0B  32 bytes  (Reserved)
 341                               0x0C  32 bytes  (Reserved)
 342                               0x0D  64 bytes  (Reserved)
 343                               0x0E  64 bytes  (Reserved)
 344                               0x0F  64 bytes  (Reserved)
 345              4-7    0xF0  Reserved for future use; MUST be zero for now.
 346
 347         Implementations SHOULD support at least the Check IDs 0x00
 348         (None) and 0x01 (CRC32). Supporting other Check IDs is
 349         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 350         indicate a warning or error.
 351
 352         If any reserved bit is set, the decoder MUST indicate an error.
 353         It is possible that there is a new field present which the
 354         decoder is not aware of, and can thus parse the Stream Header
 355         incorrectly.
 356
 357
 358 2.1.1.3. CRC32
 359
 360         The CRC32 is calculated from the Stream Flags field. It is
 361         stored as an unsigned 32-bit little endian integer. If the
 362         calculated value does not match the stored one, the decoder
 363         MUST indicate an error.
 364
 365         The idea is that Stream Flags would always be two bytes, even
 366         if new features are needed. This way old decoders will be able
 367         to verify the CRC32 calculated from Stream Flags, and thus
 368         distinguish between corrupt files (CRC32 doesn't match) and
 369         files that the decoder doesn't support (CRC32 matches but
 370         Stream Flags has reserved bits set).
 371
 372
 373 2.1.2. Stream Footer
 374
 375         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 376         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 377         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 378
 379
 380 2.1.2.1. CRC32
 381
 382         The CRC32 is calculated from the Backward Size and Stream Flags
 383         fields. It is stored as an unsigned 32-bit little endian
 384         integer. If the calculated value does not match the stored one,
 385         the decoder MUST indicate an error.
 386
 387         The reason to have the CRC32 field before the Backward Size and
 388         Stream Flags fields is to keep the four-byte fields aligned to
 389         a multiple of four bytes.
 390
 391
 392 2.1.2.2. Backward Size
 393
 394         Backward Size is stored as a 32-bit little endian integer,
 395         which indicates the size of the Index field as multiple of
 396         four bytes, minimum value being four bytes:
 397
 398             real_backward_size = (stored_backward_size + 1) * 4;
 399
 400         If the stored value does not match the real size of the Index
 401         field, the decoder MUST indicate an error.
 402
 403         Using a fixed-size integer to store Backward Size makes
 404         it slightly simpler to parse the Stream Footer when the
 405         application needs to parse the Stream backwards.
 406
 407
 408 2.1.2.3. Stream Flags
 409
 410         This is a copy of the Stream Flags field from the Stream
 411         Header. The information stored to Stream Flags is needed
 412         when parsing the Stream backwards. The decoder MUST compare
 413         the Stream Flags fields in both Stream Header and Stream
 414         Footer, and indicate an error if they are not identical.
 415
 416
 417 2.1.2.4. Footer Magic Bytes
 418
 419         As the last step of the decoding process, the decoder MUST
 420         verify the existence of Footer Magic Bytes. If they don't
 421         match, an error MUST be indicated.
 422
 423             Using a C array and ASCII:
 424             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 425
 426             In hexadecimal:
 427             59 5A
 428
 429         The primary reason to have Footer Magic Bytes is to make
 430         it easier to detect incomplete files quickly, without
 431         uncompressing. If the file does not end with Footer Magic Bytes
 432         (excluding Stream Padding described in Section 2.2), it cannot
 433         be undamaged, unless someone has intentionally appended garbage
 434         after the end of the Stream.
 435
 436
 437 2.2. Stream Padding
 438
 439         Only the decoders that support decoding of concatenated Streams
 440         MUST support Stream Padding.
 441
 442         Stream Padding MUST contain only null bytes. To preserve the
 443         four-byte alignment of consecutive Streams, the size of Stream
 444         Padding MUST be a multiple of four bytes. Empty Stream Padding
 445         is allowed. If these requirements are not met, the decoder MUST
 446         indicate an error.
 447
 448         Note that non-empty Stream Padding is allowed at the end of the
 449         file; there doesn't need to be a new Stream after non-empty
 450         Stream Padding. This can be convenient in certain situations
 451         [GNU-tar].
 452
 453         The possibility of Stream Padding MUST be taken into account
 454         when designing an application that parses Streams backwards,
 455         and the application supports concatenated Streams.
 456
 457
 458 3. Block
 459
 460         +==============+=================+===============+=======+
 461         | Block Header | Compressed Data | Block Padding | Check |
 462         +==============+=================+===============+=======+
 463
 464
 465 3.1. Block Header
 466
 467         +-------------------+-------------+=================+
 468         | Block Header Size | Block Flags | Compressed Size |
 469         +-------------------+-------------+=================+
 470
 471              +===================+======================+
 472         ---> | Uncompressed Size | List of Filter Flags |
 473              +===================+======================+
 474
 475              +================+--+--+--+--+
 476         ---> | Header Padding |   CRC32   |
 477              +================+--+--+--+--+
 478
 479
 480 3.1.1. Block Header Size
 481
 482         This field overlaps with the Index Indicator field (see
 483         Section 4.1).
 484
 485         This field contains the size of the Block Header field,
 486         including the Block Header Size field itself. Valid values are
 487         in the range [0x01, 0xFF], which indicate the size of the Block
 488         Header as multiples of four bytes, minimum size being eight
 489         bytes:
 490
 491             real_header_size = (encoded_header_size + 1) * 4;
 492
 493         If bigger Block Header is needed in future, a new field can be
 494         added between the current Block Header and Compressed Data
 495         fields. The presence of this new field would be indicated in
 496         the Block Header.
 497
 498
 499 3.1.2. Block Flags
 500
 501         The first byte of the Block Flags field is a bit field:
 502
 503             Bit(s)  Mask  Description
 504              0-1    0x03  Number of filters (1-4)
 505              2-5    0x3C  Reserved for future use; MUST be zero for now.
 506               6     0x40  The Compressed Size field is present.
 507               7     0x80  The Uncompressed Size field is present.
 508
 509         If any reserved bit is set, the decoder MUST indicate an error.
 510         It is possible that there is a new field present which the
 511         decoder is not aware of, and can thus parse the Block Header
 512         incorrectly.
 513
 514
 515 3.1.3. Compressed Size
 516
 517         This field is present only if the appropriate bit is set in
 518         the Block Flags field (see Section 3.1.2).
 519
 520         The Compressed Size field contains the size of the Compressed
 521         Data field, which MUST be non-zero. Compressed Size is stored
 522         using the encoding described in Section 1.2. If the Compressed
 523         Size doesn't match the size of the Compressed Data field, the
 524         decoder MUST indicate an error.
 525
 526
 527 3.1.4. Uncompressed Size
 528
 529         This field is present only if the appropriate bit is set in
 530         the Block Flags field (see Section 3.1.2).
 531
 532         The Uncompressed Size field contains the size of the Block
 533         after uncompressing. Uncompressed Size is stored using the
 534         encoding described in Section 1.2. If the Uncompressed Size
 535         does not match the real uncompressed size, the decoder MUST
 536         indicate an error.
 537
 538         Storing the Compressed Size and Uncompressed Size fields serves
 539         several purposes:
 540           - The decoder knows how much memory it needs to allocate
 541             for a temporary buffer in multithreaded mode.
 542           - Simple error detection: wrong size indicates a broken file.
 543           - Seeking forwards to a specific location in streamed mode.
 544
 545         It should be noted that the only reliable way to determine
 546         the real uncompressed size is to uncompress the Block,
 547         because the Block Header and Index fields may contain
 548         (intentionally or unintentionally) invalid information.
 549
 550
 551 3.1.5. List of Filter Flags
 552
 553         +================+================+     +================+
 554         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 555         +================+================+     +================+
 556
 557         The number of Filter Flags fields is stored in the Block Flags
 558         field (see Section 3.1.2).
 559
 560         The format of each Filter Flags field is as follows:
 561
 562             +===========+====================+===================+
 563             | Filter ID | Size of Properties | Filter Properties |
 564             +===========+====================+===================+
 565
 566         Both Filter ID and Size of Properties are stored using the
 567         encoding described in Section 1.2. Size of Properties indicates
 568         the size of the Filter Properties field as bytes. The list of
 569         officially defined Filter IDs and the formats of their Filter
 570         Properties are described in Section 5.3.
 571
 572         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 573         (2^62) are reserved for implementation-specific internal use.
 574         These Filter IDs MUST never be used in List of Filter Flags.
 575
 576
 577 3.1.6. Header Padding
 578
 579         This field contains as many null byte as it is needed to make
 580         the Block Header have the size specified in Block Header Size.
 581         If any of the bytes are not null bytes, the decoder MUST
 582         indicate an error. It is possible that there is a new field
 583         present which the decoder is not aware of, and can thus parse
 584         the Block Header incorrectly.
 585
 586
 587 3.1.7. CRC32
 588
 589         The CRC32 is calculated over everything in the Block Header
 590         field except the CRC32 field itself. It is stored as an
 591         unsigned 32-bit little endian integer. If the calculated
 592         value does not match the stored one, the decoder MUST indicate
 593         an error.
 594
 595         By verifying the CRC32 of the Block Header before parsing the
 596         actual contents allows the decoder to distinguish between
 597         corrupt and unsupported files.
 598
 599
 600 3.2. Compressed Data
 601
 602         The format of Compressed Data depends on Block Flags and List
 603         of Filter Flags. Excluding the descriptions of the simplest
 604         filters in Section 5.3, the format of the filter-specific
 605         encoded data is out of scope of this document.
 606
 607
 608 3.3. Block Padding
 609
 610         Block Padding MUST contain 0-3 null bytes to make the size of
 611         the Block a multiple of four bytes. This can be needed when
 612         the size of Compressed Data is not a multiple of four. If any
 613         of the bytes in Block Padding are not null bytes, the decoder
 614         MUST indicate an error.
 615
 616
 617 3.4. Check
 618
 619         The type and size of the Check field depends on which bits
 620         are set in the Stream Flags field (see Section 2.1.1.2).
 621
 622         The Check, when used, is calculated from the original
 623         uncompressed data. If the calculated Check does not match the
 624         stored one, the decoder MUST indicate an error. If the selected
 625         type of Check is not supported by the decoder, it SHOULD
 626         indicate a warning or error.
 627
 628
 629 4. Index
 630
 631         +-----------------+===================+
 632         | Index Indicator | Number of Records |
 633         +-----------------+===================+
 634
 635              +=================+===============+-+-+-+-+
 636         ---> | List of Records | Index Padding | CRC32 |
 637              +=================+===============+-+-+-+-+
 638
 639         Index serves several purposes. Using it, one can
 640           - verify that all Blocks in a Stream have been processed;
 641           - find out the uncompressed size of a Stream; and
 642           - quickly access the beginning of any Block (random access).
 643
 644
 645 4.1. Index Indicator
 646
 647         This field overlaps with the Block Header Size field (see
 648         Section 3.1.1). The value of Index Indicator is always 0x00.
 649
 650
 651 4.2. Number of Records
 652
 653         This field indicates how many Records there are in the List
 654         of Records field, and thus how many Blocks there are in the
 655         Stream. The value is stored using the encoding described in
 656         Section 1.2. If the decoder has decoded all the Blocks of the
 657         Stream, and then notices that the Number of Records doesn't
 658         match the real number of Blocks, the decoder MUST indicate an
 659         error.
 660
 661
 662 4.3. List of Records
 663
 664         List of Records consists of as many Records as indicated by the
 665         Number of Records field:
 666
 667             +========+========+
 668             | Record | Record | ...
 669             +========+========+
 670
 671         Each Record contains information about one Block:
 672
 673             +===============+===================+
 674             | Unpadded Size | Uncompressed Size |
 675             +===============+===================+
 676
 677         If the decoder has decoded all the Blocks of the Stream, it
 678         MUST verify that the contents of the Records match the real
 679         Unpadded Size and Uncompressed Size of the respective Blocks.
 680
 681         Implementation hint: It is possible to verify the Index with
 682         constant memory usage by calculating for example SHA-256 of
 683         both the real size values and the List of Records, then
 684         comparing the hash values. Implementing this using
 685         non-cryptographic hash like CRC32 SHOULD be avoided unless
 686         small code size is important.
 687
 688         If the decoder supports random-access reading, it MUST verify
 689         that Unpadded Size and Uncompressed Size of every completely
 690         decoded Block match the sizes stored in the Index. If only
 691         partial Block is decoded, the decoder MUST verify that the
 692         processed sizes don't exceed the sizes stored in the Index.
 693
 694
 695 4.3.1. Unpadded Size
 696
 697         This field indicates the size of the Block excluding the Block
 698         Padding field. That is, Unpadded Size is the size of the Block
 699         Header, Compressed Data, and Check fields. Unpadded Size is
 700         stored using the encoding described in Section 1.2. The value
 701         MUST never be zero; with the current structure of Blocks, the
 702         actual minimum value for Unpadded Size is five.
 703
 704         Implementation note: Because the size of the Block Padding
 705         field is not included in Unpadded Size, calculating the total
 706         size of a Stream or doing random-access reading requires
 707         calculating the actual size of the Blocks by rounding Unpadded
 708         Sizes up to the next multiple of four.
 709
 710         The reason to exclude Block Padding from Unpadded Size is to
 711         ease making a raw copy of Compressed Data without Block
 712         Padding. This can be useful, for example, if someone wants
 713         to convert Streams to some other file format quickly.
 714
 715
 716 4.3.2. Uncompressed Size
 717
 718         This field indicates the Uncompressed Size of the respective
 719         Block as bytes. The value is stored using the encoding
 720         described in Section 1.2.
 721
 722
 723 4.4. Index Padding
 724
 725         This field MUST contain 0-3 null bytes to pad the Index to
 726         a multiple of four bytes. If any of the bytes are not null
 727         bytes, the decoder MUST indicate an error.
 728
 729
 730 4.5. CRC32
 731
 732         The CRC32 is calculated over everything in the Index field
 733         except the CRC32 field itself. The CRC32 is stored as an
 734         unsigned 32-bit little endian integer. If the calculated
 735         value does not match the stored one, the decoder MUST indicate
 736         an error.
 737
 738
 739 5. Filter Chains
 740
 741         The Block Flags field defines how many filters are used. When
 742         more than one filter is used, the filters are chained; that is,
 743         the output of one filter is the input of another filter. The
 744         following figure illustrates the direction of data flow.
 745
 746                     v   Uncompressed Data   ^
 747                     |       Filter 0        |
 748             Encoder |       Filter 1        | Decoder
 749                     |       Filter n        |
 750                     v    Compressed Data    ^
 751
 752
 753 5.1. Alignment
 754
 755         Alignment of uncompressed input data is usually the job of
 756         the application producing the data. For example, to get the
 757         best results, an archiver tool should make sure that all
 758         PowerPC executable files in the archive stream start at
 759         offsets that are multiples of four bytes.
 760
 761         Some filters, for example LZMA2, can be configured to take
 762         advantage of specified alignment of input data. Note that
 763         taking advantage of aligned input can be benefical also when
 764         a filter is not the first filter in the chain. For example,
 765         if you compress PowerPC executables, you may want to use the
 766         PowerPC filter and chain that with the LZMA2 filter. Because
 767         not only the input but also the output alignment of the PowerPC
 768         filter is four bytes, it is now benefical to set LZMA2 settings
 769         so that the LZMA2 encoder can take advantage of its
 770         four-byte-aligned input data.
 771
 772         The output of the last filter in the chain is stored to the
 773         Compressed Data field, which is is guaranteed to be aligned
 774         to a multiple of four bytes relative to the beginning of the
 775         Stream. This can increase
 776           - speed, if the filtered data is handled multiple bytes at
 777             a time by the filter-specific encoder and decoder,
 778             because accessing aligned data in computer memory is
 779             usually faster; and
 780           - compression ratio, if the output data is later compressed
 781             with an external compression tool.
 782
 783
 784 5.2. Security
 785
 786         If filters would be allowed to be chained freely, it would be
 787         possible to create malicious files, that would be very slow to
 788         decode. Such files could be used to create denial of service
 789         attacks.
 790
 791         Slow files could occur when multiple filters are chained:
 792
 793             v   Compressed input data
 794             |   Filter 1 decoder (last filter)
 795             |   Filter 0 decoder (non-last filter)
 796             v   Uncompressed output data
 797
 798         The decoder of the last filter in the chain produces a lot of
 799         output from little input. Another filter in the chain takes the
 800         output of the last filter, and produces very little output
 801         while consuming a lot of input. As a result, a lot of data is
 802         moved inside the filter chain, but the filter chain as a whole
 803         gets very little work done.
 804
 805         To prevent this kind of slow files, there are restrictions on
 806         how the filters can be chained. These restrictions MUST be
 807         taken into account when designing new filters.
 808
 809         The maximum number of filters in the chain has been limited to
 810         four, thus there can be at maximum of three non-last filters.
 811         Of these three non-last filters, only two are allowed to change
 812         the size of the data.
 813
 814         The non-last filters, that change the size of the data, MUST
 815         have a limit how much the decoder can compress the data: the
 816         decoder SHOULD produce at least n bytes of output when the
 817         filter is given 2n bytes of input. This  limit is not
 818         absolute, but significant deviations MUST be avoided.
 819
 820         The above limitations guarantee that if the last filter in the
 821         chain produces 4n bytes of output, the chain as a whole will
 822         produce at least n bytes of output.
 823
 824
 825 5.3. Filters
 826
 827 5.3.1. LZMA2
 828
 829         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 830         compression algorithm with high compression ratio and fast
 831         decompression. LZMA is based on LZ77 and range coding
 832         algorithms.
 833
 834         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 835         LZMA internally, but adds support for flushing the encoder,
 836         uncompressed chunks, eases stateful decoder implementations,
 837         and improves support for multithreading. Thus, the plain LZMA
 838         will not be supported in this file format.
 839
 840             Filter ID:                  0x21
 841             Size of Filter Properties:  1 byte
 842             Changes size of data:       Yes
 843             Allow as a non-last filter: No
 844             Allow as the last filter:   Yes
 845
 846             Preferred alignment:
 847                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 848                 Output data:            1 byte
 849
 850         The format of the one-byte Filter Properties field is as
 851         follows:
 852
 853             Bits   Mask   Description
 854             0-5    0x3F   Dictionary Size
 855             6-7    0xC0   Reserved for future use; MUST be zero for now.
 856
 857         Dictionary Size is encoded with one-bit mantissa and five-bit
 858         exponent. The smallest dictionary size is 4 KiB and the biggest
 859         is 4 GiB.
 860
 861             Raw value   Mantissa   Exponent   Dictionary size
 862                 0           2         11         4 KiB
 863                 1           3         11         6 KiB
 864                 2           2         12         8 KiB
 865                 3           3         12        12 KiB
 866                 4           2         13        16 KiB
 867                 5           3         13        24 KiB
 868                 6           2         14        32 KiB
 869               ...         ...        ...      ...
 870                35           3         27       768 MiB
 871                36           2         28      1024 MiB
 872                37           3         29      1536 MiB
 873                38           2         30      2048 MiB
 874                39           3         30      3072 MiB
 875                40           2         31      4096 MiB - 1 B
 876
 877         Instead of having a table in the decoder, the dictionary size
 878         can be decoded using the following C code:
 879
 880             const uint8_t bits = get_dictionary_flags() & 0x3F;
 881             if (bits > 40)
 882                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 883
 884             uint32_t dictionary_size;
 885             if (bits == 40) {
 886                 dictionary_size = UINT32_MAX;
 887             } else {
 888                 dictionary_size = 2 | (bits & 1);
 889                 dictionary_size <<= bits / 2 + 11;
 890             }
 891
 892
 893 5.3.2. Branch/Call/Jump Filters for Executables
 894
 895         These filters convert relative branch, call, and jump
 896         instructions to their absolute counterparts in executable
 897         files. This conversion increases redundancy and thus
 898         compression ratio.
 899
 900             Size of Filter Properties:  0 or 4 bytes
 901             Changes size of data:       No
 902             Allow as a non-last filter: Yes
 903             Allow as the last filter:   No
 904
 905         Below is the list of filters in this category. The alignment
 906         is the same for both input and output data.
 907
 908             Filter ID   Alignment   Description
 909               0x04       1 byte     x86 filter (BCJ)
 910               0x05       4 bytes    PowerPC (big endian) filter
 911               0x06      16 bytes    IA64 filter
 912               0x07       4 bytes    ARM (little endian) filter
 913               0x08       2 bytes    ARM Thumb (little endian) filter
 914               0x09       4 bytes    SPARC filter
 915
 916         If the size of Filter Properties is four bytes, the Filter
 917         Properties field contains the start offset used for address
 918         conversions. It is stored as an unsigned 32-bit little endian
 919         integer. The start offset MUST be a multiple of the alignment
 920         of the filter as listed in the table above; if it isn't, the
 921         decoder MUST indicate an error. If the size of Filter
 922         Properties is zero, the start offset is zero.
 923
 924         Setting the start offset may be useful if an executable has
 925         multiple sections, and there are many cross-section calls.
 926         Taking advantage of this feature usually requires usage of
 927         the Subblock filter, whose design is not complete yet.
 928
 929
 930 5.3.3. Delta
 931
 932         The Delta filter may increase compression ratio when the value
 933         of the next byte correlates with the value of an earlier byte
 934         at specified distance.
 935
 936             Filter ID:                  0x03
 937             Size of Filter Properties:  1 byte
 938             Changes size of data:       No
 939             Allow as a non-last filter: Yes
 940             Allow as the last filter:   No
 941
 942             Preferred alignment:
 943                 Input data:             1 byte
 944                 Output data:            Same as the original input data
 945
 946         The Properties byte indicates the delta distance, which can be
 947         1-256 bytes backwards from the current byte: 0x00 indicates
 948         distance of 1 byte and 0xFF distance of 256 bytes.
 949
 950
 951 5.3.3.1. Format of the Encoded Output
 952
 953         The code below illustrates both encoding and decoding with
 954         the Delta filter.
 955
 956             // Distance is in the range [1, 256].
 957             const unsigned int distance = get_properties_byte() + 1;
 958             uint8_t pos = 0;
 959             uint8_t delta[256];
 960
 961             memset(delta, 0, sizeof(delta));
 962
 963             while (1) {
 964                 const int byte = read_byte();
 965                 if (byte == EOF)
 966                     break;
 967
 968                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 969                 if (is_encoder) {
 970                     tmp = (uint8_t)(byte) - tmp;
 971                     delta[pos] = (uint8_t)(byte);
 972                 } else {
 973                     tmp = (uint8_t)(byte) + tmp;
 974                     delta[pos] = tmp;
 975                 }
 976
 977                 write_byte(tmp);
 978                 --pos;
 979             }
 980
 981
 982 5.4. Custom Filter IDs
 983
 984         If a developer wants to use custom Filter IDs, he has two
 985         choices. The first choice is to contact Lasse Collin and ask
 986         him to allocate a range of IDs for the developer.
 987
 988         The second choice is to generate a 40-bit random integer,
 989         which the developer can use as his personal Developer ID.
 990         To minimalize the risk of collisions, Developer ID has to be
 991         a randomly generated integer, not manually selected "hex word".
 992         The following command, which works on many free operating
 993         systems, can be used to generate Developer ID:
 994
 995             dd if=/dev/urandom bs=5 count=1 | hexdump
 996
 997         The developer can then use his Developer ID to create unique
 998         (well, hopefully unique) Filter IDs.
 999
1000             Bits    Mask                    Description
1001              0-15   0x0000_0000_0000_FFFF   Filter ID
1002             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1003             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1004
1005         The resulting 63-bit integer will use 9 bytes of space when
1006         stored using the encoding described in Section 1.2. To get
1007         a shorter ID, see the beginning of this Section how to
1008         request a custom ID range.
1009
1010
1011 5.4.1. Reserved Custom Filter ID Ranges
1012
1013         Range                       Description
1014         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1015         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1016         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1017
1018
1019 6. Cyclic Redundancy Checks
1020
1021         There are several incompatible variations to calculate CRC32
1022         and CRC64. For simplicity and clarity, complete examples are
1023         provided to calculate the checks as they are used in this file
1024         format. Implementations MAY use different code as long as it
1025         gives identical results.
1026
1027         The program below reads data from standard input, calculates
1028         the CRC32 and CRC64 values, and prints the calculated values
1029         as big endian hexadecimal strings to standard output.
1030
1031             #include <stddef.h>
1032             #include <inttypes.h>
1033             #include <stdio.h>
1034
1035             uint32_t crc32_table[256];
1036             uint64_t crc64_table[256];
1037
1038             void
1039             init(void)
1040             {
1041                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1042                 static const uint64_t poly64
1043                         = UINT64_C(0xC96C5795D7870F42);
1044
1045                 for (size_t i = 0; i < 256; ++i) {
1046                     uint32_t crc32 = i;
1047                     uint64_t crc64 = i;
1048
1049                     for (size_t j = 0; j < 8; ++j) {
1050                         if (crc32 & 1)
1051                             crc32 = (crc32 >> 1) ^ poly32;
1052                         else
1053                             crc32 >>= 1;
1054
1055                         if (crc64 & 1)
1056                             crc64 = (crc64 >> 1) ^ poly64;
1057                         else
1058                             crc64 >>= 1;
1059                     }
1060
1061                     crc32_table[i] = crc32;
1062                     crc64_table[i] = crc64;
1063                 }
1064             }
1065
1066             uint32_t
1067             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1068             {
1069                 crc = ~crc;
1070                 for (size_t i = 0; i < size; ++i)
1071                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1072                             ^ (crc >> 8);
1073                 return ~crc;
1074             }
1075
1076             uint64_t
1077             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1078             {
1079                 crc = ~crc;
1080                 for (size_t i = 0; i < size; ++i)
1081                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1082                             ^ (crc >> 8);
1083                 return ~crc;
1084             }
1085
1086             int
1087             main()
1088             {
1089                 init();
1090
1091                 uint32_t value32 = 0;
1092                 uint64_t value64 = 0;
1093                 uint64_t total_size = 0;
1094                 uint8_t buf[8192];
1095
1096                 while (1) {
1097                     const size_t buf_size
1098                             = fread(buf, 1, sizeof(buf), stdin);
1099                     if (buf_size == 0)
1100                         break;
1101
1102                     total_size += buf_size;
1103                     value32 = crc32(buf, buf_size, value32);
1104                     value64 = crc64(buf, buf_size, value64);
1105                 }
1106
1107                 printf("Bytes:  %" PRIu64 "\n", total_size);
1108                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1109                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1110
1111                 return 0;
1112             }
1113
1114
1115 7. References
1116
1117         LZMA SDK - The original LZMA implementation
1118         http://7-zip.org/sdk.html
1119
1120         LZMA Utils - LZMA adapted to POSIX-like systems
1121         http://tukaani.org/lzma/
1122
1123         XZ Utils - The next generation of LZMA Utils
1124         http://tukaani.org/xz/
1125
1126         [RFC-1952]
1127         GZIP file format specification version 4.3
1128         http://www.ietf.org/rfc/rfc1952.txt
1129           - Notation of byte boxes in section "2.1. Overall conventions"
1130
1131         [RFC-2119]
1132         Key words for use in RFCs to Indicate Requirement Levels
1133         http://www.ietf.org/rfc/rfc2119.txt
1134
1135         [GNU-tar]
1136         GNU tar 1.21 manual
1137         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1138           - Node 9.4.2 "Blocking Factor", paragraph that begins
1139             "gzip will complain about trailing garbage"
1140           - Note that this URL points to the latest version of the
1141             manual, and may some day not contain the note which is in
1142             1.21. For the exact version of the manual, download GNU
1143             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1144