doc/file-format.txt

   1
   2 The .xz File Format
   3 -------------------
   4
   5         0. Preface
   6            0.1. Copyright Notices
   7            0.2. Changes
   8         1. Conventions
   9            1.1. Byte and Its Representation
  10            1.2. Multibyte Integers
  11         2. Overall Structure of .xz File
  12            2.1. Stream
  13                 2.1.1. Stream Header
  14                        2.1.1.1. Header Magic Bytes
  15                        2.1.1.2. Stream Flags
  16                        2.1.1.3. CRC32
  17                 2.1.2. Stream Footer
  18                        2.1.2.1. CRC32
  19                        2.1.2.2. Backward Size
  20                        2.1.2.3. Stream Flags
  21                        2.1.2.4. Footer Magic Bytes
  22            2.2. Stream Padding
  23         3. Block
  24            3.1. Block Header
  25                 3.1.1. Block Header Size
  26                 3.1.2. Block Flags
  27                 3.1.3. Compressed Size
  28                 3.1.4. Uncompressed Size
  29                 3.1.5. List of Filter Flags
  30                 3.1.6. Header Padding
  31                 3.1.7. CRC32
  32            3.2. Compressed Data
  33            3.3. Block Padding
  34            3.4. Check
  35         4. Index
  36            4.1. Index Indicator
  37            4.2. Number of Records
  38            4.3. List of Records
  39                 4.3.1. Unpadded Size
  40                 4.3.2. Uncompressed Size
  41            4.4. Index Padding
  42            4.5. CRC32
  43         5. Filter Chains
  44            5.1. Alignment
  45            5.2. Security
  46            5.3. Filters
  47                 5.3.1. LZMA2
  48                 5.3.2. Branch/Call/Jump Filters for Executables
  49                 5.3.3. Delta
  50                        5.3.3.1. Format of the Encoded Output
  51            5.4. Custom Filter IDs
  52                 5.4.1. Reserved Custom Filter ID Ranges
  53         6. Cyclic Redundancy Checks
  54         7. References
  55
  56
  57 0. Preface
  58
  59         This document describes the .xz file format (filename suffix
  60         ".xz", MIME type "application/x-xz"). It is intended that this
  61         this format replace the old .lzma format used by LZMA SDK and
  62         LZMA Utils.
  63
  64         IMPORTANT:  The version described in this document is a
  65                     draft, NOT a final, official version. Changes
  66                     are possible.
  67
  68
  69 0.1. Copyright Notices
  70
  71         Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
  72         Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
  73
  74         Copying and distribution of this file, with or without
  75         modification, are permitted in any medium without royalty
  76         provided the copyright notice and this notice are preserved.
  77         Modified versions must be marked as such.
  78
  79         All source code examples given in this document are put into
  80         the public domain by the authors of this document.
  81
  82         Special thanks for helping with this document goes to
  83         Igor Pavlov. Thanks for helping with this document goes to
  84         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  85
  86
  87 0.2. Changes
  88
  89         Last modified: 2008-11-03 00:35+0200
  90
  91         (A changelog will be kept once the first official version
  92         is made.)
  93
  94
  95 1. Conventions
  96
  97         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
  98         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  99         document are to be interpreted as described in [RFC-2119].
 100
 101         Indicating a warning means displaying a message, returning
 102         appropriate exit status, or doing something else to let the
 103         user know that something worth warning occurred. The operation
 104         SHOULD still finish if a warning is indicated.
 105
 106         Indicating an error means displaying a message, returning
 107         appropriate exit status, or doing something else to let the
 108         user know that something prevented successfully finishing the
 109         operation. The operation MUST be aborted once an error has
 110         been indicated.
 111
 112
 113 1.1. Byte and Its Representation
 114
 115         In this document, byte is always 8 bits.
 116
 117         A "null byte" has all bits unset. That is, the value of a null
 118         byte is 0x00.
 119
 120         To represent byte blocks, this document uses notation that
 121         is similar to the notation used in [RFC-1952]:
 122
 123             +-------+
 124             |  Foo  |   One byte.
 125             +-------+
 126
 127             +---+---+
 128             |  Foo  |   Two bytes; that is, some of the vertical bars
 129             +---+---+   can be missing.
 130
 131             +=======+
 132             |  Foo  |   Zero or more bytes.
 133             +=======+
 134
 135         In this document, a boxed byte or a byte sequence declared
 136         using this notation is called "a field". The example field
 137         above would be called "the Foo field" or plain "Foo".
 138
 139         If there are many fields, they may be split to multiple lines.
 140         This is indicated with an arrow ("--->"):
 141
 142             +=====+
 143             | Foo |
 144             +=====+
 145
 146                  +=====+
 147             ---> | Bar |
 148                  +=====+
 149
 150         The above is equivalent to this:
 151
 152             +=====+=====+
 153             | Foo | Bar |
 154             +=====+=====+
 155
 156
 157 1.2. Multibyte Integers
 158
 159         Multibyte integers of static length, such as CRC values,
 160         are stored in little endian byte order (least significant
 161         byte first).
 162
 163         When smaller values are more likely than bigger values (for
 164         example file sizes), multibyte integers are encoded in a
 165         variable-length representation:
 166           - Numbers in the range [0, 127] are copied as is, and take
 167             one byte of space.
 168           - Bigger numbers will occupy two or more bytes. All but the
 169             last byte of the multibyte representation have the highest
 170             (eighth) bit set.
 171
 172         For now, the value of the variable-length integers is limited
 173         to 63 bits, which limits the encoded size of the integer to
 174         nine bytes. These limits may be increased in future if needed.
 175
 176         The following C code illustrates encoding and decoding of
 177         variable-length integers. The functions return the number of
 178         bytes occupied by the integer (1-9), or zero on error.
 179
 180             #include <sys/types.h>
 181             #include <inttypes.h>
 182
 183             size_t
 184             encode(uint8_t buf[static 9], uint64_t num)
 185             {
 186                 if (num > UINT64_MAX / 2)
 187                     return 0;
 188
 189                 size_t i = 0;
 190
 191                 while (num >= 0x80) {
 192                     buf[i++] = (uint8_t)(num) | 0x80;
 193                     num >>= 7;
 194                 }
 195
 196                 buf[i++] = (uint8_t)(num);
 197
 198                 return i;
 199             }
 200
 201             size_t
 202             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 203             {
 204                 if (size_max == 0)
 205                     return 0;
 206
 207                 if (size_max > 9)
 208                     size_max = 9;
 209
 210                 *num = buf[0] & 0x7F;
 211                 size_t i = 0;
 212
 213                 while (buf[i++] & 0x80) {
 214                     if (i >= size_max || buf[i] == 0x00)
 215                         return 0;
 216
 217                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 218                 }
 219
 220                 return i;
 221             }
 222
 223
 224 2. Overall Structure of .xz File
 225
 226         A standalone .xz files consist of one or more Streams which may
 227         have Stream Padding between or after them:
 228
 229             +========+================+========+================+
 230             | Stream | Stream Padding | Stream | Stream Padding | ...
 231             +========+================+========+================+
 232
 233         While a typical file contains only one Stream and no Stream
 234         Padding, a decoder handling standalone .xz files SHOULD support
 235         files that have more than one Stream or Stream Padding.
 236
 237         In contrast to standalone .xz files, when the .xz file format
 238         is used as an internal part of some other file format or
 239         communication protocol, it usually is expected that the decoder
 240         stops after the first Stream, and doesn't look for Stream
 241         Padding or possibly other Streams.
 242
 243
 244 2.1. Stream
 245
 246         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 247         |     Stream Header     | Block | Block | ... | Block |
 248         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 249
 250              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 251         ---> | Index |     Stream Footer     |
 252              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 253
 254         All the above fields have a size that is a multiple of four. If
 255         Stream is used as an internal part of another file format, it
 256         is RECOMMENDED to make the Stream start at an offset that is
 257         a multiple of four bytes.
 258
 259         Stream Header, Index, and Stream Footer are always present in
 260         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 261
 262         There are zero or more Blocks. The maximum number of Blocks is
 263         limited only by the maximum size of the Index field.
 264
 265         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 266         The same limit applies to the total amount of uncompressed
 267         data stored in a Stream.
 268
 269         If an implementation supports handling .xz files with multiple
 270         concatenated Streams, it MAY apply the above limits to the file
 271         as a whole instead of limiting per Stream basis.
 272
 273
 274 2.1.1. Stream Header
 275
 276         +---+---+---+---+---+---+-------+------+--+--+--+--+
 277         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 278         +---+---+---+---+---+---+-------+------+--+--+--+--+
 279
 280
 281 2.1.1.1. Header Magic Bytes
 282
 283         The first six (6) bytes of the Stream are so called Header
 284         Magic Bytes. They can be used to identify the file type.
 285
 286             Using a C array and ASCII:
 287             const uint8_t HEADER_MAGIC[6]
 288                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 289
 290             In plain hexadecimal:
 291             FD 37 7A 58 5A 00
 292
 293         Notes:
 294           - The first byte (0xFD) was chosen so that the files cannot
 295             be erroneously detected as being in .lzma format, in which
 296             the first byte is in the range [0x00, 0xE0].
 297           - The sixth byte (0x00) was chosen to prevent applications
 298             from misdetecting the file as a text file.
 299
 300         If the Header Magic Bytes don't match, the decoder MUST
 301         indicate an error.
 302
 303
 304 2.1.1.2. Stream Flags
 305
 306         The first byte of Stream Flags is always a null byte. In future
 307         this byte may be used to indicate new Stream version or other
 308         Stream properties.
 309
 310         The second byte of Stream Flags is a bit field:
 311
 312             Bit(s)  Mask  Description
 313              0-3    0x0F  Type of Check (see Section 3.4):
 314                               ID    Size      Check name
 315                               0x00   0 bytes  None
 316                               0x01   4 bytes  CRC32
 317                               0x02   4 bytes  (Reserved)
 318                               0x03   4 bytes  (Reserved)
 319                               0x04   8 bytes  CRC64
 320                               0x05   8 bytes  (Reserved)
 321                               0x06   8 bytes  (Reserved)
 322                               0x07  16 bytes  (Reserved)
 323                               0x08  16 bytes  (Reserved)
 324                               0x09  16 bytes  (Reserved)
 325                               0x0A  32 bytes  SHA-256
 326                               0x0B  32 bytes  (Reserved)
 327                               0x0C  32 bytes  (Reserved)
 328                               0x0D  64 bytes  (Reserved)
 329                               0x0E  64 bytes  (Reserved)
 330                               0x0F  64 bytes  (Reserved)
 331              4-7    0xF0  Reserved for future use; MUST be zero for now.
 332
 333         Implementations SHOULD support at least the Check IDs 0x00
 334         (None) and 0x01 (CRC32). Supporting other Check IDs is
 335         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 336         indicate a warning or error.
 337
 338         If any reserved bit is set, the decoder MUST indicate an error.
 339         It is possible that there is a new field present which the
 340         decoder is not aware of, and can thus parse the Stream Header
 341         incorrectly.
 342
 343
 344 2.1.1.3. CRC32
 345
 346         The CRC32 is calculated from the Stream Flags field. It is
 347         stored as an unsigned 32-bit little endian integer. If the
 348         calculated value does not match the stored one, the decoder
 349         MUST indicate an error.
 350
 351         The idea is that Stream Flags would always be two bytes, even
 352         if new features are needed. This way old decoders will be able
 353         to verify the CRC32 calculated from Stream Flags, and thus
 354         distinguish between corrupt files (CRC32 doesn't match) and
 355         files that the decoder doesn't support (CRC32 matches but
 356         Stream Flags has reserved bits set).
 357
 358
 359 2.1.2. Stream Footer
 360
 361         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 362         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 363         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 364
 365
 366 2.1.2.1. CRC32
 367
 368         The CRC32 is calculated from the Backward Size and Stream Flags
 369         fields. It is stored as an unsigned 32-bit little endian
 370         integer. If the calculated value does not match the stored one,
 371         the decoder MUST indicate an error.
 372
 373         The reason to have the CRC32 field before the Backward Size and
 374         Stream Flags fields is to keep the four-byte fields aligned to
 375         a multiple of four bytes.
 376
 377
 378 2.1.2.2. Backward Size
 379
 380         Backward Size is stored as a 32-bit little endian integer,
 381         which indicates the size of the Index field as multiple of
 382         four bytes, minimum value being four bytes:
 383
 384             real_backward_size = (stored_backward_size + 1) * 4;
 385
 386         If the stored value does not match the real size of the Index
 387         field, the decoder MUST indicate an error.
 388
 389         Using a fixed-size integer to store Backward Size makes
 390         it slightly simpler to parse the Stream Footer when the
 391         application needs to parse the Stream backwards.
 392
 393
 394 2.1.2.3. Stream Flags
 395
 396         This is a copy of the Stream Flags field from the Stream
 397         Header. The information stored to Stream Flags is needed
 398         when parsing the Stream backwards. The decoder MUST compare
 399         the Stream Flags fields in both Stream Header and Stream
 400         Footer, and indicate an error if they are not identical.
 401
 402
 403 2.1.2.4. Footer Magic Bytes
 404
 405         As the last step of the decoding process, the decoder MUST
 406         verify the existence of Footer Magic Bytes. If they don't
 407         match, an error MUST be indicated.
 408
 409             Using a C array and ASCII:
 410             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 411
 412             In hexadecimal:
 413             59 5A
 414
 415         The primary reason to have Footer Magic Bytes is to make
 416         it easier to detect incomplete files quickly, without
 417         uncompressing. If the file does not end with Footer Magic Bytes
 418         (excluding Stream Padding described in Section 2.2), it cannot
 419         be undamaged, unless someone has intentionally appended garbage
 420         after the end of the Stream.
 421
 422
 423 2.2. Stream Padding
 424
 425         Only the decoders that support decoding of concatenated Streams
 426         MUST support Stream Padding.
 427
 428         Stream Padding MUST contain only null bytes. To preserve the
 429         four-byte alignment of consecutive Streams, the size of Stream
 430         Padding MUST be a multiple of four bytes. Empty Stream Padding
 431         is allowed.
 432
 433         Note that non-empty Stream Padding is allowed at the end of the
 434         file; there doesn't need to be a new Stream after non-empty
 435         Stream Padding. This can be convenient in certain situations
 436         [GNU-tar].
 437
 438         The possibility of Padding MUST be taken into account when
 439         designing an application that parses Streams backwards, and
 440         the application supports concatenated Streams.
 441
 442
 443 3. Block
 444
 445         +==============+=================+===============+=======+
 446         | Block Header | Compressed Data | Block Padding | Check |
 447         +==============+=================+===============+=======+
 448
 449
 450 3.1. Block Header
 451
 452         +-------------------+-------------+=================+
 453         | Block Header Size | Block Flags | Compressed Size |
 454         +-------------------+-------------+=================+
 455
 456              +===================+======================+
 457         ---> | Uncompressed Size | List of Filter Flags |
 458              +===================+======================+
 459
 460              +================+--+--+--+--+
 461         ---> | Header Padding |   CRC32   |
 462              +================+--+--+--+--+
 463
 464
 465 3.1.1. Block Header Size
 466
 467         This field overlaps with the Index Indicator field (see
 468         Section 4.1).
 469
 470         This field contains the size of the Block Header field,
 471         including the Block Header Size field itself. Valid values are
 472         in the range [0x01, 0xFF], which indicate the size of the Block
 473         Header as multiples of four bytes, minimum size being eight
 474         bytes:
 475
 476             real_header_size = (encoded_header_size + 1) * 4;
 477
 478         If bigger Block Header is needed in future, a new field can be
 479         added between the current Block Header and Compressed Data
 480         fields. The presence of this new field would be indicated in
 481         the Block Header.
 482
 483
 484 3.1.2. Block Flags
 485
 486         The first byte of the Block Flags field is a bit field:
 487
 488             Bit(s)  Mask  Description
 489              0-1    0x03  Number of filters (1-4)
 490              2-5    0x3C  Reserved for future use; MUST be zero for now.
 491               6     0x40  The Compressed Size field is present.
 492               7     0x80  The Uncompressed Size field is present.
 493
 494         If any reserved bit is set, the decoder MUST indicate an error.
 495         It is possible that there is a new field present which the
 496         decoder is not aware of, and can thus parse the Block Header
 497         incorrectly.
 498
 499
 500 3.1.3. Compressed Size
 501
 502         This field is present only if the appropriate bit is set in
 503         the Block Flags field (see Section 3.1.2).
 504
 505         The Compressed Size field contains the size of the Compressed
 506         Data field, which MUST be non-zero. Compressed Size is stored
 507         using the encoding described in Section 1.2. If the Compressed
 508         Size doesn't match the size of the Compressed Data field, the
 509         decoder MUST indicate an error.
 510
 511
 512 3.1.4. Uncompressed Size
 513
 514         This field is present only if the appropriate bit is set in
 515         the Block Flags field (see Section 3.1.2).
 516
 517         The Uncompressed Size field contains the size of the Block
 518         after uncompressing. Uncompressed Size is stored using the
 519         encoding described in Section 1.2. If the Uncompressed Size
 520         does not match the real uncompressed size, the decoder MUST
 521         indicate an error.
 522
 523         Storing the Compressed Size and Uncompressed Size fields serves
 524         several purposes:
 525           - The decoder knows how much memory it needs to allocate
 526             for a temporary buffer in multithreaded mode.
 527           - Simple error detection: wrong size indicates a broken file.
 528           - Seeking forwards to a specific location in streamed mode.
 529
 530         It should be noted that the only reliable way to determine
 531         the real uncompressed size is to uncompress the Block,
 532         because the Block Header and Index fields may contain
 533         (intentionally or unintentionally) invalid information.
 534
 535
 536 3.1.5. List of Filter Flags
 537
 538         +================+================+     +================+
 539         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 540         +================+================+     +================+
 541
 542         The number of Filter Flags fields is stored in the Block Flags
 543         field (see Section 3.1.2).
 544
 545         The format of each Filter Flags field is as follows:
 546
 547             +===========+====================+===================+
 548             | Filter ID | Size of Properties | Filter Properties |
 549             +===========+====================+===================+
 550
 551         Both Filter ID and Size of Properties are stored using the
 552         encoding described in Section 1.2. Size of Properties indicates
 553         the size of the Filter Properties field as bytes. The list of
 554         officially defined Filter IDs and the formats of their Filter
 555         Properties are described in Section 5.3.
 556
 557         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 558         (2^62) are reserved for implementation-specific internal use.
 559         These Filter IDs MUST never be used in List of Filter Flags.
 560
 561
 562 3.1.6. Header Padding
 563
 564         This field contains as many null byte as it is needed to make
 565         the Block Header have the size specified in Block Header Size.
 566         If any of the bytes are not null bytes, the decoder MUST
 567         indicate an error. It is possible that there is a new field
 568         present which the decoder is not aware of, and can thus parse
 569         the Block Header incorrectly.
 570
 571
 572 3.1.7. CRC32
 573
 574         The CRC32 is calculated over everything in the Block Header
 575         field except the CRC32 field itself. It is stored as an
 576         unsigned 32-bit little endian integer. If the calculated
 577         value does not match the stored one, the decoder MUST indicate
 578         an error.
 579
 580         By verifying the CRC32 of the Block Header before parsing the
 581         actual contents allows the decoder to distinguish between
 582         corrupt and unsupported files.
 583
 584
 585 3.2. Compressed Data
 586
 587         The format of Compressed Data depends on Block Flags and List
 588         of Filter Flags. Excluding the descriptions of the simplest
 589         filters in Section 5.3, the format of the filter-specific
 590         encoded data is out of scope of this document.
 591
 592
 593 3.3. Block Padding
 594
 595         Block Padding MUST contain 0-3 null bytes to make the size of
 596         the Block a multiple of four bytes. This can be needed when
 597         the size of Compressed Data is not a multiple of four.
 598
 599
 600 3.4. Check
 601
 602         The type and size of the Check field depends on which bits
 603         are set in the Stream Flags field (see Section 2.1.1.2).
 604
 605         The Check, when used, is calculated from the original
 606         uncompressed data. If the calculated Check does not match the
 607         stored one, the decoder MUST indicate an error. If the selected
 608         type of Check is not supported by the decoder, it MUST indicate
 609         a warning or error.
 610
 611
 612 4. Index
 613
 614         +-----------------+=========================+
 615         | Index Indicator | Number of Index Records |
 616         +-----------------+=========================+
 617
 618              +=================+=========+-+-+-+-+
 619         ---> | List of Records | Padding | CRC32 |
 620              +=================+=========+-+-+-+-+
 621
 622         Index serves several purporses. Using it, one can
 623           - verify that all Blocks in a Stream have been processed;
 624           - find out the uncompressed size of a Stream; and
 625           - quickly access the beginning of any Block (random access).
 626
 627
 628 4.1. Index Indicator
 629
 630         This field overlaps with the Block Header Size field (see
 631         Section 3.1.1). The value of Index Indicator is always 0x00.
 632
 633
 634 4.2. Number of Records
 635
 636         This field indicates how many Records there are in the List
 637         of Records field, and thus how many Blocks there are in the
 638         Stream. The value is stored using the encoding described in
 639         Section 1.2. If the decoder has decoded all the Blocks of the
 640         Stream, and then notices that the Number of Records doesn't
 641         match the real number of Blocks, the decoder MUST indicate an
 642         error.
 643
 644
 645 4.3. List of Records
 646
 647         List of Records consists of as many Records as indicated by the
 648         Number of Records field:
 649
 650             +========+========+
 651             | Record | Record | ...
 652             +========+========+
 653
 654         Each Record contains information about one Block:
 655
 656             +===============+===================+
 657             | Unpadded Size | Uncompressed Size |
 658             +===============+===================+
 659
 660         If the decoder has decoded all the Blocks of the Stream, it
 661         MUST verify that the contents of the Records match the real
 662         Unpadded Size and Uncompressed Size of the respective Blocks.
 663
 664         Implementation hint: It is possible to verify the Index with
 665         constant memory usage by calculating for example SHA256 of both
 666         the real size values and the List of Records, then comparing
 667         the check values. Implementing this using non-cryptographic
 668         check like CRC32 SHOULD be avoided unless small code size is
 669         important.
 670
 671         If the decoder supports random-access reading, it MUST verify
 672         that Unpadded Size and Uncompressed Size of every completely
 673         decoded Block match the sizes stored in the Index. If only
 674         partial Block is decoded, the decoder MUST verify that the
 675         processed sizes don't exceed the sizes stored in the Index.
 676
 677
 678 4.3.1. Unpadded Size
 679
 680         This field indicates the size of the Block excluding the Block
 681         Padding field. That is, Unpadded Size is the size of the Block
 682         Header, Compressed Data, and Check fields. Unpadded Size is
 683         stored using the encoding described in Section 1.2. The value
 684         MUST never be zero; with the current structure of Blocks, the
 685         actual minimum value for Unpadded Size is five.
 686
 687         Implementation note: Because the size of the Block Padding
 688         field is not included in Unpadded Size, calculating the total
 689         size of a Stream or doing random-access reading requires
 690         calculating the actual size of the Blocks by rounding Unpadded
 691         Sizes up to the next multiple of four.
 692
 693         The reason to exclude Block Padding from Unpadded Size is to
 694         ease making a raw copy of Compressed Data without Block
 695         Padding. This can be useful, for example, if someone wants
 696         to convert Streams to some other file format quickly.
 697
 698
 699 4.3.2. Uncompressed Size
 700
 701         This field indicates the Uncompressed Size of the respective
 702         Block as bytes. The value is stored using the encoding
 703         described in Section 1.2.
 704
 705
 706 4.4. Index Padding
 707
 708         This field MUST contain 0-3 null bytes to pad the Index to
 709         a multiple of four bytes.
 710
 711
 712 4.5. CRC32
 713
 714         The CRC32 is calculated over everything in the Index field
 715         except the CRC32 field itself. The CRC32 is stored as an
 716         unsigned 32-bit little endian integer. If the calculated
 717         value does not match the stored one, the decoder MUST indicate
 718         an error.
 719
 720
 721 5. Filter Chains
 722
 723         The Block Flags field defines how many filters are used. When
 724         more than one filter is used, the filters are chained; that is,
 725         the output of one filter is the input of another filter. The
 726         following figure illustrates the direction of data flow.
 727
 728                     v   Uncompressed Data   ^
 729                     |       Filter 0        |
 730             Encoder |       Filter 1        | Decoder
 731                     |       Filter n        |
 732                     v    Compressed Data    ^
 733
 734
 735 5.1. Alignment
 736
 737         Alignment of uncompressed input data is usually the job of
 738         the application producing the data. For example, to get the
 739         best results, an archiver tool should make sure that all
 740         PowerPC executable files in the archive stream start at
 741         offsets that are multiples of four bytes.
 742
 743         Some filters, for example LZMA2, can be configured to take
 744         advantage of specified alignment of input data. Note that
 745         taking advantage of aligned input can be benefical also when
 746         a filter is not the first filter in the chain. For example,
 747         if you compress PowerPC executables, you may want to use the
 748         PowerPC filter and chain that with the LZMA2 filter. Because
 749         not only the input but also the output alignment of the PowerPC
 750         filter is four bytes, it is now benefical to set LZMA2 settings
 751         so that the LZMA2 encoder can take advantage of its
 752         four-byte-aligned input data.
 753
 754         The output of the last filter in the chain is stored to the
 755         Compressed Data field, which is is guaranteed to be aligned
 756         to a multiple of four bytes relative to the beginning of the
 757         Stream. This can increase
 758           - speed, if the filtered data is handled multiple bytes at
 759             a time by the filter-specific encoder and decoder,
 760             because accessing aligned data in computer memory is
 761             usually faster; and
 762           - compression ratio, if the output data is later compressed
 763             with an external compression tool.
 764
 765
 766 5.2. Security
 767
 768         If filters would be allowed to be chained freely, it would be
 769         possible to create malicious files, that would be very slow to
 770         decode. Such files could be used to create denial of service
 771         attacks.
 772
 773         Slow files could occur when multiple filters are chained:
 774
 775             v   Compressed input data
 776             |   Filter 1 decoder (last filter)
 777             |   Filter 0 decoder (non-last filter)
 778             v   Uncompressed output data
 779
 780         The decoder of the last filter in the chain produces a lot of
 781         output from little input. Another filter in the chain takes the
 782         output of the last filter, and produces very little output
 783         while consuming a lot of input. As a result, a lot of data is
 784         moved inside the filter chain, but the filter chain as a whole
 785         gets very little work done.
 786
 787         To prevent this kind of slow files, there are restrictions on
 788         how the filters can be chained. These restrictions MUST be
 789         taken into account when designing new filters.
 790
 791         The maximum number of filters in the chain has been limited to
 792         four, thus there can be at maximum of three non-last filters.
 793         Of these three non-last filters, only two are allowed to change
 794         the size of the data.
 795
 796         The non-last filters, that change the size of the data, MUST
 797         have a limit how much the decoder can compress the data: the
 798         decoder SHOULD produce at least n bytes of output when the
 799         filter is given 2n bytes of input. This  limit is not
 800         absolute, but significant deviations MUST be avoided.
 801
 802         The above limitations guarantee that if the last filter in the
 803         chain produces 4n bytes of output, the chain as a whole will
 804         produce at least n bytes of output.
 805
 806
 807 5.3. Filters
 808
 809 5.3.1. LZMA2
 810
 811         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 812         compression algorithm with high compression ratio and fast
 813         decompression. LZMA is based on LZ77 and range coding
 814         algorithms.
 815
 816         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 817         LZMA internally, but adds support for flushing the encoder,
 818         uncompressed chunks, eases stateful decoder implementations,
 819         and improves support for multithreading. Thus, the plain LZMA
 820         will not be supported in this file format.
 821
 822             Filter ID:                  0x21
 823             Size of Filter Properties:  1 byte
 824             Changes size of data:       Yes
 825             Allow as a non-last filter: No
 826             Allow as the last filter:   Yes
 827
 828             Preferred alignment:
 829                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 830                 Output data:            1 byte
 831
 832         The format of the one-byte Filter Properties field is as
 833         follows:
 834
 835             Bits   Mask   Description
 836             0-5    0x3F   Dictionary Size
 837             6-7    0xC0   Reserved for future use; MUST be zero for now.
 838
 839         Dictionary Size is encoded with one-bit mantissa and five-bit
 840         exponent. The smallest dictionary size is 4 KiB and the biggest
 841         is 4 GiB.
 842
 843             Raw value   Mantissa   Exponent   Dictionary size
 844                 0           2         11         4 KiB
 845                 1           3         11         6 KiB
 846                 2           2         12         8 KiB
 847                 3           3         12        12 KiB
 848                 4           2         13        16 KiB
 849                 5           3         13        24 KiB
 850                 6           2         14        32 KiB
 851               ...         ...        ...      ...
 852                35           3         27       768 MiB
 853                36           2         28      1024 MiB
 854                37           3         29      1536 MiB
 855                38           2         30      2048 MiB
 856                39           3         30      3072 MiB
 857                40           2         31      4096 MiB - 1 B
 858
 859         Instead of having a table in the decoder, the dictionary size
 860         can be decoded using the following C code:
 861
 862             const uint8_t bits = get_dictionary_flags() & 0x3F;
 863             if (bits > 40)
 864                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 865
 866             uint32_t dictionary_size;
 867             if (bits == 40) {
 868                 dictionary_size = UINT32_MAX;
 869             } else {
 870                 dictionary_size = 2 | (bits & 1);
 871                 dictionary_size <<= bits / 2 + 11;
 872             }
 873
 874
 875 5.3.2. Branch/Call/Jump Filters for Executables
 876
 877         These filters convert relative branch, call, and jump
 878         instructions to their absolute counterparts in executable
 879         files. This conversion increases redundancy and thus
 880         compression ratio.
 881
 882             Size of Filter Properties:  0 or 4 bytes
 883             Changes size of data:       No
 884             Allow as a non-last filter: Yes
 885             Allow as the last filter:   No
 886
 887         Below is the list of filters in this category. The alignment
 888         is the same for both input and output data.
 889
 890             Filter ID   Alignment   Description
 891               0x04       1 byte     x86 filter (BCJ)
 892               0x05       4 bytes    PowerPC (big endian) filter
 893               0x06      16 bytes    IA64 filter
 894               0x07       4 bytes    ARM (little endian) filter
 895               0x08       2 bytes    ARM Thumb (little endian) filter
 896               0x09       4 bytes    SPARC filter
 897
 898         If the size of Filter Properties is four bytes, the Filter
 899         Properties field contains the start offset used for address
 900         conversions. It is stored as an unsigned 32-bit little endian
 901         integer. If the size of Filter Properties is zero, the start
 902         offset is zero.
 903
 904         Setting the start offset may be useful if an executable has
 905         multiple sections, and there are many cross-section calls.
 906         Taking advantage of this feature usually requires usage of
 907         the Subblock filter.
 908
 909
 910 5.3.3. Delta
 911
 912         The Delta filter may increase compression ratio when the value
 913         of the next byte correlates with the value of an earlier byte
 914         at specified distance.
 915
 916             Filter ID:                  0x03
 917             Size of Filter Properties:  1 byte
 918             Changes size of data:       No
 919             Allow as a non-last filter: Yes
 920             Allow as the last filter:   No
 921
 922             Preferred alignment:
 923                 Input data:             1 byte
 924                 Output data:            Same as the original input data
 925
 926         The Properties byte indicates the delta distance, which can be
 927         1-256 bytes backwards from the current byte: 0x00 indicates
 928         distance of 1 byte and 0xFF distance of 256 bytes.
 929
 930
 931 5.3.3.1. Format of the Encoded Output
 932
 933         The code below illustrates both encoding and decoding with
 934         the Delta filter.
 935
 936             // Distance is in the range [1, 256].
 937             const unsigned int distance = get_properties_byte() + 1;
 938             uint8_t pos = 0;
 939             uint8_t delta[256];
 940
 941             memset(delta, 0, sizeof(delta));
 942
 943             while (1) {
 944                 const int byte = read_byte();
 945                 if (byte == EOF)
 946                     break;
 947
 948                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 949                 if (is_encoder) {
 950                     tmp = (uint8_t)(byte) - tmp;
 951                     delta[pos] = (uint8_t)(byte);
 952                 } else {
 953                     tmp = (uint8_t)(byte) + tmp;
 954                     delta[pos] = tmp;
 955                 }
 956
 957                 write_byte(tmp);
 958                 --pos;
 959             }
 960
 961
 962 5.4. Custom Filter IDs
 963
 964         If a developer wants to use custom Filter IDs, he has two
 965         choices. The first choice is to contact Lasse Collin and ask
 966         him to allocate a range of IDs for the developer.
 967
 968         The second choice is to generate a 40-bit random integer,
 969         which the developer can use as his personal Developer ID.
 970         To minimalize the risk of collisions, Developer ID has to be
 971         a randomly generated integer, not manually selected "hex word".
 972         The following command, which works on many free operating
 973         systems, can be used to generate Developer ID:
 974
 975             dd if=/dev/urandom bs=5 count=1 | hexdump
 976
 977         The developer can then use his Developer ID to create unique
 978         (well, hopefully unique) Filter IDs.
 979
 980             Bits    Mask                    Description
 981              0-15   0x0000_0000_0000_FFFF   Filter ID
 982             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
 983             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
 984
 985         The resulting 63-bit integer will use 9 bytes of space when
 986         stored using the encoding described in Section 1.2. To get
 987         a shorter ID, see the beginning of this Section how to
 988         request a custom ID range.
 989
 990
 991 5.4.1. Reserved Custom Filter ID Ranges
 992
 993         Range                       Description
 994         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
 995         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
 996
 997
 998 6. Cyclic Redundancy Checks
 999
1000         There are several incompatible variations to calculate CRC32
1001         and CRC64. For simplicity and clarity, complete examples are
1002         provided to calculate the checks as they are used in this file
1003         format. Implementations MAY use different code as long as it
1004         gives identical results.
1005
1006         The program below reads data from standard input, calculates
1007         the CRC32 and CRC64 values, and prints the calculated values
1008         as big endian hexadecimal strings to standard output.
1009
1010             #include <sys/types.h>
1011             #include <inttypes.h>
1012             #include <stdio.h>
1013
1014             uint32_t crc32_table[256];
1015             uint64_t crc64_table[256];
1016
1017             void
1018             init(void)
1019             {
1020                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1021                 static const uint64_t poly64
1022                         = UINT64_C(0xC96C5795D7870F42);
1023
1024                 for (size_t i = 0; i < 256; ++i) {
1025                     uint32_t crc32 = i;
1026                     uint64_t crc64 = i;
1027
1028                     for (size_t j = 0; j < 8; ++j) {
1029                         if (crc32 & 1)
1030                             crc32 = (crc32 >> 1) ^ poly32;
1031                         else
1032                             crc32 >>= 1;
1033
1034                         if (crc64 & 1)
1035                             crc64 = (crc64 >> 1) ^ poly64;
1036                         else
1037                             crc64 >>= 1;
1038                     }
1039
1040                     crc32_table[i] = crc32;
1041                     crc64_table[i] = crc64;
1042                 }
1043             }
1044
1045             uint32_t
1046             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1047             {
1048                 crc = ~crc;
1049                 for (size_t i = 0; i < size; ++i)
1050                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1051                             ^ (crc >> 8);
1052                 return ~crc;
1053             }
1054
1055             uint64_t
1056             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1057             {
1058                 crc = ~crc;
1059                 for (size_t i = 0; i < size; ++i)
1060                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1061                             ^ (crc >> 8);
1062                 return ~crc;
1063             }
1064
1065             int
1066             main()
1067             {
1068                 init();
1069
1070                 uint32_t value32 = 0;
1071                 uint64_t value64 = 0;
1072                 uint64_t total_size = 0;
1073                 uint8_t buf[8192];
1074
1075                 while (1) {
1076                     const size_t buf_size = fread(buf, 1, 8192, stdin);
1077                     if (buf_size == 0)
1078                         break;
1079
1080                     total_size += buf_size;
1081                     value32 = crc32(buf, buf_size, value32);
1082                     value64 = crc64(buf, buf_size, value64);
1083                 }
1084
1085                 printf("Bytes:  %" PRIu64 "\n", total_size);
1086                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1087                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1088
1089                 return 0;
1090             }
1091
1092
1093 7. References
1094
1095         LZMA SDK - The original LZMA implementation
1096         http://7-zip.org/sdk.html
1097
1098         LZMA Utils - LZMA adapted to POSIX-like systems
1099         http://tukaani.org/lzma/
1100
1101         [RFC-1952]
1102         GZIP file format specification version 4.3
1103         http://www.ietf.org/rfc/rfc1952.txt
1104           - Notation of byte boxes in section "2.1. Overall conventions"
1105
1106         [RFC-2119]
1107         Key words for use in RFCs to Indicate Requirement Levels
1108         http://www.ietf.org/rfc/rfc2119.txt
1109
1110         [GNU-tar]
1111         GNU tar 1.20 manual
1112         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1113           - Node 9.4.2 "Blocking Factor", paragraph that begins
1114             "gzip will complain about trailing garbage"
1115           - Note that this URL points to the latest version of the
1116             manual, and may some day not contain the note which is in
1117             1.20. For the exact version of the manual, download GNU
1118             tar 1.20: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.20.tar.gz
1119