doc/file-format.txt

   1
   2 The .xz File Format
   3 -------------------
   4
   5         0. Preface
   6            0.1. Copyright Notices
   7            0.2. Changes
   8         1. Conventions
   9            1.1. Byte and Its Representation
  10            1.2. Multibyte Integers
  11         2. Overall Structure of .xz File
  12            2.1. Stream
  13                 2.1.1. Stream Header
  14                        2.1.1.1. Header Magic Bytes
  15                        2.1.1.2. Stream Flags
  16                        2.1.1.3. CRC32
  17                 2.1.2. Stream Footer
  18                        2.1.2.1. CRC32
  19                        2.1.2.2. Backward Size
  20                        2.1.2.3. Stream Flags
  21                        2.1.2.4. Footer Magic Bytes
  22            2.2. Stream Padding
  23         3. Block
  24            3.1. Block Header
  25                 3.1.1. Block Header Size
  26                 3.1.2. Block Flags
  27                 3.1.3. Compressed Size
  28                 3.1.4. Uncompressed Size
  29                 3.1.5. List of Filter Flags
  30                 3.1.6. Header Padding
  31                 3.1.7. CRC32
  32            3.2. Compressed Data
  33            3.3. Check
  34         4. Index
  35            4.1. Index Indicator
  36            4.2. Number of Records
  37            4.3. List of Records
  38                 4.3.1. Total Size
  39                 4.3.2. Uncompressed Size
  40            4.4. Index Padding
  41            4.5. CRC32
  42         5. Filter Chains
  43            5.1. Alignment
  44            5.2. Security
  45            5.3. Filters
  46                 5.3.1. LZMA2
  47                 5.3.2. Branch/Call/Jump Filters for Executables
  48                 5.3.3. Delta
  49                        5.3.3.1. Format of the Encoded Output
  50            5.4. Custom Filter IDs
  51                 5.4.1. Reserved Custom Filter ID Ranges
  52         6. Cyclic Redundancy Checks
  53         7. References
  54
  55
  56 0. Preface
  57
  58         This document describes the .xz file format (filename suffix
  59         `.xz', MIME type `application/x-xz'). It is intended that this
  60         this format replace the old .lzma format used by LZMA SDK and
  61         LZMA Utils.
  62
  63         IMPORTANT:  The version described in this document is a
  64                     draft, NOT a final, official version. Changes
  65                     are possible.
  66
  67
  68 0.1. Copyright Notices
  69
  70         Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
  71         Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
  72
  73         Copying and distribution of this file, with or without
  74         modification, are permitted in any medium without royalty
  75         provided the copyright notice and this notice are preserved.
  76         Modified versions must be marked as such.
  77
  78         All source code examples given in this document are put into
  79         the public domain by the authors of this document.
  80
  81         Special thanks for helping with this document goes to
  82         Igor Pavlov. Thanks for helping with this document goes to
  83         Mark Adler, H. Peter Anvin, and Mikko Pouru.
  84
  85
  86 0.2. Changes
  87
  88         Last modified: 2008-09-24 21:05+0300
  89
  90         (A changelog will be kept once the first official version
  91         is made.)
  92
  93
  94 1. Conventions
  95
  96         The keywords `must', `must not', `required', `should',
  97         `should not', `recommended', `may', and `optional' in this
  98         document are to be interpreted as described in [RFC-2119].
  99         These words are not capitalized in this document.
 100
 101         Indicating a warning means displaying a message, returning
 102         appropriate exit status, or something else to let the user
 103         know that something worth warning occurred. The operation
 104         should still finish if a warning is indicated.
 105
 106         Indicating an error means displaying a message, returning
 107         appropriate exit status, or something else to let the user
 108         know that something prevented successfully finishing the
 109         operation. The operation must be aborted once an error has
 110         been indicated.
 111
 112
 113 1.1. Byte and Its Representation
 114
 115         In this document, byte is always 8 bits.
 116
 117         A `nul byte' has all bits unset. That is, the value of a nul
 118         byte is 0x00.
 119
 120         To represent byte blocks, this document uses notation that
 121         is similar to the notation used in [RFC-1952]:
 122
 123             +-------+
 124             |  Foo  |   One byte.
 125             +-------+
 126
 127             +---+---+
 128             |  Foo  |   Two bytes; that is, some of the vertical bars
 129             +---+---+   can be missing.
 130
 131             +=======+
 132             |  Foo  |   Zero or more bytes.
 133             +=======+
 134
 135         In this document, a boxed byte or a byte sequence declared
 136         using this notation is called `a field'. The example field
 137         above would be called `the Foo field' or plain `Foo'.
 138
 139
 140 1.2. Multibyte Integers
 141
 142         Multibyte integers of static length, such as CRC values,
 143         are stored in little endian byte order (least significant
 144         byte first).
 145
 146         When smaller values are more likely than bigger values (for
 147         example file sizes), multibyte integers are encoded in a
 148         variable-length representation:
 149           - Numbers in the range [0, 127] are copied as is, and take
 150             one byte of space.
 151           - Bigger numbers will occupy two or more bytes. All but the
 152             last byte of the multibyte representation have the highest
 153             (eighth) bit set.
 154
 155         For now, the value of the variable-length integers is limited
 156         to 63 bits, which limits the encoded size of the integer to
 157         nine bytes. These limits may be increased in future if needed.
 158
 159         The following C code illustrates encoding and decoding of
 160         variable-length integers. The functions return the number of
 161         bytes occupied by the integer (1-9), or zero on error.
 162
 163             #include <sys/types.h>
 164             #include <inttypes.h>
 165
 166             size_t
 167             encode(uint8_t buf[static 9], uint64_t num)
 168             {
 169                 if (num >= UINT64_MAX / 2)
 170                     return 0;
 171
 172                 size_t i = 0;
 173
 174                 while (num >= 0x80) {
 175                     buf[i++] = (uint8_t)(num) | 0x80;
 176                     num >>= 7;
 177                 }
 178
 179                 buf[i++] = (uint8_t)(num);
 180
 181                 return i;
 182             }
 183
 184             size_t
 185             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 186             {
 187                 if (size_max == 0)
 188                     return 0;
 189
 190                 if (size_max > 9)
 191                     size_max = 9;
 192
 193                 *num = buf[0] & 0x7F;
 194                 size_t i = 0;
 195
 196                 while (buf[i++] & 0x80) {
 197                     if (i > size_max || buf[i] == 0x00)
 198                         return 0;
 199
 200                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 201                 }
 202
 203                 return i;
 204             }
 205
 206
 207 2. Overall Structure of .xz File
 208
 209         +========+================+========+================+
 210         | Stream | Stream Padding | Stream | Stream Padding | ...
 211         +========+================+========+================+
 212
 213         A file contains usually only one Stream. However, it is
 214         possible to concatenate multiple Streams together with no
 215         additional processing. It is up to the implementation to
 216         decide if the decoder will continue decoding from the next
 217         Stream once the end of the first Stream has been reached.
 218
 219
 220 2.1. Stream
 221
 222         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 223         |     Stream Header     | Block | Block | ... | Block |
 224         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 225
 226              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 227         ---> | Index |     Stream Footer     |
 228              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 229
 230         All the above fields have a size that is a multiple of four. If
 231         Stream is used as an internal part of another file format, it
 232         is recommended to make the Stream start at an offset that is
 233         a multiple of four bytes.
 234
 235         Stream Header, Index, and Stream Footer are always present in
 236         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 237
 238         There are zero or more Blocks. The maximum number of Blocks is
 239         limited only by the maximum size of the Index field.
 240
 241         Total size of a Stream must be less than 8 EiB (2^63 bytes).
 242         The same limit applies to the total amount of uncompressed
 243         data stored in a Stream.
 244
 245         If an implementation supports handling .xz files with multiple
 246         concatenated Streams, it may apply the above limits to the file
 247         as a whole instead of limiting per Stream basis.
 248
 249
 250 2.1.1. Stream Header
 251
 252         +---+---+---+---+---+---+-------+------+--+--+--+--+
 253         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 254         +---+---+---+---+---+---+-------+------+--+--+--+--+
 255
 256
 257 2.1.1.1. Header Magic Bytes
 258
 259         The first six (6) bytes of the Stream are so called Header
 260         Magic Bytes. They can be used to identify the file type.
 261
 262             Using a C array and ASCII:
 263             const uint8_t HEADER_MAGIC[6]
 264                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 265
 266             In plain hexadecimal:
 267             FD 37 7A 58 5A 00
 268
 269         Notes:
 270           - The first byte (0xFD) was chosen so that the files cannot
 271             be erroneously detected as being in .lzma format, in which
 272             the first byte is in the range [0x00, 0xE0].
 273           - The sixth byte (0x00) was chosen to prevent applications
 274             from misdetecting the file as a text file.
 275
 276         If the Header Magic Bytes don't match, the decoder must
 277         indicate an error.
 278
 279
 280 2.1.1.2. Stream Flags
 281
 282         The first byte of Stream Flags is always a nul byte. In future
 283         this byte may be used to indicate new Stream version or other
 284         Stream properties.
 285
 286         The second byte of Stream Flags is a bit field:
 287
 288             Bit(s)  Mask  Description
 289              0-3    0x0F  Type of Check (see Section 3.3):
 290                               ID    Size      Check name
 291                               0x00   0 bytes  None
 292                               0x01   4 bytes  CRC32
 293                               0x02   4 bytes  (Reserved)
 294                               0x03   4 bytes  (Reserved)
 295                               0x04   8 bytes  CRC64
 296                               0x05   8 bytes  (Reserved)
 297                               0x06   8 bytes  (Reserved)
 298                               0x07  16 bytes  (Reserved)
 299                               0x08  16 bytes  (Reserved)
 300                               0x09  16 bytes  (Reserved)
 301                               0x0A  32 bytes  SHA-256
 302                               0x0B  32 bytes  (Reserved)
 303                               0x0C  32 bytes  (Reserved)
 304                               0x0D  64 bytes  (Reserved)
 305                               0x0E  64 bytes  (Reserved)
 306                               0x0F  64 bytes  (Reserved)
 307              4-7    0xF0  Reserved for future use; must be zero for now.
 308
 309         Implementations must support at least the Check IDs 0x00 (None)
 310         and 0x01 (CRC32). Supporting other Check IDs is optional. If
 311         an unsupported Check is used, the decoder should indicate a
 312         warning or error.
 313
 314         If any reserved bit is set, the decoder must indicate an error.
 315         It is possible that there is a new field present which the
 316         decoder is not aware of, and can thus parse the Stream Header
 317         incorrectly.
 318
 319
 320 2.1.1.3. CRC32
 321
 322         The CRC32 is calculated from the Stream Flags field. It is
 323         stored as an unsigned 32-bit little endian integer. If the
 324         calculated value does not match the stored one, the decoder
 325         must indicate an error.
 326
 327         The idea is that Stream Flags would always be two bytes, even
 328         if new features are needed. This way old decoders will be able
 329         to verify the CRC32 calculated from Stream Flags, and thus
 330         distinguish between corrupt files (CRC32 doesn't match) and
 331         files that the decoder doesn't support (CRC32 matches but
 332         Stream Flags has reserved bits set).
 333
 334
 335 2.1.2. Stream Footer
 336
 337         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 338         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 339         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 340
 341
 342 2.1.2.1. CRC32
 343
 344         The CRC32 is calculated from the Backward Size and Stream Flags
 345         fields. It is stored as an unsigned 32-bit little endian
 346         integer. If the calculated value does not match the stored one,
 347         the decoder must indicate an error.
 348
 349         The reason to have the CRC32 field before the Backward Size and
 350         Stream Flags fields is to keep the four-byte fields aligned to
 351         a multiple of four bytes.
 352
 353
 354 2.1.2.2. Backward Size
 355
 356         Backward Size is stored as a 32-bit little endian integer,
 357         which indicates the size of the Index field as multiple of
 358         four bytes, minimum value being four bytes:
 359
 360             real_backward_size = (stored_backward_size + 1) * 4;
 361
 362         Using a fixed-size integer to store this value makes it
 363         slightly simpler to parse the Stream Footer when the
 364         application needs to parse the Stream backwards.
 365
 366
 367 2.1.2.3. Stream Flags
 368
 369         This is a copy of the Stream Flags field from the Stream
 370         Header. The information stored to Stream Flags is needed
 371         when parsing the Stream backwards. The decoder must compare
 372         the Stream Flags fields in both Stream Header and Stream
 373         Footer, and indicate an error if they are not identical.
 374
 375
 376 2.1.2.4. Footer Magic Bytes
 377
 378         As the last step of the decoding process, the decoder must
 379         verify the existence of Footer Magic Bytes. If they don't
 380         match, an error must be indicated.
 381
 382             Using a C array and ASCII:
 383             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 384
 385             In hexadecimal:
 386             59 5A
 387
 388         The primary reason to have Footer Magic Bytes is to make
 389         it easier to detect incomplete files quickly, without
 390         uncompressing. If the file does not end with Footer Magic Bytes
 391         (excluding Stream Padding described in Section 2.2), it cannot
 392         be undamaged, unless someone has intentionally appended garbage
 393         after the end of the Stream.
 394
 395
 396 2.2. Stream Padding
 397
 398         Only the decoders that support decoding of concatenated Streams
 399         must support Stream Padding.
 400
 401         Stream Padding must contain only nul bytes. Any non-nul byte
 402         should be considered as the beginning of a new Stream. To
 403         preserve the four-byte alignment of consecutive Streams, the
 404         size of Stream Padding must be a multiple of four bytes. Empty
 405         Stream Padding is allowed.
 406
 407         Note that non-empty Stream Padding is allowed at the end of the
 408         file; there doesn't need to be a new Stream after non-empty
 409         Stream Padding. This can be convenient in certain situations
 410         [GNU-tar].
 411
 412         The possibility of Padding should be taken into account when
 413         designing an application that parses the Stream backwards.
 414
 415
 416 3. Block
 417
 418         +==============+=================+=======+
 419         | Block Header | Compressed Data | Check |
 420         +==============+=================+=======+
 421
 422
 423 3.1. Block Header
 424
 425         +-------------------+-------------+=================+
 426         | Block Header Size | Block Flags | Compressed Size |
 427         +-------------------+-------------+=================+
 428
 429              +===================+======================+
 430         ---> | Uncompressed Size | List of Filter Flags |
 431              +===================+======================+
 432
 433              +================+--+--+--+--+
 434         ---> | Header Padding |   CRC32   |
 435              +================+--+--+--+--+
 436
 437
 438 3.1.1. Block Header Size
 439
 440         This field overlaps with the Index Indicator field (see
 441         Section 4.1).
 442
 443         This field contains the size of the Block Header field,
 444         including the Block Header Size field itself. Valid values are
 445         in the range [0x01, 0xFF], which indicate the size of the Block
 446         Header as multiples of four bytes, minimum size being eight
 447         bytes:
 448
 449             real_header_size = (encoded_header_size + 1) * 4;
 450
 451         If bigger Block Header is needed in future, a new field can be
 452         added between the current Block Header and Compressed Data
 453         fields. The presence of this new field would be indicated in
 454         the Block Header.
 455
 456
 457 3.1.2. Block Flags
 458
 459         The first byte of the Block Flags field is a bit field:
 460
 461             Bit(s)  Mask  Description
 462              0-1    0x03  Number of filters (1-4)
 463              2-5    0x3C  Reserved for future use; must be zero for now.
 464               6     0x40  The Compressed Size field is present.
 465               7     0x80  The Uncompressed Size field is present.
 466
 467         If any reserved bit is set, the decoder must indicate an error.
 468         It is possible that there is a new field present which the
 469         decoder is not aware of, and can thus parse the Block Header
 470         incorrectly.
 471
 472
 473 3.1.3. Compressed Size
 474
 475         This field is present only if the appropriate bit is set in
 476         the Block Flags field (see Section 3.1.2).
 477
 478         This field contains the size of the Compressed Data field as
 479         multiple of four bytes, minimum value being four bytes:
 480
 481             real_compressed_size = (stored_compressed_size + 1) * 4;
 482
 483         The size is stored using the encoding described in Section 1.2.
 484         If the Compressed Size does not match the real size of the
 485         Compressed Data field, the decoder must indicate an error.
 486
 487
 488 3.1.4. Uncompressed Size
 489
 490         This field is present only if the appropriate bit is set in
 491         the Block Flags field (see Section 3.1.2).
 492
 493         The Uncompressed Size field contains the size of the Block
 494         after uncompressing. Uncompressed Size is stored using the
 495         encoding described in Section 1.2. If the Uncompressed Size
 496         does not match the real uncompressed size, the decoder must
 497         indicate an error.
 498
 499         Storing the Compressed Size and Uncompressed Size fields serves
 500         several purposes:
 501           - The decoder knows how much memory it needs to allocate
 502             for a temporary buffer in multithreaded mode.
 503           - Simple error detection: wrong size indicates a broken file.
 504           - Seeking forwards to a specific location in streamed mode.
 505
 506         It should be noted that the only reliable way to determine
 507         the real uncompressed size is to uncompress the Block,
 508         because the Block Header and Index fields may contain
 509         (intentionally or unintentionally) invalid information.
 510
 511
 512 3.1.5. List of Filter Flags
 513
 514         +================+================+     +================+
 515         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 516         +================+================+     +================+
 517
 518         The number of Filter Flags fields is stored in the Block Flags
 519         field (see Section 3.1.2).
 520
 521         The format of each Filter Flags field is as follows:
 522
 523             +===========+====================+===================+
 524             | Filter ID | Size of Properties | Filter Properties |
 525             +===========+====================+===================+
 526
 527         Both Filter ID and Size of Properties are stored using the
 528         encoding described in Section 1.2. Size of Properties indicates
 529         the size of the Filter Properties field as bytes. The list of
 530         officially defined Filter IDs and the formats of their Filter
 531         Properties are described in Section 5.3.
 532
 533         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 534         (2^62) are reserved for implementation-specific internal use.
 535         These Filter IDs must never be used in List of Filter Flags.
 536
 537
 538 3.1.6. Header Padding
 539
 540         This field contains as many nul byte as it is needed to make
 541         the Block Header have the size specified in Block Header Size.
 542         If any of the bytes are not nul bytes, the decoder must
 543         indicate an error. It is possible that there is a new field
 544         present which the decoder is not aware of, and can thus parse
 545         the Block Header incorrectly.
 546
 547
 548 3.1.7. CRC32
 549
 550         The CRC32 is calculated over everything in the Block Header
 551         field except the CRC32 field itself. It is stored as an
 552         unsigned 32-bit little endian integer. If the calculated
 553         value does not match the stored one, the decoder must indicate
 554         an error.
 555
 556         By verifying the CRC32 of the Block Header before parsing the
 557         actual contents allows the decoder to distinguish between
 558         corrupt and unsupported files.
 559
 560
 561 3.2. Compressed Data
 562
 563         The format of Compressed Data depends on Block Flags and List
 564         of Filter Flags. Excluding the descriptions of the simplest
 565         filters in Section 5.3, the format of the filter-specific
 566         encoded data is out of scope of this document.
 567
 568         If the natural size of Compressed Data is not a multiple of
 569         four bytes, it must be padded with 1-3 nul bytes to make it
 570         a multiple of four bytes.
 571
 572
 573 3.3. Check
 574
 575         The type and size of the Check field depends on which bits
 576         are set in the Stream Flags field (see Section 2.1.1.2).
 577
 578         The Check, when used, is calculated from the original
 579         uncompressed data. If the calculated Check does not match the
 580         stored one, the decoder must indicate an error. If the selected
 581         type of Check is not supported by the decoder, it must indicate
 582         a warning or error.
 583
 584
 585 4. Index
 586
 587         +-----------------+=========================+
 588         | Index Indicator | Number of Index Records |
 589         +-----------------+=========================+
 590
 591              +=================+=========+-+-+-+-+
 592         ---> | List of Records | Padding | CRC32 |
 593              +=================+=========+-+-+-+-+
 594
 595         Index serves several purporses. Using it, one can
 596           - verify that all Blocks in a Stream have been processed;
 597           - find out the uncompressed size of a Stream; and
 598           - quickly access the beginning of any Block (random access).
 599
 600
 601 4.1. Index Indicator
 602
 603         This field overlaps with the Block Header Size field (see
 604         Section 3.1.1). The value of Index Indicator is always 0x00.
 605
 606
 607 4.2. Number of Records
 608
 609         This field indicates how many Records there are in the List
 610         of Records field, and thus how many Blocks there are in the
 611         Stream. The value is stored using the encoding described in
 612         Section 1.2. If the decoder has decoded all the Blocks of the
 613         Stream, and then notices that the Number of Records doesn't
 614         match the real number of Blocks, the decoder must indicate an
 615         error.
 616
 617
 618 4.3. List of Records
 619
 620         List of Records consists of as many Records as indicated by the
 621         Number of Records field:
 622
 623             +========+========+
 624             | Record | Record | ...
 625             +========+========+
 626
 627         Each Record contains two fields:
 628
 629             +============+===================+
 630             | Total Size | Uncompressed Size |
 631             +============+===================+
 632
 633         If the decoder has decoded all the Blocks of the Stream, it
 634         must verify that the contents of the Records match the real
 635         Total Size and Uncompressed Size of the respective Blocks.
 636
 637         Implementation hint: It is possible to verify the Index with
 638         constant memory usage by calculating for example SHA256 of both
 639         the real size values and the List of Records, then comparing
 640         the check values. Implementing this using non-cryptographic
 641         check like CRC32 should be avoided unless small code size is
 642         important.
 643
 644         If the decoder supports random-access reading, it must verify
 645         that Total Size and Uncompressed Size of every completely
 646         decoded Block match the sizes stored in the Index. If only
 647         partial Block is decoded, the decoder must verify that the
 648         processed sizes don't exceed the sizes stored in the Index.
 649
 650
 651 4.3.1. Total Size
 652
 653         This field indicates the encoded size of the respective Block
 654         as multiples of four bytes, minimum value being four bytes:
 655
 656             real_total_size = (stored_total_size + 1) * 4;
 657
 658         The value is stored using the encoding described in Section
 659         1.2.
 660
 661
 662 4.3.2. Uncompressed Size
 663
 664         This field indicates the Uncompressed Size of the respective
 665         Block as bytes. The value is stored using the encoding
 666         described in Section 1.2.
 667
 668
 669 4.4. Index Padding
 670
 671         This field must contain 0-3 nul bytes to pad the Index to
 672         a multiple of four bytes.
 673
 674
 675 4.5. CRC32
 676
 677         The CRC32 is calculated over everything in the Index field
 678         except the CRC32 field itself. The CRC32 is stored as an
 679         unsigned 32-bit little endian integer. If the calculated
 680         value does not match the stored one, the decoder must indicate
 681         an error.
 682
 683
 684 5. Filter Chains
 685
 686         The Block Flags field defines how many filters are used. When
 687         more than one filter is used, the filters are chained; that is,
 688         the output of one filter is the input of another filter. The
 689         following figure illustrates the direction of data flow.
 690
 691                     v   Uncompressed Data   ^
 692                     |       Filter 0        |
 693             Encoder |       Filter 1        | Decoder
 694                     |       Filter n        |
 695                     v    Compressed Data    ^
 696
 697
 698 5.1. Alignment
 699
 700         Alignment of uncompressed input data is usually the job of
 701         the application producing the data. For example, to get the
 702         best results, an archiver tool should make sure that all
 703         PowerPC executable files in the archive stream start at
 704         offsets that are multiples of four bytes.
 705
 706         Some filters, for example LZMA2, can be configured to take
 707         advantage of specified alignment of input data. Note that
 708         taking advantage of aligned input can be benefical also when
 709         a filter is not the first filter in the chain. For example,
 710         if you compress PowerPC executables, you may want to use the
 711         PowerPC filter and chain that with the LZMA2 filter. Because
 712         not only the input but also the output alignment of the PowerPC
 713         filter is four bytes, it is now benefical to set LZMA2 settings
 714         so that the LZMA2 encoder can take advantage of its
 715         four-byte-aligned input data.
 716
 717         The output of the last filter in the chain is stored to the
 718         Compressed Data field, which is is guaranteed to be aligned
 719         to a multiple of four bytes relative to the beginning of the
 720         Stream. This can increase
 721           - speed, if the filtered data is handled multiple bytes at
 722             a time by the filter-specific encoder and decoder,
 723             because accessing aligned data in computer memory is
 724             usually faster; and
 725           - compression ratio, if the output data is later compressed
 726             with an external compression tool.
 727
 728
 729 5.2. Security
 730
 731         If filters would be allowed to be chained freely, it would be
 732         possible to create malicious files, that would be very slow to
 733         decode. Such files could be used to create denial of service
 734         attacks.
 735
 736         Slow files could occur when multiple filters are chained:
 737
 738             v   Compressed input data
 739             |   Filter 1 decoder (last filter)
 740             |   Filter 0 decoder (non-last filter)
 741             v   Uncompressed output data
 742
 743         The decoder of the last filter in the chain produces a lot of
 744         output from little input. Another filter in the chain takes the
 745         output of the last filter, and produces very little output
 746         while consuming a lot of input. As a result, a lot of data is
 747         moved inside the filter chain, but the filter chain as a whole
 748         gets very little work done.
 749
 750         To prevent this kind of slow files, there are restrictions on
 751         how the filters can be chained. These restrictions must be
 752         taken into account when designing new filters.
 753
 754         The maximum number of filters in the chain has been limited to
 755         four, thus there can be at maximum of three non-last filters.
 756         Of these three non-last filters, only two are allowed to change
 757         the size of the data.
 758
 759         The non-last filters, that change the size of the data, must
 760         have a limit how much the decoder can compress the data: the
 761         decoder should produce at least n bytes of output when the
 762         filter is given 2n bytes of input. This  limit is not
 763         absolute, but significant deviations must be avoided.
 764
 765         The above limitations guarantee that if the last filter in the
 766         chain produces 4n bytes of output, the chain as a whole will
 767         produce at least n bytes of output.
 768
 769
 770 5.3. Filters
 771
 772 5.3.1. LZMA2
 773
 774         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 775         compression algorithm with high compression ratio and fast
 776         decompression. LZMA is based on LZ77 and range coding
 777         algorithms.
 778
 779         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 780         LZMA internally, but adds support for flushing the encoder,
 781         uncompressed chunks, eases stateful decoder implementations,
 782         and improves support for multithreading. Thus, the plain LZMA
 783         will not be supported in this file format.
 784
 785             Filter ID:                  0x21
 786             Size of Filter Properties:  1 byte
 787             Changes size of data:       Yes
 788             Allow as a non-last filter: No
 789             Allow as the last filter:   Yes
 790
 791             Preferred alignment:
 792                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 793                 Output data:            1 byte
 794
 795         The format of the one-byte Filter Properties field is as
 796         follows:
 797
 798             Bits   Mask   Description
 799             0-5    0x3F   Dictionary Size
 800             6-7    0xC0   Reserved for future use; must be zero for now.
 801
 802         Dictionary Size is encoded with one-bit mantissa and five-bit
 803         exponent. The smallest dictionary size is 4 KiB and the biggest
 804         is 4 GiB.
 805
 806             Raw value   Mantissa   Exponent   Dictionary size
 807                 0           2         11         4 KiB
 808                 1           3         11         6 KiB
 809                 2           2         12         8 KiB
 810                 3           3         12        12 KiB
 811                 4           2         13        16 KiB
 812                 5           3         13        24 KiB
 813                 6           2         14        32 KiB
 814               ...         ...        ...      ...
 815                35           3         27       768 MiB
 816                36           2         28      1024 MiB
 817                37           3         29      1536 MiB
 818                38           2         30      2048 MiB
 819                39           3         30      3072 MiB
 820                40           2         31      4096 MiB - 1 B
 821
 822         Instead of having a table in the decoder, the dictionary size
 823         can be decoded using the following C code:
 824
 825             const uint8_t bits = get_dictionary_flags() & 0x3F;
 826             if (bits > 40)
 827                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 828
 829             uint32_t dictionary_size;
 830             if (bits == 40) {
 831                 dictionary_size = UINT32_MAX;
 832             } else {
 833                 dictionary_size = 2 | (bits & 1);
 834                 dictionary_size <<= bits / 2 + 11;
 835             }
 836
 837
 838 5.3.2. Branch/Call/Jump Filters for Executables
 839
 840         These filters convert relative branch, call, and jump
 841         instructions to their absolute counterparts in executable
 842         files. This conversion increases redundancy and thus
 843         compression ratio.
 844
 845             Size of Filter Properties:  0 or 4 bytes
 846             Changes size of data:       No
 847             Allow as a non-last filter: Yes
 848             Allow as the last filter:   No
 849
 850             Detecting when all of the data has been decoded:
 851                 Uncompressed size:      Yes
 852                 End of Payload Marker:  No
 853                 End of Input:           Yes
 854
 855         Below is the list of filters in this category. The alignment
 856         is the same for both input and output data.
 857
 858             Filter ID   Alignment   Description
 859               0x04       1 byte     x86 filter (BCJ)
 860               0x05       4 bytes    PowerPC (big endian) filter
 861               0x06      16 bytes    IA64 filter
 862               0x07       4 bytes    ARM (little endian) filter
 863               0x08       2 bytes    ARM Thumb (little endian) filter
 864               0x09       4 bytes    SPARC filter
 865
 866         If the size of Filter Properties is four bytes, the Filter
 867         Properties field contains the start offset used for address
 868         conversions. It is stored as an unsigned 32-bit little endian
 869         integer. If the size of Filter Properties is zero, the start
 870         offset is zero.
 871
 872         Setting the start offset may be useful if an executable has
 873         multiple sections, and there are many cross-section calls.
 874         Taking advantage of this feature usually requires usage of
 875         the Subblock filter.
 876
 877
 878 5.3.3. Delta
 879
 880         The Delta filter may increase compression ratio when the value
 881         of the next byte correlates with the value of an earlier byte
 882         at specified distance.
 883
 884             Filter ID:                  0x03
 885             Size of Filter Properties:  1 byte
 886             Changes size of data:       No
 887             Allow as a non-last filter: Yes
 888             Allow as the last filter:   No
 889
 890             Preferred alignment:
 891                 Input data:             1 byte
 892                 Output data:            Same as the original input data
 893
 894         The Properties byte indicates the delta distance, which can be
 895         1-256 bytes backwards from the current byte: 0x00 indicates
 896         distance of 1 byte and 0xFF distance of 256 bytes.
 897
 898
 899 5.3.3.1. Format of the Encoded Output
 900
 901         The code below illustrates both encoding and decoding with
 902         the Delta filter.
 903
 904             // Distance is in the range [1, 256].
 905             const unsigned int distance = get_properties_byte() + 1;
 906             uint8_t pos = 0;
 907             uint8_t delta[256];
 908
 909             memset(delta, 0, sizeof(delta));
 910
 911             while (1) {
 912                 const int byte = read_byte();
 913                 if (byte == EOF)
 914                     break;
 915
 916                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 917                 if (is_encoder) {
 918                     tmp = (uint8_t)(byte) - tmp;
 919                     delta[pos] = (uint8_t)(byte);
 920                 } else {
 921                     tmp = (uint8_t)(byte) + tmp;
 922                     delta[pos] = tmp;
 923                 }
 924
 925                 write_byte(tmp);
 926                 --pos;
 927             }
 928
 929
 930 5.4. Custom Filter IDs
 931
 932         If a developer wants to use custom Filter IDs, he has two
 933         choices. The first choice is to contact Lasse Collin and ask
 934         him to allocate a range of IDs for the developer.
 935
 936         The second choice is to generate a 40-bit random integer,
 937         which the developer can use as his personal Developer ID.
 938         To minimalize the risk of collisions, Developer ID has to be
 939         a randomly generated integer, not manually selected "hex word".
 940         The following command, which works on many free operating
 941         systems, can be used to generate Developer ID:
 942
 943             dd if=/dev/urandom bs=5 count=1 | hexdump
 944
 945         The developer can then use his Developer ID to create unique
 946         (well, hopefully unique) Filter IDs.
 947
 948             Bits    Mask                    Description
 949              0-15   0x0000_0000_0000_FFFF   Filter ID
 950             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
 951             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
 952
 953         The resulting 63-bit integer will use 9 bytes of space when
 954         stored using the encoding described in Section 1.2. To get
 955         a shorter ID, see the beginning of this Section how to
 956         request a custom ID range.
 957
 958
 959 5.4.1. Reserved Custom Filter ID Ranges
 960
 961         Range                       Description
 962         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
 963         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
 964
 965
 966 6. Cyclic Redundancy Checks
 967
 968         There are several incompatible variations to calculate CRC32
 969         and CRC64. For simplicity and clarity, complete examples are
 970         provided to calculate the checks as they are used in this file
 971         format. Implementations may use different code as long as it
 972         gives identical results.
 973
 974         The program below reads data from standard input, calculates
 975         the CRC32 and CRC64 values, and prints the calculated values
 976         as big endian hexadecimal strings to standard output.
 977
 978             #include <sys/types.h>
 979             #include <inttypes.h>
 980             #include <stdio.h>
 981
 982             uint32_t crc32_table[256];
 983             uint64_t crc64_table[256];
 984
 985             void
 986             init(void)
 987             {
 988                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
 989                 static const uint64_t poly64
 990                         = UINT64_C(0xC96C5795D7870F42);
 991
 992                 for (size_t i = 0; i < 256; ++i) {
 993                     uint32_t crc32 = i;
 994                     uint64_t crc64 = i;
 995
 996                     for (size_t j = 0; j < 8; ++j) {
 997                         if (crc32 & 1)
 998                             crc32 = (crc32 >> 1) ^ poly32;
 999                         else
1000                             crc32 >>= 1;
1001
1002                         if (crc64 & 1)
1003                             crc64 = (crc64 >> 1) ^ poly64;
1004                         else
1005                             crc64 >>= 1;
1006                     }
1007
1008                     crc32_table[i] = crc32;
1009                     crc64_table[i] = crc64;
1010                 }
1011             }
1012
1013             uint32_t
1014             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1015             {
1016                 crc = ~crc;
1017                 for (size_t i = 0; i < size; ++i)
1018                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1019                             ^ (crc >> 8);
1020                 return ~crc;
1021             }
1022
1023             uint64_t
1024             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1025             {
1026                 crc = ~crc;
1027                 for (size_t i = 0; i < size; ++i)
1028                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1029                             ^ (crc >> 8);
1030                 return ~crc;
1031             }
1032
1033             int
1034             main()
1035             {
1036                 init();
1037
1038                 uint32_t value32 = 0;
1039                 uint64_t value64 = 0;
1040                 uint64_t total_size = 0;
1041                 uint8_t buf[8192];
1042
1043                 while (1) {
1044                     const size_t buf_size = fread(buf, 1, 8192, stdin);
1045                     if (buf_size == 0)
1046                         break;
1047
1048                     total_size += buf_size;
1049                     value32 = crc32(buf, buf_size, value32);
1050                     value64 = crc64(buf, buf_size, value64);
1051                 }
1052
1053                 printf("Bytes:  %" PRIu64 "\n", total_size);
1054                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1055                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1056
1057                 return 0;
1058             }
1059
1060
1061 7. References
1062
1063         LZMA SDK - The original LZMA implementation
1064         http://7-zip.org/sdk.html
1065
1066         LZMA Utils - LZMA adapted to POSIX-like systems
1067         http://tukaani.org/lzma/
1068
1069         [RFC-1952]
1070         GZIP file format specification version 4.3
1071         http://www.ietf.org/rfc/rfc1952.txt
1072           - Notation of byte boxes in section `2.1. Overall conventions'
1073
1074         [RFC-2119]
1075         Key words for use in RFCs to Indicate Requirement Levels
1076         http://www.ietf.org/rfc/rfc2119.txt
1077
1078         [GNU-tar]
1079         GNU tar 1.16.1 manual
1080         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1081           - Node 9.4.2 `Blocking Factor', paragraph that begins
1082             `gzip will complain about trailing garbage'
1083           - Note that this URL points to the latest version of the
1084             manual, and may some day not contain the note which is in
1085             1.16.1. For the exact version of the manual, download GNU
1086             tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz
1087