doc/file-format.txt

   1
   2 The .lzma File Format
   3 ---------------------
   4
   5         0. Preface
   6            0.1. Copyright Notices
   7            0.2. Changes
   8         1. Conventions
   9            1.1. Byte and Its Representation
  10            1.2. Multibyte Integers
  11         2. Overall Structure of .lzma File
  12            2.1. Stream
  13                 2.1.1. Stream Header
  14                        2.1.1.1. Header Magic Bytes
  15                        2.1.1.2. Stream Flags
  16                        2.1.1.3. CRC32
  17                 2.1.2. Stream Footer
  18                        2.1.2.1. CRC32
  19                        2.1.2.2. Backward Size
  20                        2.1.2.3. Stream Flags
  21                        2.1.2.4. Footer Magic Bytes
  22            2.2. Stream Padding
  23         3. Block
  24            3.1. Block Header
  25                 3.1.1. Block Header Size
  26                 3.1.2. Block Flags
  27                 3.1.3. Compressed Size
  28                 3.1.4. Uncompressed Size
  29                 3.1.5. List of Filter Flags
  30                 3.1.6. Header Padding
  31                 3.1.7. CRC32
  32            3.2. Compressed Data
  33            3.3. Check
  34         4. Index
  35            4.1. Index Indicator
  36            4.2. Number of Records
  37            4.3. List of Records
  38                 4.3.1. Total Size
  39                 4.3.2. Uncompressed Size
  40            4.4. Index Padding
  41            4.5. CRC32
  42         5. Filter Chains
  43            5.1. Alignment
  44            5.2. Security
  45            5.3. Filters
  46                 5.3.1. LZMA
  47                 5.3.2. LZMA2
  48                 5.3.3. Branch/Call/Jump Filters for Executables
  49                 5.3.4. Delta
  50                        5.3.4.1. Format of the Encoded Output
  51            5.4. Custom Filter IDs
  52                 5.4.1. Reserved Custom Filter ID Ranges
  53         6. Cyclic Redundancy Checks
  54         7. References
  55
  56
  57 0. Preface
  58
  59         This document describes the .lzma file format (filename suffix
  60         `.lzma', MIME type `application/x-lzma'). It is intended that
  61         this format replace the format used by the LZMA_Alone tool
  62         included in LZMA SDK up to and including version 4.57.
  63
  64         IMPORTANT:  The version described in this document is a
  65                     draft, NOT a final, official version. Changes
  66                     are possible.
  67
  68
  69 0.1. Copyright Notices
  70
  71         Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
  72         Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
  73
  74         Copying and distribution of this file, with or without
  75         modification, are permitted in any medium without royalty
  76         provided the copyright notice and this notice are preserved.
  77         Modified versions must be marked as such.
  78
  79         All source code examples given in this document are put into
  80         the public domain by the authors of this document.
  81
  82         Special thanks for helping with this document goes to
  83         Igor Pavlov. Thanks for helping with this document goes to
  84         Mark Adler, H. Peter Anvin, and Mikko Pouru.
  85
  86
  87 0.2. Changes
  88
  89         Last modified: 2008-09-03 14:10+0300
  90
  91         (A changelog will be kept once the first official version
  92         is made.)
  93
  94
  95 1. Conventions
  96
  97         The keywords `must', `must not', `required', `should',
  98         `should not', `recommended', `may', and `optional' in this
  99         document are to be interpreted as described in [RFC-2119].
 100         These words are not capitalized in this document.
 101
 102         Indicating a warning means displaying a message, returning
 103         appropriate exit status, or something else to let the user
 104         know that something worth warning occurred. The operation
 105         should still finish if a warning is indicated.
 106
 107         Indicating an error means displaying a message, returning
 108         appropriate exit status, or something else to let the user
 109         know that something prevented successfully finishing the
 110         operation. The operation must be aborted once an error has
 111         been indicated.
 112
 113
 114 1.1. Byte and Its Representation
 115
 116         In this document, byte is always 8 bits.
 117
 118         A `nul byte' has all bits unset. That is, the value of a nul
 119         byte is 0x00.
 120
 121         To represent byte blocks, this document uses notation that
 122         is similar to the notation used in [RFC-1952]:
 123
 124             +-------+
 125             |  Foo  |   One byte.
 126             +-------+
 127
 128             +---+---+
 129             |  Foo  |   Two bytes; that is, some of the vertical bars
 130             +---+---+   can be missing.
 131
 132             +=======+
 133             |  Foo  |   Zero or more bytes.
 134             +=======+
 135
 136         In this document, a boxed byte or a byte sequence declared
 137         using this notation is called `a field'. The example field
 138         above would be called `the Foo field' or plain `Foo'.
 139
 140
 141 1.2. Multibyte Integers
 142
 143         Multibyte integers of static length, such as CRC values,
 144         are stored in little endian byte order (least significant
 145         byte first).
 146
 147         When smaller values are more likely than bigger values (for
 148         example file sizes), multibyte integers are encoded in a
 149         variable-length representation:
 150           - Numbers in the range [0, 127] are copied as is, and take
 151             one byte of space.
 152           - Bigger numbers will occupy two or more bytes. All but the
 153             last byte of the multibyte representation have the highest
 154             (eighth) bit set.
 155
 156         For now, the value of the variable-length integers is limited
 157         to 63 bits, which limits the encoded size of the integer to
 158         nine bytes. These limits may be increased in future if needed.
 159
 160         The following C code illustrates encoding and decoding of
 161         variable-length integers. The functions return the number of
 162         bytes occupied by the integer (1-9), or zero on error.
 163
 164             #include <sys/types.h>
 165             #include <inttypes.h>
 166
 167             size_t
 168             encode(uint8_t buf[static 9], uint64_t num)
 169             {
 170                 if (num >= UINT64_MAX / 2)
 171                     return 0;
 172
 173                 size_t i = 0;
 174
 175                 while (num >= 0x80) {
 176                     buf[i++] = (uint8_t)(num) | 0x80;
 177                     num >>= 7;
 178                 }
 179
 180                 buf[i++] = (uint8_t)(num);
 181
 182                 return i;
 183             }
 184
 185             size_t
 186             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 187             {
 188                 if (size_max == 0)
 189                     return 0;
 190
 191                 if (size_max > 9)
 192                     size_max = 9;
 193
 194                 *num = buf[0] & 0x7F;
 195                 size_t i = 0;
 196
 197                 while (buf[i++] & 0x80) {
 198                     if (i > size_max || buf[i] == 0x00)
 199                         return 0;
 200
 201                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 202                 }
 203
 204                 return i;
 205             }
 206
 207
 208 2. Overall Structure of .lzma File
 209
 210         +========+================+========+================+
 211         | Stream | Stream Padding | Stream | Stream Padding | ...
 212         +========+================+========+================+
 213
 214         A file contains usually only one Stream. However, it is
 215         possible to concatenate multiple Streams together with no
 216         additional processing. It is up to the implementation to
 217         decide if the decoder will continue decoding from the next
 218         Stream once the end of the first Stream has been reached.
 219
 220
 221 2.1. Stream
 222
 223         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 224         |     Stream Header     | Block | Block | ... | Block |
 225         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 226
 227              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 228         ---> | Index |     Stream Footer     |
 229              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 230
 231         All the above fields have a size that is a multiple of four. If
 232         Stream is used as an internal part of another file format, it
 233         is recommended to make the Stream start at an offset that is
 234         a multiple of four bytes.
 235
 236         Stream Header, Index, and Stream Footer are always present in
 237         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 238
 239         There are zero or more Blocks. The maximum number of Blocks is
 240         limited only by the maximum size of the Index field.
 241
 242         Total size of a Stream must be less than 8 EiB (2^63 bytes).
 243         The same limit applies to the total amount of uncompressed
 244         data stored in a Stream.
 245
 246         If an implementation supports handling .lzma files with
 247         multiple concatenated Streams, it may apply the above limits
 248         to the file as a whole instead of limiting per Stream basis.
 249
 250
 251 2.1.1. Stream Header
 252
 253         +---+---+---+---+---+---+-------+------+--+--+--+--+
 254         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 255         +---+---+---+---+---+---+-------+------+--+--+--+--+
 256
 257
 258 2.1.1.1. Header Magic Bytes
 259
 260         The first six (6) bytes of the Stream are so called Header
 261         Magic Bytes. They can be used to identify the file type.
 262
 263             Using a C array and ASCII:
 264             const uint8_t HEADER_MAGIC[6]
 265                     = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 };
 266
 267             In plain hexadecimal:
 268             FF 4C 5A 4D 41 00
 269
 270         Notes:
 271           - The first byte (0xFF) was chosen so that the files cannot
 272             be erroneously detected as being in LZMA_Alone format, in
 273             which the first byte is in the range [0x00, 0xE0].
 274           - The sixth byte (0x00) was chosen to prevent applications
 275             from misdetecting the file as a text file.
 276
 277         If the Header Magic Bytes don't match, the decoder must
 278         indicate an error.
 279
 280
 281 2.1.1.2. Stream Flags
 282
 283         The first byte of Stream Flags is always a nul byte. In future
 284         this byte may be used to indicate new Stream version or other
 285         Stream properties.
 286
 287         The second byte of Stream Flags is a bit field:
 288
 289             Bit(s)  Mask  Description
 290              0-3    0x0F  Type of Check (see Section 3.3):
 291                               ID    Size      Check name
 292                               0x00   0 bytes  None
 293                               0x01   4 bytes  CRC32
 294                               0x02   4 bytes  (Reserved)
 295                               0x03   4 bytes  (Reserved)
 296                               0x04   8 bytes  CRC64
 297                               0x05   8 bytes  (Reserved)
 298                               0x06   8 bytes  (Reserved)
 299                               0x07  16 bytes  (Reserved)
 300                               0x08  16 bytes  (Reserved)
 301                               0x09  16 bytes  (Reserved)
 302                               0x0A  32 bytes  SHA-256
 303                               0x0B  32 bytes  (Reserved)
 304                               0x0C  32 bytes  (Reserved)
 305                               0x0D  64 bytes  (Reserved)
 306                               0x0E  64 bytes  (Reserved)
 307                               0x0F  64 bytes  (Reserved)
 308              4-7    0xF0  Reserved for future use; must be zero for now.
 309
 310         Implementations must support at least the Check IDs 0x00 (None)
 311         and 0x01 (CRC32). Supporting other Check IDs is optional. If
 312         an unsupported Check is used, the decoder should indicate a
 313         warning or error.
 314
 315         If any reserved bit is set, the decoder must indicate an error.
 316         It is possible that there is a new field present which the
 317         decoder is not aware of, and can thus parse the Stream Header
 318         incorrectly.
 319
 320
 321 2.1.1.3. CRC32
 322
 323         The CRC32 is calculated from the Stream Flags field. It is
 324         stored as an unsigned 32-bit little endian integer. If the
 325         calculated value does not match the stored one, the decoder
 326         must indicate an error.
 327
 328         The idea is that Stream Flags would always be two bytes, even
 329         if new features are needed. This way old decoders will be able
 330         to verify the CRC32 calculated from Stream Flags, and thus
 331         distinguish between corrupt files (CRC32 doesn't match) and
 332         files that the decoder doesn't support (CRC32 matches but
 333         Stream Flags has reserved bits set).
 334
 335
 336 2.1.2. Stream Footer
 337
 338         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 339         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 340         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 341
 342
 343 2.1.2.1. CRC32
 344
 345         The CRC32 is calculated from the Backward Size and Stream Flags
 346         fields. It is stored as an unsigned 32-bit little endian
 347         integer. If the calculated value does not match the stored one,
 348         the decoder must indicate an error.
 349
 350         The reason to have the CRC32 field before the Backward Size and
 351         Stream Flags fields is to keep the four-byte fields aligned to
 352         a multiple of four bytes.
 353
 354
 355 2.1.2.2. Backward Size
 356
 357         Backward Size is stored as a 32-bit little endian integer,
 358         which indicates the size of the Index field as multiple of
 359         four bytes, minimum value being four bytes:
 360
 361             real_backward_size = (stored_backward_size + 1) * 4;
 362
 363         Using a fixed-size integer to store this value makes it
 364         slightly simpler to parse the Stream Footer when the
 365         application needs to parse the Stream backwards.
 366
 367
 368 2.1.2.3. Stream Flags
 369
 370         This is a copy of the Stream Flags field from the Stream
 371         Header. The information stored to Stream Flags is needed
 372         when parsing the Stream backwards. The decoder must compare
 373         the Stream Flags fields in both Stream Header and Stream
 374         Footer, and indicate an error if they are not identical.
 375
 376
 377 2.1.2.4. Footer Magic Bytes
 378
 379         As the last step of the decoding process, the decoder must
 380         verify the existence of Footer Magic Bytes. If they don't
 381         match, an error must be indicated.
 382
 383             Using a C array and ASCII:
 384             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 385
 386             In hexadecimal:
 387             59 5A
 388
 389         The primary reason to have Footer Magic Bytes is to make
 390         it easier to detect incomplete files quickly, without
 391         uncompressing. If the file does not end with Footer Magic Bytes
 392         (excluding Stream Padding described in Section 2.2), it cannot
 393         be undamaged, unless someone has intentionally appended garbage
 394         after the end of the Stream.
 395
 396
 397 2.2. Stream Padding
 398
 399         Only the decoders that support decoding of concatenated Streams
 400         must support Stream Padding.
 401
 402         Stream Padding must contain only nul bytes. Any non-nul byte
 403         should be considered as the beginning of a new Stream. To
 404         preserve the four-byte alignment of consecutive Streams, the
 405         size of Stream Padding must be a multiple of four bytes. Empty
 406         Stream Padding is allowed.
 407
 408         Note that non-empty Stream Padding is allowed at the end of the
 409         file; there doesn't need to be a new Stream after non-empty
 410         Stream Padding. This can be convenient in certain situations
 411         [GNU-tar].
 412
 413         The possibility of Padding should be taken into account when
 414         designing an application that parses the Stream backwards.
 415
 416
 417 3. Block
 418
 419         +==============+=================+=======+
 420         | Block Header | Compressed Data | Check |
 421         +==============+=================+=======+
 422
 423
 424 3.1. Block Header
 425
 426         +-------------------+-------------+=================+
 427         | Block Header Size | Block Flags | Compressed Size |
 428         +-------------------+-------------+=================+
 429
 430              +===================+======================+
 431         ---> | Uncompressed Size | List of Filter Flags |
 432              +===================+======================+
 433
 434              +================+--+--+--+--+
 435         ---> | Header Padding |   CRC32   |
 436              +================+--+--+--+--+
 437
 438
 439 3.1.1. Block Header Size
 440
 441         This field overlaps with the Index Indicator field (see
 442         Section 4.1).
 443
 444         This field contains the size of the Block Header field,
 445         including the Block Header Size field itself. Valid values are
 446         in the range [0x01, 0xFF], which indicate the size of the Block
 447         Header as multiples of four bytes, minimum size being eight
 448         bytes:
 449
 450             real_header_size = (encoded_header_size + 1) * 4;
 451
 452         If bigger Block Header is needed in future, a new field can be
 453         added between the current Block Header and Compressed Data
 454         fields. The presence of this new field would be indicated in
 455         the Block Header.
 456
 457
 458 3.1.2. Block Flags
 459
 460         The first byte of the Block Flags field is a bit field:
 461
 462             Bit(s)  Mask  Description
 463              0-1    0x03  Number of filters (1-4)
 464              2-5    0x3C  Reserved for future use; must be zero for now.
 465               6     0x40  The Compressed Size field is present.
 466               7     0x80  The Uncompressed Size field is present.
 467
 468         If any reserved bit is set, the decoder must indicate an error.
 469         It is possible that there is a new field present which the
 470         decoder is not aware of, and can thus parse the Block Header
 471         incorrectly.
 472
 473
 474 3.1.3. Compressed Size
 475
 476         This field is present only if the appropriate bit is set in
 477         the Block Flags field (see Section 3.1.2).
 478
 479         This field contains the size of the Compressed Data field as
 480         multiple of four bytes, minimum value being four bytes:
 481
 482             real_compressed_size = (stored_compressed_size + 1) * 4;
 483
 484         The size is stored using the encoding described in Section 1.2.
 485         If the Compressed Size does not match the real size of the
 486         Compressed Data field, the decoder must indicate an error.
 487
 488
 489 3.1.4. Uncompressed Size
 490
 491         This field is present only if the appropriate bit is set in
 492         the Block Flags field (see Section 3.1.2).
 493
 494         The Uncompressed Size field contains the size of the Block
 495         after uncompressing. Uncompressed Size is stored using the
 496         encoding described in Section 1.2. If the Uncompressed Size
 497         does not match the real uncompressed size, the decoder must
 498         indicate an error.
 499
 500         Storing the Compressed Size and Uncompressed Size fields serves
 501         several purposes:
 502           - The decoder knows how much memory it needs to allocate
 503             for a temporary buffer in multithreaded mode.
 504           - Simple error detection: wrong size indicates a broken file.
 505           - Seeking forwards to a specific location in streamed mode.
 506
 507         It should be noted that the only reliable way to determine
 508         the real uncompressed size is to uncompress the Block,
 509         because the Block Header and Index fields may contain
 510         (intentionally or unintentionally) invalid information.
 511
 512
 513 3.1.5. List of Filter Flags
 514
 515         +================+================+     +================+
 516         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 517         +================+================+     +================+
 518
 519         The number of Filter Flags fields is stored in the Block Flags
 520         field (see Section 3.1.2).
 521
 522         The format of each Filter Flags field is as follows:
 523
 524             +===========+====================+===================+
 525             | Filter ID | Size of Properties | Filter Properties |
 526             +===========+====================+===================+
 527
 528         Both Filter ID and Size of Properties are stored using the
 529         encoding described in Section 1.2. Size of Properties indicates
 530         the size of the Filter Properties field as bytes. The list of
 531         officially defined Filter IDs and the formats of their Filter
 532         Properties are described in Section 5.3.
 533
 534         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 535         (2^62) are reserved for implementation-specific internal use.
 536         These Filter IDs must never be used in List of Filter Flags.
 537
 538
 539 3.1.6. Header Padding
 540
 541         This field contains as many nul byte as it is needed to make
 542         the Block Header have the size specified in Block Header Size.
 543         If any of the bytes are not nul bytes, the decoder must
 544         indicate an error. It is possible that there is a new field
 545         present which the decoder is not aware of, and can thus parse
 546         the Block Header incorrectly.
 547
 548
 549 3.1.7. CRC32
 550
 551         The CRC32 is calculated over everything in the Block Header
 552         field except the CRC32 field itself. It is stored as an
 553         unsigned 32-bit little endian integer. If the calculated
 554         value does not match the stored one, the decoder must indicate
 555         an error.
 556
 557         By verifying the CRC32 of the Block Header before parsing the
 558         actual contents allows the decoder to distinguish between
 559         corrupt and unsupported files.
 560
 561
 562 3.2. Compressed Data
 563
 564         The format of Compressed Data depends on Block Flags and List
 565         of Filter Flags. Excluding the descriptions of the simplest
 566         filters in Section 5.3, the format of the filter-specific
 567         encoded data is out of scope of this document.
 568
 569         If the natural size of Compressed Data is not a multiple of
 570         four bytes, it must be padded with 1-3 nul bytes to make it
 571         a multiple of four bytes.
 572
 573
 574 3.3. Check
 575
 576         The type and size of the Check field depends on which bits
 577         are set in the Stream Flags field (see Section 2.1.1.2).
 578
 579         The Check, when used, is calculated from the original
 580         uncompressed data. If the calculated Check does not match the
 581         stored one, the decoder must indicate an error. If the selected
 582         type of Check is not supported by the decoder, it must indicate
 583         a warning or error.
 584
 585
 586 4. Index
 587
 588         +-----------------+=========================+
 589         | Index Indicator | Number of Index Records |
 590         +-----------------+=========================+
 591
 592              +=================+=========+-+-+-+-+
 593         ---> | List of Records | Padding | CRC32 |
 594              +=================+=========+-+-+-+-+
 595
 596         Index serves several purporses. Using it, one can
 597           - verify that all Blocks in a Stream have been processed;
 598           - find out the uncompressed size of a Stream; and
 599           - quickly access the beginning of any Block (random access).
 600
 601
 602 4.1. Index Indicator
 603
 604         This field overlaps with the Block Header Size field (see
 605         Section 3.1.1). The value of Index Indicator is always 0x00.
 606
 607
 608 4.2. Number of Records
 609
 610         This field indicates how many Records there are in the List
 611         of Records field, and thus how many Blocks there are in the
 612         Stream. The value is stored using the encoding described in
 613         Section 1.2. If the decoder has decoded all the Blocks of the
 614         Stream, and then notices that the Number of Records doesn't
 615         match the real number of Blocks, the decoder must indicate an
 616         error.
 617
 618
 619 4.3. List of Records
 620
 621         List of Records consists of as many Records as indicated by the
 622         Number of Records field:
 623
 624             +========+========+
 625             | Record | Record | ...
 626             +========+========+
 627
 628         Each Record contains two fields:
 629
 630             +============+===================+
 631             | Total Size | Uncompressed Size |
 632             +============+===================+
 633
 634         If the decoder has decoded all the Blocks of the Stream, it
 635         must verify that the contents of the Records match the real
 636         Total Size and Uncompressed Size of the respective Blocks.
 637
 638         Implementation hint: It is possible to verify the Index with
 639         constant memory usage by calculating for example SHA256 of both
 640         the real size values and the List of Records, then comparing
 641         the check values. Implementing this using non-cryptographic
 642         check like CRC32 should be avoided unless small code size is
 643         important.
 644
 645         If the decoder supports random-access reading, it must verify
 646         that Total Size and Uncompressed Size of every completely
 647         decoded Block match the sizes stored in the Index. If only
 648         partial Block is decoded, the decoder must verify that the
 649         processed sizes don't exceed the sizes stored in the Index.
 650
 651
 652 4.3.1. Total Size
 653
 654         This field indicates the encoded size of the respective Block
 655         as multiples of four bytes, minimum value being four bytes:
 656
 657             real_total_size = (stored_total_size + 1) * 4;
 658
 659         The value is stored using the encoding described in Section
 660         1.2.
 661
 662
 663 4.3.2. Uncompressed Size
 664
 665         This field indicates the Uncompressed Size of the respective
 666         Block as bytes. The value is stored using the encoding
 667         described in Section 1.2.
 668
 669
 670 4.4. Index Padding
 671
 672         This field must contain 0-3 nul bytes to pad the Index to
 673         a multiple of four bytes.
 674
 675
 676 4.5. CRC32
 677
 678         The CRC32 is calculated over everything in the Index field
 679         except the CRC32 field itself. The CRC32 is stored as an
 680         unsigned 32-bit little endian integer. If the calculated
 681         value does not match the stored one, the decoder must indicate
 682         an error.
 683
 684
 685 5. Filter Chains
 686
 687         The Block Flags field defines how many filters are used. When
 688         more than one filter is used, the filters are chained; that is,
 689         the output of one filter is the input of another filter. The
 690         following figure illustrates the direction of data flow.
 691
 692                     v   Uncompressed Data   ^
 693                     |       Filter 0        |
 694             Encoder |       Filter 1        | Decoder
 695                     |       Filter n        |
 696                     v    Compressed Data    ^
 697
 698
 699 5.1. Alignment
 700
 701         Alignment of uncompressed input data is usually the job of
 702         the application producing the data. For example, to get the
 703         best results, an archiver tool should make sure that all
 704         PowerPC executable files in the archive stream start at
 705         offsets that are multiples of four bytes.
 706
 707         Some filters, for example LZMA, can be configured to take
 708         advantage of specified alignment of input data. Note that
 709         taking advantage of aligned input can be benefical also when
 710         a filter is not the first filter in the chain. For example,
 711         if you compress PowerPC executables, you may want to use the
 712         PowerPC filter and chain that with the LZMA filter. Because not
 713         only the input but also the output alignment of the PowerPC
 714         filter is four bytes, it is now benefical to set LZMA settings
 715         so that the LZMA encoder can take advantage of its
 716         four-byte-aligned input data.
 717
 718         The output of the last filter in the chain is stored to the
 719         Compressed Data field, which is is guaranteed to be aligned
 720         to a multiple of four bytes relative to the beginning of the
 721         Stream. This can increase
 722           - speed, if the filtered data is handled multiple bytes at
 723             a time by the filter-specific encoder and decoder,
 724             because accessing aligned data in computer memory is
 725             usually faster; and
 726           - compression ratio, if the output data is later compressed
 727             with an external compression tool.
 728
 729
 730 5.2. Security
 731
 732         If filters would be allowed to be chained freely, it would be
 733         possible to create malicious files, that would be very slow to
 734         decode. Such files could be used to create denial of service
 735         attacks.
 736
 737         Slow files could occur when multiple filters are chained:
 738
 739             v   Compressed input data
 740             |   Filter 1 decoder (last filter)
 741             |   Filter 0 decoder (non-last filter)
 742             v   Uncompressed output data
 743
 744         The decoder of the last filter in the chain produces a lot of
 745         output from little input. Another filter in the chain takes the
 746         output of the last filter, and produces very little output
 747         while consuming a lot of input. As a result, a lot of data is
 748         moved inside the filter chain, but the filter chain as a whole
 749         gets very little work done.
 750
 751         To prevent this kind of slow files, there are restrictions on
 752         how the filters can be chained. These restrictions must be
 753         taken into account when designing new filters.
 754
 755         The maximum number of filters in the chain has been limited to
 756         four, thus there can be at maximum of three non-last filters.
 757         Of these three non-last filters, only two are allowed to change
 758         the size of the data.
 759
 760         The non-last filters, that change the size of the data, must
 761         have a limit how much the decoder can compress the data: the
 762         decoder should produce at least n bytes of output when the
 763         filter is given 2n bytes of input. This  limit is not
 764         absolute, but significant deviations must be avoided.
 765
 766         The above limitations guarantee that if the last filter in the
 767         chain produces 4n bytes of output, the chain as a whole will
 768         produce at least n bytes of output.
 769
 770
 771 5.3. Filters
 772
 773 5.3.1. LZMA
 774
 775         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
 776         compression algorithm with high compression ratio and fast
 777         decompression. LZMA is based on LZ77 and range coding
 778         algorithms.
 779
 780             Filter ID:                  0x40
 781             Size of Filter Properties:  5 bytes
 782             Changes size of data:       Yes
 783             Allow as a non-last filter: No
 784             Allow as the last filter:   Yes
 785
 786             Preferred alignment:
 787                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 788                 Output data:            1 byte
 789
 790         At the time of writing, there is no other documentation about
 791         how LZMA works than the source code in LZMA SDK. Once such
 792         documentation gets written, it will probably be published as
 793         a separate document, because including the documentation here
 794         would lengthen this document considerably.
 795
 796         The format of the Filter Properties field is as follows:
 797
 798             +-----------------+----+----+----+----+
 799             | LZMA Properties |  Dictionary Size  |
 800             +-----------------+----+----+----+----+
 801
 802         The LZMA Properties field contains three properties. An
 803         abbreviation is given in parentheses, followed by the value
 804         range of the property. The field consists of
 805
 806             1) the number of literal context bits (lc, [0, 4]);
 807             2) the number of literal position bits (lp, [0, 4]); and
 808             3) the number of position bits (pb, [0, 4]).
 809
 810         In addition to above ranges, the sum of lc and lp must not
 811         exceed four. Note that this limit didn't exist in the old
 812         LZMA_Alone format, which allowed lc to be in the range [0, 8].
 813
 814         The properties are encoded using the following formula:
 815
 816             LZMA Properties = (pb * 5 + lp) * 9 + lc
 817
 818         The following C code illustrates a straightforward way to
 819         decode the properties:
 820
 821             uint8_t lc, lp, pb;
 822             uint8_t prop = get_lzma_properties();
 823             if (prop > (4 * 5 + 4) * 9 + 8)
 824                 return LZMA_PROPERTIES_ERROR;
 825
 826             pb = prop / (9 * 5);
 827             prop -= pb * 9 * 5;
 828             lp = prop / 9;
 829             lc = prop - lp * 9;
 830
 831             if (lc + lp > 4)
 832                 return LZMA_PROPERTIES_ERROR;
 833
 834         Dictionary Size is encoded as unsigned 32-bit little endian
 835         integer.
 836
 837
 838 5.3.2. LZMA2
 839
 840         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 841         LZMA internally, but adds support for flushing the encoder,
 842         uncompressed chunks, eases stateful decoder implementations,
 843         and improves support for multithreading. For most uses, it is
 844         recommended to use LZMA2 instead of LZMA.
 845
 846             Filter ID:                  0x21
 847             Size of Filter Properties:  1 byte
 848             Changes size of data:       Yes
 849             Allow as a non-last filter: No
 850             Allow as the last filter:   Yes
 851
 852             Preferred alignment:
 853                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 854                 Output data:            1 byte
 855
 856         The format of the one-byte Filter Properties field is as
 857         follows:
 858
 859             Bits   Mask   Description
 860             0-5    0x3F   Dictionary Size
 861             6-7    0xC0   Reserved for future use; must be zero for now.
 862
 863         Dictionary Size is encoded with one-bit mantissa and five-bit
 864         exponent. The smallest dictionary size is 4 KiB and the biggest
 865         is 4 GiB.
 866
 867             Raw value   Mantissa   Exponent   Dictionary size
 868                 0           2         11         4 KiB
 869                 1           3         11         6 KiB
 870                 2           2         12         8 KiB
 871                 3           3         12        12 KiB
 872                 4           2         13        16 KiB
 873                 5           3         13        24 KiB
 874                 6           2         14        32 KiB
 875               ...         ...        ...      ...
 876                35           3         27       768 MiB
 877                36           2         28      1024 MiB
 878                37           3         29      1536 MiB
 879                38           2         30      2048 MiB
 880                39           3         30      3072 MiB
 881                40           2         31      4096 MiB - 1 B
 882
 883         Instead of having a table in the decoder, the dictionary size
 884         can be decoded using the following C code:
 885
 886             const uint8_t bits = get_dictionary_flags() & 0x3F;
 887             if (bits > 40)
 888                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 889
 890             uint32_t dictionary_size;
 891             if (bits == 40) {
 892                 dictionary_size = UINT32_MAX;
 893             } else {
 894                 dictionary_size = 2 | (bits & 1);
 895                 dictionary_size <<= bits / 2 + 11;
 896             }
 897
 898
 899 5.3.3. Branch/Call/Jump Filters for Executables
 900
 901         These filters convert relative branch, call, and jump
 902         instructions to their absolute counterparts in executable
 903         files. This conversion increases redundancy and thus
 904         compression ratio.
 905
 906             Size of Filter Properties:  0 or 4 bytes
 907             Changes size of data:       No
 908             Allow as a non-last filter: Yes
 909             Allow as the last filter:   No
 910
 911             Detecting when all of the data has been decoded:
 912                 Uncompressed size:      Yes
 913                 End of Payload Marker:  No
 914                 End of Input:           Yes
 915
 916         Below is the list of filters in this category. The alignment
 917         is the same for both input and output data.
 918
 919             Filter ID   Alignment   Description
 920               0x04       1 byte     x86 filter (BCJ)
 921               0x05       4 bytes    PowerPC (big endian) filter
 922               0x06      16 bytes    IA64 filter
 923               0x07       4 bytes    ARM (little endian) filter
 924               0x08       2 bytes    ARM Thumb (little endian) filter
 925               0x09       4 bytes    SPARC filter
 926
 927         If the size of Filter Properties is four bytes, the Filter
 928         Properties field contains the start offset used for address
 929         conversions. It is stored as an unsigned 32-bit little endian
 930         integer. If the size of Filter Properties is zero, the start
 931         offset is zero.
 932
 933         Setting the start offset may be useful if an executable has
 934         multiple sections, and there are many cross-section calls.
 935         Taking advantage of this feature usually requires usage of
 936         the Subblock filter.
 937
 938
 939 5.3.4. Delta
 940
 941         The Delta filter may increase compression ratio when the value
 942         of the next byte correlates with the value of an earlier byte
 943         at specified distance.
 944
 945             Filter ID:                  0x03
 946             Size of Filter Properties:  1 byte
 947             Changes size of data:       No
 948             Allow as a non-last filter: Yes
 949             Allow as the last filter:   No
 950
 951             Preferred alignment:
 952                 Input data:             1 byte
 953                 Output data:            Same as the original input data
 954
 955         The Properties byte indicates the delta distance, which can be
 956         1-256 bytes backwards from the current byte: 0x00 indicates
 957         distance of 1 byte and 0xFF distance of 256 bytes.
 958
 959
 960 5.3.4.1. Format of the Encoded Output
 961
 962         The code below illustrates both encoding and decoding with
 963         the Delta filter.
 964
 965             // Distance is in the range [1, 256].
 966             const unsigned int distance = get_properties_byte() + 1;
 967             uint8_t pos = 0;
 968             uint8_t delta[256];
 969
 970             memset(delta, 0, sizeof(delta));
 971
 972             while (1) {
 973                 const int byte = read_byte();
 974                 if (byte == EOF)
 975                     break;
 976
 977                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 978                 if (is_encoder) {
 979                     tmp = (uint8_t)(byte) - tmp;
 980                     delta[pos] = (uint8_t)(byte);
 981                 } else {
 982                     tmp = (uint8_t)(byte) + tmp;
 983                     delta[pos] = tmp;
 984                 }
 985
 986                 write_byte(tmp);
 987                 --pos;
 988             }
 989
 990
 991 5.4. Custom Filter IDs
 992
 993         If a developer wants to use custom Filter IDs, he has two
 994         choices. The first choice is to contact Lasse Collin and ask
 995         him to allocate a range of IDs for the developer.
 996
 997         The second choice is to generate a 40-bit random integer,
 998         which the developer can use as his personal Developer ID.
 999         To minimalize the risk of collisions, Developer ID has to be
1000         a randomly generated integer, not manually selected "hex word".
1001         The following command, which works on many free operating
1002         systems, can be used to generate Developer ID:
1003
1004             dd if=/dev/urandom bs=5 count=1 | hexdump
1005
1006         The developer can then use his Developer ID to create unique
1007         (well, hopefully unique) Filter IDs.
1008
1009             Bits    Mask                    Description
1010              0-15   0x0000_0000_0000_FFFF   Filter ID
1011             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1012             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1013
1014         The resulting 63-bit integer will use 9 bytes of space when
1015         stored using the encoding described in Section 1.2. To get
1016         a shorter ID, see the beginning of this Section how to
1017         request a custom ID range.
1018
1019
1020 5.4.1. Reserved Custom Filter ID Ranges
1021
1022         Range                       Description
1023         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1024         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1025
1026
1027 6. Cyclic Redundancy Checks
1028
1029         There are several incompatible variations to calculate CRC32
1030         and CRC64. For simplicity and clarity, complete examples are
1031         provided to calculate the checks as they are used in this file
1032         format. Implementations may use different code as long as it
1033         gives identical results.
1034
1035         The program below reads data from standard input, calculates
1036         the CRC32 and CRC64 values, and prints the calculated values
1037         as big endian hexadecimal strings to standard output.
1038
1039             #include <sys/types.h>
1040             #include <inttypes.h>
1041             #include <stdio.h>
1042
1043             uint32_t crc32_table[256];
1044             uint64_t crc64_table[256];
1045
1046             void
1047             init(void)
1048             {
1049                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1050                 static const uint64_t poly64
1051                         = UINT64_C(0xC96C5795D7870F42);
1052
1053                 for (size_t i = 0; i < 256; ++i) {
1054                     uint32_t crc32 = i;
1055                     uint64_t crc64 = i;
1056
1057                     for (size_t j = 0; j < 8; ++j) {
1058                         if (crc32 & 1)
1059                             crc32 = (crc32 >> 1) ^ poly32;
1060                         else
1061                             crc32 >>= 1;
1062
1063                         if (crc64 & 1)
1064                             crc64 = (crc64 >> 1) ^ poly64;
1065                         else
1066                             crc64 >>= 1;
1067                     }
1068
1069                     crc32_table[i] = crc32;
1070                     crc64_table[i] = crc64;
1071                 }
1072             }
1073
1074             uint32_t
1075             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1076             {
1077                 crc = ~crc;
1078                 for (size_t i = 0; i < size; ++i)
1079                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1080                             ^ (crc >> 8);
1081                 return ~crc;
1082             }
1083
1084             uint64_t
1085             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1086             {
1087                 crc = ~crc;
1088                 for (size_t i = 0; i < size; ++i)
1089                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1090                             ^ (crc >> 8);
1091                 return ~crc;
1092             }
1093
1094             int
1095             main()
1096             {
1097                 init();
1098
1099                 uint32_t value32 = 0;
1100                 uint64_t value64 = 0;
1101                 uint64_t total_size = 0;
1102                 uint8_t buf[8192];
1103
1104                 while (1) {
1105                     const size_t buf_size = fread(buf, 1, 8192, stdin);
1106                     if (buf_size == 0)
1107                         break;
1108
1109                     total_size += buf_size;
1110                     value32 = crc32(buf, buf_size, value32);
1111                     value64 = crc64(buf, buf_size, value64);
1112                 }
1113
1114                 printf("Bytes:  %" PRIu64 "\n", total_size);
1115                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1116                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1117
1118                 return 0;
1119             }
1120
1121
1122 7. References
1123
1124         LZMA SDK - The original LZMA implementation
1125         http://7-zip.org/sdk.html
1126
1127         LZMA Utils - LZMA adapted to POSIX-like systems
1128         http://tukaani.org/lzma/
1129
1130         [RFC-1952]
1131         GZIP file format specification version 4.3
1132         http://www.ietf.org/rfc/rfc1952.txt
1133           - Notation of byte boxes in section `2.1. Overall conventions'
1134
1135         [RFC-2119]
1136         Key words for use in RFCs to Indicate Requirement Levels
1137         http://www.ietf.org/rfc/rfc2119.txt
1138
1139         [GNU-tar]
1140         GNU tar 1.16.1 manual
1141         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1142           - Node 9.4.2 `Blocking Factor', paragraph that begins
1143             `gzip will complain about trailing garbage'
1144           - Note that this URL points to the latest version of the
1145             manual, and may some day not contain the note which is in
1146             1.16.1. For the exact version of the manual, download GNU
1147             tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz
1148