doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.0.3 (2009-06-05)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
  97
  98         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
  99
 100         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
 101                                 clarifications to Sections 2, 2.2,
 102                                 3.3, 4.4, and 5.3.2
 103
 104         1.0.0     2009-01-14    The first official version
 105
 106
 107 1. Conventions
 108
 109         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 110         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 111         document are to be interpreted as described in [RFC-2119].
 112
 113         Indicating a warning means displaying a message, returning
 114         appropriate exit status, or doing something else to let the
 115         user know that something worth warning occurred. The operation
 116         SHOULD still finish if a warning is indicated.
 117
 118         Indicating an error means displaying a message, returning
 119         appropriate exit status, or doing something else to let the
 120         user know that something prevented successfully finishing the
 121         operation. The operation MUST be aborted once an error has
 122         been indicated.
 123
 124
 125 1.1. Byte and Its Representation
 126
 127         In this document, byte is always 8 bits.
 128
 129         A "null byte" has all bits unset. That is, the value of a null
 130         byte is 0x00.
 131
 132         To represent byte blocks, this document uses notation that
 133         is similar to the notation used in [RFC-1952]:
 134
 135             +-------+
 136             |  Foo  |   One byte.
 137             +-------+
 138
 139             +---+---+
 140             |  Foo  |   Two bytes; that is, some of the vertical bars
 141             +---+---+   can be missing.
 142
 143             +=======+
 144             |  Foo  |   Zero or more bytes.
 145             +=======+
 146
 147         In this document, a boxed byte or a byte sequence declared
 148         using this notation is called "a field". The example field
 149         above would be called "the Foo field" or plain "Foo".
 150
 151         If there are many fields, they may be split to multiple lines.
 152         This is indicated with an arrow ("--->"):
 153
 154             +=====+
 155             | Foo |
 156             +=====+
 157
 158                  +=====+
 159             ---> | Bar |
 160                  +=====+
 161
 162         The above is equivalent to this:
 163
 164             +=====+=====+
 165             | Foo | Bar |
 166             +=====+=====+
 167
 168
 169 1.2. Multibyte Integers
 170
 171         Multibyte integers of static length, such as CRC values,
 172         are stored in little endian byte order (least significant
 173         byte first).
 174
 175         When smaller values are more likely than bigger values (for
 176         example file sizes), multibyte integers are encoded in a
 177         variable-length representation:
 178           - Numbers in the range [0, 127] are copied as is, and take
 179             one byte of space.
 180           - Bigger numbers will occupy two or more bytes. All but the
 181             last byte of the multibyte representation have the highest
 182             (eighth) bit set.
 183
 184         For now, the value of the variable-length integers is limited
 185         to 63 bits, which limits the encoded size of the integer to
 186         nine bytes. These limits may be increased in future if needed.
 187
 188         The following C code illustrates encoding and decoding of
 189         variable-length integers. The functions return the number of
 190         bytes occupied by the integer (1-9), or zero on error.
 191
 192             #include <stddef.h>
 193             #include <inttypes.h>
 194
 195             size_t
 196             encode(uint8_t buf[static 9], uint64_t num)
 197             {
 198                 if (num > UINT64_MAX / 2)
 199                     return 0;
 200
 201                 size_t i = 0;
 202
 203                 while (num >= 0x80) {
 204                     buf[i++] = (uint8_t)(num) | 0x80;
 205                     num >>= 7;
 206                 }
 207
 208                 buf[i++] = (uint8_t)(num);
 209
 210                 return i;
 211             }
 212
 213             size_t
 214             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 215             {
 216                 if (size_max == 0)
 217                     return 0;
 218
 219                 if (size_max > 9)
 220                     size_max = 9;
 221
 222                 *num = buf[0] & 0x7F;
 223                 size_t i = 0;
 224
 225                 while (buf[i++] & 0x80) {
 226                     if (i >= size_max || buf[i] == 0x00)
 227                         return 0;
 228
 229                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 230                 }
 231
 232                 return i;
 233             }
 234
 235
 236 2. Overall Structure of .xz File
 237
 238         A standalone .xz files consist of one or more Streams which may
 239         have Stream Padding between or after them:
 240
 241             +========+================+========+================+
 242             | Stream | Stream Padding | Stream | Stream Padding | ...
 243             +========+================+========+================+
 244
 245         The sizes of Stream and Stream Padding are always multiples
 246         of four bytes, thus the size of every valid .xz file MUST be
 247         a multiple of four bytes.
 248
 249         While a typical file contains only one Stream and no Stream
 250         Padding, a decoder handling standalone .xz files SHOULD support
 251         files that have more than one Stream or Stream Padding.
 252
 253         In contrast to standalone .xz files, when the .xz file format
 254         is used as an internal part of some other file format or
 255         communication protocol, it usually is expected that the decoder
 256         stops after the first Stream, and doesn't look for Stream
 257         Padding or possibly other Streams.
 258
 259
 260 2.1. Stream
 261
 262         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 263         |     Stream Header     | Block | Block | ... | Block |
 264         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 265
 266              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 267         ---> | Index |     Stream Footer     |
 268              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 269
 270         All the above fields have a size that is a multiple of four. If
 271         Stream is used as an internal part of another file format, it
 272         is RECOMMENDED to make the Stream start at an offset that is
 273         a multiple of four bytes.
 274
 275         Stream Header, Index, and Stream Footer are always present in
 276         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 277
 278         There are zero or more Blocks. The maximum number of Blocks is
 279         limited only by the maximum size of the Index field.
 280
 281         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 282         The same limit applies to the total amount of uncompressed
 283         data stored in a Stream.
 284
 285         If an implementation supports handling .xz files with multiple
 286         concatenated Streams, it MAY apply the above limits to the file
 287         as a whole instead of limiting per Stream basis.
 288
 289
 290 2.1.1. Stream Header
 291
 292         +---+---+---+---+---+---+-------+------+--+--+--+--+
 293         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 294         +---+---+---+---+---+---+-------+------+--+--+--+--+
 295
 296
 297 2.1.1.1. Header Magic Bytes
 298
 299         The first six (6) bytes of the Stream are so called Header
 300         Magic Bytes. They can be used to identify the file type.
 301
 302             Using a C array and ASCII:
 303             const uint8_t HEADER_MAGIC[6]
 304                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 305
 306             In plain hexadecimal:
 307             FD 37 7A 58 5A 00
 308
 309         Notes:
 310           - The first byte (0xFD) was chosen so that the files cannot
 311             be erroneously detected as being in .lzma format, in which
 312             the first byte is in the range [0x00, 0xE0].
 313           - The sixth byte (0x00) was chosen to prevent applications
 314             from misdetecting the file as a text file.
 315
 316         If the Header Magic Bytes don't match, the decoder MUST
 317         indicate an error.
 318
 319
 320 2.1.1.2. Stream Flags
 321
 322         The first byte of Stream Flags is always a null byte. In future
 323         this byte may be used to indicate new Stream version or other
 324         Stream properties.
 325
 326         The second byte of Stream Flags is a bit field:
 327
 328             Bit(s)  Mask  Description
 329              0-3    0x0F  Type of Check (see Section 3.4):
 330                               ID    Size      Check name
 331                               0x00   0 bytes  None
 332                               0x01   4 bytes  CRC32
 333                               0x02   4 bytes  (Reserved)
 334                               0x03   4 bytes  (Reserved)
 335                               0x04   8 bytes  CRC64
 336                               0x05   8 bytes  (Reserved)
 337                               0x06   8 bytes  (Reserved)
 338                               0x07  16 bytes  (Reserved)
 339                               0x08  16 bytes  (Reserved)
 340                               0x09  16 bytes  (Reserved)
 341                               0x0A  32 bytes  SHA-256
 342                               0x0B  32 bytes  (Reserved)
 343                               0x0C  32 bytes  (Reserved)
 344                               0x0D  64 bytes  (Reserved)
 345                               0x0E  64 bytes  (Reserved)
 346                               0x0F  64 bytes  (Reserved)
 347              4-7    0xF0  Reserved for future use; MUST be zero for now.
 348
 349         Implementations SHOULD support at least the Check IDs 0x00
 350         (None) and 0x01 (CRC32). Supporting other Check IDs is
 351         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 352         indicate a warning or error.
 353
 354         If any reserved bit is set, the decoder MUST indicate an error.
 355         It is possible that there is a new field present which the
 356         decoder is not aware of, and can thus parse the Stream Header
 357         incorrectly.
 358
 359
 360 2.1.1.3. CRC32
 361
 362         The CRC32 is calculated from the Stream Flags field. It is
 363         stored as an unsigned 32-bit little endian integer. If the
 364         calculated value does not match the stored one, the decoder
 365         MUST indicate an error.
 366
 367         The idea is that Stream Flags would always be two bytes, even
 368         if new features are needed. This way old decoders will be able
 369         to verify the CRC32 calculated from Stream Flags, and thus
 370         distinguish between corrupt files (CRC32 doesn't match) and
 371         files that the decoder doesn't support (CRC32 matches but
 372         Stream Flags has reserved bits set).
 373
 374
 375 2.1.2. Stream Footer
 376
 377         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 378         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 379         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 380
 381
 382 2.1.2.1. CRC32
 383
 384         The CRC32 is calculated from the Backward Size and Stream Flags
 385         fields. It is stored as an unsigned 32-bit little endian
 386         integer. If the calculated value does not match the stored one,
 387         the decoder MUST indicate an error.
 388
 389         The reason to have the CRC32 field before the Backward Size and
 390         Stream Flags fields is to keep the four-byte fields aligned to
 391         a multiple of four bytes.
 392
 393
 394 2.1.2.2. Backward Size
 395
 396         Backward Size is stored as a 32-bit little endian integer,
 397         which indicates the size of the Index field as multiple of
 398         four bytes, minimum value being four bytes:
 399
 400             real_backward_size = (stored_backward_size + 1) * 4;
 401
 402         If the stored value does not match the real size of the Index
 403         field, the decoder MUST indicate an error.
 404
 405         Using a fixed-size integer to store Backward Size makes
 406         it slightly simpler to parse the Stream Footer when the
 407         application needs to parse the Stream backwards.
 408
 409
 410 2.1.2.3. Stream Flags
 411
 412         This is a copy of the Stream Flags field from the Stream
 413         Header. The information stored to Stream Flags is needed
 414         when parsing the Stream backwards. The decoder MUST compare
 415         the Stream Flags fields in both Stream Header and Stream
 416         Footer, and indicate an error if they are not identical.
 417
 418
 419 2.1.2.4. Footer Magic Bytes
 420
 421         As the last step of the decoding process, the decoder MUST
 422         verify the existence of Footer Magic Bytes. If they don't
 423         match, an error MUST be indicated.
 424
 425             Using a C array and ASCII:
 426             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 427
 428             In hexadecimal:
 429             59 5A
 430
 431         The primary reason to have Footer Magic Bytes is to make
 432         it easier to detect incomplete files quickly, without
 433         uncompressing. If the file does not end with Footer Magic Bytes
 434         (excluding Stream Padding described in Section 2.2), it cannot
 435         be undamaged, unless someone has intentionally appended garbage
 436         after the end of the Stream.
 437
 438
 439 2.2. Stream Padding
 440
 441         Only the decoders that support decoding of concatenated Streams
 442         MUST support Stream Padding.
 443
 444         Stream Padding MUST contain only null bytes. To preserve the
 445         four-byte alignment of consecutive Streams, the size of Stream
 446         Padding MUST be a multiple of four bytes. Empty Stream Padding
 447         is allowed. If these requirements are not met, the decoder MUST
 448         indicate an error.
 449
 450         Note that non-empty Stream Padding is allowed at the end of the
 451         file; there doesn't need to be a new Stream after non-empty
 452         Stream Padding. This can be convenient in certain situations
 453         [GNU-tar].
 454
 455         The possibility of Stream Padding MUST be taken into account
 456         when designing an application that parses Streams backwards,
 457         and the application supports concatenated Streams.
 458
 459
 460 3. Block
 461
 462         +==============+=================+===============+=======+
 463         | Block Header | Compressed Data | Block Padding | Check |
 464         +==============+=================+===============+=======+
 465
 466
 467 3.1. Block Header
 468
 469         +-------------------+-------------+=================+
 470         | Block Header Size | Block Flags | Compressed Size |
 471         +-------------------+-------------+=================+
 472
 473              +===================+======================+
 474         ---> | Uncompressed Size | List of Filter Flags |
 475              +===================+======================+
 476
 477              +================+--+--+--+--+
 478         ---> | Header Padding |   CRC32   |
 479              +================+--+--+--+--+
 480
 481
 482 3.1.1. Block Header Size
 483
 484         This field overlaps with the Index Indicator field (see
 485         Section 4.1).
 486
 487         This field contains the size of the Block Header field,
 488         including the Block Header Size field itself. Valid values are
 489         in the range [0x01, 0xFF], which indicate the size of the Block
 490         Header as multiples of four bytes, minimum size being eight
 491         bytes:
 492
 493             real_header_size = (encoded_header_size + 1) * 4;
 494
 495         If bigger Block Header is needed in future, a new field can be
 496         added between the current Block Header and Compressed Data
 497         fields. The presence of this new field would be indicated in
 498         the Block Header.
 499
 500
 501 3.1.2. Block Flags
 502
 503         The first byte of the Block Flags field is a bit field:
 504
 505             Bit(s)  Mask  Description
 506              0-1    0x03  Number of filters (1-4)
 507              2-5    0x3C  Reserved for future use; MUST be zero for now.
 508               6     0x40  The Compressed Size field is present.
 509               7     0x80  The Uncompressed Size field is present.
 510
 511         If any reserved bit is set, the decoder MUST indicate an error.
 512         It is possible that there is a new field present which the
 513         decoder is not aware of, and can thus parse the Block Header
 514         incorrectly.
 515
 516
 517 3.1.3. Compressed Size
 518
 519         This field is present only if the appropriate bit is set in
 520         the Block Flags field (see Section 3.1.2).
 521
 522         The Compressed Size field contains the size of the Compressed
 523         Data field, which MUST be non-zero. Compressed Size is stored
 524         using the encoding described in Section 1.2. If the Compressed
 525         Size doesn't match the size of the Compressed Data field, the
 526         decoder MUST indicate an error.
 527
 528
 529 3.1.4. Uncompressed Size
 530
 531         This field is present only if the appropriate bit is set in
 532         the Block Flags field (see Section 3.1.2).
 533
 534         The Uncompressed Size field contains the size of the Block
 535         after uncompressing. Uncompressed Size is stored using the
 536         encoding described in Section 1.2. If the Uncompressed Size
 537         does not match the real uncompressed size, the decoder MUST
 538         indicate an error.
 539
 540         Storing the Compressed Size and Uncompressed Size fields serves
 541         several purposes:
 542           - The decoder knows how much memory it needs to allocate
 543             for a temporary buffer in multithreaded mode.
 544           - Simple error detection: wrong size indicates a broken file.
 545           - Seeking forwards to a specific location in streamed mode.
 546
 547         It should be noted that the only reliable way to determine
 548         the real uncompressed size is to uncompress the Block,
 549         because the Block Header and Index fields may contain
 550         (intentionally or unintentionally) invalid information.
 551
 552
 553 3.1.5. List of Filter Flags
 554
 555         +================+================+     +================+
 556         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 557         +================+================+     +================+
 558
 559         The number of Filter Flags fields is stored in the Block Flags
 560         field (see Section 3.1.2).
 561
 562         The format of each Filter Flags field is as follows:
 563
 564             +===========+====================+===================+
 565             | Filter ID | Size of Properties | Filter Properties |
 566             +===========+====================+===================+
 567
 568         Both Filter ID and Size of Properties are stored using the
 569         encoding described in Section 1.2. Size of Properties indicates
 570         the size of the Filter Properties field as bytes. The list of
 571         officially defined Filter IDs and the formats of their Filter
 572         Properties are described in Section 5.3.
 573
 574         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 575         (2^62) are reserved for implementation-specific internal use.
 576         These Filter IDs MUST never be used in List of Filter Flags.
 577
 578
 579 3.1.6. Header Padding
 580
 581         This field contains as many null byte as it is needed to make
 582         the Block Header have the size specified in Block Header Size.
 583         If any of the bytes are not null bytes, the decoder MUST
 584         indicate an error. It is possible that there is a new field
 585         present which the decoder is not aware of, and can thus parse
 586         the Block Header incorrectly.
 587
 588
 589 3.1.7. CRC32
 590
 591         The CRC32 is calculated over everything in the Block Header
 592         field except the CRC32 field itself. It is stored as an
 593         unsigned 32-bit little endian integer. If the calculated
 594         value does not match the stored one, the decoder MUST indicate
 595         an error.
 596
 597         By verifying the CRC32 of the Block Header before parsing the
 598         actual contents allows the decoder to distinguish between
 599         corrupt and unsupported files.
 600
 601
 602 3.2. Compressed Data
 603
 604         The format of Compressed Data depends on Block Flags and List
 605         of Filter Flags. Excluding the descriptions of the simplest
 606         filters in Section 5.3, the format of the filter-specific
 607         encoded data is out of scope of this document.
 608
 609
 610 3.3. Block Padding
 611
 612         Block Padding MUST contain 0-3 null bytes to make the size of
 613         the Block a multiple of four bytes. This can be needed when
 614         the size of Compressed Data is not a multiple of four. If any
 615         of the bytes in Block Padding are not null bytes, the decoder
 616         MUST indicate an error.
 617
 618
 619 3.4. Check
 620
 621         The type and size of the Check field depends on which bits
 622         are set in the Stream Flags field (see Section 2.1.1.2).
 623
 624         The Check, when used, is calculated from the original
 625         uncompressed data. If the calculated Check does not match the
 626         stored one, the decoder MUST indicate an error. If the selected
 627         type of Check is not supported by the decoder, it SHOULD
 628         indicate a warning or error.
 629
 630
 631 4. Index
 632
 633         +-----------------+===================+
 634         | Index Indicator | Number of Records |
 635         +-----------------+===================+
 636
 637              +=================+===============+-+-+-+-+
 638         ---> | List of Records | Index Padding | CRC32 |
 639              +=================+===============+-+-+-+-+
 640
 641         Index serves several purposes. Using it, one can
 642           - verify that all Blocks in a Stream have been processed;
 643           - find out the uncompressed size of a Stream; and
 644           - quickly access the beginning of any Block (random access).
 645
 646
 647 4.1. Index Indicator
 648
 649         This field overlaps with the Block Header Size field (see
 650         Section 3.1.1). The value of Index Indicator is always 0x00.
 651
 652
 653 4.2. Number of Records
 654
 655         This field indicates how many Records there are in the List
 656         of Records field, and thus how many Blocks there are in the
 657         Stream. The value is stored using the encoding described in
 658         Section 1.2. If the decoder has decoded all the Blocks of the
 659         Stream, and then notices that the Number of Records doesn't
 660         match the real number of Blocks, the decoder MUST indicate an
 661         error.
 662
 663
 664 4.3. List of Records
 665
 666         List of Records consists of as many Records as indicated by the
 667         Number of Records field:
 668
 669             +========+========+
 670             | Record | Record | ...
 671             +========+========+
 672
 673         Each Record contains information about one Block:
 674
 675             +===============+===================+
 676             | Unpadded Size | Uncompressed Size |
 677             +===============+===================+
 678
 679         If the decoder has decoded all the Blocks of the Stream, it
 680         MUST verify that the contents of the Records match the real
 681         Unpadded Size and Uncompressed Size of the respective Blocks.
 682
 683         Implementation hint: It is possible to verify the Index with
 684         constant memory usage by calculating for example SHA-256 of
 685         both the real size values and the List of Records, then
 686         comparing the hash values. Implementing this using
 687         non-cryptographic hash like CRC32 SHOULD be avoided unless
 688         small code size is important.
 689
 690         If the decoder supports random-access reading, it MUST verify
 691         that Unpadded Size and Uncompressed Size of every completely
 692         decoded Block match the sizes stored in the Index. If only
 693         partial Block is decoded, the decoder MUST verify that the
 694         processed sizes don't exceed the sizes stored in the Index.
 695
 696
 697 4.3.1. Unpadded Size
 698
 699         This field indicates the size of the Block excluding the Block
 700         Padding field. That is, Unpadded Size is the size of the Block
 701         Header, Compressed Data, and Check fields. Unpadded Size is
 702         stored using the encoding described in Section 1.2. The value
 703         MUST never be zero; with the current structure of Blocks, the
 704         actual minimum value for Unpadded Size is five.
 705
 706         Implementation note: Because the size of the Block Padding
 707         field is not included in Unpadded Size, calculating the total
 708         size of a Stream or doing random-access reading requires
 709         calculating the actual size of the Blocks by rounding Unpadded
 710         Sizes up to the next multiple of four.
 711
 712         The reason to exclude Block Padding from Unpadded Size is to
 713         ease making a raw copy of Compressed Data without Block
 714         Padding. This can be useful, for example, if someone wants
 715         to convert Streams to some other file format quickly.
 716
 717
 718 4.3.2. Uncompressed Size
 719
 720         This field indicates the Uncompressed Size of the respective
 721         Block as bytes. The value is stored using the encoding
 722         described in Section 1.2.
 723
 724
 725 4.4. Index Padding
 726
 727         This field MUST contain 0-3 null bytes to pad the Index to
 728         a multiple of four bytes. If any of the bytes are not null
 729         bytes, the decoder MUST indicate an error.
 730
 731
 732 4.5. CRC32
 733
 734         The CRC32 is calculated over everything in the Index field
 735         except the CRC32 field itself. The CRC32 is stored as an
 736         unsigned 32-bit little endian integer. If the calculated
 737         value does not match the stored one, the decoder MUST indicate
 738         an error.
 739
 740
 741 5. Filter Chains
 742
 743         The Block Flags field defines how many filters are used. When
 744         more than one filter is used, the filters are chained; that is,
 745         the output of one filter is the input of another filter. The
 746         following figure illustrates the direction of data flow.
 747
 748                     v   Uncompressed Data   ^
 749                     |       Filter 0        |
 750             Encoder |       Filter 1        | Decoder
 751                     |       Filter n        |
 752                     v    Compressed Data    ^
 753
 754
 755 5.1. Alignment
 756
 757         Alignment of uncompressed input data is usually the job of
 758         the application producing the data. For example, to get the
 759         best results, an archiver tool should make sure that all
 760         PowerPC executable files in the archive stream start at
 761         offsets that are multiples of four bytes.
 762
 763         Some filters, for example LZMA2, can be configured to take
 764         advantage of specified alignment of input data. Note that
 765         taking advantage of aligned input can be beneficial also when
 766         a filter is not the first filter in the chain. For example,
 767         if you compress PowerPC executables, you may want to use the
 768         PowerPC filter and chain that with the LZMA2 filter. Because
 769         not only the input but also the output alignment of the PowerPC
 770         filter is four bytes, it is now beneficial to set LZMA2
 771         settings so that the LZMA2 encoder can take advantage of its
 772         four-byte-aligned input data.
 773
 774         The output of the last filter in the chain is stored to the
 775         Compressed Data field, which is is guaranteed to be aligned
 776         to a multiple of four bytes relative to the beginning of the
 777         Stream. This can increase
 778           - speed, if the filtered data is handled multiple bytes at
 779             a time by the filter-specific encoder and decoder,
 780             because accessing aligned data in computer memory is
 781             usually faster; and
 782           - compression ratio, if the output data is later compressed
 783             with an external compression tool.
 784
 785
 786 5.2. Security
 787
 788         If filters would be allowed to be chained freely, it would be
 789         possible to create malicious files, that would be very slow to
 790         decode. Such files could be used to create denial of service
 791         attacks.
 792
 793         Slow files could occur when multiple filters are chained:
 794
 795             v   Compressed input data
 796             |   Filter 1 decoder (last filter)
 797             |   Filter 0 decoder (non-last filter)
 798             v   Uncompressed output data
 799
 800         The decoder of the last filter in the chain produces a lot of
 801         output from little input. Another filter in the chain takes the
 802         output of the last filter, and produces very little output
 803         while consuming a lot of input. As a result, a lot of data is
 804         moved inside the filter chain, but the filter chain as a whole
 805         gets very little work done.
 806
 807         To prevent this kind of slow files, there are restrictions on
 808         how the filters can be chained. These restrictions MUST be
 809         taken into account when designing new filters.
 810
 811         The maximum number of filters in the chain has been limited to
 812         four, thus there can be at maximum of three non-last filters.
 813         Of these three non-last filters, only two are allowed to change
 814         the size of the data.
 815
 816         The non-last filters, that change the size of the data, MUST
 817         have a limit how much the decoder can compress the data: the
 818         decoder SHOULD produce at least n bytes of output when the
 819         filter is given 2n bytes of input. This  limit is not
 820         absolute, but significant deviations MUST be avoided.
 821
 822         The above limitations guarantee that if the last filter in the
 823         chain produces 4n bytes of output, the chain as a whole will
 824         produce at least n bytes of output.
 825
 826
 827 5.3. Filters
 828
 829 5.3.1. LZMA2
 830
 831         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 832         compression algorithm with high compression ratio and fast
 833         decompression. LZMA is based on LZ77 and range coding
 834         algorithms.
 835
 836         LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
 837         LZMA internally, but adds support for flushing the encoder,
 838         uncompressed chunks, eases stateful decoder implementations,
 839         and improves support for multithreading. Thus, the plain LZMA
 840         will not be supported in this file format.
 841
 842             Filter ID:                  0x21
 843             Size of Filter Properties:  1 byte
 844             Changes size of data:       Yes
 845             Allow as a non-last filter: No
 846             Allow as the last filter:   Yes
 847
 848             Preferred alignment:
 849                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 850                 Output data:            1 byte
 851
 852         The format of the one-byte Filter Properties field is as
 853         follows:
 854
 855             Bits   Mask   Description
 856             0-5    0x3F   Dictionary Size
 857             6-7    0xC0   Reserved for future use; MUST be zero for now.
 858
 859         Dictionary Size is encoded with one-bit mantissa and five-bit
 860         exponent. The smallest dictionary size is 4 KiB and the biggest
 861         is 4 GiB.
 862
 863             Raw value   Mantissa   Exponent   Dictionary size
 864                 0           2         11         4 KiB
 865                 1           3         11         6 KiB
 866                 2           2         12         8 KiB
 867                 3           3         12        12 KiB
 868                 4           2         13        16 KiB
 869                 5           3         13        24 KiB
 870                 6           2         14        32 KiB
 871               ...         ...        ...      ...
 872                35           3         27       768 MiB
 873                36           2         28      1024 MiB
 874                37           3         29      1536 MiB
 875                38           2         30      2048 MiB
 876                39           3         30      3072 MiB
 877                40           2         31      4096 MiB - 1 B
 878
 879         Instead of having a table in the decoder, the dictionary size
 880         can be decoded using the following C code:
 881
 882             const uint8_t bits = get_dictionary_flags() & 0x3F;
 883             if (bits > 40)
 884                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 885
 886             uint32_t dictionary_size;
 887             if (bits == 40) {
 888                 dictionary_size = UINT32_MAX;
 889             } else {
 890                 dictionary_size = 2 | (bits & 1);
 891                 dictionary_size <<= bits / 2 + 11;
 892             }
 893
 894
 895 5.3.2. Branch/Call/Jump Filters for Executables
 896
 897         These filters convert relative branch, call, and jump
 898         instructions to their absolute counterparts in executable
 899         files. This conversion increases redundancy and thus
 900         compression ratio.
 901
 902             Size of Filter Properties:  0 or 4 bytes
 903             Changes size of data:       No
 904             Allow as a non-last filter: Yes
 905             Allow as the last filter:   No
 906
 907         Below is the list of filters in this category. The alignment
 908         is the same for both input and output data.
 909
 910             Filter ID   Alignment   Description
 911               0x04       1 byte     x86 filter (BCJ)
 912               0x05       4 bytes    PowerPC (big endian) filter
 913               0x06      16 bytes    IA64 filter
 914               0x07       4 bytes    ARM (little endian) filter
 915               0x08       2 bytes    ARM Thumb (little endian) filter
 916               0x09       4 bytes    SPARC filter
 917
 918         If the size of Filter Properties is four bytes, the Filter
 919         Properties field contains the start offset used for address
 920         conversions. It is stored as an unsigned 32-bit little endian
 921         integer. The start offset MUST be a multiple of the alignment
 922         of the filter as listed in the table above; if it isn't, the
 923         decoder MUST indicate an error. If the size of Filter
 924         Properties is zero, the start offset is zero.
 925
 926         Setting the start offset may be useful if an executable has
 927         multiple sections, and there are many cross-section calls.
 928         Taking advantage of this feature usually requires usage of
 929         the Subblock filter, whose design is not complete yet.
 930
 931
 932 5.3.3. Delta
 933
 934         The Delta filter may increase compression ratio when the value
 935         of the next byte correlates with the value of an earlier byte
 936         at specified distance.
 937
 938             Filter ID:                  0x03
 939             Size of Filter Properties:  1 byte
 940             Changes size of data:       No
 941             Allow as a non-last filter: Yes
 942             Allow as the last filter:   No
 943
 944             Preferred alignment:
 945                 Input data:             1 byte
 946                 Output data:            Same as the original input data
 947
 948         The Properties byte indicates the delta distance, which can be
 949         1-256 bytes backwards from the current byte: 0x00 indicates
 950         distance of 1 byte and 0xFF distance of 256 bytes.
 951
 952
 953 5.3.3.1. Format of the Encoded Output
 954
 955         The code below illustrates both encoding and decoding with
 956         the Delta filter.
 957
 958             // Distance is in the range [1, 256].
 959             const unsigned int distance = get_properties_byte() + 1;
 960             uint8_t pos = 0;
 961             uint8_t delta[256];
 962
 963             memset(delta, 0, sizeof(delta));
 964
 965             while (1) {
 966                 const int byte = read_byte();
 967                 if (byte == EOF)
 968                     break;
 969
 970                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 971                 if (is_encoder) {
 972                     tmp = (uint8_t)(byte) - tmp;
 973                     delta[pos] = (uint8_t)(byte);
 974                 } else {
 975                     tmp = (uint8_t)(byte) + tmp;
 976                     delta[pos] = tmp;
 977                 }
 978
 979                 write_byte(tmp);
 980                 --pos;
 981             }
 982
 983
 984 5.4. Custom Filter IDs
 985
 986         If a developer wants to use custom Filter IDs, he has two
 987         choices. The first choice is to contact Lasse Collin and ask
 988         him to allocate a range of IDs for the developer.
 989
 990         The second choice is to generate a 40-bit random integer,
 991         which the developer can use as his personal Developer ID.
 992         To minimize the risk of collisions, Developer ID has to be
 993         a randomly generated integer, not manually selected "hex word".
 994         The following command, which works on many free operating
 995         systems, can be used to generate Developer ID:
 996
 997             dd if=/dev/urandom bs=5 count=1 | hexdump
 998
 999         The developer can then use his Developer ID to create unique
1000         (well, hopefully unique) Filter IDs.
1001
1002             Bits    Mask                    Description
1003              0-15   0x0000_0000_0000_FFFF   Filter ID
1004             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1005             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1006
1007         The resulting 63-bit integer will use 9 bytes of space when
1008         stored using the encoding described in Section 1.2. To get
1009         a shorter ID, see the beginning of this Section how to
1010         request a custom ID range.
1011
1012
1013 5.4.1. Reserved Custom Filter ID Ranges
1014
1015         Range                       Description
1016         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1017         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1018         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1019
1020
1021 6. Cyclic Redundancy Checks
1022
1023         There are several incompatible variations to calculate CRC32
1024         and CRC64. For simplicity and clarity, complete examples are
1025         provided to calculate the checks as they are used in this file
1026         format. Implementations MAY use different code as long as it
1027         gives identical results.
1028
1029         The program below reads data from standard input, calculates
1030         the CRC32 and CRC64 values, and prints the calculated values
1031         as big endian hexadecimal strings to standard output.
1032
1033             #include <stddef.h>
1034             #include <inttypes.h>
1035             #include <stdio.h>
1036
1037             uint32_t crc32_table[256];
1038             uint64_t crc64_table[256];
1039
1040             void
1041             init(void)
1042             {
1043                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1044                 static const uint64_t poly64
1045                         = UINT64_C(0xC96C5795D7870F42);
1046
1047                 for (size_t i = 0; i < 256; ++i) {
1048                     uint32_t crc32 = i;
1049                     uint64_t crc64 = i;
1050
1051                     for (size_t j = 0; j < 8; ++j) {
1052                         if (crc32 & 1)
1053                             crc32 = (crc32 >> 1) ^ poly32;
1054                         else
1055                             crc32 >>= 1;
1056
1057                         if (crc64 & 1)
1058                             crc64 = (crc64 >> 1) ^ poly64;
1059                         else
1060                             crc64 >>= 1;
1061                     }
1062
1063                     crc32_table[i] = crc32;
1064                     crc64_table[i] = crc64;
1065                 }
1066             }
1067
1068             uint32_t
1069             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1070             {
1071                 crc = ~crc;
1072                 for (size_t i = 0; i < size; ++i)
1073                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1074                             ^ (crc >> 8);
1075                 return ~crc;
1076             }
1077
1078             uint64_t
1079             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1080             {
1081                 crc = ~crc;
1082                 for (size_t i = 0; i < size; ++i)
1083                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1084                             ^ (crc >> 8);
1085                 return ~crc;
1086             }
1087
1088             int
1089             main()
1090             {
1091                 init();
1092
1093                 uint32_t value32 = 0;
1094                 uint64_t value64 = 0;
1095                 uint64_t total_size = 0;
1096                 uint8_t buf[8192];
1097
1098                 while (1) {
1099                     const size_t buf_size
1100                             = fread(buf, 1, sizeof(buf), stdin);
1101                     if (buf_size == 0)
1102                         break;
1103
1104                     total_size += buf_size;
1105                     value32 = crc32(buf, buf_size, value32);
1106                     value64 = crc64(buf, buf_size, value64);
1107                 }
1108
1109                 printf("Bytes:  %" PRIu64 "\n", total_size);
1110                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1111                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1112
1113                 return 0;
1114             }
1115
1116
1117 7. References
1118
1119         LZMA SDK - The original LZMA implementation
1120         http://7-zip.org/sdk.html
1121
1122         LZMA Utils - LZMA adapted to POSIX-like systems
1123         http://tukaani.org/lzma/
1124
1125         XZ Utils - The next generation of LZMA Utils
1126         http://tukaani.org/xz/
1127
1128         [RFC-1952]
1129         GZIP file format specification version 4.3
1130         http://www.ietf.org/rfc/rfc1952.txt
1131           - Notation of byte boxes in section "2.1. Overall conventions"
1132
1133         [RFC-2119]
1134         Key words for use in RFCs to Indicate Requirement Levels
1135         http://www.ietf.org/rfc/rfc2119.txt
1136
1137         [GNU-tar]
1138         GNU tar 1.21 manual
1139         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1140           - Node 9.4.2 "Blocking Factor", paragraph that begins
1141             "gzip will complain about trailing garbage"
1142           - Note that this URL points to the latest version of the
1143             manual, and may some day not contain the note which is in
1144             1.21. For the exact version of the manual, download GNU
1145             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1146