doc/file-format.txt

   1
   2 The .lzma File Format
   3 ---------------------
   4
   5         0. Preface
   6           0.1. Copyright Notices
   7           0.2. Changes
   8         1. Conventions
   9           1.1. Byte and Its Representation
  10           1.2. Multibyte Integers
  11         2. Stream
  12           2.1. Stream Types
  13             2.1.1. Single-Block Stream
  14             2.1.2. Multi-Block Stream
  15           2.2. Stream Header
  16             2.2.1. Header Magic Bytes
  17             2.2.2. Stream Flags
  18             2.2.3. CRC32
  19         3. Block
  20           3.1. Block Header
  21             3.1.1. Block Flags
  22             3.1.2. Compressed Size
  23             3.1.3. Uncompressed Size
  24             3.1.4. List of Filter Flags
  25               3.1.4.1. Misc
  26               3.1.4.2. External ID
  27               3.1.4.3. External Size of Properties
  28               3.1.4.4. Filter Properties
  29             3.1.5. CRC32
  30             3.1.6. Header Padding
  31           3.2. Compressed Data
  32           3.3. Block Footer
  33             3.3.1. Check
  34             3.3.2. Stream Footer
  35               3.3.2.1. Uncompressed Size
  36               3.3.2.2. Backward Size
  37               3.3.2.3. Stream Flags
  38               3.3.2.4. Footer Magic Bytes
  39             3.3.3. Footer Padding
  40         4. Filters
  41           4.1. Detecting when All Data Has Been Decoded
  42             4.1.1. With Uncompressed Size
  43             4.1.2. With End of Input
  44             4.1.3. With End of Payload Marker
  45           4.2. Alignment
  46           4.3. Filters
  47             4.3.1. Copy
  48             4.3.2. Subblock
  49               4.3.2.1. Format of the Encoded Output
  50             4.3.3. Delta
  51               4.3.3.1. Format of the Encoded Output
  52             4.3.4. LZMA
  53               4.3.4.1. LZMA Properties
  54               4.3.4.2. Dictionary Flags
  55             4.3.5. Branch/Call/Jump Filters for Executables
  56         5. Metadata
  57           5.1. Metadata Flags
  58           5.2. Size of Header Metadata Block
  59           5.3. Total Size
  60           5.4. Uncompressed Size
  61           5.5. Index
  62             5.5.1. Number of Data Blocks
  63             5.5.2. Total Sizes
  64             5.5.3. Uncompressed Sizes
  65           5.6. Extra
  66             5.6.1. 0x00: Dummy/Padding
  67             5.6.2. 0x01: OpenPGP Signature
  68             5.6.3. 0x02: Filter Information
  69             5.6.4. 0x03: Comment
  70             5.6.5. 0x04: List of Checks
  71             5.6.6. 0x05: Original Filename
  72             5.6.7. 0x07: Modification Time
  73             5.6.8. 0x09: High-Resolution Modification Time
  74             5.6.9. 0x0B: MIME Type
  75             5.6.10. 0x0D: Homepage URL
  76         6. Custom Filter and Extra Record IDs
  77           6.1. Reserved Custom Filter ID Ranges
  78         7. Cyclic Redundancy Checks
  79         8. References
  80           8.1. Normative References
  81           8.2. Informative References
  82
  83
  84 0. Preface
  85
  86         This document describes the .lzma file format (filename suffix
  87         `.lzma', MIME type `application/x-lzma'). It is intended that
  88         this format replace the format used by the LZMA_Alone tool
  89         included in LZMA SDK up to and including version 4.57.
  90
  91         IMPORTANT:  The version described in this document is a
  92                     draft, NOT a final, official version. Changes
  93                     are possible.
  94
  95
  96 0.1. Copyright Notices
  97
  98         Copyright (C) 2006, 2007 Lasse Collin <lasse.collin@tukaani.org>
  99         Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
 100
 101         Copying and distribution of this file, with or without
 102         modification, are permitted in any medium without royalty
 103         provided the copyright notice and this notice are preserved.
 104         Modified versions must be marked as such.
 105
 106         All source code examples given in this document are put into
 107         the public domain by the authors of this document.
 108
 109         Thanks for helping with this document goes to Igor Pavlov,
 110         Mark Adler and Mikko Pouru.
 111
 112
 113 0.2. Changes
 114
 115         Last modified: 2008-02-01 19:25+0200
 116
 117         (A changelog will be kept once the first official version
 118         is made.)
 119
 120
 121 1. Conventions
 122
 123         The keywords `must', `must not', `required', `should',
 124         `should not', `recommended', `may', and `optional' in this
 125         document are to be interpreted as described in [RFC-2119].
 126         These words are not capitalized in this document.
 127
 128         Indicating a warning means displaying a message, returning
 129         appropriate exit status, or something else to let the user
 130         know that something worth warning occurred. The operation
 131         should still finish if a warning is indicated.
 132
 133         Indicating an error means displaying a message, returning
 134         appropriate exit status, or something else to let the user
 135         know that something prevented successfully finishing the
 136         operation. The operation must be aborted once an error has
 137         been indicated.
 138
 139
 140 1.1. Byte and Its Representation
 141
 142         In this document, byte is always 8 bits.
 143
 144         A `nul byte' has all bits unset. That is, the value of a nul
 145         byte is 0x00.
 146
 147         To represent byte blocks, this document uses notation that
 148         is similar to the notation used in [RFC-1952]:
 149
 150             +-------+
 151             |  Foo  |   One byte.
 152             +-------+
 153
 154             +---+---+
 155             |  Foo  |   Two bytes; that is, some of the vertical bars
 156             +---+---+   can be missing.
 157
 158             +=======+
 159             |  Foo  |   Zero or more bytes.
 160             +=======+
 161
 162         In this document, a boxed byte or a byte sequence declared
 163         using this notation is called `a field'. The example field
 164         above would be called called `the Foo field' or plain `Foo'.
 165
 166
 167 1.2. Multibyte Integers
 168
 169         Multibyte integers of static length, such as CRC values,
 170         are stored in little endian byte order (least significant
 171         byte first).
 172
 173         When smaller values are more likely than bigger values (e.g.
 174         file sizes), multibyte integers are encoded in a simple
 175         variable-length representation:
 176           - Numbers in the range [0, 127] are copied as is, and take
 177             one byte of space.
 178           - Bigger numbers will occupy two or more bytes. The lowest
 179             seven bits of every byte are used for data; the highest
 180             (eighth) bit indicates either that
 181               0) the byte is in the middle of the byte sequence, or
 182               1) the byte is the first or the last byte.
 183
 184         For now, the value of the variable-length integers is limited
 185         to 63 bits, which limits the encoded size of the integer to
 186         nine bytes. These limits may be increased in future if needed.
 187
 188         Note that the encoding is not as optimal as it could be. For
 189         example, it is possible to encode the number 42 using any
 190         number of bytes between one and nine. This is convenient
 191         for non-streamed encoders, that write Compressed Size or
 192         Uncompressed Size fields to the Block Header (see Section 3.1)
 193         after the Compressed Data field is written to the disk.
 194
 195         In several situations, the decoder needs to compare that two
 196         fields contain identical information. When comparing fields
 197         using the encoding described in this Section, the decoder must
 198         consider two fields identical if their decoded values are
 199         identical; it does not matter if the encoded variable-length
 200         representations differ.
 201
 202         The following C code illustrates encoding and decoding 63-bit
 203         variables; the highest bit of uint64_t must be unset. The
 204         functions return the number of bytes occupied by the integer
 205         (1-9), or zero on error.
 206
 207             #include <sys/types.h>
 208             #include <inttypes.h>
 209
 210             size_t
 211             encode(uint8_t buf[static 9], uint64_t num)
 212             {
 213                 if (num >= (UINT64_C(1) << (9 * 7)))
 214                     return 0;
 215                 if (num <= 0x7F) {
 216                     buf[0] = num;
 217                     return 1;
 218                 }
 219                 buf[0] = (num & 0x7F) | 0x80;
 220                 num >>= 7;
 221                 size_t i = 1;
 222                 while (num >= 0x80) {
 223                     buf[i++] = num & 0x7F;
 224                     num >>= 7;
 225                 }
 226                 buf[i++] = num | 0x80;
 227                 return i;
 228             }
 229
 230             size_t
 231             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 232             {
 233                 if (size_max == 0)
 234                     return 0;
 235                 if (size_max > 9)
 236                     size_max = 9;
 237                 *num = buf[0] & 0x7F;
 238                 if (!(buf[0] & 0x80))
 239                     return 1;
 240                 size_t i = 1;
 241                 do {
 242                     if (i == size_max)
 243                         return 0;
 244                     *num |= (uint64_t)(buf[i] & 0x7F) << (7 * i);
 245                 } while (!(buf[i++] & 0x80));
 246                 return i;
 247             }
 248
 249             size_t
 250             decode_reverse(const uint8_t buf[], size_t size_max,
 251                     uint64_t *num)
 252             {
 253                 if (size_max == 0)
 254                     return 0;
 255                 const size_t end = size_max > 9 ? size_max - 9 : 0;
 256                 size_t i = size_max - 1;
 257                 *num = buf[i] & 0x7F;
 258                 if (!(buf[i] & 0x80))
 259                     return 1;
 260                 do {
 261                     if (i-- == end)
 262                         return 0;
 263                     *num <<= 7;
 264                     *num |= buf[i] & 0x7F;
 265                 } while (!(buf[i] & 0x80));
 266                 return size_max - i;
 267             }
 268
 269
 270 2. Stream
 271
 272         +========+========+========+
 273         | Stream | Stream | Stream | ...
 274         +========+========+========+
 275
 276         A file contains usually only one Stream. However, it is
 277         possible to concatenate multiple Streams together with no
 278         additional processing. It is up to the implementation to
 279         decide if the decoder will continue decoding from the next
 280         Stream once the end of the first Stream has been reached.
 281
 282
 283 2.1. Stream Types
 284
 285         There are two types of Streams: Single-Block Streams and
 286         Multi-Block Streams. Decoders conforming to this specification
 287         must support at least Single-Block Streams. Supporting
 288         Multi-Block Streams is optional. If the decoder supports only
 289         Single-Block Streams, the documentation of the decoder should
 290         mention this fact clearly.
 291
 292
 293 2.1.1. Single-Block Stream
 294
 295         +===============+============+
 296         | Stream Header | Data Block |
 297         +===============+============+
 298
 299         As the name says, a Single-Block Stream has exactly one Block.
 300         The Block must be a Data Block; Metadata Blocks are not allowed
 301         in Single-Block Streams.
 302
 303
 304 2.1.2. Multi-Block Stream
 305
 306         +===============+=======================+
 307         | Stream Header | Header Metadata Block |
 308         +===============+=======================+
 309
 310              +============+     +============+=======================+
 311         ---> | Data Block | ... | Data Block | Footer Metadata Block |
 312              +============+     +============+=======================+
 313
 314         Notes:
 315           - Stream Header is mandatory.
 316           - Header Metadata Block is optional.
 317           - Each Multi-Block Stream has at least one Data Block. The
 318             maximum number of Data Blocks is not limited.
 319           - Footer Metadata Block is mandatory.
 320
 321
 322 2.2. Stream Header
 323
 324         +---+---+---+---+---+---+--------------+--+--+--+--+
 325         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 326         +---+---+---+---+---+---+--------------+--+--+--+--+
 327
 328
 329 2.2.1. Header Magic Bytes
 330
 331         The first six (6) bytes of the Stream are so called Header
 332         Magic Bytes. They can be used to identify the file type.
 333
 334             Using a C array and ASCII:
 335             const uint8_t HEADER_MAGIC[6]
 336                     = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 };
 337
 338             In plain hexadecimal:
 339             FF 4C 5A 4D 41 00
 340
 341         Notes:
 342           - The first byte (0xFF) was chosen so that the files cannot
 343             be erroneously detected as being in LZMA_Alone format, in
 344             which the first byte is in the the range [0x00, 0xE0].
 345           - The sixth byte (0x00) was chosen to prevent applications
 346             from misdetecting the file as a text file.
 347
 348
 349 2.2.2. Stream Flags
 350
 351         Bit(s)  Mask  Description
 352          0-2    0x07  Type of Check (see Section 3.3.1):
 353                           ID    Size      Check name
 354                           0x00   0 bytes  None
 355                           0x01   4 bytes  CRC32
 356                           0x02   4 bytes  (Reserved)
 357                           0x03   8 bytes  CRC64
 358                           0x04  16 bytes  (Reserved)
 359                           0x05  32 bytes  SHA-256
 360                           0x06  32 bytes  (Reserved)
 361                           0x07  64 bytes  (Reserved)
 362           3     0x08  The CRC32 field is present in Block Headers.
 363           4     0x10  If unset, this is a Single-Block Stream; if set,
 364                       this is a Multi-Block Stream.
 365          5-7    0xE0  Reserved for future use; must be zero for now.
 366
 367         Implementations must support at least the Check IDs 0x00 (None)
 368         and 0x01 (CRC32). Supporting other Check IDs is optional. If an
 369         unsupported Check is used, the decoder must indicate a warning
 370         or error.
 371
 372         If any reserved bit is set, the decoder must indicate an error.
 373         It is possible that there is a new field present which the
 374         decoder is not aware of, and can thus parse the Stream Header
 375         incorrectly.
 376
 377
 378 2.2.3. CRC32
 379
 380         The CRC32 is calculated from the Stream Flags field. It is
 381         stored as an unsigned 32-bit little endian integer. If the
 382         calculated value does not match the stored one, the decoder
 383         must indicate an error.
 384
 385         Note that this field is always present; the bit in Stream Flags
 386         controls only presence of CRC32 in Block Headers.
 387
 388
 389 3. Block
 390
 391         +==============+=================+==============+
 392         | Block Header | Compressed Data | Block Footer |
 393         +==============+=================+==============+
 394
 395         There are two types of Blocks:
 396           - Data Blocks hold the actual compressed data.
 397           - Metadata Blocks hold the Index, Extra, and a few other
 398             non-data fields (see Section 5).
 399
 400         The type of the Block is indicated by the corresponding bit
 401         in the Block Flags field (see Section 3.1.1).
 402
 403
 404 3.1. Block Header
 405
 406         +------+------+=================+===================+
 407         | Block Flags | Compressed Size | Uncompressed Size |
 408         +------+------+=================+===================+
 409
 410              +======================+--+--+--+--+================+
 411         ---> | List of Filter Flags |   CRC32   | Header Padding |
 412              +======================+--+--+--+--+================+
 413
 414
 415 3.1.1. Block Flags
 416
 417         The first byte of the Block Flags field is a bit field:
 418
 419             Bit(s)  Mask  Description
 420              0-2    0x07  Number of filters (0-7)
 421               3     0x08  Use End of Payload Marker (even if
 422                           Uncompressed Size is stored to Block Header).
 423               4     0x10  The Compressed Size field is present.
 424               5     0x20  The Uncompressed Size field is present.
 425               6     0x40  Reserved for future use; must be zero for now.
 426               7     0x80  This is a Metadata Block.
 427
 428         The second byte of the Block Flags field is also a bit field:
 429
 430             Bit(s)  Mask  Description
 431              0-4    0x1F  Size of the Header Padding field (0-31 bytes)
 432              5-7    0xE0  Reserved for future use; must be zero for now.
 433
 434         The decoder must indicate an error if End of Payload Marker
 435         is not used and Uncompressed Size is not stored to the Block
 436         Header. Because of this, the first byte of Block Flags can
 437         never be a nul byte. This is useful when detecting beginning
 438         of the Block after Footer Padding (see Section 3.3.3).
 439
 440         If any reserved bit is set, the decoder must indicate an error.
 441         It is possible that there is a new field present which the
 442         decoder is not aware of, and can thus parse the Block Header
 443         incorrectly.
 444
 445
 446 3.1.2. Compressed Size
 447
 448         This field is present only if the appropriate bit is set in
 449         the Block Flags field (see Section 3.1.1).
 450
 451         This field contains the size of the Compressed Data field.
 452         The size is stored using the encoding described in Section 1.2.
 453         If the Compressed Size does not match the real size of the
 454         Compressed Data field, the decoder must indicate an error.
 455
 456         Having the Compressed Size field in the Block Header can be
 457         useful for multithreaded decoding when seeking is not possible.
 458         If the Blocks are small enough, the decoder can read multiple
 459         Blocks into its internal buffer, and decode the Blocks in
 460         parallel.
 461
 462         Compressed Size can also be useful when seeking forwards to
 463         a specific location in streamed mode: the decoder can quickly
 464         skip over irrelevant Blocks, without decoding them.
 465
 466
 467 3.1.3. Uncompressed Size
 468
 469         This field is present only if the appropriate bit is set in
 470         the Block Flags field (see Section 3.1.1).
 471
 472         The Uncompressed Size field contains the size of the Block
 473         after uncompressing.
 474
 475         Storing Uncompressed Size serves several purposes:
 476           - The decoder will know when all of the data has been
 477             decoded without an explicit End of Payload Marker.
 478           - The decoder knows how much memory it needs to allocate
 479             for a temporary buffer in multithreaded mode.
 480           - Simple error detection: wrong size indicates a broken file.
 481           - Sometimes it is useful to know the file size without
 482             uncompressing the file.
 483
 484         It should be noted that the only reliable way to find out what
 485         the real uncompressed size is is to uncompress the Block,
 486         because the Block Header and Metadata Block fields may contain
 487         (intentionally or unintentionally) invalid information.
 488
 489         Uncompressed Size is stored using the encoding described in
 490         Section 1.2. If the Uncompressed Size does not match the
 491         real uncompressed size, the decoder must indicate an error.
 492
 493
 494 3.1.4. List of Filter Flags
 495
 496         +================+================+     +================+
 497         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 498         +================+================+     +================+
 499
 500         The number of Filter Flags fields is stored in the Block Flags
 501         field (see Section 3.1.1). As a special case, if the number of
 502         Filter Flags fields is zero, it is equivalent to having the
 503         Copy filter as the only filter.
 504
 505         The format of each Filter Flags field is as follows:
 506
 507             +------+=============+=============================+
 508             | Misc | External ID | External Size of Properties |
 509             +------+=============+=============================+
 510
 511                  +===================+
 512             ---> | Filter Properties |
 513                  +===================+
 514
 515         The list of officially defined Filter IDs and the formats of
 516         their Filter Properties are described in Section 4.3.
 517
 518
 519 3.1.4.1. Misc
 520
 521         To save space, the most commonly used Filter IDs and the
 522         Size of Filter Properties are encoded in a single byte.
 523         Depending on the contents of the Misc field, Filter ID is
 524         the value of the Misc or External ID field.
 525
 526             Value          Filter ID        Size of Filter Properties
 527             0x00 - 0x1F    Misc             0 bytes
 528             0x20 - 0x3F    Misc             1 byte
 529             0x40 - 0x5F    Misc             2 bytes
 530             0x60 - 0x7F    Misc             3 bytes
 531             0x80 - 0x9F    Misc             4 bytes
 532             0xA0 - 0xBF    Misc             5 bytes
 533             0xC0 - 0xDF    Misc             6 bytes
 534             0xE0 - 0xFE    External ID      0-30 bytes
 535             0xFF           External ID      External Size of Properties
 536
 537         The following code demonstrates parsing the Misc field and,
 538         when needed, the External ID and External Size of Properties
 539         fields.
 540
 541             uint64_t id;
 542             uint64_t properties_size;
 543             uint8_t misc = read_byte();
 544
 545             if (misc >= 0xE0) {
 546                 id = read_variable_length_integer();
 547
 548                 if (misc == 0xFF)
 549                     properties_size = read_variable_length_integer();
 550                 else
 551                     properties_size = misc - 0xE0;
 552
 553             } else {
 554                 id = misc;
 555                 properties_size = misc / 0x20;
 556             }
 557
 558
 559 3.1.4.2. External ID
 560
 561         This field is present only if the Misc field contains a value
 562         that indicates usage of External ID. The External ID is stored
 563         using the encoding described in Section 1.2.
 564
 565
 566 3.1.4.3. External Size of Properties
 567
 568         This field is present only if the Misc field contains a value
 569         that indicates usage of External Size of Properties. The size
 570         of Filter Properties is stored using the encoding described in
 571         Section 1.2.
 572
 573
 574 3.1.4.4. Filter Properties
 575
 576         Size of this field depends on the Misc field (Section 3.1.4.1)
 577         and, if present, External Size of Properties field (Section
 578         3.1.4.3). The format of this field is depends on the selected
 579         filter; see Section 4.3 for details.
 580
 581
 582 3.1.5. CRC32
 583
 584         This field is present only if the appropriate bit is set in
 585         the Stream Flags field (see Section 2.2.2).
 586
 587         The CRC32 is calculated over everything in the Block Header
 588         field except the Header Padding field and the CRC32 field
 589         itself. It is stored as an unsigned 32-bit little endian
 590         integer. If the calculated value does not match the stored
 591         one, the decoder must indicate an error.
 592
 593
 594 3.1.6. Header Padding
 595
 596         This field contains as many nul bytes as indicated by the value
 597         stored in the Header Flags field. If the Header Padding field
 598         contains any non-nul bytes, the decoder must indicate an error.
 599
 600         The intent of the Header Padding field is to allow alignment
 601         of Compressed Data. The usefulness of alignment is described
 602         in Section 4.3.
 603
 604
 605 3.2. Compressed Data
 606
 607         The format of Compressed Data depends on Block Flags and List
 608         of Filter Flags. Excluding the descriptions of the simplest
 609         filters in Section 4, the format of the filter-specific encoded
 610         data is out of scope of this document.
 611
 612         Note a special case: if End of Payload Marker (see Section
 613         3.1.1) is not used and Uncompressed Size is zero, the size
 614         of the Compressed Data field is always zero.
 615
 616
 617 3.3. Block Footer
 618
 619         +=======+===============+================+
 620         | Check | Stream Footer | Footer Padding |
 621         +=======+===============+================+
 622
 623
 624 3.3.1. Check
 625
 626         The type and size of the Check field depends on which bits
 627         are set in the Stream Flags field (see Section 2.2.2).
 628
 629         The Check, when used, is calculated from the original
 630         uncompressed data. If the calculated Check does not match the
 631         stored one, the decoder must indicate an error. If the selected
 632         type of Check is not supported by the decoder, it must indicate
 633         a warning or error.
 634
 635
 636 3.3.2. Stream Footer
 637
 638         +===================+===============+--------------+
 639         | Uncompressed Size | Backward Size | Stream Flags |
 640         +===================+===============+--------------+
 641
 642              +----------+---------+
 643         ---> | Footer Magic Bytes |
 644              +----------+---------+
 645
 646         Stream Footer is present only in
 647           - Data Block of a Single-Block Stream; and
 648           - Footer Metadata Block of a Multi-Block Stream.
 649
 650         The Stream Footer field is placed inside Block Footer, because
 651         no padding is allowed between Check and Stream Footer.
 652
 653
 654 3.3.2.1. Uncompressed Size
 655
 656         This field is present only in the Data Block of a Single-Block
 657         Stream if Uncompressed Size is not stored to the Block Header
 658         (see Section 3.1.1). Without the Uncompressed Size field in
 659         Stream Footer it would not be possible to quickly find out
 660         the Uncompressed Size of the Stream in all cases.
 661
 662         Uncompressed Size is stored using the encoding described in
 663         Section 1.2. If the stored value does not match the real
 664         uncompressed size of the Single-Block Stream, the decoder must
 665         indicate an error.
 666
 667
 668 3.3.2.2. Backward Size
 669
 670         This field contains the total size of the Block Header,
 671         Compressed Data, Check, and Uncompressed Size fields. The
 672         value is stored using the encoding described in Section 1.2.
 673         If the Backward Size does not match the real total size of
 674         the appropriate fields, the decoder must indicate an error.
 675
 676         Implementations reading the Stream backwards should notice
 677         that the value in this field can never be zero.
 678
 679
 680 3.3.2.3. Stream Flags
 681
 682         This is a copy of the Stream Flags field from the Stream
 683         Header. The information stored to Stream Flags is needed
 684         when parsing the Stream backwards.
 685
 686
 687 3.3.2.4. Footer Magic Bytes
 688
 689         As the last step of the decoding process, the decoder must
 690         verify the existence of Footer Magic Bytes. If they are not
 691         found, an error must be indicated.
 692
 693             Using a C array and ASCII:
 694             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 695
 696             In hexadecimal:
 697             59 5A
 698
 699         The primary reason to have Footer Magic Bytes is to make
 700         it easier to detect incomplete files quickly, without
 701         uncompressing. If the file does not end with Footer Magic Bytes
 702         (excluding Footer Padding described in Section 3.3.3), it
 703         cannot be undamaged, unless someone has intentionally appended
 704         garbage after the end of the Stream. (Appending garbage at the
 705         end of the file does not prevent uncompressing the file, but
 706         may give a warning or error depending on the decoder
 707         implementation.)
 708
 709
 710 3.3.3. Footer Padding
 711
 712         In certain situations it is convenient to be able to pad
 713         Blocks or Streams to be multiples of, for example, 512 bytes.
 714         Footer Padding makes this possible. Note that this is in no
 715         way required to enforce alignment in the way described in
 716         Section 4.3; the Header Padding field is enough for that.
 717
 718         When Footer Padding is used, it must contain only nul bytes.
 719         Any non-nul byte should be considered as the beginning of
 720         a new Block or Stream.
 721
 722         The possibility of Padding should be taken into account when
 723         designing an application that wants to find out information
 724         about a Stream by parsing Footer Metadata Block.
 725
 726         Support for Padding was inspired by a related note in
 727         [GNU-tar].
 728
 729
 730 4. Filters
 731
 732         The Block Flags field defines how many filters are used. When
 733         more than one filter is used, the filters are chained; that is,
 734         the output of one filter is the input of another filter. The
 735         following figure illustrates the direction of data flow.
 736
 737                     v   Uncompressed Data   ^
 738                     |       Filter 0        |
 739             Encoder |       Filter 1        | Decoder
 740                     |         ...           |
 741                     |       Filter n        |
 742                     v    Compressed Data    ^
 743
 744         The filters are independent from each other, except that they
 745         must cooperate a little to make it possible, in all cases, to
 746         detect when all of the data has been decoded. In addition, the
 747         filters should cooperate in the encoder to keep the alignment
 748         optimal.
 749
 750
 751 4.1. Detecting when All Data Has Been Decoded
 752
 753         There must be a way for the decoder to detect when all of the
 754         Compressed Data has been decoded. This is simple when only
 755         one filter is used, but a bit more complex when multiple
 756         filters are chained.
 757
 758         This file format supports three methods to detect when all of
 759         the data has been decoded:
 760           - Uncompressed size
 761           - End of Input
 762           - End of Payload Marker
 763
 764         In both encoder and decoder, filters are initialized starting
 765         from the first filter in the chain. For each filter, one of
 766         these three methods is used.
 767
 768
 769 4.1.1. With Uncompressed Size
 770
 771         This method is the only method supported by all filters.
 772         It must be used when uncompressed size is known by the
 773         filter-specific encoder or decoder. In practice this means
 774         that Uncompressed Size has been stored to the Block Header.
 775
 776         In case of the first filter in the chain, the uncompressed size
 777         given to the filter-specific encoder or decoder equals the
 778         Uncompressed Size stored in the Block Header. For the rest of
 779         the filters in the chain, uncompressed size is the size of the
 780         output data of the previous filter in the chain.
 781
 782         Note that when Use End of Payload Marker bit is set in Block
 783         Flags, Uncompressed Size is considered to be unknown even if
 784         it was present in the Block Header. Thus, if End of Payload
 785         Marker is used, uncompressed size of all of the filters in
 786         the chain is unknown, and can never be used to detect when
 787         all of the data has been decoded.
 788
 789         Once the correct number of bytes has been written out, the
 790         filter-specific decoder indicates to its caller that all of
 791         the data has been decoded. If the filter-specific decoder
 792         detects End of Input or End of Payload Marker before the
 793         correct number of bytes is decoded, the decoder must indicate
 794         an error.
 795
 796
 797 4.1.2. With End of Input
 798
 799         Most filters will know that all of the data has been decoded
 800         when the End of Input data has been reached. Once the filter
 801         knows that it has received the input data in its entirety,
 802         it finishes its job, and indicates to its caller that all of
 803         the data has been decoded. The filter-specific decoder must
 804         indicate an error if it detects End of Payload Marker.
 805
 806         Note that this method can work only when the filter is not
 807         the last filter in the chain, because only another filter
 808         can indicate the End of Input data. In practice this means,
 809         that a filter later in the chain must support embedding
 810         End of Payload Marker.
 811
 812         When a filter that cannot embed End of Payload Marker is the
 813         last filter in the chain, Subblock filter is appended to the
 814         chain as an implicit filter. In the simplest case, this occurs
 815         when no filters are specified, and the End of Payload Marker
 816         bit is set in Block Flags.
 817
 818
 819 4.1.3. With End of Payload Marker
 820
 821         End of Payload Marker is a filter-specific bit sequence that
 822         indicates the end of data. It is supported by only a few
 823         filters. It is used when uncompressed size is unknown, and
 824         the filter
 825           - doesn't support End of Input; or
 826           - is the last filter in the chain.
 827
 828         End of Payload Marker is embedded at the end of the encoded
 829         data by the filter-specific encoder. When the filter-specific
 830         decoder detects the embedded End of Payload Marker, the decoder
 831         knows that all of the data has been decoded. Then it finishes
 832         its job, and indicates to its caller that all of the data has
 833         been decoded. If the filter-specific decoder detects End of
 834         Input before End of Payload Marker, the decoder must indicate
 835         an error.
 836
 837         If the filter supports both End of Input and End of Payload
 838         Marker, the former is used, unless the filter is the last
 839         filter in the chain.
 840
 841
 842 4.2. Alignment
 843
 844         Some filters give better compression ratio or are faster
 845         when the input or output data is aligned. For optimal results,
 846         the encoder should try to enforce proper alignment when
 847         possible. Not enforcing alignment in the encoder is not
 848         an error. Thus, the decoder must be able to handle files with
 849         suboptimal alignment.
 850
 851         Alignment of uncompressed input data is usually the job of
 852         the application producing the data. For example, to get the
 853         best results, an archiver tool should make sure that all
 854         PowerPC executable files in the archive stream start at
 855         offsets that are multiples of four bytes.
 856
 857         Some filters, for example LZMA, can be configured to take
 858         advantage of specified alignment of input data. Note that
 859         taking advantage of aligned input can be benefical also when
 860         a filter is not the first filter in the chain. For example,
 861         if you compress PowerPC executables, you may want to use the
 862         PowerPC filter and chain that with the LZMA filter. Because not
 863         only the input but also the output alignment of the PowerPC
 864         filter is four bytes, it is now benefical to set LZMA settings
 865         so that the LZMA encoder can take advantage of its
 866         four-byte-aligned input data.
 867
 868         The output of the last filter in the chain is stored to the
 869         Compressed Data field. Aligning Compressed Data appropriately
 870         can increase
 871           - speed, if the filtered data is handled multiple bytes at
 872             a time by the filter-specific encoder and decoder,
 873             because accessing aligned data in computer memory is
 874             usually faster; and
 875           - compression ratio, if the output data is later compressed
 876             with an external compression tool.
 877
 878         Compressed Data in a Stream can be aligned by using the Header
 879         Padding field in the Block Header.
 880
 881
 882 4.3. Filters
 883
 884 4.3.1. Copy
 885
 886         This is a dummy filter that simply copies all data from input
 887         to output unmodified.
 888
 889             Filter ID:                  0x00
 890             Size of Filter Properties:  0 bytes
 891             Changes size of data:       No
 892
 893             Detecting when all of the data has been decoded:
 894                 Uncompressed size:      Yes
 895                 End of Payload Marker:  No
 896                 End of Input:           Yes
 897
 898             Preferred alignment:
 899                 Input data:             1 byte
 900                 Output data:            1 byte
 901
 902
 903 4.3.2. Subblock
 904
 905         The Subblock filter can be used to
 906           - embed End of Payload Marker when the otherwise last
 907             filter in the chain does not support embedding it; and
 908           - apply additional filters in the middle of a Block.
 909
 910             Filter ID:                  0x01
 911             Size of Filter Properties:  0 bytes
 912             Changes size of data:       Yes, unpredictably
 913
 914             Detecting when all of the data has been decoded:
 915                 Uncompressed size:      Yes
 916                 End of Payload Marker:  Yes
 917                 End of Input:           Yes
 918
 919             Preferred alignment:
 920                 Input data:             1 byte
 921                 Output data:            Freely adjustable
 922
 923
 924 4.3.2.1. Format of the Encoded Output
 925
 926         The encoded data from the Subblock filter consist of zero or
 927         more Subblocks:
 928
 929             +==========+==========+
 930             | Subblock | Subblock | ...
 931             +==========+==========+
 932
 933         Each Subblock contains two fields:
 934
 935             +----------------+===============+
 936             | Subblock Flags | Subblock Data |
 937             +----------------+===============+
 938
 939         Subblock Flags is a bitfield:
 940
 941             Bits   Mask   Description
 942             0-3    0x0F   The interpretation of these bits depend on
 943                           the Subblock Type:
 944                             - 0x20    Bits 0-3 for Size
 945                             - 0x30    Bits 0-3 for Repeat Count
 946                             - Other   These bits must be zero.
 947             4-7    0xF0   Subblock Type:
 948                             - 0x00: Padding
 949                             - 0x10: End of Payload Marker
 950                             - 0x20: Data
 951                             - 0x30: Repeating Data
 952                             - 0x40: Set Subfilter
 953                             - 0x50: Unset Subfilter
 954                           If some other value is detected, the decoder
 955                           must indicate an error.
 956
 957         The format of the Subblock Data field depends on Subblock Type.
 958
 959         Subblocks with the Subblock Type 0x00 (Padding) don't have a
 960         Subblock Data field. These Subblocks can be useful for fixing
 961         alignment. There can be at maximum of 31 consecutive Subblocks
 962         with this Subblock Type; if there are more, the decoder must
 963         indicate an error.
 964
 965         Subblock with the Subblock Type 0x10 (End of Payload Marker)
 966         doesn't have a Subblock Data field. The decoder must indicate
 967         an error if this Subblock Type is detected when Subfilter is
 968         enabled, or when the Subblock filter is not supposed to embed
 969         the End of Payload Marker.
 970
 971         Subblocks with the Subblock Type 0x20 (Data) contain the rest
 972         of the Size, which is followed by Size + 1 bytes in the Data
 973         field (that is, Data can never be empty):
 974
 975             +------+------+------+======+
 976             | Bits 4-27 for Size | Data |
 977             +------+------+------+======+
 978
 979         Subblocks with the Subblock Type 0x30 (Repeating Data) contain
 980         the rest of the Repeat Count, the Size of the Data, and finally
 981         the actual Data to be repeated:
 982
 983             +---------+---------+--------+------+======+
 984             | Bits 4-27 for Repeat Count | Size | Data |
 985             +---------+---------+--------+------+======+
 986
 987         The size of the Data field is Size + 1. It is repeated Repeat
 988         Count + 1 times. That is, the minimum size of Data is one byte;
 989         the maximum size of Data is 256 bytes. The minimum number of
 990         repeats is one; the maximum number of repeats is 2^28.
 991
 992         If Subfilter is not used, the Data field of Subblock Types 0x20
 993         and 0x30 is the output of the decoded Subblock filter. If
 994         Subfilter is used, Data is the input of the Subfilter, and the
 995         decoded output of the Subfilter is the decoded output of the
 996         Subblock filter.
 997
 998         Subblocks with the Subblock Type 0x40 (Set Subfilter) contain
 999         a Filter Flags field in Subblock Data:
1000
1001             +==============+
1002             | Filter Flags |
1003             +==============+
1004
1005         It is an error to set the Subfilter to Filter ID 0x00 (Copy)
1006         or 0x01 (Subblock). All the other Filter IDs are allowed.
1007         The decoder must indicate an error if this Subblock Type is
1008         detected when a Subfilter is already enabled.
1009
1010         Subblocks with the Subblock Type 0x50 (Unset Subfilter) don't
1011         have a Subblock Data field. There must be at least one Subblock
1012         with Subblock Type 0x20 or 0x30 between Subblocks with Subblock
1013         Type 0x40 and 0x50; if there isn't, the decoder must indicate
1014         an error.
1015
1016         Subblock Types 0x40 and 0x50 are always used as a pair: If the
1017         Subblock filter has been enabled with Subblock Type 0x40, it
1018         must always be disabled later with Subblock Type 0x50.
1019         Disabling must be done even if the Subfilter used End of
1020         Payload Marker; after the Subfilter has detected End of Payload
1021         Marker, the next Subblock that is not Padding must unset the
1022         Subfilter.
1023
1024         When the Subblock filter is used as an implicit filter to embed
1025         End of Payload marker, the Subblock Types 0x40 and 0x50 (Set or
1026         Unset Subfilter) must not be used. The decoder must indicate an
1027         error if it detects any of these Subblock Types in an implicit
1028         Subblock filter.
1029
1030         The following code illustrates the basic structure of a
1031         Subblock decoder.
1032
1033             uint32_t consecutive_padding = 0;
1034             bool got_output_with_subfilter = false;
1035
1036             while (true) {
1037                 uint32_t size;
1038                 uint32_t repeat;
1039                 uint8_t flags = read_byte();
1040
1041                 if (flags != 0)
1042                     consecutive_padding = 0;
1043
1044                 switch (flags >> 4) {
1045                     case 0:
1046                         // Padding
1047                         if (flags & 0x0F)
1048                             return DATA_ERROR;
1049                         if (++consecutive_padding == 32)
1050                             return DATA_ERROR;
1051                         break;
1052
1053                     case 1:
1054                         // End of Payload Marker
1055                         if (flags & 0x0F)
1056                             return DATA_ERROR;
1057                         if (subfilter_enabled || !allow_eopm)
1058                             return DATA_ERROR;
1059                         break;
1060
1061                     case 2:
1062                         // Data
1063                         size = flags & 0x0F;
1064                         for (size_t i = 4; i < 28; i += 8)
1065                             size |= (uint32_t)(read_byte()) << i;
1066
1067                         // If any output is produced, this will
1068                         // set got_output_with_subfilter to true.
1069                         copy_data(size);
1070                         break;
1071
1072                     case 3:
1073                         // Repeating Data
1074                         repeat = flags & 0x0F;
1075                         for (size_t i = 4; i < 28; i += 8)
1076                             repeat |= (uint32_t)(read_byte()) << i;
1077                         size = read_byte();
1078
1079                         // If any output is produced, this will
1080                         // set got_output_with_subfilter to true.
1081                         copy_repeating_data(size, repeat);
1082                         break;
1083
1084                     case 4:
1085                         // Set Subfilter
1086                         if (flags & 0x0F)
1087                             return DATA_ERROR;
1088                         if (subfilter_enabled)
1089                             return DATA_ERROR;
1090                         got_output_with_subfilter = false;
1091                         set_subfilter();
1092                         break;
1093
1094                     case 5:
1095                         // Unset Subfilter
1096                         if (flags & 0x0F)
1097                             return DATA_ERROR;
1098                         if (!subfilter_enabled)
1099                             return DATA_ERROR;
1100                         if (!got_output_with_subfilter)
1101                             return DATA_ERROR;
1102                         unset_subfilter();
1103                         break;
1104
1105                     default:
1106                         return DATA_ERROR;
1107                 }
1108             }
1109
1110
1111 4.3.3. Delta
1112
1113         The Delta filter may increase compression ratio when the value
1114         of the next byte correlates with the value of an earlier byte
1115         at specified distance.
1116
1117             Filter ID:                  0x20
1118             Size of Filter Properties:  1 byte
1119             Changes size of data:       No
1120
1121             Detecting when all of the data has been decoded:
1122                 Uncompressed size:      Yes
1123                 End of Payload Marker:  No
1124                 End of Input:           Yes
1125
1126             Preferred alignment:
1127                 Input data:             1 byte
1128                 Output data:            Same as the original input data
1129
1130         The Properties byte indicates the delta distance, which can be
1131         1-256 bytes backwards from the current byte: 0x00 indicates
1132         distance of 1 byte and 0xFF distance of 256 bytes.
1133
1134
1135 4.3.3.1. Format of the Encoded Output
1136
1137         The code below illustrates both encoding and decoding with
1138         the Delta filter.
1139
1140             // Distance is in the range [1, 256].
1141             const unsigned int distance = get_properties_byte() + 1;
1142             uint8_t pos = 0;
1143             uint8_t delta[256];
1144
1145             memset(delta, 0, sizeof(delta));
1146
1147             while (1) {
1148                 const int byte = read_byte();
1149                 if (byte == EOF)
1150                     break;
1151
1152                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
1153                 if (is_encoder) {
1154                     tmp = (uint8_t)(byte) - tmp;
1155                     delta[pos] = (uint8_t)(byte);
1156                 } else {
1157                     tmp = (uint8_t)(byte) + tmp;
1158                     delta[pos] = tmp;
1159                 }
1160
1161                 write_byte(tmp);
1162                 --pos;
1163             }
1164
1165
1166 4.3.4. LZMA
1167
1168         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
1169         compression algorithm with high compression ratio and fast
1170         decompression. LZMA based on LZ77 and range coding algorithms.
1171
1172             Filter ID:                  0x40
1173             Size of Filter Properties:  2 bytes
1174             Changes size of data:       Yes, unpredictably
1175
1176             Detecting when all of the data has been decoded:
1177                 Uncompressed size:      Yes
1178                 End of Payload Marker:  Yes
1179                 End of Input:           No
1180
1181             Preferred alignment:
1182                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
1183                 Output data:            1 byte
1184
1185         At the time of writing, there is no other documentation about
1186         how LZMA works than the source code in LZMA SDK. Once such
1187         documentation gets written, it will probably be published as
1188         a separate document, because including the documentation here
1189         would lengthen this document considerably.
1190
1191         The format of the Filter Properties field is as follows:
1192
1193             +-----------------+------------------+
1194             | LZMA Properties | Dictionary Flags |
1195             +-----------------+------------------+
1196
1197
1198 4.3.4.1. LZMA Properties
1199
1200         The LZMA Properties field contains three properties. An
1201         abbreviation is given in parentheses, followed by the value
1202         range of the property. The field consists of
1203
1204             1) the number of literal context bits (lc, [0, 8]);
1205             2) the number of literal position bits (lp, [0, 4]); and
1206             3) the number of position bits (pb, [0, 4]).
1207
1208         They are encoded using the following formula:
1209
1210             LZMA Properties = (pb * 5 + lp) * 9 + lc
1211
1212         The following C code illustrates a straightforward way to
1213         decode the properties:
1214
1215             uint8_t lc, lp, pb;
1216             uint8_t prop = get_lzma_properties() & 0xFF;
1217             if (prop > (4 * 5 + 4) * 9 + 8)
1218                 return LZMA_PROPERTIES_ERROR;
1219
1220             pb = prop / (9 * 5);
1221             prop -= pb * 9 * 5;
1222             lp = prop / 9;
1223             lc = prop - lp * 9;
1224
1225
1226 4.3.4.2. Dictionary Flags
1227
1228         Currently the lowest six bits of the Dictionary Flags field
1229         are in use:
1230
1231             Bits   Mask   Description
1232             0-5    0x3F   Dictionary Size
1233             6-7    0xC0   Reserved for future use; must be zero for now.
1234
1235         Dictionary Size is encoded with one-bit mantissa and five-bit
1236         exponent. To avoid wasting space, one-byte dictionary has its
1237         own special value.
1238
1239             Raw value   Mantissa   Exponent   Dictionary size
1240                 0           1          0      1 byte
1241                 1           2          0      2 bytes
1242                 2           3          0      3 bytes
1243                 3           2          1      4 bytes
1244                 4           3          1      6 bytes
1245                 5           2          2      8 bytes
1246                 6           3          2      12 bytes
1247                 7           2          3      16 bytes
1248                 8           3          3      24 bytes
1249                 9           2          4      32 bytes
1250               ...         ...        ...      ...
1251                61           2         30      2 GiB
1252                62           3         30      3 GiB
1253                63           2         31      4 GiB (*)
1254
1255             (*) The real maximum size of the dictionary is one byte
1256                 less than 4 GiB, because the distance of 4 GiB is
1257                 reserved for End of Payload Marker.
1258
1259         Instead of having a table in the decoder, the dictionary size
1260         can be decoded using the following C code:
1261
1262             uint64_t dictionary_size;
1263             const uint8_t bits = get_dictionary_flags() & 0x3F;
1264             if (bits == 0) {
1265                 dictionary_size = 1;
1266             } else {
1267                 dictionary_size = 2 | ((bits + 1) & 1);
1268                 dictionary_size = dictionary_size << ((bits - 1)  / 2);
1269             }
1270
1271
1272 4.3.5. Branch/Call/Jump Filters for Executables
1273
1274         These filters convert relative branch, call, and jump
1275         instructions to their absolute counterparts in executable
1276         files. This conversion increases redundancy and thus
1277         compression ratio.
1278
1279             Size of Filter Properties:  0 or 4 bytes
1280             Changes size of data:       No
1281
1282             Detecting when all of the data has been decoded:
1283                 Uncompressed size:      Yes
1284                 End of Payload Marker:  No
1285                 End of Input:           Yes
1286
1287         Below is the list of filters in this category. The alignment
1288         is the same for both input and output data.
1289
1290             Filter ID   Alignment   Description
1291               0x04       1 byte     x86 filter (BCJ)
1292               0x05       4 bytes    PowerPC (big endian) filter
1293               0x06      16 bytes    IA64 filter
1294               0x07       4 bytes    ARM (little endian) filter
1295               0x08       2 bytes    ARM Thumb (little endian) filter
1296               0x09       4 bytes    SPARC filter
1297
1298         If the size of Filter Properties is four bytes, the Filter
1299         Properties field contains the start offset used for address
1300         conversions. It is stored as an unsigned 32-bit little endian
1301         integer. If the size of Filter Properties is zero, the start
1302         offset is zero.
1303
1304         Setting the start offset may be useful if an executable has
1305         multiple sections, and there are many cross-section calls.
1306         Taking advantage of this feature usually requires usage of
1307         the Subblock filter.
1308
1309
1310 5. Metadata
1311
1312         Metadata is stored in Metadata Blocks, which can be in the
1313         beginning or at the end of a Multi-Block Stream. Because of
1314         Blocks, it is possible to compress Metadata in the same way
1315         as the actual data is compressed. This Section describes the
1316         format of the data stored in Metadata Blocks.
1317
1318             +----------------+===============================+
1319             | Metadata Flags | Size of Header Metadata Block |
1320             +----------------+===============================+
1321
1322                  +============+===================+=======+=======+
1323             ---> | Total Size | Uncompressed Size | Index | Extra |
1324                  +============+===================+=======+=======+
1325
1326         Stream must be parseable backwards. That is, there must be
1327         a way to locate the beginning of the Stream by starting from
1328         the end of the Stream. Thus, the Footer Metadata Block must
1329         contain the Total Size field or the Index field. If the Stream
1330         has Header Metadata Block, also the Size of Header Metadata
1331         Block field must be present in Footer Metadata Block.
1332
1333         It must be possible to quickly locate the Blocks in
1334         non-streamed mode. Thus, the Index field must be present
1335         at least in one Metadata Block.
1336
1337         If the above conditions are not met, the decoder must indicate
1338         an error.
1339
1340         There should be no additional data after the last field. If
1341         there is, the the decoder should indicate an error.
1342
1343
1344 5.1. Metadata Flags
1345
1346         This field describes which fields are present in a Metadata
1347         Block:
1348
1349             Bit(s)  Mask   Desription
1350               0     0x01   Size of Header Metadata Block is present.
1351               1     0x02   Total Size is present.
1352               2     0x04   Uncompressed Size is present.
1353               3     0x08   Index is present.
1354              4-6    0x70   Reserve for future use; must be zero for now.
1355               7     0x80   Extra is present.
1356
1357         If any reserved bit is set, the decoder must indicate an error.
1358         It is possible that there is a new field present which the
1359         decoder is not aware of, and can thus parse the Metadata
1360         incorrectly.
1361
1362
1363 5.2. Size of Header Metadata Block
1364
1365         This field is present only if the appropriate bit is set in
1366         the Metadata Flags field (see Section 5.1).
1367
1368         Size of Header Metadata Block is needed to make it possible to
1369         parse the Stream backwards. The size is stored using the
1370         encoding described in Section 1.2. The decoder must verify that
1371         that the value stored in this field is non-zero. In Footer
1372         Metadata Block, the decoder must also verify that the stored
1373         size matches the real size of Header Metadata Block. In the
1374         Header Meatadata Block, the value of this field is ignored as
1375         long as it is not zero.
1376
1377
1378 5.3. Total Size
1379
1380         This field is present only if the appropriate bit is set in the
1381         Metadata Flags field (see Section 5.1).
1382
1383         This field contains the total size of the Data Blocks in the
1384         Stream. Total Size is stored using the encoding described in
1385         Section 1.2. If the stored value does not match the real total
1386         size of the Data Blocks, the decoder must indicate an error.
1387         The value of this field must be non-zero.
1388
1389         Total Size can be used to quickly locate the beginning or end
1390         of the Stream. This can be useful for example when doing
1391         random-access reading, and the Index field is not in the
1392         Metadata Block currently being read.
1393
1394         It is useless to have both Total Size and Index in the same
1395         Metadata Block, because Total Size can be calculated from the
1396         Index field.
1397
1398
1399 5.4. Uncompressed Size
1400
1401         This field is present only if the appropriate bit is set in the
1402         Metadata Flags field (see Section 5.1).
1403
1404         This field contains the total uncompressed size of the Data
1405         Blocks in the Stream. Uncompresssed Size is stored using the
1406         encoding described in Section 1.2. If the stored value does not
1407         match the real uncompressed size of the Data Blocks, the
1408         decoder must indicate an error.
1409
1410         It is useless to have both Uncompressed Size and Index in
1411         the same Metadata Block, because Uncompressed Size can be
1412         calculated from the Index field.
1413
1414
1415 5.5. Index
1416
1417         +=======================+=============+====================+
1418         | Number of Data Blocks | Total Sizes | Uncompressed Sizes |
1419         +=======================+=============+====================+
1420
1421         Index serves several purporses. Using it, one can
1422           - verify that all Blocks in a Stream have been processed;
1423           - find out the Uncompressed Size of a Stream; and
1424           - quickly access the beginning of any Block (random access).
1425
1426
1427 5.5.1. Number of Data Blocks
1428
1429         This field contains the number of Data Blocks in the Stream.
1430         The value is stored using the encoding described in Section
1431         1.2. If the decoder has decoded all the Data Blocks of the
1432         Stream, and then notices that the Number of Records doesn't
1433         match the real number of Data Blocks, the decoder must
1434         indicate an error. The value of this field must be non-zero.
1435
1436
1437 5.5.2. Total Sizes
1438
1439         +============+============+
1440         | Total Size | Total Size | ...
1441         +============+============+
1442
1443         This field lists the Total Sizes of every Data Block in the
1444         Stream. There are as many Total Size fields as indicated by
1445         the Number of Data Blocks field.
1446
1447         Total Size is the size of Block Header, Compressed Data, and
1448         Block Footer. It is stored using the encoding described in
1449         Section 1.2. If the Total Sizes do not match the real sizes
1450         of respective Blocks, the decoder should indicate an error.
1451         All the Total Size fields must have a non-zero value.
1452
1453
1454 5.5.3. Uncompressed Sizes
1455
1456         +===================+===================+
1457         | Uncompressed Size | Uncompressed Size | ...
1458         +===================+===================+
1459
1460         This field lists the Uncompressed Sizes of every Data Block
1461         in the Stream. There are as many Uncompressed Size fields as
1462         indicated by the Number of Records field.
1463
1464         Uncompressed Sizes are stored using the encoding described
1465         in Section 1.2. If the Uncompressed Sizes do not match the
1466         real sizes of respective Blocks, the decoder shoud indicate
1467         an error.
1468
1469
1470 5.6. Extra
1471
1472         This field is present only if the appropriate bit is set in the
1473         Metadata Flags field (see Section 5.1). Note that the bit does
1474         not indicate that there is any data in the Extra field; it only
1475         indicates that Extra may be non-empty.
1476
1477         The Extra field contains only information that is not required
1478         to properly uncompress the Stream or to do random-access
1479         reading. Supporting the Extra field is optional. In case the
1480         decoder doesn't support the Extra field, it should silently
1481         ignore it.
1482
1483         Extra consists of zero or more Records:
1484
1485             +========+========+
1486             | Record | Record | ...
1487             +========+========+
1488
1489         Excluding Records with Record ID 0x00, each Record contains
1490         three fields:
1491
1492             +==========+==============+======+
1493             | Reord ID | Size of Data | Data |
1494             +==========+==============+======+
1495
1496         The Record ID and Size of Data are stored using the encoding
1497         described in Section 1.2. Data can be binary or UTF-8
1498         [RFC-3629] strings. Non-UTF-8 strings should be avoided.
1499         Because the Size of Data is known, there is no need to
1500         terminate strings with a nul byte, although doing so should
1501         not be considered an error.
1502
1503         The Record IDs are divided in two categories:
1504           - Safe-to-Copy Records may be preserved as is when the
1505             Stream is modified in ways that don't change the actual
1506             uncompressed data. Examples of such operatings include
1507             recompressing and adding, modifying, or deleting unrelated
1508             Extra Records.
1509           - Unsafe-to-Copy Records should be removed (and possibly
1510             recreated) when any kind of changes are made to the Stream.
1511
1512         When the actual uncompressed data is modified, all Records
1513         should be removed (and possibly recreated), unless the
1514         application knows that the Data stored to the Record(s) is
1515         still valid.
1516
1517         The following subsections describe the standard Record IDs and
1518         the format of their Data fields. Safe-to-Copy Records have an
1519         odd ID, while Unsafe-to-Copy Records have an even ID.
1520
1521
1522 5.6.1. 0x00: Dummy/Padding
1523
1524         This Record is special, because it doesn't have the Size of
1525         Data or Data fields.
1526
1527         Dummy Records can be used, for example, to fill Metadata Block
1528         when a few bytes of extra space has been reserved for it. There
1529         can be any number of Dummy Records.
1530
1531
1532 5.6.2. 0x01: OpenPGP Signature
1533
1534         OpenPGP signature is computed from uncompressed data. The
1535         signature can be used to verify that the contents of a Stream
1536         has been created by a trustworthy source.
1537
1538         If the decoder supports decoding concatenated Streams, it
1539         must indicate an error when verifying OpenPGP signatures if
1540         there is more than one Stream.
1541
1542         OpenPGP format is documented in [RFC-2440].
1543
1544
1545 5.6.3. 0x02: Filter Information
1546
1547         The Filter Information Record contains information about the
1548         filters used in the Stream. This field can be used to quickly
1549           - display which filters are used in each Block;
1550           - check if all the required filters are supported by the
1551             current decoder version; and
1552           - check how much memory is required to decode each Block.
1553
1554         The format of the Filter Information field is as follows:
1555
1556             +=================+=================+
1557             | Block 0 Filters | Block 1 Filters | ...
1558             +=================+=================+
1559
1560         There can be at maximum of as many Block Filters fields as
1561         there are Data Blocks in the Stream. The format of the Block
1562         Filters field is as follows:
1563
1564             +------------------+======================+============+
1565             | Block Properties | List of Filter Flags | Subfilters |
1566             +------------------+======================+============+
1567
1568         Block Properties is a bitfield:
1569
1570             Bit(s)  Mask  Description
1571              0-2    0x07  Number of filters (0-7)
1572               3     0x08  End of Payload Marker is used.
1573               4     0x10  The Subfilters field is present.
1574              5-7    0xE0  Reserved for future use; must be zero for now.
1575
1576         The contents of the List of Filter Flags field must match the
1577         List of Filter Flags field in the respective Block Header.
1578
1579         The Subfilters field may be present only if the List of Filter
1580         Flags contains a Filter Flags field for a Subblock filter. The
1581         format of the Subfilters field is as follows:
1582
1583             +======================+=========================+
1584             | Number of Subfilters | List of Subfilter Flags |
1585             +======================+=========================+
1586
1587         The value stored in the Number of Subfilters field is stored
1588         using the encoding described in Section 1.2. The List of
1589         Subfilter Flags field contains as many Filter Flags fields
1590         as indicated by the Number of Subfilters field. These Filter
1591         Flags fields list some or all the Subfilters used via the
1592         Subblock filter. The order of the listed Subfilters is not
1593         significant.
1594
1595         Decoders supporting this Record should indicate a warning or
1596         error if this Record contains Filter Flags that are not
1597         actually used by the respective Blocks.
1598
1599
1600 5.6.4. 0x03: Comment
1601
1602         Free-form comment is stored in UTF-8 [RFC-3629] encoding.
1603
1604         The beginning of a new line should be indicated using the
1605         ASCII Line Feed character (0x0A). When the Line Feed character
1606         is not the native way to indicate new line in the underlying
1607         operating system, the encoder and decoder should convert the
1608         newline characters to and from Line Feeds.
1609
1610
1611 5.6.5. 0x04: List of Checks
1612
1613         +=======+=======+
1614         | Check | Check | ...
1615         +=======+=======+
1616
1617         There are as many Check fields as there are Blocks in the
1618         Stream. The size of Check fields depend on Stream Flags
1619         (see Section 2.2.2).
1620
1621         Decoders supporting this Record should indicate a warning or
1622         error if the Checks don't match the respective Blocks.
1623
1624
1625 5.6.6. 0x05: Original Filename
1626
1627         Original filename is stored in UTF-8 [RFC-3629] encoding.
1628
1629         The filename must not include any path, only the filename
1630         itself. Special care must be taken to prevent directory
1631         traversal vulnerabilities.
1632
1633         When files are moved between different operating systems, it
1634         is possible that filename valid in the source system is not
1635         valid in the target system. It is implementation defined how
1636         the decoder handles this kind of situations.
1637
1638
1639 5.6.7. 0x07: Modification Time
1640
1641         Modification time is stored as POSIX time, as an unsigned
1642         little endian integer. The number of bits depends on the
1643         Size of Data field. Note that the usage of unsigned integer
1644         limits the earliest representable time to 1970-01-01T00:00:00.
1645
1646
1647 5.6.8. 0x09: High-Resolution Modification Time
1648
1649         This Record extends the `0x04: Modification time' Record with
1650         a subsecond time information. There are two supported formats
1651         of this field, which can be distinguished by looking at the
1652         Size of Data field.
1653
1654         Size  Data
1655           3   [0; 9,999,999] times 100 nanoseconds
1656           4   [0; 999,999,999] nanoseconds
1657
1658         The value is stored as an unsigned 24-bit or 32-bit little
1659         endian integer.
1660
1661
1662 5.6.9. 0x0B: MIME Type
1663
1664         MIME type of the uncompressed Stream. This can be used to
1665         detect the content type. [IANA-MIME]
1666
1667
1668 5.6.10. 0x0D: Homepage URL
1669
1670         This field can be used, for example, when distributing software
1671         packages (sources or binaries). The field would indicate the
1672         homepage of the program.
1673
1674         For details on how to encode URLs, see [RFC-1738].
1675
1676
1677 6. Custom Filter and Extra Record IDs
1678
1679         If a developer wants to use custom Filter or Extra Record IDs,
1680         he has two choices. The first choice is to contact Lasse Collin
1681         and ask him to allocate a range of IDs for the developer.
1682
1683         The second choice is to generate a 40-bit random integer,
1684         which the developer can use as his personal Developer ID.
1685         To minimalize the risk of collisions, Developer ID has to be
1686         a randomly generated integer, not manually selected "hex word".
1687         The following command, which works on many free operating
1688         systems, can be used to generate Developer ID:
1689
1690             dd if=/dev/urandom bs=5 count=1 | hexdump
1691
1692         The developer can then use his Developer ID to create unique
1693         (well, hopefully unique) Filter and Extra Record IDs.
1694
1695             Bits    Mask                    Description
1696              0-15   0x0000_0000_0000_FFFF   Filter or Extra Record ID
1697             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1698             56-62   0x7F00_0000_0000_0000   Static prefix: 0x7F
1699
1700         The resulting 63-bit integer will use 9 bytes of space when
1701         stored using the encoding described in Section 1.2. To get
1702         a shorter ID, see the beginning of this Section how to
1703         request a custom ID range.
1704
1705         Note that Filter and Metadata Record IDs are in their own
1706         namespaces. That is, you can use the same ID value as Filter ID
1707         and Metadata Record ID, and the meanings of the IDs do not need
1708         to be related to each other.
1709
1710
1711 6.1. Reserved Custom Filter ID Ranges
1712
1713         Range                       Description
1714         0x0000_0000 - 0x0000_00DF   IDs fitting into the Misc field
1715         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1716         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1717
1718
1719 7. Cyclic Redundancy Checks
1720
1721         There are several incompatible variations to calculate CRC32
1722         and CRC64. For simplicity and clarity, complete examples are
1723         provided to calculate the checks as they are used in this file
1724         format. Implementations may use different code as long as it
1725         gives identical results.
1726
1727         The program below reads data from standard input, calculates
1728         the CRC32 and CRC64 values, and prints the calculated values
1729         as big endian hexadecimal strings to standard output.
1730
1731             #include <sys/types.h>
1732             #include <inttypes.h>
1733             #include <stdio.h>
1734
1735             uint32_t crc32_table[256];
1736             uint64_t crc64_table[256];
1737
1738             void
1739             init(void)
1740             {
1741                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1742                 static const uint64_t poly64
1743                         = UINT64_C(0xC96C5795D7870F42);
1744
1745                 for (size_t i = 0; i < 256; ++i) {
1746                     uint32_t crc32 = i;
1747                     uint64_t crc64 = i;
1748
1749                     for (size_t j = 0; j < 8; ++j) {
1750                         if (crc32 & 1)
1751                             crc32 = (crc32 >> 1) ^ poly32;
1752                         else
1753                             crc32 >>= 1;
1754
1755                         if (crc64 & 1)
1756                             crc64 = (crc64 >> 1) ^ poly64;
1757                         else
1758                             crc64 >>= 1;
1759                     }
1760
1761                     crc32_table[i] = crc32;
1762                     crc64_table[i] = crc64;
1763                 }
1764             }
1765
1766             uint32_t
1767             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1768             {
1769                 crc = ~crc;
1770                 for (size_t i = 0; i < size; ++i)
1771                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1772                             ^ (crc >> 8);
1773                 return ~crc;
1774             }
1775
1776             uint64_t
1777             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1778             {
1779                 crc = ~crc;
1780                 for (size_t i = 0; i < size; ++i)
1781                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1782                             ^ (crc >> 8);
1783                 return ~crc;
1784             }
1785
1786             int
1787             main()
1788             {
1789                 init();
1790
1791                 uint32_t value32 = 0;
1792                 uint64_t value64 = 0;
1793                 uint64_t total_size = 0;
1794                 uint8_t buf[8192];
1795
1796                 while (1) {
1797                     const size_t buf_size = fread(buf, 1, 8192, stdin);
1798                     if (buf_size == 0)
1799                         break;
1800
1801                     total_size += buf_size;
1802                     value32 = crc32(buf, buf_size, value32);
1803                     value64 = crc64(buf, buf_size, value64);
1804                 }
1805
1806                 printf("Bytes:  %" PRIu64 "\n", total_size);
1807                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1808                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1809
1810                 return 0;
1811             }
1812
1813
1814 8. References
1815
1816 8.1. Normative References
1817
1818         [RFC-1738]
1819         Uniform Resource Locators (URL)
1820         http://www.ietf.org/rfc/rfc1738.txt
1821
1822         [RFC-2119]
1823         Key words for use in RFCs to Indicate Requirement Levels
1824         http://www.ietf.org/rfc/rfc2119.txt
1825
1826         [RFC-2440]
1827         OpenPGP Message Format
1828         http://www.ietf.org/rfc/rfc2440.txt
1829
1830         [RFC-3629]
1831         UTF-8, a transformation format of ISO 10646
1832         http://www.ietf.org/rfc/rfc3629.txt
1833
1834         [IANA-MIME]
1835         MIME Media Types
1836         http://www.iana.org/assignments/media-types/
1837
1838
1839 8.2. Informative References
1840
1841         LZMA SDK - The original LZMA implementation
1842         http://7-zip.org/sdk.html
1843
1844         LZMA Utils - LZMA adapted to POSIX-like systems
1845         http://tukaani.org/lzma/
1846
1847         [RFC-1952]
1848         GZIP file format specification version 4.3
1849         http://www.ietf.org/rfc/rfc1952.txt
1850           - Notation of byte boxes in section `2.1. Overall conventions'
1851
1852         [GNU-tar]
1853         GNU tar 1.16.1 manual
1854         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1855           - Node 9.4.2 `Blocking Factor', paragraph that begins
1856             `gzip will complain about trailing garbage'
1857           - Note that this URL points to the latest version of the
1858             manual, and may some day not contain the note which is in
1859             1.16.1. For the exact version of the manual, download GNU
1860             tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz
1861