doc/liblzma-advanced.txt

   1
   2 Advanced features of liblzma
   3 ----------------------------
   4
   5 0. Introduction
   6
   7     Most developers need only the basic features of liblzma. These
   8     features allow single-threaded encoding and decoding of .lzma files
   9     in streamed mode.
  10
  11     In some cases developers want more. The .lzma file format is
  12     designed to allow multi-threaded encoding and decoding and limited
  13     random-access reading. These features are possible in non-streamed
  14     mode and limitedly also in streamed mode.
  15
  16     To take advange of these features, the application needs a custom
  17     .lzma file format handler. liblzma provides a set of tools to ease
  18     this task, but it's still quite a bit of work to get a good custom
  19     .lzma handler done.
  20
  21
  22 1. Where to begin
  23
  24     Start by reading the .lzma file format specification. Understanding
  25     the basics of the .lzma file structure is required to implement a
  26     custom .lzma file handler and to understand the rest of this document.
  27
  28
  29 2. The basic components
  30
  31 2.1. Stream Header and tail
  32
  33     Stream Header begins the .lzma Stream and Stream tail ends it. Stream
  34     Header is defined in the file format specification, but Stream tail
  35     isn't (thus I write "tail" with a lower-case letter). Stream tail is
  36     simply the Stream Flags and the Footer Magic Bytes fields together.
  37     It was done this way in liblzma, because the Block coders take care
  38     of the rest of the stuff in the Stream Footer.
  39
  40     For now, the size of Stream Header is fixed to 11 bytes. The header
  41     <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
  42     should use instead of a hardcoded number. Similarly, Stream tail
  43     is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
  44
  45     It is possible, that a future version of the .lzma format will have
  46     variable-sized Stream Header and tail. As of writing, this seems so
  47     unlikely though, that it was considered simplest to just use a
  48     constant instead of providing a functions to get and store the sizes
  49     of the Stream Header and tail.
  50
  51
  52 2.x. Stream tail
  53
  54     For now, the size of Stream tail is fixed to 3 bytes. The header
  55     <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
  56     should use instead of a hardcoded number.
  57
  58
  59 3. Keeping track of size information
  60
  61     The lzma_info_* functions found from <lzma/info.h> should ease the
  62     task of keeping track of sizes of the Blocks and also the Stream
  63     as a whole. Using these functions is strongly recommended, because
  64     there are surprisingly many situations where an error can occur,
  65     and these functions check for possible errors every time some new
  66     information becomes available.
  67
  68     If you find lzma_info_* functions lacking something that you would
  69     find useful, please contact the author.
  70
  71
  72 3.1. Start offset of the Stream
  73
  74     If you are storing the .lzma Stream inside anothe file format, or
  75     for some other reason are placing the .lzma Stream to somewhere
  76     else than to the beginning of the file, you should tell the starting
  77     offset of the Stream using lzma_info_start_offset_set().
  78
  79     The start offset of the Stream is used for two distinct purporses.
  80     First, knowing the start offset of the Stream allows
  81     lzma_info_alignment_get() to correctly calculate the alignment of
  82     every Block. This information is given to the Block encoder, which
  83     will calculate the size of Header Padding so that Compressed Data
  84     is alignment at an optimal offset.
  85
  86     Another use for start offset of the Stream is in random-access
  87     reading. If you set the start offset of the Stream, lzma_info_locate()
  88     will be able to calculate the offset relative to the beginning of the
  89     file containing the Stream (instead of offset relative to the
  90     beginning of the Stream).
  91
  92
  93 3.2. Size of Stream Header
  94
  95     While the size of Stream Header is constant (11 bytes) in the current
  96     version of the .lzma file format, this may change in future.
  97
  98
  99 3.3. Size of Header Metadata Block
 100
 101     This information is needed when doing random-access reading, and
 102     to verify the value of this field stored in Footer Metadata Block.
 103
 104
 105 3.4. Total Size of the Data Blocks
 106
 107
 108 3.5. Uncompressed Size of Data Blocks
 109
 110
 111 3.6. Index
 112
 113
 114
 115
 116 x. Alignment
 117
 118     There are a few slightly different types of alignment issues when
 119     working with .lzma files.
 120
 121     The .lzma format doesn't strictly require any kind of alignment.
 122     However, if the encoder carefully optimizes the alignment in all
 123     situations, it can improve compression ratio, speed of the encoder
 124     and decoder, and slightly help if the files get damaged and need
 125     recovery.
 126
 127     Alignment has the most significant effect compression ratio FIXME
 128
 129
 130 x.1. Compression ratio
 131
 132     Some filters take advantage of the alignment of the input data.
 133     To get the best compression ratio, make sure that you feed these
 134     filters correctly aligned data.
 135
 136     Some filters (e.g. LZMA) don't necessarily mind too much if the
 137     input doesn't match the preferred alignment. With these filters
 138     the penalty in compression ratio depends on the specific type of
 139     data being compressed.
 140
 141     Other filters (e.g. PowerPC executable filter) won't work at all
 142     with data that is improperly aligned. While the data can still
 143     be de-filtered back to its original form, the benefit of the
 144     filtering (better compression ratio) is completely lost, because
 145     these filters expect certain patterns at properly aligned offsets.
 146     The compression ratio may even worse with incorrectly aligned input
 147     than without the filter.
 148
 149
 150 x.1.1. Inter-filter alignment
 151
 152     When there are multiple filters chained, checking the alignment can
 153     be useful not only with the input of the first filter and output of
 154     the last filter, but also between the filters.
 155
 156     Inter-filter alignment important especially with the Subblock filter.
 157
 158
 159 x.1.2. Further compression with external tools
 160
 161     This is relatively rare situation in practice, but still worth
 162     understanding.
 163
 164     Let's say that there are several SPARC executables, which are each
 165     filtered to separate .lzma files using only the SPARC filter. If
 166     Uncompressed Size is written to the Block Header, the size of Block
 167     Header may vary between the .lzma files. If no Padding is used in
 168     the Block Header to correct the alignment, the starting offset of
 169     the Compressed Data field will be differently aligned in different
 170     .lzma files.
 171
 172     All these .lzma files are archived into a single .tar archive. Due
 173     to nature of the .tar format, every file is aligned inside the
 174     archive to an offset that is a multiple of 512 bytes.
 175
 176     The .tar archive is compressed into a new .lzma file using the LZMA
 177     filter with options, that prefer input alignment of four bytes. Now
 178     if the independent .lzma files don't have the same alignment of
 179     the Compressed Data fields, the LZMA filter will be unable to take
 180     advantage of the input alignment between the files in the .tar
 181     archive, which reduces compression ratio.
 182
 183     Thus, even if you have only single Block per file, it can be good for
 184     compression ratio to align the Compressed Data to optimal offset.
 185
 186
 187 x.2. Speed
 188
 189     Most modern computers are faster when multi-byte data is located
 190     at aligned offsets in RAM. Proper alignment of the Compressed Data
 191     fields can slightly increase the speed of some filters.
 192
 193
 194 x.3. Recovery
 195
 196     Aligning every Block Header to start at an offset with big enough
 197     alignment may ease or at least speed up recovery of broken files.
 198
 199
 200 y. Typical usage cases
 201
 202 y.x. Parsing the Stream backwards
 203
 204     You may need to parse the Stream backwards if you need to get
 205     information such as the sizes of the Stream, Index, or Extra.
 206     The basic procedure to do this follows.
 207
 208     Locate the end of the Stream. If the Stream is stored as is in a
 209     standalone .lzma file, simply seek to the end of the file and start
 210     reading backwards using appropriate buffer size. The file format
 211     specification allows arbitrary amount of Footer Padding (zero or more
 212     NUL bytes), which you skip before trying to decode the Stream tail.
 213
 214     Once you have located the end of the Stream (a non-NULL byte), make
 215     sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
 216     Stream in a buffer. If there isn't enough bytes left from the file,
 217     the file is too small to contain a valid Stream. Decode the Stream
 218     tail using lzma_stream_tail_decoder(). Store the offset of the first
 219     byte of the Stream tail; you will need it later.
 220
 221     You may now want to do some internal verifications e.g. if the Check
 222     type is supported by the liblzma build you are using.
 223
 224     Decode the Backward Size field with lzma_vli_reverse_decode(). The
 225     field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
 226     Backward Size is not zero. Store the offset of the first byte of
 227     the Backward Size; you will need it later.
 228
 229     Now you know the Total Size of the last Block of the Stream. It's the
 230     value of Backward Size plus the size of the Backward Size field. Note
 231     that you cannot use lzma_vli_size() to calculate the size since there
 232     might be padding; you need to use the real observed size of the
 233     Backward Size field.
 234
 235     At this point, the operation continues differently for Single-Block
 236     and Multi-Block Streams.
 237
 238
 239 y.x.1. Single-Block Stream
 240
 241     There might be Uncompressed Size field present in the Stream Footer.
 242     You cannot know it for sure unless you have already parsed the Block
 243     Header earlier. For security reasons, you probably want to try to
 244     decode the Uncompressed Size field, but you must not indicate any
 245     error if decoding fails. Later you can give the decoded Uncompressed
 246     Size to Block decoder if Uncopmressed Size isn't otherwise known;
 247     this prevents it from producing too much output in case of (possibly
 248     intentionally) corrupt file.
 249
 250     Calculate the the start offset of the Stream:
 251
 252         backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
 253
 254     backward_offset is the offset of the first byte of the Backward Size
 255     field. Remember to check for integer overflows, which can occur with
 256     invalid input files.
 257
 258     Seek to the beginning of the Stream. Decode the Stream Header using
 259     lzma_stream_header_decoder(). Verify that the decoded Stream Flags
 260     match the values found from Stream tail. You can use the
 261     lzma_stream_flags_is_equal() macro for this.
 262
 263     Decode the Block Header. Verify that it isn't a Metadata Block, since
 264     Single-Block Streams cannot have Metadata. If Uncompressed Size is
 265     present in the Block Header, the value you tried to decode from the
 266     Stream Footer must be ignored, since Uncompressed Size wasn't actually
 267     present there. If Block Header doesn't have Uncompressed Size, and
 268     decoding the Uncompressed Size field from the Stream Footer failed,
 269     the file is corrupt.
 270
 271     If you were only looking for the Uncompressed Size of the Stream,
 272     you now got that information, and you can stop processing the Stream.
 273
 274     To decode the Block, the same instructions apply as described in
 275     FIXME. However, because you have some extra known information decoded
 276     from the Stream Footer, you should give this information to the Block
 277     decoder so that it can verify it while decoding:
 278       - If Uncompressed Size is not present in the Block Header, set
 279         lzma_options_block.uncompressed_size to the value you decoded
 280         from the Stream Footer.
 281       - Always set lzma_options_block.total_size to backward_size +
 282         size_of_backward_size (you calculated this sum earlier already).
 283
 284
 285 y.x.2. Multi-Block Stream
 286
 287     Calculate the start offset of the Footer Metadata Block:
 288
 289         backward_offset - backward_size
 290
 291     backward_offset is the offset of the first byte of the Backward Size
 292     field. Remember to check for integer overflows, which can occur with
 293     broken input files.
 294
 295     Decode the Block Header. Verify that it is a Metadata Block. Set
 296     lzma_options_block.total_size to backward_size + size_of_backward_size
 297     (you calculated this sum earlier already). Then decode the Footer
 298     Metadata Block.
 299
 300     Store the decoded Footer Metadata to lzma_info structure using
 301     lzma_info_set_metadata(). Set also the offset of the Backward Size
 302     field using lzma_info_size_set(). Then you can get the start offset
 303     of the Stream using lzma_info_size_get(). Note that any of these steps
 304     may fail so don't omit error checking.
 305
 306     Seek to the beginning of the Stream. Decode the Stream Header using
 307     lzma_stream_header_decoder(). Verify that the decoded Stream Flags
 308     match the values found from Stream tail. You can use the
 309     lzma_stream_flags_is_equal() macro for this.
 310
 311     If you were only looking for the Uncompressed Size of the Stream,
 312     it's possible that you already have it now. If Uncompressed Size (or
 313     whatever information you were looking for) isn't available yet,
 314     continue by decoding also the Header Metadata Block. (If some
 315     information is missing, the Header Metadata Block has to be present.)
 316
 317     Decoding the Data Blocks goes the same way as described in FIXME.
 318
 319
 320 y.x.3. Variations
 321
 322     If you know the offset of the beginning of the Stream, you may want
 323     to parse the Stream Header before parsing the Stream tail.
 324