2 Advanced features of liblzma
3 ----------------------------
7 Most developers need only the basic features of liblzma. These
8 features allow single-threaded encoding and decoding of .lzma files
11 In some cases developers want more. The .lzma file format is
12 designed to allow multi-threaded encoding and decoding and limited
13 random-access reading. These features are possible in non-streamed
14 mode and limitedly also in streamed mode.
16 To take advange of these features, the application needs a custom
17 .lzma file format handler. liblzma provides a set of tools to ease
18 this task, but it's still quite a bit of work to get a good custom
24 Start by reading the .lzma file format specification. Understanding
25 the basics of the .lzma file structure is required to implement a
26 custom .lzma file handler and to understand the rest of this document.
29 2. The basic components
31 2.1. Stream Header and tail
33 Stream Header begins the .lzma Stream and Stream tail ends it. Stream
34 Header is defined in the file format specification, but Stream tail
35 isn't (thus I write "tail" with a lower-case letter). Stream tail is
36 simply the Stream Flags and the Footer Magic Bytes fields together.
37 It was done this way in liblzma, because the Block coders take care
38 of the rest of the stuff in the Stream Footer.
40 For now, the size of Stream Header is fixed to 11 bytes. The header
41 <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
42 should use instead of a hardcoded number. Similarly, Stream tail
43 is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
45 It is possible, that a future version of the .lzma format will have
46 variable-sized Stream Header and tail. As of writing, this seems so
47 unlikely though, that it was considered simplest to just use a
48 constant instead of providing a functions to get and store the sizes
49 of the Stream Header and tail.
54 For now, the size of Stream tail is fixed to 3 bytes. The header
55 <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
56 should use instead of a hardcoded number.
59 3. Keeping track of size information
61 The lzma_info_* functions found from <lzma/info.h> should ease the
62 task of keeping track of sizes of the Blocks and also the Stream
63 as a whole. Using these functions is strongly recommended, because
64 there are surprisingly many situations where an error can occur,
65 and these functions check for possible errors every time some new
66 information becomes available.
68 If you find lzma_info_* functions lacking something that you would
69 find useful, please contact the author.
72 3.1. Start offset of the Stream
74 If you are storing the .lzma Stream inside anothe file format, or
75 for some other reason are placing the .lzma Stream to somewhere
76 else than to the beginning of the file, you should tell the starting
77 offset of the Stream using lzma_info_start_offset_set().
79 The start offset of the Stream is used for two distinct purporses.
80 First, knowing the start offset of the Stream allows
81 lzma_info_alignment_get() to correctly calculate the alignment of
82 every Block. This information is given to the Block encoder, which
83 will calculate the size of Header Padding so that Compressed Data
84 is alignment at an optimal offset.
86 Another use for start offset of the Stream is in random-access
87 reading. If you set the start offset of the Stream, lzma_info_locate()
88 will be able to calculate the offset relative to the beginning of the
89 file containing the Stream (instead of offset relative to the
90 beginning of the Stream).
93 3.2. Size of Stream Header
95 While the size of Stream Header is constant (11 bytes) in the current
96 version of the .lzma file format, this may change in future.
99 3.3. Size of Header Metadata Block
101 This information is needed when doing random-access reading, and
102 to verify the value of this field stored in Footer Metadata Block.
105 3.4. Total Size of the Data Blocks
108 3.5. Uncompressed Size of Data Blocks
118 There are a few slightly different types of alignment issues when
119 working with .lzma files.
121 The .lzma format doesn't strictly require any kind of alignment.
122 However, if the encoder carefully optimizes the alignment in all
123 situations, it can improve compression ratio, speed of the encoder
124 and decoder, and slightly help if the files get damaged and need
127 Alignment has the most significant effect compression ratio FIXME
130 x.1. Compression ratio
132 Some filters take advantage of the alignment of the input data.
133 To get the best compression ratio, make sure that you feed these
134 filters correctly aligned data.
136 Some filters (e.g. LZMA) don't necessarily mind too much if the
137 input doesn't match the preferred alignment. With these filters
138 the penalty in compression ratio depends on the specific type of
139 data being compressed.
141 Other filters (e.g. PowerPC executable filter) won't work at all
142 with data that is improperly aligned. While the data can still
143 be de-filtered back to its original form, the benefit of the
144 filtering (better compression ratio) is completely lost, because
145 these filters expect certain patterns at properly aligned offsets.
146 The compression ratio may even worse with incorrectly aligned input
147 than without the filter.
150 x.1.1. Inter-filter alignment
152 When there are multiple filters chained, checking the alignment can
153 be useful not only with the input of the first filter and output of
154 the last filter, but also between the filters.
156 Inter-filter alignment important especially with the Subblock filter.
159 x.1.2. Further compression with external tools
161 This is relatively rare situation in practice, but still worth
164 Let's say that there are several SPARC executables, which are each
165 filtered to separate .lzma files using only the SPARC filter. If
166 Uncompressed Size is written to the Block Header, the size of Block
167 Header may vary between the .lzma files. If no Padding is used in
168 the Block Header to correct the alignment, the starting offset of
169 the Compressed Data field will be differently aligned in different
172 All these .lzma files are archived into a single .tar archive. Due
173 to nature of the .tar format, every file is aligned inside the
174 archive to an offset that is a multiple of 512 bytes.
176 The .tar archive is compressed into a new .lzma file using the LZMA
177 filter with options, that prefer input alignment of four bytes. Now
178 if the independent .lzma files don't have the same alignment of
179 the Compressed Data fields, the LZMA filter will be unable to take
180 advantage of the input alignment between the files in the .tar
181 archive, which reduces compression ratio.
183 Thus, even if you have only single Block per file, it can be good for
184 compression ratio to align the Compressed Data to optimal offset.
189 Most modern computers are faster when multi-byte data is located
190 at aligned offsets in RAM. Proper alignment of the Compressed Data
191 fields can slightly increase the speed of some filters.
196 Aligning every Block Header to start at an offset with big enough
197 alignment may ease or at least speed up recovery of broken files.
200 y. Typical usage cases
202 y.x. Parsing the Stream backwards
204 You may need to parse the Stream backwards if you need to get
205 information such as the sizes of the Stream, Index, or Extra.
206 The basic procedure to do this follows.
208 Locate the end of the Stream. If the Stream is stored as is in a
209 standalone .lzma file, simply seek to the end of the file and start
210 reading backwards using appropriate buffer size. The file format
211 specification allows arbitrary amount of Footer Padding (zero or more
212 NUL bytes), which you skip before trying to decode the Stream tail.
214 Once you have located the end of the Stream (a non-NULL byte), make
215 sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
216 Stream in a buffer. If there isn't enough bytes left from the file,
217 the file is too small to contain a valid Stream. Decode the Stream
218 tail using lzma_stream_tail_decoder(). Store the offset of the first
219 byte of the Stream tail; you will need it later.
221 You may now want to do some internal verifications e.g. if the Check
222 type is supported by the liblzma build you are using.
224 Decode the Backward Size field with lzma_vli_reverse_decode(). The
225 field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
226 Backward Size is not zero. Store the offset of the first byte of
227 the Backward Size; you will need it later.
229 Now you know the Total Size of the last Block of the Stream. It's the
230 value of Backward Size plus the size of the Backward Size field. Note
231 that you cannot use lzma_vli_size() to calculate the size since there
232 might be padding; you need to use the real observed size of the
235 At this point, the operation continues differently for Single-Block
236 and Multi-Block Streams.
239 y.x.1. Single-Block Stream
241 There might be Uncompressed Size field present in the Stream Footer.
242 You cannot know it for sure unless you have already parsed the Block
243 Header earlier. For security reasons, you probably want to try to
244 decode the Uncompressed Size field, but you must not indicate any
245 error if decoding fails. Later you can give the decoded Uncompressed
246 Size to Block decoder if Uncopmressed Size isn't otherwise known;
247 this prevents it from producing too much output in case of (possibly
248 intentionally) corrupt file.
250 Calculate the the start offset of the Stream:
252 backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
254 backward_offset is the offset of the first byte of the Backward Size
255 field. Remember to check for integer overflows, which can occur with
258 Seek to the beginning of the Stream. Decode the Stream Header using
259 lzma_stream_header_decoder(). Verify that the decoded Stream Flags
260 match the values found from Stream tail. You can use the
261 lzma_stream_flags_is_equal() macro for this.
263 Decode the Block Header. Verify that it isn't a Metadata Block, since
264 Single-Block Streams cannot have Metadata. If Uncompressed Size is
265 present in the Block Header, the value you tried to decode from the
266 Stream Footer must be ignored, since Uncompressed Size wasn't actually
267 present there. If Block Header doesn't have Uncompressed Size, and
268 decoding the Uncompressed Size field from the Stream Footer failed,
271 If you were only looking for the Uncompressed Size of the Stream,
272 you now got that information, and you can stop processing the Stream.
274 To decode the Block, the same instructions apply as described in
275 FIXME. However, because you have some extra known information decoded
276 from the Stream Footer, you should give this information to the Block
277 decoder so that it can verify it while decoding:
278 - If Uncompressed Size is not present in the Block Header, set
279 lzma_options_block.uncompressed_size to the value you decoded
280 from the Stream Footer.
281 - Always set lzma_options_block.total_size to backward_size +
282 size_of_backward_size (you calculated this sum earlier already).
285 y.x.2. Multi-Block Stream
287 Calculate the start offset of the Footer Metadata Block:
289 backward_offset - backward_size
291 backward_offset is the offset of the first byte of the Backward Size
292 field. Remember to check for integer overflows, which can occur with
295 Decode the Block Header. Verify that it is a Metadata Block. Set
296 lzma_options_block.total_size to backward_size + size_of_backward_size
297 (you calculated this sum earlier already). Then decode the Footer
300 Store the decoded Footer Metadata to lzma_info structure using
301 lzma_info_set_metadata(). Set also the offset of the Backward Size
302 field using lzma_info_size_set(). Then you can get the start offset
303 of the Stream using lzma_info_size_get(). Note that any of these steps
304 may fail so don't omit error checking.
306 Seek to the beginning of the Stream. Decode the Stream Header using
307 lzma_stream_header_decoder(). Verify that the decoded Stream Flags
308 match the values found from Stream tail. You can use the
309 lzma_stream_flags_is_equal() macro for this.
311 If you were only looking for the Uncompressed Size of the Stream,
312 it's possible that you already have it now. If Uncompressed Size (or
313 whatever information you were looking for) isn't available yet,
314 continue by decoding also the Header Metadata Block. (If some
315 information is missing, the Header Metadata Block has to be present.)
317 Decoding the Data Blocks goes the same way as described in FIXME.
322 If you know the offset of the beginning of the Stream, you may want
323 to parse the Stream Header before parsing the Stream tail.