This chapter is an edited version of Matthew Russotto's page; "Microsoft's HTML Help (.chm) format", which is available at http://www.speakeasy.org/~russotto/chm/chmformat.htm. It is used with his knowledge and permission.
This is documentation on the ITSF format used by MS HH. This format has been reverse engineered in the past, but as far as is known this is the first freely available documentation on it, other than the predecessor of this chapter. One Usenet message indicates that CHM files are actually IStorage files documented in the MS Platform SDK. However, such documentation has searched for, without success. A code sample (Delphi one here) that shows how to get an IStorage object representing a CHM from a ITStorage object has been located outside MS. No reference to ITStorage has been found in the MSDN or anywhere else. The DLL used to implement this format is itss.dll and its resource information indicates that it is MS' InfoTech Storage System Library.
The word "section" is badly overloaded in this document. Sorry about that.
All numbers are in decimal unless otherwise indicated in the text. Hex numbers are indicated by 0x. All values within the ITSF file are Intel byte order (little endian) unless indicated otherwise.
The ITSF file begins with a short (0x38 byte) initial header. This is followed by the header section table, the offset to the content, and a number of bytes of information of unknown use. Collectively, this is the "header".
The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the ITSF file length and some unknown data. Immediately following the header sections is the content. The image below shows the structure of the ITSF files.
The header starts with the initial header, which has the following format:
Offset | Type | Comment/Value |
0 | char[4] | ITSF |
4 | DWORD | 2 or 3 (version number) |
8 | DWORD | 0x58 in version 2 files, 0x60 in version 3 files (total header length, including header section table and following data) |
0xC | DWORD | 1 (unknown) |
0x10 | DWORD | Unknown checksum. |
0x14 | DWORD | LCID of the OS at the time of compilation, not the one stored in the HHP file. It is unknown whether or not this is the system LCID (from GetSystemDefaultLCID), the user LCID (from GetUserDefaultLCID) or the thread LCID (from GetThreadLocale). It is likely to be GetUserDefaultLCID since that is the function that itss.dll depends on. If you have the facility to check this please let us know of the result. |
0x18 | GUID | {7C01FD10-7BAA-11D0-9E0C-00A0C922E6EC} |
0x28 | GUID | {7C01FD11-7BAA-11D0-9E0C-00A0C922E6EC} |
It is followed by the header section table, which is 2 entries, where each entry is 0x10 bytes long and has this format:
Offset | Type | Comment/Value |
0 | QWORD | Offset of header section from beginning of ITSF file |
8 | QWORD | Length of header section |
Following the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory.
Offset | Type | Comment/Value |
0 | QWORD | Offset within ITSF file of content section 0 |
This section contains the total size of the ITSF file, and not much else.
Offset | Type | Comment/Value |
0 | DWORD | 0x01fe (unknown) |
4 | DWORD | 0 (unknown) |
8 | QWORD | ITSF File Size |
0x10 | DWORD | 0 (unknown) |
0x14 | DWORD | 0 (unknown) |
The central part of the ITSF file: A directory of the files and information it contains.
The directory starts with a header; its format is as follows:
Offset | Type | Comment/Value |
0 | char[4] | ITSP |
4 | DWORD | 1 (version number) |
8 | DWORD | 0x54 (directory header length) |
0xC | DWORD | 0x0a (unknown) |
0x10 | DWORD | 0x1000 (directory chunk size) |
0x14 | DWORD | Density of quickref section, usually 2. |
0x18 | DWORD | Depth of the directory tree. 1 there is no index, 2 if there is one level of PMGI chunks … |
0x1C | DWORD | Chunk number of root index chunk. -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug.) |
0x20 | DWORD | Chunk number of first PMGL (listing) chunk |
0x24 | DWORD | Chunk number of last PMGL (listing) chunk |
0x28 | DWORD | -1 (unknown) |
0x2C | DWORD | Number of directory chunks (total) |
0x30 | DWORD | LCID. 0x409=en-us is the only one seen. If you have a non-us version of HHW please change your system and HHP locales to something other than en-us & something other than the locale of HHW, compile a CHM and check its LCID at offset 168. It is probably from the program that compiled the ITSF, definately not the one stored in the HHP file or from the OS. It is unknown which EXE/DLL this LCID comes from, but at a guess it would be ITSS.DLL, which provides the following GUID. |
0x34 | GUID | {5D02926A-212E-11D0-9DF9-00A0C922E6EC} |
0x44 | DWORD | 0x54 (this is the length again) |
0x48 | DWORD | -1 (unknown) |
0x4C | DWORD | -1 (unknown) |
0x50 | DWORD | -1 (unknown) |
The header is directly followed by the directory chunks. There are two types of directory chunks - index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:
Offset | Type | Comment/Value |
0 | char[4] | PMGL |
4 | DWORD | Length of free space and/or quickref area at end of directory chunk |
8 | DWORD | 0 (unknown) |
0xC | DWORD | Chunk number of previous listing chunk when reading directory in sequence (-1 if this is the first listing chunk) |
0x10 | DWORD | Chunk number of next listing chunk when reading directory in sequence (-1 if this is the last listing chunk) |
0x14 | Directory listing entries to quickref area. Sorted case-insensitively by filename. Consecutive entries do not necessarily have increasing offsets. |
The format of a directory listing entry is:
Offset | Type | Comment/Value |
0 | BYTE | length of name |
1 | BYTEs | name (UTF-8 encoded) |
+0 | ENCINT | content section |
+0 | ENCINT | offset |
+0 | ENCINT | length |
The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression.
There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/".
Between the chunk entries and the quickref entries is chunk length - ( num entries / n + !!( num entries % n ) ) * 2 bytes worth of free space. This usually contains the same data from the same offsets in the previous chunk, and can be zeroed out, with no effect on the decoder and a slight increase in the compressability of the file with zip/gzip/bzip2 & probably other crunchers. The free space is usually partial/junk chunk entries, free space and/or quickref entries.
The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5.
Offset | Type | Comment/Value |
Chunklen-2 | WORD | Number of entries in the chunk |
Chunklen-4 | WORD | Offset of entry n from entry 0 |
Chunklen-8 | WORD | Offset of entry 2n from entry 0 |
Chunklen-0xC | WORD | Offset of entry 3n from entry 0 |
... |
An index chunk has the following format:
Offset | Type | Comment/Value |
0 | char[4] | PMGI |
4 | DWORD | Length of quickref/free area at end of directory chunk |
8 | Directory index entries (to quickref/free area) |
The format of a directory index entry is as follows:
Offset | Type | Comment/Value |
0 | BYTE | length of name |
1 | BYTEs | name (UTF-8 encoded) |
+0 | ENCINT | directory listing chunk which starts with name |
When higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks.
The quickref area in an PMGI is the same as in an PMGL.
An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, 0xEA 0x15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515.
The content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. All content section 0 locations in the directory are relative to that point. The other content sections are stored within content section 0.
There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows:
Offset | Type | Comment/Value |
0 | WORD | Length of file, in words |
2 | WORD | Number of entries in file |
4 | Entries to the EOF |
Each entry:
Offset | Type | Comment/Value |
0 | WORD | Length of name in words, excluding terminating NIL |
2 | WORDs | Double-byte characters |
+0 | WORD | 0 |
Yes, the names have a length word and are NT; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian).
The section names seen so far are:
"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with MS' LZX algorithm.
For each section other than 0, there exists a file called '::DataSpace/Storage/<Section Name>/Content'. This file contains the compressed and/or encrypted data for the section. So, conceptually, getting a file from a nonzero section is a multi-step process. First you must get the content file from section 0. Then you decompress (if appropriate) the section. Then you get the desired file from your decompressed section.
There are several other files associated with the sections
This file contains 0x20 bytes of information on the compression. The information is partially known:
Offset | Type | Comment/Value |
0 | DWORD | 6 (unknown) |
4 | ASCII | LZXC (compression type identifier) |
8 | DWORD | 2 (possibly numeric code for LZX) |
0xC | DWORD | The Huffman reset interval in 0x8000-byte blocks |
0x10 | DWORD | The window size in 0x8000-byte blocks |
0x14 | DWORD | sometimes 2, sometimes 1, sometimes 0 (unknown) |
0x18 | DWORD | 0 (unknown) |
This file contains a QWORD containing the uncompressed length of the section.
It appears this file was intended to contain a list of GUIDs belonging to methods of decompressing (or otherwise transforming) the section. However, it actually contains only half of the string representation of a GUID, apparently because it was sized for characters but contains wide characters.
The compressed sections are compressed using LZX, a compression method MS also uses for its cabinet files. To ensure this, check the second DWORD of compression info in the ControlData file for the section - it should be 'LZXC'. To decompress, first read the file "::DataSpace/Storage/<SectionName>/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable". This reset table has the following format:
Offset | Type | Comment/Value |
0 | DWORD | 2 (unknown - possibly a version number) |
4 | DWORD | Number of entries in reset table |
8 | DWORD | 8 (unknown) |
0xC | DWORD | 0x28 (length of table header - area before table entries) |
0x10 | QWORD | Uncompressed Length |
0x18 | QWORD | Compressed Length |
0x20 | QWORD | 0x8000 (block size for locations below) |
0x28 | QWORD | Offset in compressed data of nth block boundary in uncompressed data (first offset = 0) |
Repeat QWORD offsets to EOF |
Now you can finally obtain the section (from its Content file). The window size for the LZX compression is 16 (decimal) on all the files seen so far. This is specified by the DWORD at 0x10 in the ControlData file (but note that DWORD gives the window size in 0x8000-byte blocks, not the LZX code for the window size).
There is one change from LZX as defined by MS: After each Huffman reset interval (defined in the ControlData file, but in practice equal to the window size) of compressed data is processed, the decoder state is partially reset: that is, the Huffman length tables are cleared and the one-bit preprocessing header is reread. The LZ window is not cleared.
The rule that the input bit-stream is to be re-aligned to a 16-bit boundary after 0x8000 output characters have been processed IS in effect, despite this LZX not being part of a CAB file. The reset table tells you when this was done, though there seems to be no need for that during decompression; you can just keep track of the number of output characters. Furthermore, while this does not appear to be documented in the LZX format, the compressed stream is padded to an 0x8000 (decimal) byte boundary.