NSFileWrapper serializedRepresentation

06/09/2015
Simon Rodriguez

With the last update of Aquarii, I published online a small utility app, allowing people to extract data from their backups archives. The reason is that backups are generated using the NSFileWrapper class, and its serializedRepresentation method[1] : this is an easy way to transform a directory and its content into one big file, without having to use external compression libraries. But what if the user wanted to get back things that are not exportable in the app, or full size images of the items list? Thus I developed a small OS X app to do this job, basically a simple wrapper around NSFileWrapper[2].
But what if the user was using Windows or Linux ? NSFileWrapper is an Apple class, included in Foundation, and I needed it to provide deserialization. The only solution left was to understand how a serialized archive is built by NSFileWrapper, and how to manually extract the data I wanted from it. My goal was only this one : be able to extract the files and subdirectories from an archive, conserving the hierarchy and names. Nothing more. It is important to note that the NSFileWrapper real implementation is private, and can be changed by Apple whenever they need to (or even has already been changed, with backward-compatibility).
Thus a bit of reverse-engineering. I created a simple command line tool to serialize any directory. And started doing tests.

An empty directory

Let's create an empty directory, name it Test1 and serialize it. We get the resulting hex content :
72 74 66 64 00 00 00 00 03 00 00 00 02 00 00 00 13 00 00 00 5F 5F 40 50 72 65 66 65 72 72 65 64 4E 61 6D 65 40 5F 5F 17 00 00 00 5F 5F 40 55 54 46 38 50 72 65 66 65 72 72 65 64 4E 61 6D 65 40 5F 5F 0D 00 00 00 0D 00 00 00 01 00 00 00 05 00 00 00 54 65 73 74 31 01 00 00 00 05 00 00 00 54 65 73 74 31

Using Hexfiend to examine the file, we notice a few meaningful strings, that we can replace:
rtfd 00 00 00 00 03 00 00 00 02 00 00 00 13 00 00 00 __@PreferredName@__ 17 00 00 00 __@UTF8PreferredName@__ 0D 00 00 00 0D 00 00 00 01 00 00 00 05 00 00 00 Test1 01 00 00 00 05 00 00 00 Test1

The rtfd string comes from the fact that NSFileWrapper is among other things built to generate rich text format directories, ie rtf files wrapped with images, sounds..., in a package. We can notice that each time we have a human-readable string, the 4 bytes number before is equal to the length (in bytes) of the string. This gives us another hint: numbers are read little-endian. The strings __@PreferredName@__ and __@UTF8PreferredName@__ refer to properties of a NSFileWrapper instance. In our case, we would expect those two strings to be Test1. So there seems to be a first part where properties are declared, and a second part where their values are stored, in the same order. But we don't know yet if the same rule is applied for files and subdirectories.

A directory with one plain text file

We add a plain text file in the directory (now named Test2), and we examine the result of the serialization. We know that the text file is called loremipsum.txt, that its content begins with "Donec" and ends with "elit.", and its size is 616B:
rtfd 00 00 00 00 03 00 00 00 04 00 00 00 size:14 loremipsum.txt size:19 __@PreferredName@__ size:23 __@UTF8PreferredName@__ 01 00 00 00 2E 70 02 00 00 0D 00 00 00 0D 00 00 00 32 00 00 00 01 00 00 00 68 02 00 00 Donec ... elit. 01 00 00 00 size:5 Test2 01 00 00 00 size:5 Test2 01 00 00 00 2A 00 00 00 01 00 00 00 size:14 loremipsum.txt 10 00 00 00 21 F4 E9 55 A4 01 00 00 00 00 00 00 00 00 00 00

We replaced strings and their length as before. We notice that the rule seems to be the same for files and properties. In the second part, we see that each value is preceded by 8 bytes: the first four are always 01000000, probably denoting the beginning of a value/file. We see that for the value test2, the net 4 bytes indicate its length. When checking this hypothesis on the text file we read a value of 616, indeed its size in bytes.
When looking at the very beginning of the file, we see that by adding a file, the 4-th quadruple of bytes has increased by one. Maybe the number of files/properties ? But it seems shifted by one unit. And what is it with the weird repetition of loremipsum.txt at the end of the file ? There seems to be one extra item in the serialized representation. We know that NSFileWrapper can preserve authorizations and other attributes of each file, so they might be stored here. But if there is one extra value, there is one extra item in the list (first part of the file). If we read after __@UTF8PreferredName@__, following the same logic as before (size of name followed by name), we read size:1, and . . We recognize the standard name for the current directory (as in bash). This explains the overcount and the content at the end of the file. As all I wanted to do is to extract the files data, we won't explore this part further.
We now have:
rtfd 00 00 00 00 03 00 00 00 | items:4 | size:14 loremipsum.txt | size:19 __@PreferredName@__ | size:23 __@UTF8PreferredName@__ | size:1 . | 70 02 00 00 0D 00 00 00 0D 00 00 00 32 00 00 00 | Begin:file size:616 Donec ...
... elit. | Begin:file size:5 Test2 | Begin:file size:5 Test2 | Begin:file size:42 directory-listing

What about the segment in the middle? Let's convert it to four 4-bytes-little-endian integers:
624 | 13 | 13 | 50
Just to recall, the sizes for each file were:
616 | 5 | 5 | 42
We can notice that each time, we have number = size + 8. This gives the size of the file plus the size of the header, in bytes (two quadruples : 01000000 and the content size). So we finally have:
rtfd 00 00 00 00 03 00 00 00 | items:4 | size:14 loremipsum.txt | size:19 __@PreferredName@__ | size:23 __@UTF8PreferredName@__ | size:1 . | item1:624B | item2:13B | item3:13B | item4:42B | Begin:file size:616 Donec...elit. | Begin:file size:5 Test2 | Begin:file size:5 Test2 | Begin:file size:42 directory-listing

But a raw text file is a really simple type of file, small and easily readable. What if we added something more complex?

Adding a binary file

We add a PNG file, named imapicture.png, 100482B size. The root directory is renamed Test3.
The begining is just as we would expect it to be :
rtfd 00 00 00 00 03 00 00 00 | items:5 | size:14 loremipsum.txt | size:19 __@PreferredName@__ | size:23 __@UTF8PreferredName@__ | size:14 imapicture.png | size:1 . | item1:624B | item2:13B | item3:13B | item4:103801B | item5:88B | ...

Except that the header+content size of the PNG seems huge.
Then follows the content of the loremipsum.txt file and the two preferred strings, with the same 8 bytes header each time, just as before. And then, we would expect to get the PNG part :
01 00 00 00 00 00 00 80 82 88 01 00 E7 0C 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00... and so on, 3303 00 bytes before the real content of the file begins.

NSFileWrapper seems to works by blocks for binary/big files. If we look at the beginning of this sequence we have : the usual 1 value, then a value indicating that this file has some padding (00 00 00 80)[3], and two other values: 100482 and 3303. The first one is equal to the real size of the file, and the other one is equal to the number of 00 bytes used for the padding. We can also notice that 100482+3303+4*4 = 103801 [4], the size given in the first part of the serialized achive for the PNG file.

Now we have a pretty good understanding of how files are stored in a NSFileWrapper serialized representation. We have enough information to be able to read the archive using a buffer, detecting files, their name, size, if they use padding or not, and to extract their data and write it to disk.

But what if we have subdirectories ?

With a subdirectory

We now create a Test4 directory. Inside it, we put our previous Test3 directory, renamed SubTest3, and we add another text-based file at the root, 1984.txt.

The beginning is as usual, and the SubTest3 directory is registered as any other item. rtfd 00 00 00 00 03 00 00 00 | items:6| ... | size:8 SubTest3 | ...

And here is the beginning of the corresponding content:
03 00 00 00 03 00 00 00 0E 00 00 00 69 6D 61 70 69 63 74 75 72 65 2E 70 6E 67 01 00 00 00 2E 0E 00 00 00 6C 6F 72 65 6D 69 70 73 75 6D 2E 74 78 74 B5 92 01 00 58 00 00 00 70 02 00 00...

We notice that this doesn't begin by 01000000, but 03000000. But the following 4-bytes number can't be the length to read, as its value is also 3. So NSFileWrapper uses another rule when dealing with subdirectories. A reasonable idea is that it performs serialization recursively: it first serializes the subdirectory before including it in the main archive.

Let's apply what we learned in the previous section to this subdirectory data:
Begin:directory | items:3 | size:14 imapicture.png | size:1 . | size:14 loremipsum.txt | item1:103093B | item:2 88B | item3: 624B | Begin:file usesPadding size:100482B padding:2595B ...| Begin:file size:80B directory-listing | Begin:file size:616 Donec...elit.

This works! We can now provide deserialization of files and directories, respecting names and hierarchy, without relying on NSFileWrapper.

Thus, mischief managed.




  1. NSFileWrapper class reference

  2. hum.

  3. we can infer this by testing with other files and noticing that those 4 bytes remain constant.

  4. content + padding + header