Manual recovery of DOC contents


You're trying to open a document but instead of getting your text on the screen, the program presents you with an error message. The file is unreadable, corrupted... what to do? Here's a small, practical "first-aid" guide that can actually be useful for many formats, although it's written for Microsoft Word DOC recovery in particular.
Note: I'm keeping this as simple as possible, so I'm skipping most technical details and may even make some oversimplified statements.

You may have a backup
Before thinking about recovery, you should check if perhaps you have a hidden backup copy of your corrupted file. Programs such as Word and OpenOffice.org Writer often automatically create backup files in the same directory that the original file is in. The backups may be hidden from you when browsing the directory: to make them visible, follow the instructions here. If you're using another language of Windows, search Google for "show hidden files" (in your language).

Automatic recovery
If you can't find a backup, don't angrily delete the corrupted file: it's time to recover anything you can. Running a search on Google for "doc recovery" will get you heaps of software products that claim they can recover your entire document in a few clicks. The problem with such products is that they're expensive, and don't always work. If you're going to use one, first download a free demo/trial version to see if it can handle your file. But before doing so, there's a good old manual approach you should try. I even suspect most of these programs use the same technique, which I will now describe.

Manual recovery: theory
Word documents, and many other formats, store the document text as "raw" letters (like the ones you produce with Notepad) with the additional information (fonts, colours, etc.) around them in "computer code". Small errors in this code may cause Word to choke on and refuse the document, while all the text itself is still in it. So what you need to do is open the file with a raw reader that doesn't interpret any code, but simply displays the raw contents of the file. (One such raw reading program is called Edxor.) This will give you your document's text with a lot of gibberish around it - the "computer codes". After using simple filtering techniques that remove all non-textual characters, you should be left with all your raw text. That means you will have lost all formatting, tables, images, etcetera, but at least you've saved your text! In theory I assume it would be possible to fetch e.g. images out of the document, but this requires specific knowledge of the structure of the "computer code" gibberish, and is very difficult to do manually. For recovery of more than raw text you may want to have a look at the programs mentioned above.

edxor.png
Edxor showing a Microsoft Word DOC file, with header gibberish on top, followed by plain text contents, and below that footer gibberish (not visible due to scrolling)

Manual recovery: practice
If you came here frustrated, looking for a solution to your corrupt file problem, I hope I can help you. Here's a simple description what to do:

  1. Make a duplicate copy of your corrupted file by right-clicking on it, choosing Copy, and then right-clicking on empty space (anywhere in a folder/desktop) and choosing Paste.
  2. Download the Edxor program from here (click) and start it.
  3. In the Edxor menu use File, Open, to browse to the directory where the duplicate copy of your file is, and open the file.
    You will now see a lot of gibberish, but when scrolling down you should recognise some of your text.
  4. Simultaneously press Ctrl+A to select the entire text field, then use the menu Format, Wipe Non-ASCII, as a first filter.
    This should clear a lot of gibberish. Now you can clearly see your text, although it's probably cut apart by spaces.
  5. Select and delete all the parts of the file that do not contain text you want to recover. These gibberish parts to be deleted are usually at the top of the text field ("header") and at the bottom ("footer"); in between them is your text. Edxor works just like a normal text processor, so use your mouse and delete/backspace keys.

    In some cases you may now have the clean text: you're done. But with most DOC files, although you will have just your text, it has each letter separated by a space and each word by 2 spaces.
  6. To get these spaces out, use the option Replace in Edxor's Search menu. Three search&replace runs must be performed:
    1. First, search for two spaces ("  ") and have them replaced by a random character which doesn't occur anywhere else in your text, e.g. "~".
      This marks the gaps between words, where spaces belong, so that those spaces won't be removed.
    2. Then, search for one space (" ") and have it replaced by nothing (leave the Replace field totally empty).
      This removes the spaces between letters of the same word.
    3. Finally, search for the random character used at 1 (e.g. "~") and replace it by one space.
      This restores the spaces between words that were marked at step 1.
    Each time, use the option Replace all to avoid clicking your finger lame, and make sure to click OK when asked to "Search from top".
  7. Now you have your text in plain format. Save it, copy/paste it to your text editor, whatever you want.

fields.png
Note that the Search and Replace fields in the search&replace window are very sensitive. A space (" ") is different than two spaces ("  ") and than an empty field (""), although you cannot easily see spaces entered in the field. To see them, use your mouse to select all contents of the field, as shown in the image.


© 2006-2007 Arnoud Onnink: arnie[at]arnie.frih.net

Valid HTML 4.01 Strict