A free template from Joomlashack

A free template from Joomlashack

Cleaning metadata from OpenOffice.org doucments Print E-mail
Written by Christopher Bumgarner   
Thursday, 07 August 2008

The word “metadata” seems to strike fear in many lawyers. Over the past few years, I have heard warning after warning about it, but few offer any solutions. The fear of metadata is easy to explain—people fear the unknown. No one seems to know much about metadata, what it is, why we have it, and how to get rid of it. Don't worry, I will impart you with knowledge of metadata in OpenOffice.org documents. With this knowledge, you will defeat the fear of metadata.

What is metadata?

Technically speaking, “metadata” is “data about data.” When applied to office documents, metadata describes the entire document itself, rather than specific elements of the document. To see an example, open an OpenOffice.org document and open the properties window by going to File --> Properties. This window shows you some of the metadata in the current document. It shows you things like who created the document, when it was created, and the like. If you click on other tabs in the window, you can see more metadata. Some of this information can be defined by the user; you can add a document title, subject, keywords and a description. You can also view statistics such as the number of words and paragraphs in the document. Notice that in the General tab of the properties window, most of the metadata cannot be changed. For example, you cannot erase the date and time the document was created. Actually you can erase the metadata, but not through this interface. I'll show you how later.

screenshot

 

This metadata may not look so scary at first. But there is more to the story. There are other elements of hidden information that can be easily uncovered, such as “tracked changes” and “annotations”. Keep in mind that every word processing document will contain metadata in one form or another, not just OpenOffice.org documents. The good thing about OpenOffice.org documents is that it uses the Open Document Format, or ODF. The keyword here is “open”, meaning there are no secrets. Contrast this with MS Word documents prior to Word 2007. Those documents are kept in a closed, proprietary format that is not easily readable by most users. But if you use OpenOffice.org, there are no barriers between you and the data your document holds.

A little about ODF files

Before we get into the nitty gritty, there are some details about ODF files that you should now about. First, OpenOffice.org files are not actually single files—they are multiple files that are compressed into one file. Lets take a peek inside one. If you use Linux, just right-click on any OpenOffice.org file and open it in an archive manager. It is not so easy to do on Windows; you have to play a little trick. First, make a backup copy of your document, then change the extension from “.odt” (for a text document) to “.zip”. This tricks Windows into thinking it is an ordinary zip file. Then you can double-click on the file to view its contents (don't forget to change the extension back to “.odt” when you are done). Here is what you should see.

screenshot

 

Your mileage may vary somewhat, depending on which archive manager you use. As you can see, the file actually contains several directories and several XML files. Never mind what XML stands for, the important thing to know is that it is human readable. That is, the information is stored in readable utf-8 characters rather than binary mumbo jumbo. From this point, you can view the contents of any XML file simply by double clicking on it if you use Windows—it will open the document in Internet Explorer. You can also extract all of the files using unzip, but that won't be necessary for this article. The two most important files for our purposes are content.xml and meta.xml. Not surprisingly, the content.xml file contains the documents contents, and the meta.xml contains metadata. Open the meta.xml file and here is what you will see.

screenshot

XML is a little bit like HTML in that information is nested in opening and closing tags, such as <office:meta> and </office:meta>. The tag with the slash is the closing tag. In the screenshot above, you can see what type of information is kept in this file. The generator tag tells you which program generated the file. As you can see in the above screenshot, the document was created with OpenOffice.org version 2.4 running on Linux. You can also see who created the document and when it was created in the <meta:initial-creator> and <meta:creation-date> tags respectively. The <dc:creator> and <dc:date> tags indicate the last person to edit the file and when it was last edited. The <meta:editing-cycles> tag indicates how many times the document has been edited, and <meta:editing-duration> tells us the length of time the document has been edited. My sample document has been edited for a whopping 41 seconds.

A very interesting tag is the <meta:template> tag. This tag holds information about the template used to create the document. Not all documents are created from templates, but if it is, then some information about the template will be included in this tag. Notice that the location of the template is provided in the xlink:href attribute. If someone were to get a hold of this information, then they have knowledge of my IT structure that could be used to infiltrate my system. It's not a big deal if the template is located in the default location (as mine is), but if you keep all of the firm's templates on your server, then a hacker could be able to use knowledge of your infrastructure against you.

Next are some tags for user defined fields. You can define them in the properties windows you saw earlier, but these are rarely used. Finally, you can see the document statistics at the very bottom.

Removing the Metadata

I'll let you decide how damaging this information could prove from the unwary. The good news is, that all of this information is easily removed. While you have your .odf file open in your archive manager, simply delete the meta.xml file. Now, the information will no longer be retrievable. Keep in mind that if you edit the document after you delete the meta.xml file, OpenOffice.org will regenerate the file. Information such as the original creator and template path will not be regenerated, but the editing cycles and the last person to edit the file will be inserted into the new meta.xml file.

Hidden Data

Metadata is not the only hidden danger in word processing documents. Other features can hold information in not-so-obvious places. Such features include tracked changes, comments, and notes. This information is not as easy to remove as the document metadata because they are not all kept in one place. Fortunately, the user has full control over these features and can remove the hidden data manually.

Tracked Changes

The first of these places I would like to cover is the “track changes” feature. This is a feature that all major word processors provide, and they all work basically the same. In OpenOffice.org, you can activate the feature under Edit --> Changes menu. Choose Record to activate the feature (you should then see a check mark next to th word 'Record'). Choose Show under the same menu and you can see the changes; new text will appear colored and underlined, while text that is deleted will appear colored with a strike through effect. Changes will appear in different colors depending on the user who made the changes. If you click on Show again the changes will be hidden and the document will look normal, but OpenOffice.org will still keep track of every change you make until you deactivate the feature. This can be a dangerous situation; the user may not be aware that OpenOffice.org is keeping track of the changes made.

These tracked changes are technically not “metadata;” it is not “data about data,” it is actual data. However, each change made comes with its own metadata. OpenOffice.org will not only store the change itself, but it will also store who made the change, when it was made, and what type of change was made.

Take a look at the next screenshot. Here you can see what tracked changes look like in the content.xml document. Text that is added is contained between <text:change-start> and <text:change-end> tags. Text that is deleted has a <text:change> tag in its place.

screenshot

No matter what the change, each change has a text:change-id attribute in the xml tag.  This attribute acts as a unique ID that corresponds to the change's metadata elsewhere in the document (towards the beginning). Take a look at the next screenshot. This shot shows the code that keeps track of who made each change and when, and what type of change was made. Each change made will have its own <text:changed-region> tag with a text:id attribute that corresponds to the ID of the change itself. The first change listed in the screenshot is a format change as indicated by the <text:format-change> tag. The <office:change-info> tag contains nested tags about the user who made the change and when. The next change is a deletion as indicated by the <text:deletion> tag. The deletion tag not only indicates who made the deletion and when, but it contains the text that was deleted, which you can see in the <text:p> tag. The next change is an insertion as indicated by the <text:insertion> tag.

screenshot

So now that you know about the information that can be hidden with the tracked changes feature, we will talk about how to get rid of it. My first suggestion would be to never use this feature if you can help it. There is no need for it unless you are collaborating with someone else in drafting the document. Even then, don't turn on the feature until the first draft is done; otherwise your document will be a mess of nested changes. Only use it if you really need it.

To get rid of the tracked changes, first turn the feature off by clearing the check mark next to Edit --> Changes --> Record. Once you do that, click on Edit --> Changes --> Accept or Reject . . . . You are then presented with a dialog box. In this box, you can accept or reject each individual change, or you can accept or reject all changes. If you accept all changes, you document will look just like it would if you had turned off “Show” under Edit --> Changes. Whether changes are accepted or rejected, all information about the change is destroyed. Not only will the changes themselves be gone, but also the change's metadata (who made the change and when).

Another feature of tracked changes is comments. A user can add comments to any change made. You can add a comment by placing you cursor anywhere inside the change and select Edit --> Changes --> Comment . . . . You can view the comment in two ways. First, when you select Edit --> Changes --> Accept or Reject . . ., a dialog box opens that shows you each change made. The comments are in the far right-hand column. Also you can hover your mouse pointer over the change, and a Tooltip will appear. Under normal circumstances, the Tooltip will only show you who made the change and when the change was made, but if you have extended Tips enabled, it will show you any comments made as well. (You can enable extended tips by opening Tools --> Options - OpenOffice.org - General, and checking Extended tips.)

Comments don't generate their own metadata. The comments will appear in the change's metadata, alongside the name of the person who made the original change and when. There is no way to tell who authored the comment. The good news is, getting rid of comments to changes is the same process as getting rid of changes. Just turn off the track changes feature and accept or reject all changes.

Notes

Similar to comments are notes. The difference is that you can insert notes anywhere—they are separate from tracked changes. To insert a note, go to Insert --> Note, type some text in the little text box that appears, and click OK. You will then see a small yellow box where the cursor is. If you hover your mouse pointer over the yellow box, the text of the note will appear. To edit the note, simply double-click on the yellow box. Notes are simple, and they are simple to remove as well. They can be deleted just like any text character with the delete or backspace key.

Conclusion

While this is not a comprehensive study of every nook and cranny where metadata can hide, the information in this article should provide the vast majority of OOo users with the knowledge they need to combat inadvertent disclosure of secrets.

Last Updated ( Sunday, 17 August 2008 )
 
< Prev   Next >
Joomla Templates by Joomlashack
© 2008 Christopher Bumgarner. All rights reserved.