Home > Fun, HowTo, PL/SQL > (PL/SQL) CFBF File Reader

(PL/SQL) CFBF File Reader

In this post, I’ll be presenting a PL/SQL utility to read Compound File Binary Format (CFBF).
CFBF is originally a Microsoft proprietary format, which has now been published through the Open Specifications program.

In a nutshell, a CFBF file, or Compound Document File (CDF), is a container assembled like a FAT filesystem, which can be explored as a regular directory structure.

The purpose of the post is not to discuss about the format itself so I’ll let the reader browse the specs to get more familiar with it :

CFBF is the basis for various file formats used in Windows systems.
To cite a few, .MSI (Microsoft Windows Installer) and the well-known .DOC, .XLS, .PPT etc. from the Office 97-2003 Suite.

So, why am I interested in developing a CFBF PL/SQL reader?

  1. Because it’s fun
  2. Because password-encrypted Open Office XML files (.xlsx, .docx …) are stored in a CFBF container, and I thought it would be a good enhancement to my ExcelTable interface.

Source code available on GitHub :

/mbleron/oracle/CDFReader

 

1- PL/SQL Implementation

XUTL_CDF package is my implementation of the CFBF reader.
The main functionality is to extract stream(s) from the container as BLOB, based on their paths within the internal directory structure.

API description :
open_file
function open_file (
  p_file in blob
)
return cdf_handle;

Opens a CFBF file contained in the BLOB argument and returns a context handle.

close_file
procedure close_file (
  p_hdl in cdf_handle
);

Closes the context referenced by the handle argument.

Exceptions :
INVALID_HANDLE – if the handle does not reference a valid CDF context.

get_stream Overload 1 :

function get_stream (
  p_hdl    in cdf_handle
, p_path   in varchar2
, p_offset in integer default 0
)
return blob;

Extracts a single stream as BLOB from a compound file (context handle), using its path.
An optional p_offset argument is provided to extract the stream starting at that specific offset (zero-based).

Exceptions :
INVALID_HANDLE – if the handle does not reference a valid CDF context.
NO_STREAM_FOUND – if the specified path does not exist in the compound file directory.

get_stream Overload 2 :

function get_stream (
  p_file   in blob
, p_path   in varchar2
, p_offset in integer default 0
)
return blob;

Extracts a single stream as BLOB from a compound file (p_file), using its path.
An optional p_offset argument is provided to extract the stream starting at that specific offset (zero-based).

Exceptions :
NO_STREAM_FOUND – if the specified path does not exist in the compound file directory.

get_streams
function get_streams (
  p_file    in blob
, p_pattern in varchar2 default '*'
)
return entry_list_t pipelined;

Extracts all streams from the compound file p_file (BLOB), if their paths match the specified pattern argument (defaults to ‘*’).
The pattern may be a regular expression supported by Oracle’s implementation (REGEXP_LIKE function is used internally).
Must be used in conjunction with the TABLE operator.

is_cdf
function is_cdf (
  p_file in blob
) 
return boolean;

Returns true if the CFBF header signature is found in the input file, false otherwise.

The 2nd overload of get_stream function is a shorthand when we only need to extract a single stream from the compound file.
When extracting more than one stream, use open-get-close steps.

 

2- Examples

Extracting a single stream :

select xutl_cdf.get_stream(
         file2blob('TMP_DIR','sample_word.doc')
       , '/Data'
       )
from dual;

Extracting multiple streams :

DECLARE

  file     blob := file2blob('TMP_DIR','sample.dat');
  hdl      xutl_cdf.cdf_handle;
  stream1  blob;
  stream2  blob;
  
BEGIN
  
  hdl := xutl_cdf.open_file(file);
  stream1 := xutl_cdf.get_stream(hdl, '/Folder1/SomeStream');
  stream2 := xutl_cdf.get_stream(hdl, '/Folder1/SomeFolder/OtherStream');
  xutl_cdf.close_file(hdl);
  
END;
/

Projecting the content as a table :

SQL> select *
  2  from table(
  3         xutl_cdf.get_streams(
  4           file2blob('TMP_DIR','sample_word.doc')
  5         )
  6       )
  7  ;
 
PATH                                                 STREAM_SIZE CREATION_TIME MODIFIED_TIME STREAM
---------------------------------------------------- ----------- ------------- ------------- ------
/CompObj                                                    114                             <BLOB>
/DocumentSummaryInformation                                5500                             <BLOB>
/SummaryInformation                                         556                             <BLOB>
/1Table                                                    92780                             <BLOB>
/Data                                                    4141178                             <BLOB>
/MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Item                  205                             <BLOB>
/MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Properties            341                             <BLOB>
/ObjectPool/_1281533281/CompObj                              77                             <BLOB>
/ObjectPool/_1281533281/Ole                                  20                             <BLOB>
/ObjectPool/_1281533281/Ole10ItemName                         5                             <BLOB>
/ObjectPool/_1281533281/Ole10Native                        8964                             <BLOB>
/ObjectPool/_1281533281/ObjInfo                               6                             <BLOB>
/ObjectPool/_1412434795/CompObj                             117                             <BLOB>
/ObjectPool/_1412434795/Ole                                  20                             <BLOB>
/ObjectPool/_1412434795/ObjInfo                               6                             <BLOB>
/ObjectPool/_1412434795/DocumentSummaryInformation          336                             <BLOB>
/ObjectPool/_1412434795/SummaryInformation                  224                             <BLOB>
/ObjectPool/_1412434795/Workbook                          128971                             <BLOB>
/WordDocument                                             253868                             <BLOB>
 
19 rows selected.
SQL> select path, stream
  2  from table(
  3         xutl_cdf.get_streams(
  4           file2blob('TMP_DIR','crypto.xlsx')
  5         , '^/Encrypt'
  6         )
  7       )
  8  ;
 
PATH                 STREAM
-------------------- ------
/EncryptedPackage    <BLOB>
/EncryptionInfo      <BLOB>
 

 

Advertisements
  1. David Blake
    January 10, 2018 at 20:13

    Hi there,

    I’m also trying to understand CFBs. One thing I’ve never found out is; what is the meaning of the string in the MsoDataStore? i.e the ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ??

    In your example:
    /MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Item
    /MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Properties

    Could it be a SIgnatureProperty? Or is it stream locations (offsets)?

    • David Blake
      January 10, 2018 at 20:27

      Sorry should have added that to me the == (3d 3d) seems to imply padding for B64..

    • January 11, 2018 at 15:13

      Hi David,

      It’s not related to the CFB format itself.
      Searching the MSO-DOC specs for “MsoDataStore” redirects to MSO-OSHARED specs, §2.3.6 Custom XML Data Storage :

      The name of this storage MUST be “MsoDataStore”. Within this storage, zero or more sub-storages exist. The name of each of these sub-storages MUST be unique.

      So I take it the string is some kind of unique identifier, probably derived from a GUID.

  2. David Blake
    January 11, 2018 at 15:58

    Hi Marc,

    Thanks very much for your reply. If you are interested in US politics, I have an instance where it’s kind of important to find out :-)

    There is such a Sub-Storage in the documents altered by Guccifer2.0. (In case you don’t know: He’s allegedly the Russian Hacker behind the theft of the DNC emails).

    As far as I can tell this is the only identifying feature that can tie directly to him/her/them is the “unique name” of the sub-storage. So if it is tied to a GUID that would be really interesting! Any thoughts as to how to find out if it is? :-)

  3. David Blake
    January 11, 2018 at 16:00

    Oh and thanks for the link to MSO-Shared. I’ve had a good read and they don’t say how the name is derived. Just that it’s unique. That suggests you are right. As they know the GUID/MSID is unique then it would make sense to use that. But then again, two or more sub-storages are possible. They can’t use the same GUID for each …

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: