Home > Fun, HowTo, PL/SQL > (PL/SQL) CFBF File Reader

(PL/SQL) CFBF File Reader

In this post, I’ll be presenting a PL/SQL utility to read Compound File Binary Format (CFBF).
CFBF is originally a Microsoft proprietary format, which has now been published through the Open Specifications program.

In a nutshell, a CFBF file, or Compound Document File (CDF), is a container assembled like a FAT filesystem, which can be explored as a regular directory structure.

The purpose of the post is not to discuss about the format itself so I’ll let the reader browse the specs to get more familiar with it :

CFBF is the basis for various file formats used in Windows systems.
To cite a few, .MSI (Microsoft Windows Installer) and the well-known .DOC, .XLS, .PPT etc. from the Office 97-2003 Suite.

So, why am I interested in developing a CFBF PL/SQL reader?

  1. Because it’s fun
  2. Because password-encrypted Open Office XML files (.xlsx, .docx …) are stored in a CFBF container, and I thought it would be a good enhancement to my ExcelTable interface.

Source code available on GitHub :

/mbleron/oracle/CDFReader

 

1- PL/SQL Implementation

XUTL_CDF package is my implementation of the CFBF reader.
The main functionality is to extract stream(s) from the container as BLOB, based on their paths within the internal directory structure.

API description :
open_file
function open_file (
  p_file in blob
)
return cdf_handle;

Opens a CFBF file contained in the BLOB argument and returns a context handle.

close_file
procedure close_file (
  p_hdl in cdf_handle
);

Closes the context referenced by the handle argument.

Exceptions :
INVALID_HANDLE – if the handle does not reference a valid CDF context.

get_stream Overload 1 :

function get_stream (
  p_hdl    in cdf_handle
, p_path   in varchar2
, p_offset in integer default 0
)
return blob;

Extracts a single stream as BLOB from a compound file (context handle), using its path.
An optional p_offset argument is provided to extract the stream starting at that specific offset (zero-based).

Exceptions :
INVALID_HANDLE – if the handle does not reference a valid CDF context.
NO_STREAM_FOUND – if the specified path does not exist in the compound file directory.

get_stream Overload 2 :

function get_stream (
  p_file   in blob
, p_path   in varchar2
, p_offset in integer default 0
)
return blob;

Extracts a single stream as BLOB from a compound file (p_file), using its path.
An optional p_offset argument is provided to extract the stream starting at that specific offset (zero-based).

Exceptions :
NO_STREAM_FOUND – if the specified path does not exist in the compound file directory.

get_streams
function get_streams (
  p_file    in blob
, p_pattern in varchar2 default '*'
)
return entry_list_t pipelined;

Extracts all streams from the compound file p_file (BLOB), if their paths match the specified pattern argument (defaults to ‘*’).
The pattern may be a regular expression supported by Oracle’s implementation (REGEXP_LIKE function is used internally).
Must be used in conjunction with the TABLE operator.

is_cdf
function is_cdf (
  p_file in blob
) 
return boolean;

Returns true if the CFBF header signature is found in the input file, false otherwise.

The 2nd overload of get_stream function is a shorthand when we only need to extract a single stream from the compound file.
When extracting more than one stream, use open-get-close steps.

 

2- Examples

Extracting a single stream :

select xutl_cdf.get_stream(
         file2blob('TMP_DIR','sample_word.doc')
       , '/Data'
       )
from dual;

Extracting multiple streams :

DECLARE

  file     blob := file2blob('TMP_DIR','sample.dat');
  hdl      xutl_cdf.cdf_handle;
  stream1  blob;
  stream2  blob;
  
BEGIN
  
  hdl := xutl_cdf.open_file(file);
  stream1 := xutl_cdf.get_stream(hdl, '/Folder1/SomeStream');
  stream2 := xutl_cdf.get_stream(hdl, '/Folder1/SomeFolder/OtherStream');
  xutl_cdf.close_file(hdl);
  
END;
/

Projecting the content as a table :

SQL> select *
  2  from table(
  3         xutl_cdf.get_streams(
  4           file2blob('TMP_DIR','sample_word.doc')
  5         )
  6       )
  7  ;
 
PATH                                                 STREAM_SIZE CREATION_TIME MODIFIED_TIME STREAM
---------------------------------------------------- ----------- ------------- ------------- ------
/CompObj                                                    114                             <BLOB>
/DocumentSummaryInformation                                5500                             <BLOB>
/SummaryInformation                                         556                             <BLOB>
/1Table                                                    92780                             <BLOB>
/Data                                                    4141178                             <BLOB>
/MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Item                  205                             <BLOB>
/MsoDataStore/ÐHÆÌÙÐ3ÌÙEOB2ÑQ5ÏÐÀÏÐQ==/Properties            341                             <BLOB>
/ObjectPool/_1281533281/CompObj                              77                             <BLOB>
/ObjectPool/_1281533281/Ole                                  20                             <BLOB>
/ObjectPool/_1281533281/Ole10ItemName                         5                             <BLOB>
/ObjectPool/_1281533281/Ole10Native                        8964                             <BLOB>
/ObjectPool/_1281533281/ObjInfo                               6                             <BLOB>
/ObjectPool/_1412434795/CompObj                             117                             <BLOB>
/ObjectPool/_1412434795/Ole                                  20                             <BLOB>
/ObjectPool/_1412434795/ObjInfo                               6                             <BLOB>
/ObjectPool/_1412434795/DocumentSummaryInformation          336                             <BLOB>
/ObjectPool/_1412434795/SummaryInformation                  224                             <BLOB>
/ObjectPool/_1412434795/Workbook                          128971                             <BLOB>
/WordDocument                                             253868                             <BLOB>
 
19 rows selected.
SQL> select path, stream
  2  from table(
  3         xutl_cdf.get_streams(
  4           file2blob('TMP_DIR','crypto.xlsx')
  5         , '^/Encrypt'
  6         )
  7       )
  8  ;
 
PATH                 STREAM
-------------------- ------
/EncryptedPackage    <BLOB>
/EncryptionInfo      <BLOB>
 

 

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: