ExcelTable 2.0 : new support for Excel 97-2003 files (.xls)

April 1, 2018 Leave a comment

Here’s the new version of ExcelTable, which can now read old Excel 97-2003 (.xls) files.
I’ve decided to add this feature not so long ago when I realized there was still a lot of those files around, although they were superseded by OOX-based files (.xlsx) more than a decade ago.

This new version is labeled 2.0 as it now requires hard dependencies that were formerly optional :

The interface remains otherwise unchanged.

I also took the opportunity to migrate ExcelTable project to its own GitHub repository :

/mbleron/ExcelTable

The dependencies were migrated as well to :

/mbleron/MSUtilities

 

The XLS reader

My implementation of the XLS reader is based on the official Microsoft specs published on MSDN :
[MS-XLS]: Excel Binary File Format (.xls) Structure

For an overview :
Understanding the Excel .xls Binary File Format

I’ll just say a few words about the format itself because it’s very well explained in the specs.
Basically, an .xls file is a CFBF container whose ‘Workbook’ binary stream contains the workbook structure and data :

SQL> select *
  2  from table(xutl_cdf.get_streams(file2blob('TMP_DIR','ooxdata2c.xls')));
 
PATH                              STREAM_SIZE CREATION_TIME MODIFIED_TIME STREAM
--------------------------------- ----------- ------------- ------------- ------
/CompObj                                 122                             <BLOB>
/DocumentSummaryInformation              236                             <BLOB>
/SummaryInformation                      236                             <BLOB>
/Workbook                               71871                             <BLOB>
 

The Workbook content is composed of a sequence of records sharing a common header structure (type and record size) followed by the actual record data. This format is known as BIFF8.
The work of the XLS reader is to scan through those records and extract the information we need to expose data with ExcelTable, the same way it’s done for .xlsx files :

  • Encryption information
  • Sheet list
  • Shared strings
  • Comments
  • Cell data

As said, encrypted .xls files are also supported. The RC4 encryption method used by default is described in the [MS-OFFCRYPTO] specs.
The corresponding key-derivation routines have been added in XUTL_OFFCRYPTO package.

Although the XLS reader is primarily meant to be used internally by ExcelTable, it also possesses a “raw” pipelined table interface to directly extract cell data from the Workbook stream.

Here’s an example :

SQL> select t.cellrow
  2       , t.cellcol
  3       , case when t.cellData.getTypeName() = 'SYS.VARCHAR2' then t.cellData.accessVarchar2() end as strval
  4       , case when t.cellData.getTypeName() = 'SYS.NUMBER' then t.cellData.accessNumber() end as numval
  5       , case when t.cellData.getTypeName() = 'SYS.CLOB' then t.cellData.accessClob() end as lobval
  6  from table(
  7         xutl_xls.getRows(
  8           p_file     => xutl_cdf.get_stream(file2blob('TMP_DIR','ooxdata2c.xls'),'/Workbook')
  9         , p_sheet    => 'DataSource'
 10         , p_password => 'pass123'
 11         , p_cols     => 'A,B,C,D,E,F'
 12         , p_firstRow =>  1
 13         , p_lastRow  =>  91
 14         )
 15       ) t
 16  ;
 
   CELLROW CELLCOL STRVAL             NUMVAL LOBVAL
---------- ------- -------------- ---------- --------
         1 A                               1 
         1 B       LINE-00001                
         1 C       ABCD                      
         1 D       1899-12                   
         2 A                               2 
         2 B       LINE-00002                
         2 C       ABC                       
         2 D                               2 
         2 F       TEST                      
         3 A                               3 
         3 B       LINE-00003                
         3 C       ABC                       
         3 D                               3 
         4 A                               4 
         4 B       LINE-00004                
         4 C       ABC                       
         4 D                               4 
 

Stay tuned for the upcoming addition : XLSB format…

 

Advertisements
Categories: PL/SQL Tags: , , ,

Reading Large XML Node Values using the DOM API

March 24, 2018 Leave a comment

As of Oracle release 11.1, DBMS_XMLDOM API has been extended with a set of stream-based functions and procedures to read and write large node data (i.e. data exceeding VARCHAR2 size limit) :
Link to documentation

Sounds great on the paper and well documented, but looking closer and actually trying those functionalities on real examples turns out to be a bit disappointing.
That’s why I could have subtitled this post “The Good, the Bad and the Ugly”.

 

1- Reading a large node value composed of single-byte characters : the “Good”

In this example, I’ll use a simple XML document containing a single 128K text node :

<test>XXXX ... XXXX</test>

and try to read it into a CLOB instance via the DOM streaming API.

declare

  xmldoc   xmltype;
  domdoc   dbms_xmldom.DOMDocument;
  node     dbms_xmldom.DOMNode;
  istream  utl_characterinputstream;
  tmp      varchar2(32767);
  nread    integer := 32767;
  output   clob;
  
begin
  
  select xmlelement("test", rpad(to_clob('X'),131072,'X'))
  into xmldoc
  from dual;
  
  domdoc := dbms_xmldom.newDOMDocument(xmldoc);
  node := dbms_xslprocessor.selectSingleNode(dbms_xmldom.makenode(domdoc), '/test/text()');
  
  dbms_lob.createtemporary(output, true);
  
  istream := dbms_xmldom.getNodeValueAsCharacterStream(node);
  
  loop
    istream.read(tmp, nread);
    exit when nread = 0;
    dbms_lob.writeappend(output, nread, tmp);
  end loop;
  
  istream.close();
  
  dbms_output.put_line('output size = '||dbms_lob.getlength(output));
  
  dbms_lob.freetemporary(output);
  dbms_xmldom.freeDocument(domdoc);
  
end;  
/

output size = 131072
 
PL/SQL procedure successfully completed.

Works great.
We retrieve an input stream for the large node by calling getNodeValueAsCharacterStream() function.
Node data is then read from that stream in 32K chunks, and appended to a temp CLOB.

This was tested on 12.1.0.2, with database character set AL32UTF8.
Note that the large string we want to extract in this example consists only in single-byte characters.

Let’s see what happens when the data contains multi-byte characters…

 

Read more…

How To : Generate Empty XML Attributes

February 17, 2018 1 comment

It is well-known that XML attributes may be generated using SQL/XML publishing function XMLAttributes().
It is also well-known that XMLAttributes() does not generate anything when passed an expression that evaluates to NULL :

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/adxdb/generation-of-XML-data-from-relational-data.html

If an attribute value expression evaluates to NULL, then no corresponding attribute is created.

So, how can we generate an empty attribute using SQL?

<root attr=""/>

Following are a compilation of “tricks” (more than official methods) to generate empty XML attributes, and empty namespace declarations in a SQL context.
I’ll divide them into two categories :

  • Document post-processing : target attributes we want to be empty are transformed after the complete document has been generated
  • In-place generation : empty attributes are directly generated in the same “flow”

 

Read more…

Categories: HowTo, SQL, XQuery Tags: , ,

ExcelTable 1.6 : support for cell comments

January 3, 2018 Leave a comment

Here’s the new version of ExcelTable, which can now extract cell comments as regular columns.
Not much change in the user interface, except an extended column syntax specification to declare a request for cell metadata instead of its value :

For example, using sample file ooxdata3.xlsx :

SQL> select t.*
  2  from table(
  3         ExcelTable.getRows(
  4           ExcelTable.getFile('TMP_DIR','ooxdata3.xlsx')
  5         , 'DataSource'
  6         , q'{
  7             "RN"             for ordinality
  8           , "SPARE2"         varchar2(30)   column 'F'
  9           , "SPARE2_COMMENT" varchar2(2000) column 'F' for metadata (comment)
 10           }'
 11         , '2:11'
 12         )
 13       ) t
 14  ;

 RN SPARE2 SPARE2_COMMENT
--- ------ ------------------------------
  1
  2
  3 OK     bleronm:
           This is a comment.

  4 OK
  5
  6 OK
  7
  8
  9        This is
           another comment
           on three lines

 10

10 rows selected.

 

Source code available on GitHub :

/mbleron/ExcelTable

 

A few words about the internals are following…

 

Read more…

Categories: Miscellaneous, PL/SQL, SQL, XQuery Tags:

How To : Nest Multiple JSON Path Expression Filters

July 9, 2017 Leave a comment

A short post about JSON Path Expressions.
I will focus on filter expressions (now officially documented as of Oracle 12.2) we can use in conjunction with JSON_EXISTS function, and more specifically how to nest multiple relative filter steps.

Here’s an example of what I mean :

{
  "_id":"53e3c6ed-9bfc-2730-e053-0100007f6afb",
  "content":{
    "name":"obj1", 
    "type":1, 
    "values":[
      {"name":"x", "v":1},
      {"name":"y", "v":2},
      {"name":"z", "v":1.5}
    ]
  }
}

and say we want to find documents (like the one above) where content.type = 1 and content.values contains an x item having v > 0.

 

Read more…

ExcelTable 1.3 : support for password-encrypted files

June 5, 2017 3 comments

Here’s the new version of ExcelTable, which can now read password-encrypted files.
It supports Standard and Agile encryption methods, as specified in [MS-OFFCRYPTO].

By default, Office 2007 will encrypt using the Standard method whereas Office 2010 and onwards use Agile encryption.
AES (128 or 256) is usually the default algorithm on standard Office installations.
Because latest Office versions (2013+) make use of SHA-2 hashing algorithms, Oracle 12c is required to read Excel documents encrypted in those versions.

Basically, the only change from ExcelTable 1.2 is the addition of an optional argument p_password in getRows() function :

function getRows (
  p_file     in blob
, p_sheet    in varchar2
, p_cols     in varchar2
, p_range    in varchar2 default null
, p_method   in binary_integer default DOM_READ
, p_password in varchar2 default null
) 
return anydataset pipelined
using ExcelTableImpl;

The following dependencies are also required :
XUTL_CDF
XUTL_OFFCRYPTO

 

Source code available on GitHub :

/mbleron/ExcelTable

 

A few words about the internals are following…

 

Read more…

(PL/SQL) CFBF File Reader

May 21, 2017 5 comments

In this post, I’ll be presenting a PL/SQL utility to read Compound File Binary Format (CFBF).
CFBF is originally a Microsoft proprietary format, which has now been published through the Open Specifications program.

In a nutshell, a CFBF file, or Compound Document File (CDF), is a container assembled like a FAT filesystem, which can be explored as a regular directory structure.

The purpose of the post is not to discuss about the format itself so I’ll let the reader browse the specs to get more familiar with it :

CFBF is the basis for various file formats used in Windows systems.
To cite a few, .MSI (Microsoft Windows Installer) and the well-known .DOC, .XLS, .PPT etc. from the Office 97-2003 Suite.

So, why am I interested in developing a CFBF PL/SQL reader?

  1. Because it’s fun
  2. Because password-encrypted Open Office XML files (.xlsx, .docx …) are stored in a CFBF container, and I thought it would be a good enhancement to my ExcelTable interface.

Source code available on GitHub :

/mbleron/MSUtilities/CDFReader

 

Read more…