Archive

Posts Tagged ‘ExcelTable’

ExcelTable 2.1 : support for XLSB files

April 22, 2018 5 comments

Latest enhancement to ExcelTable : support for Excel Binary File Format (.xlsb).

The .xlsb file format is quite similar to .xlsx as they are both using the same OPC package structure, i.e. a zip archive with different parts containing document data and metadata.

The difference lies in the internal format of the data parts.
In the xlsx format, most data parts (except purely binary parts like images) are XML, whereas in the xlsb format, those parts are binary (with a default .bin extension) :

Relationship parts (.rels) are still exposed as XML, so the workflow to read and extract data is the same as for xlsx, and already implemented in ExcelTable interface.
However, for modularity, reading the different binary parts (workbook, sharedStrings, comments, sheet) is handled by a separate PL/SQL package : XUTL_XLSB.

Technically, each binary part is organized much like the Workbook stream in the XLS format, i.e. as a sequence of records with a common header structure (type and size) followed by record data.
That makes .xlsb files usually smaller and faster to read than the same document saved as xlsx, especially for very large Excel files.

Source code, documentation and samples available on GitHub :

/mbleron/ExcelTable

 

The XLSB reader

My implementation of the XLSB reader resembles that of the XLS reader.
It is based on the official XLSB specs available on MSDN, like other Microsoft Office file formats :
[MS-XLSB]: Excel (.xlsb) Binary File Format

It uses the same open/iterate/close methods to interface with ExcelTable and feed cell data in a streaming fashion.

A pipelined table interface is provided, for debugging purpose, as it takes the sheet and sharedStrings parts as parameters :

function get_rows (
  p_sheet_part  in blob 
, p_sst_part    in blob
, p_cols        in varchar2 default null
, p_firstRow    in pls_integer default null
, p_lastRow     in pls_integer default null
)
return ExcelTableCellList
pipelined;

Sample usage, using sheet1.bin and sharedStrings.bin parts manually extracted from ooxdata3.xlsb :

SQL> select t.cellrow
  2       , t.cellcol
  3       , case when t.cellData.getTypeName() = 'SYS.VARCHAR2' then t.cellData.accessVarchar2() end as strval
  4       , case when t.cellData.getTypeName() = 'SYS.NUMBER' then t.cellData.accessNumber() end as numval
  5  from table(
  6         xutl_xlsb.get_rows(
  7           p_sheet_part => file2blob('TMP_DIR','sheet1.bin')
  8         , p_sst_part   => file2blob('TMP_DIR','sharedStrings.bin')
  9         , p_cols       => 'A,B,C,D,E,F'
 10         )
 11       ) t
 12  ;
 
   CELLROW CELLCOL STRVAL             NUMVAL
---------- ------- -------------- ----------
         1 A       SRNO           
         1 B       NAME           
         1 C       VALUE          
         1 D       DT             
         1 E       SPARE1         
         1 F       SPARE2         
         2 A                               1
         2 B       LINE-00001     
         2 C                      68459.0560
         2 D                      8598.67446                        

...

       100 A                              99
       100 B       LINE-00099     
       100 C                      7348.46845
       100 D                      5275.00033
       101 A                             100
       101 B       LINE-00100     
       101 C                      94805.7298
       101 D                      4492.44441
 
412 rows selected.
 

 

Advertisements
Categories: ExcelTable, PL/SQL Tags: ,

ExcelTable 2.0 : new support for Excel 97-2003 files (.xls)

April 1, 2018 Leave a comment

Here’s the new version of ExcelTable, which can now read old Excel 97-2003 (.xls) files.
I’ve decided to add this feature not so long ago when I realized there was still a lot of those files around, although they were superseded by OOX-based files (.xlsx) more than a decade ago.

This new version is labeled 2.0 as it now requires hard dependencies that were formerly optional :

The interface remains otherwise unchanged.

I also took the opportunity to migrate ExcelTable project to its own GitHub repository :

/mbleron/ExcelTable

The dependencies were migrated as well to :

/mbleron/MSUtilities

 

The XLS reader

My implementation of the XLS reader is based on the official Microsoft specs published on MSDN :
[MS-XLS]: Excel Binary File Format (.xls) Structure

For an overview :
Understanding the Excel .xls Binary File Format

I’ll just say a few words about the format itself because it’s very well explained in the specs.
Basically, an .xls file is a CFBF container whose ‘Workbook’ binary stream contains the workbook structure and data :

SQL> select *
  2  from table(xutl_cdf.get_streams(file2blob('TMP_DIR','ooxdata2c.xls')));
 
PATH                              STREAM_SIZE CREATION_TIME MODIFIED_TIME STREAM
--------------------------------- ----------- ------------- ------------- ------
/CompObj                                 122                             <BLOB>
/DocumentSummaryInformation              236                             <BLOB>
/SummaryInformation                      236                             <BLOB>
/Workbook                               71871                             <BLOB>
 

The Workbook content is composed of a sequence of records sharing a common header structure (type and record size) followed by the actual record data. This format is known as BIFF8.
The work of the XLS reader is to scan through those records and extract the information we need to expose data with ExcelTable, the same way it’s done for .xlsx files :

  • Encryption information
  • Sheet list
  • Shared strings
  • Comments
  • Cell data

As said, encrypted .xls files are also supported. The RC4 encryption method used by default is described in the [MS-OFFCRYPTO] specs.
The corresponding key-derivation routines have been added in XUTL_OFFCRYPTO package.

Although the XLS reader is primarily meant to be used internally by ExcelTable, it also possesses a “raw” pipelined table interface to directly extract cell data from the Workbook stream.

Here’s an example :

SQL> select t.cellrow
  2       , t.cellcol
  3       , case when t.cellData.getTypeName() = 'SYS.VARCHAR2' then t.cellData.accessVarchar2() end as strval
  4       , case when t.cellData.getTypeName() = 'SYS.NUMBER' then t.cellData.accessNumber() end as numval
  5       , case when t.cellData.getTypeName() = 'SYS.CLOB' then t.cellData.accessClob() end as lobval
  6  from table(
  7         xutl_xls.getRows(
  8           p_file     => xutl_cdf.get_stream(file2blob('TMP_DIR','ooxdata2c.xls'),'/Workbook')
  9         , p_sheet    => 'DataSource'
 10         , p_password => 'pass123'
 11         , p_cols     => 'A,B,C,D,E,F'
 12         , p_firstRow =>  1
 13         , p_lastRow  =>  91
 14         )
 15       ) t
 16  ;
 
   CELLROW CELLCOL STRVAL             NUMVAL LOBVAL
---------- ------- -------------- ---------- --------
         1 A                               1 
         1 B       LINE-00001                
         1 C       ABCD                      
         1 D       1899-12                   
         2 A                               2 
         2 B       LINE-00002                
         2 C       ABC                       
         2 D                               2 
         2 F       TEST                      
         3 A                               3 
         3 B       LINE-00003                
         3 C       ABC                       
         3 D                               3 
         4 A                               4 
         4 B       LINE-00004                
         4 C       ABC                       
         4 D                               4 
 

Stay tuned for the upcoming addition : XLSB format…

 

Categories: PL/SQL Tags: , , ,

ExcelTable 1.6 : support for cell comments

January 3, 2018 Leave a comment

Here’s the new version of ExcelTable, which can now extract cell comments as regular columns.
Not much change in the user interface, except an extended column syntax specification to declare a request for cell metadata instead of its value :

For example, using sample file ooxdata3.xlsx :

SQL> select t.*
  2  from table(
  3         ExcelTable.getRows(
  4           ExcelTable.getFile('TMP_DIR','ooxdata3.xlsx')
  5         , 'DataSource'
  6         , q'{
  7             "RN"             for ordinality
  8           , "SPARE2"         varchar2(30)   column 'F'
  9           , "SPARE2_COMMENT" varchar2(2000) column 'F' for metadata (comment)
 10           }'
 11         , '2:11'
 12         )
 13       ) t
 14  ;

 RN SPARE2 SPARE2_COMMENT
--- ------ ------------------------------
  1
  2
  3 OK     bleronm:
           This is a comment.

  4 OK
  5
  6 OK
  7
  8
  9        This is
           another comment
           on three lines

 10

10 rows selected.

 

Source code available on GitHub :

/mbleron/ExcelTable

 

A few words about the internals are following…

 

Read more…

Categories: Miscellaneous, PL/SQL, SQL, XQuery Tags:

ExcelTable 1.3 : support for password-encrypted files

June 5, 2017 3 comments

Here’s the new version of ExcelTable, which can now read password-encrypted files.
It supports Standard and Agile encryption methods, as specified in [MS-OFFCRYPTO].

By default, Office 2007 will encrypt using the Standard method whereas Office 2010 and onwards use Agile encryption.
AES (128 or 256) is usually the default algorithm on standard Office installations.
Because latest Office versions (2013+) make use of SHA-2 hashing algorithms, Oracle 12c is required to read Excel documents encrypted in those versions.

Basically, the only change from ExcelTable 1.2 is the addition of an optional argument p_password in getRows() function :

function getRows (
  p_file     in blob
, p_sheet    in varchar2
, p_cols     in varchar2
, p_range    in varchar2 default null
, p_method   in binary_integer default DOM_READ
, p_password in varchar2 default null
) 
return anydataset pipelined
using ExcelTableImpl;

The following dependencies are also required :
XUTL_CDF
XUTL_OFFCRYPTO

 

Source code available on GitHub :

/mbleron/ExcelTable

 

A few words about the internals are following…

 

Read more…

ExcelTable 1.2 : introducing streaming support for large files

November 15, 2016 1 comment

ExcelTable is my attempt at building a SQL query interface to read MS Excel files in xlsx (or xlsm) format.
Version 1.2 is now available with the following new features :

  • Streaming read support for large Excel files
  • setFetchSize() routine to limit the number of rows processed per request

 
/mbleron/ExcelTable

 

Streaming read method

The getRows() function has been extended with an additional optional argument p_method.
Allowed values are as follows :

-- Read methods  
DOM_READ     constant binary_integer := 0;
STREAM_READ  constant binary_integer := 1;

The default is 0 (DOM_READ).

The streaming method requires Java (StAX API) and is much more scalable than the DOM_READ method when accessing large files.
Please see the README for more details about what to install depending on the target database version.

Example on a 500,000-row file (bigfile.xlsx) :

select * 
from table(
       ExcelTable.getRows(
         ExcelTable.getFile('TMP_DIR','bigfile.xlsx')
       , 'data'      
       , q'{ 
            "ID"           number
          , "FIRST_NAME"   varchar2(15)
          , "LAST_NAME"    varchar2(20)
          , "EMAIL"        varchar2(30)
          , "GENDER"       varchar2(10)
          , "IP_ADDRESS"   varchar2(16)
          , "COMMENT"      varchar2(4000)
          , "IMAGE"        varchar2(4000)
          , "DT"           date  format 'DD/MM/YYYY'

         }'
       , 'A2'
       , 1
       )
     ) t ;

 

Read more…

Oracle SQL – Reading an Excel File (xlsx) as an External Table

June 21, 2016 21 comments

I’ve been thinking about it for quite a long time and never really had time to implement it, but it’s finally there : a pipelined table interface to read an Excel file (.xlsx) as if it were an external table.

It’s entirely implemented in PL/SQL using an object type (for the ODCI routines) and a package supporting the core functionalities.
Available for download on GitHub :

/mbleron/ExcelTable

Read more…