Hierarchical JSON Aggregation : the JSONNest function

June 17, 2018 Leave a comment

This post echoes one of my previous work on the topic, this time applied to JSON data.

The idea is to build a user-defined aggregate function to automatically nest a hierarchically-ordered set of JSON objects, and return a single JSON document containing the whole hierarchy.

For example, from this adjacency list :

        ID  PARENT_ID NAME
---------- ---------- ------
         1            Item_1
         2          1 Item_2
         3          1 Item_3
         4          3 Item_4
         5          4 Item_5

to :

{
  "id": 1,
  "name": "Item_1",
  "children": [
    {
      "id": 2,
      "name": "Item_2"
    },
    {
      "id": 3,
      "name": "Item_3",
      "children": [
        {
          "id": 4,
          "name": "Item_4",
          "children": [
            {
              "id": 5,
              "name": "Item_5"
            }
          ]
        }
      ]
    }
  ]
}

 

The implementation I’ll present here requires the following :

  • An object type to hold the definition of each node (JSONNestItem)
  • A PL/SQL package for the core routines and context maintenance (JSONNestUtil)
  • A object type for ODCI routines (JSONNestImpl)
  • The aggregate function (JSONNest)

Source code available on GitHub :

/mbleron/oracle/JSONNest

Installation :

@JSONNestItem.tps
@JSONNestUtil.pks
@JSONNestUtil.pkb
@JSONNestImpl.tps
@JSONNestImpl.tpb
@JSONNest.fnc

 

Implementation

JSONNest requires at least Oracle 12.2.
It uses PL/SQL JSON object types introduced in that release, allowing fine-grained manipulation of JSON data.
The implementation is a little more complicated than expected because of a bug (I reported) in 12.2, related to accessing JSON objects by reference :
ORA-7445 Or Corrupted Result When Nesting PL/SQL JSON Objects (Doc ID 2361514.1)

The bug is announced to be fixed in 18c (18.1), but I don’t have the opportunity to check yet.
The workaround requires maintaining a stack of JSON objects. When I get to work on a 18c instance, I’ll rewrite the code to use object references instead.

The function :

function JSONNest (input in JSONNestItem) 
return clob
aggregate using JSONNestImpl;

where JSONNestItem is an object type describing a node of the hierarchy :

create or replace type JSONNestItem as object (
  item_level    integer
, json_content  varchar2(32767)
, wrapper       varchar2(4000)
)
  • item_level : node level in the hierarchy, typically derived from LEVEL pseudocolumn in a CONNECT-BY query, or computed in a recursive CTE.
  • json_content : actual JSON content of this node. Must be a JSON object.
  • wrapper : name of the field containing the array of child nodes.

 

Example

I’ll use the usual HR.EMPLOYEES table to generate the hierarchy of employees :

select JSONNest( 
         JSONNestItem(
           level
         , json_object(
             'id' value employee_id
           , 'name' value first_name || ' ' ||last_name
           )
        , 'employees'
        )
      ) as result
from hr.employees
connect by prior employee_id = manager_id
start with manager_id is null
order siblings by last_name
;

Output (formatted) :

{
  "id": 100,
  "name": "Steven King",
  "employees": [
    {
      "id": 101,
      "name": "Neena Kochhar",
      "employees": [
        {
          "id": 108,
          "name": "Nancy Greenberg",
          "employees": [
            {
              "id": 109,
              "name": "Daniel Faviet"
            },
            {
              "id": 110,
              "name": "John Chen"
            },
            {
              "id": 111,
              "name": "Ismael Sciarra"
            },
            {
              "id": 112,
              "name": "Jose Manuel Urman"
            },
            {
              "id": 113,
              "name": "Luis Popp"
            }
          ]
        },
        {
          "id": 200,
          "name": "Jennifer Whalen"
        },
        {
          "id": 203,
          "name": "Susan Mavris"
        },
        {
          "id": 204,
          "name": "Hermann Baer"
        },
        {
          "id": 205,
          "name": "Shelley Higgins",
          "employees": [
            {
              "id": 206,
              "name": "William Gietz"
            }
          ]
        }
      ]
    },
    {
      "id": 102,
      "name": "Lex De Haan",
      "employees": [
        {
          "id": 103,
          "name": "Alexander Hunold",
          "employees": [
            {
              "id": 104,
              "name": "Bruce Ernst"
            },
            {
              "id": 105,
              "name": "David Austin"
            },
            {
              "id": 106,
              "name": "Valli Pataballa"
            },
            {
              "id": 107,
              "name": "Diana Lorentz"
            }
          ]
        }
      ]
    },
    {
      "id": 114,
      "name": "Den Raphaely",
      "employees": [
        {
          "id": 115,
          "name": "Alexander Khoo"
        },
        {
          "id": 116,
          "name": "Shelli Baida"
        },
        {
          "id": 117,
          "name": "Sigal Tobias"
        },
        {
          "id": 118,
          "name": "Guy Himuro"
        },
        {
          "id": 119,
          "name": "Karen Colmenares"
        }
      ]
    },
    {
      "id": 120,
      "name": "Matthew Weiss",
      "employees": [
        {
          "id": 125,
          "name": "Julia Nayer"
        },
        {
          "id": 126,
          "name": "Irene Mikkilineni"
        },
        {
          "id": 127,
          "name": "James Landry"
        },
        {
          "id": 128,
          "name": "Steven Markle"
        },
        {
          "id": 180,
          "name": "Winston Taylor"
        },
        {
          "id": 181,
          "name": "Jean Fleaur"
        },
        {
          "id": 182,
          "name": "Martha Sullivan"
        },
        {
          "id": 183,
          "name": "Girard Geoni"
        }
      ]
    },
    {
      "id": 121,
      "name": "Adam Fripp",
      "employees": [
        {
          "id": 129,
          "name": "Laura Bissot"
        },
        {
          "id": 130,
          "name": "Mozhe Atkinson"
        },
        {
          "id": 131,
          "name": "James Marlow"
        },
        {
          "id": 132,
          "name": "TJ Olson"
        },
        {
          "id": 184,
          "name": "Nandita Sarchand"
        },
        {
          "id": 185,
          "name": "Alexis Bull"
        },
        {
          "id": 186,
          "name": "Julia Dellinger"
        },
        {
          "id": 187,
          "name": "Anthony Cabrio"
        }
      ]
    },
    {
      "id": 122,
      "name": "Payam Kaufling",
      "employees": [
        {
          "id": 133,
          "name": "Jason Mallin"
        },
        {
          "id": 134,
          "name": "Michael Rogers"
        },
        {
          "id": 135,
          "name": "Ki Gee"
        },
        {
          "id": 136,
          "name": "Hazel Philtanker"
        },
        {
          "id": 188,
          "name": "Kelly Chung"
        },
        {
          "id": 189,
          "name": "Jennifer Dilly"
        },
        {
          "id": 190,
          "name": "Timothy Gates"
        },
        {
          "id": 191,
          "name": "Randall Perkins"
        }
      ]
    },
    {
      "id": 123,
      "name": "Shanta Vollman",
      "employees": [
        {
          "id": 137,
          "name": "Renske Ladwig"
        },
        {
          "id": 138,
          "name": "Stephen Stiles"
        },
        {
          "id": 139,
          "name": "John Seo"
        },
        {
          "id": 140,
          "name": "Joshua Patel"
        },
        {
          "id": 192,
          "name": "Sarah Bell"
        },
        {
          "id": 193,
          "name": "Britney Everett"
        },
        {
          "id": 194,
          "name": "Samuel McCain"
        },
        {
          "id": 195,
          "name": "Vance Jones"
        }
      ]
    },
    {
      "id": 124,
      "name": "Kevin Mourgos",
      "employees": [
        {
          "id": 141,
          "name": "Trenna Rajs"
        },
        {
          "id": 142,
          "name": "Curtis Davies"
        },
        {
          "id": 143,
          "name": "Randall Matos"
        },
        {
          "id": 144,
          "name": "Peter Vargas"
        },
        {
          "id": 196,
          "name": "Alana Walsh"
        },
        {
          "id": 197,
          "name": "Kevin Feeney"
        },
        {
          "id": 198,
          "name": "Donald OConnell"
        },
        {
          "id": 199,
          "name": "Douglas Grant"
        }
      ]
    },
    {
      "id": 145,
      "name": "John Russell",
      "employees": [
        {
          "id": 150,
          "name": "Peter Tucker"
        },
        {
          "id": 151,
          "name": "David Bernstein"
        },
        {
          "id": 152,
          "name": "Peter Hall"
        },
        {
          "id": 153,
          "name": "Christopher Olsen"
        },
        {
          "id": 154,
          "name": "Nanette Cambrault"
        },
        {
          "id": 155,
          "name": "Oliver Tuvault"
        }
      ]
    },
    {
      "id": 146,
      "name": "Karen Partners",
      "employees": [
        {
          "id": 156,
          "name": "Janette King"
        },
        {
          "id": 157,
          "name": "Patrick Sully"
        },
        {
          "id": 158,
          "name": "Allan McEwen"
        },
        {
          "id": 159,
          "name": "Lindsey Smith"
        },
        {
          "id": 160,
          "name": "Louise Doran"
        },
        {
          "id": 161,
          "name": "Sarath Sewall"
        }
      ]
    },
    {
      "id": 147,
      "name": "Alberto Errazuriz",
      "employees": [
        {
          "id": 162,
          "name": "Clara Vishney"
        },
        {
          "id": 163,
          "name": "Danielle Greene"
        },
        {
          "id": 164,
          "name": "Mattea Marvins"
        },
        {
          "id": 165,
          "name": "David Lee"
        },
        {
          "id": 166,
          "name": "Sundar Ande"
        },
        {
          "id": 167,
          "name": "Amit Banda"
        }
      ]
    },
    {
      "id": 148,
      "name": "Gerald Cambrault",
      "employees": [
        {
          "id": 168,
          "name": "Lisa Ozer"
        },
        {
          "id": 169,
          "name": "Harrison Bloom"
        },
        {
          "id": 170,
          "name": "Tayler Fox"
        },
        {
          "id": 171,
          "name": "William Smith"
        },
        {
          "id": 172,
          "name": "Elizabeth Bates"
        },
        {
          "id": 173,
          "name": "Sundita Kumar"
        }
      ]
    },
    {
      "id": 149,
      "name": "Eleni Zlotkey",
      "employees": [
        {
          "id": 174,
          "name": "Ellen Abel"
        },
        {
          "id": 175,
          "name": "Alyssa Hutton"
        },
        {
          "id": 176,
          "name": "Jonathon Taylor"
        },
        {
          "id": 177,
          "name": "Jack Livingston"
        },
        {
          "id": 178,
          "name": "Kimberely Grant"
        },
        {
          "id": 179,
          "name": "Charles Johnson"
        }
      ]
    },
    {
      "id": 201,
      "name": "Michael Hartstein",
      "employees": [
        {
          "id": 202,
          "name": "Pat Fay"
        }
      ]
    }
  ]
}

 

Advertisements

Parsing sibling repeating XML groups : Part 2

June 3, 2018 Leave a comment

This post is an addendum to a previously published How-To presenting various techniques to break sibling repeating groups into relational data :

<root>
  <id>1</id>
  <name>SMITH</name>
  <id>2</id>
  <name>WILSON</name>
  <id>3</id>
  <name>GRANT</name>
</root>

to :

ID    NAME
----- ---------
    1 SMITH
    2 WILSON
    3 GRANT

The topic was raised again recently on the Oracle Community forum, but with a particularity I didn’t cover originally : what about optional nodes?
For instance, if id 2 has no name :

<root>
  <id>1</id>
  <name>SMITH</name>
  <id>2</id>
  <id>3</id>
  <name>GRANT</name>
</root>

The “following-sibling” trick will not return the required output :

SQL> select x.*
  2  from xmltable(
  3         'for $i in /root/id
  4          return element r {
  5            $i/following-sibling::name[1]
  6          , $i
  7          }'
  8         passing xmlparse(document
  9         '<root>
 10            <id>1</id><name>SMITH</name>
 11            <id>2</id>
 12            <id>3</id><name>GRANT</name>
 13          </root>')
 14         columns id    number       path 'id'
 15               , name  varchar2(15) path 'name'
 16       ) x
 17  ;
 
        ID NAME
---------- ---------------
         1 SMITH
         2 GRANT          
         3 GRANT

One way to workaround this is to add another predicate to ensure that the node is actually the one coming directly after the current key element, and not one belonging to another logical sibling group further down in the document.

In the following example, I’ll use the node identity comparison operator ‘is’ :

SQL> select x.*
  2  from xmltable(
  3         'for $i in /root/id
  4          return element r {
  5            $i/following-sibling::name[1][preceding-sibling::id[1] is $i]
  6          , $i
  7          }'
  8         passing xmlparse(document
  9         '<root>
 10            <id>1</id><name>SMITH</name>
 11            <id>2</id>
 12            <id>3</id><name>GRANT</name>
 13          </root>')
 14         columns id    number       path 'id'
 15               , name  varchar2(15) path 'name'
 16       ) x
 17  ;
 
        ID NAME
---------- ---------------
         1 SMITH
         2                
         3 GRANT
 

 

Additional Approaches and Performance Considerations

The full XQuery approach (w/ or w/o RETURNING SEQUENCE BY REF clause) is rather compact and maintainable, but it is also very slow when applied on large XML documents.
Mixed approaches using XQuery and SQL post-processing provide far better performance.
I’ll present below three other ways using PIVOT, MODEL and MATCH_RECOGNIZE SQL features.

Read more…

Categories: SQL, XQuery Tags: , ,

How To : Uncompress Raw DEFLATE Data without CRC32

May 27, 2018 Leave a comment

Here’s a short post about uncompressing data compressed using DEFLATE.

As of Oracle 10g, one can use UTL_COMPRESS utility to compress/uncompress data with the Lempel-Ziv algorithm (which DEFLATE is based on).
However, UTL_COMPRESS only handles input/output in the GZIP format :

The output of the UTL_COMPRESS compressed data is compatible with gzip(with -n option)/gunzip on a single file.

For a single file, the GZIP format is only composed of a header, followed by blocks of compressed data and a trailer.
In its simplest form, the header is a sequence of 10 bytes :

Field Size Description
ID1-ID2 2 GZIP signature (0x1F8B)
CM 1 compression method (0x08 = DEFLATE)
FLG 1 bit vector for extra fields
MTIME 4 modification time (UNIX format)
XFL 1 extra flags
OS 1 source Operating System (0xFF = Unknown)

The trailer consists in 8 bytes :

CRC32 4 CRC-32 checksum of the uncompressed data
ISIZE 4 Size of the uncompressed data

So, if we only have a raw DEFLATE input, we must first wrap it in a GZIP container.
While it’s easy to create the header, for instance using following constant : 0x1F8B08000000000000FF, the CRC-32 and uncompressed size is generally unknown, unless shipped together with the compressed data (as in the ZIP file format).

To illustrate the problem and the solution, I’ll use a 200,000-byte data sample composed of a sequence of ‘X’ characters, and terminated by a ‘A’ character.
This sample data will be compressed using UTL_COMPRESS.LZ_COMPRESS procedure and the raw DEFLATE block extracted from the resulting GZIP file :

SQL> declare
  2    input    blob;
  3    output   blob;
  4    deflate  raw(1024);
  5  begin
  6
  7    dbms_lob.createtemporary(input, true);
  8    for i in 1 .. 9999 loop
  9      dbms_lob.writeappend(input, 20, utl_raw.cast_to_raw('XXXXXXXXXXXXXXXXXXXX'));
 10    end loop;
 11    dbms_lob.writeappend(input, 20, utl_raw.cast_to_raw('XXXXXXXXXXXXXXXXXXXA'));
 12    dbms_output.put_line('Input size = '||dbms_lob.getlength(input));
 13
 14    output := utl_compress.lz_compress(input);
 15
 16    deflate := dbms_lob.substr(output, amount => dbms_lob.getlength(output)-18, offset => 11);
 17    dbms_output.put_line('Compressed size = '||utl_raw.length(deflate));
 18    dbms_output.put_line('DEFLATE data = ');
 19    dbms_output.put_line(deflate);
 20
 21  end;
 22  /
Input size = 200000
Compressed size = 213
DEFLATE data =
EDC1B1000000000230B62CF22789A17F5B0B00000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000009C32

PL/SQL procedure successfully completed.

 

Now, if we try to uncompress it without the trailer, it produces an incomplete output :

SQL> declare
  2
  3    input   blob := hextoraw(
  4            'EDC1B1000000000230B62CF22789A17F5B0B00000000000000000000000000000000000000000000'||
  5            '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  6            '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  7            '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  8            '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  9            '00000000000000000000009C32');
 10    tmp_gz  blob;
 11    output  blob;
 12
 13  begin
 14
 15    tmp_gz := hextoraw('1F8B08000000000000FF');
 16    dbms_lob.copy(tmp_gz, input, dbms_lob.getlength(input), 11, 1);
 17    dbms_lob.createtemporary(output, true);
 18    utl_compress.lz_uncompress(tmp_gz, output);
 19
 20    dbms_output.put_line('Uncompressed size = '||dbms_lob.getlength(output));
 21
 22  end;
 23  /
Uncompressed size = 195168

PL/SQL procedure successfully completed.

 

The solution is to use UTL_COMPRESS piecewise operations :

SQL> declare
  2
  3    input   blob := hextoraw(
  4                    'EDC1B1000000000230B62CF22789A17F5B0B00000000000000000000000000000000000000000000'||
  5                    '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  6                    '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  7                    '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  8                    '00000000000000000000000000000000000000000000000000000000000000000000000000000000'||
  9                    '00000000000000000000009C32');
 10    tmp_gz  blob;
 11    output  blob;
 12
 13    ctx     binary_integer;
 14    buf     raw(32767);
 15
 16  begin
 17
 18    tmp_gz := hextoraw('1F8B08000000000000FF');
 19    dbms_lob.copy(tmp_gz, input, dbms_lob.getlength(input), 11, 1);
 20    dbms_lob.createtemporary(output, true);
 21
 22    -- initialize piecewise uncompress context
 23    ctx := utl_compress.lz_uncompress_open(tmp_gz);
 24    -- uncompress data in chunks of 32KiB, until NO_DATA_FOUND is raised
 25    loop
 26      begin
 27        utl_compress.lz_uncompress_extract(ctx, buf);
 28      exception
 29        when no_data_found then
 30          exit;
 31      end;
 32      dbms_lob.writeappend(output, utl_raw.length(buf), buf);
 33    end loop;
 34    -- close context
 35    utl_compress.lz_uncompress_close(ctx);
 36
 37    dbms_output.put_line('Uncompressed size = '||dbms_lob.getlength(output));
 38    dbms_output.put_line('Tail = ');
 39    dbms_output.put_line(utl_raw.cast_to_varchar2(dbms_lob.substr(output, offset => dbms_lob.getlength(output)-20)));
 40
 41    dbms_lob.freetemporary(output);
 42
 43  end;
 44  /
Uncompressed size = 200000
Tail =
XXXXXXXXXXXXXXXXXXXXA

PL/SQL procedure successfully completed.

Done!

 

Categories: HowTo, PL/SQL Tags: , , ,

On the Blowfish algorithm and PL/SQL integer arithmetic

May 9, 2018 Leave a comment

My latest “weekend” project was a PL/SQL implementation of Bruce Schneier’s Blowfish encryption algorithm.

Blowfish is 25 years old but still used in a number of products since it is license-free and of course considered a strong encryption algorithm.
In particular, before being preferred AES256, Blowfish was the cipher used for encrypting OpenDocument (ODF) files in products such as OpenOffice or LibreOffice.
I am planning to add ODF spreadsheet support to ExcelTable, so I looked towards making a PL/SQL Blowfish utility.
 

Introduction

The algorithm is simple and easy to implement, provided the target language possesses the necessary set of “tools”, namely :

  • bitwise operators
  • (native) modular integer arithmetic support

The heart of Blowfish is the F-function used at each round of the Feistel network.
Here’s the function in pseudocode (Wikipedia) :

uint32_t S[4][256];

uint32_t f (uint32_t x) {
   uint32_t h = S[0][x >> 24] + S[1][x >> 16 & 0xff];
   return ( h ^ S[2][x >> 8 & 0xff] ) + S[3][x & 0xff];
}

Quite simple, isn’t it?

The input is a 32-bit unsigned integer, acting as a 4-byte vector.
Each byte respectively serves as an index into one of the four S-Boxes (initialized earlier using the key), and can be extracted from the input via right-shift operations.
The code also uses the XOR (^) operator, and addition (modulo 232).

All this can be implemented efficiently in low-level languages such as C, as it natively possesses bitwise operators.
However, how to do it efficiently in PL/SQL?

On my first try, most of the implementation used RAW data type to move around 32-bit values and perform conversion back and forth to integer when an addition was necessary.

It seemed natural to me to use RAW values as they are the closest thing to bytes in PL/SQL, and I thought manipulating RAW should be relatively fast.
Well, big mistake here.
RAW values can be manipulated using UTL_RAW package, in particular : BIT_OR, BIT_AND, BIT_XOR, CONCAT, SUBSTR, CAST_TO_BINARY_INTEGER and CAST_FROM_BINARY_INTEGER functions, and calling those millions of times proves to be very costly (see example below).

So, what alternatives do we have?
 

PL/SQL integer arithmetic

PL/SQL language has no native support for bitwise operators, except BITAND function which fortunately can be used to implement Exclusive-OR (XOR) :

XOR(x,y) = x + y - 2 * BITAND(x,y)

Right shift may be implemented using integer division by the corresponding power of 2 (with special care for rounding behaviour).

The second concern is about available integer data types themselves.
PL/SQL language provides the following :

  • INTEGER, actually an alias for NUMBER(38)
  • PLS_INTEGER, a true 32-bit signed integer
  • BINARY_INTEGER, same as PLS_INTEGER
  • SIMPLE_INTEGER, same as PLS_INTEGER with a NOT NULL constraint and different overflow semantics

Both PLS_INTEGER and its subtype SIMPLE_INTEGER use hardware arithmetics, which is a key feature here for optimization.
However, PLS_INTEGER is not suitable in the present case as it does not support modular addition :

SQL> declare
  2    x  pls_integer := utl_raw.cast_to_binary_integer('7FFFFFFF'); -- 2147483647
  3  begin
  4    x := x + 1;
  5  end;
  6  /
declare
*
ERROR at line 1:
ORA-01426: numeric overflow
ORA-06512: at line 4

whereas SIMPLE_INTEGER does :

SQL> declare
  2    x  simple_integer := utl_raw.cast_to_binary_integer('7FFFFFFF'); -- 2147483647
  3  begin
  4    x := x + 1;
  5    dbms_output.put_line(x||' [0x'||utl_raw.cast_from_binary_integer(x)||']');
  6  end;
  7  /
-2147483648 [0x80000000]

PL/SQL procedure successfully completed.

 

To emphasize the difference between RAW and integer processing, let’s consider another example.
XORing two values 10 million times, 1) as RAW, 2) as SIMPLE_INTEGER :

SQL> declare
  2    x  raw(4) := hextoraw('FFFFFFFF');
  3    y  raw(4) := hextoraw('12345678');
  4  begin
  5    for i in 1 .. 10000000 loop
  6      x := utl_raw.bit_xor(x,y);
  7    end loop;
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.72

SQL> declare
  2    x  simple_integer := -1;
  3    y  simple_integer := 305419896;
  4  begin
  5    for i in 1 .. 10000000 loop
  6      x := x + y - 2 * bitand(x,y);
  7    end loop;
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.48

Transposing RAW manipulation to integer arithmetic wherever possible reduced the execution time by an order of magnitude.
 

The Blowfish PL/SQL implementation

The XUTL_CRYPTO package (see link below) contains my implementation of the Blowfish algorithm, in the following cipher modes of operation :

  • Electronic Codebook (ECB)
  • Cipher Block Chaining (CBC)
  • Cipher Feedback (CFB)
  • Output Feedback (OFB)

Source code available on GitHub :

/mbleron/Crypto
 

Specifications :

Read more…

Categories: PL/SQL Tags: , ,

ExcelTable 2.1 : support for XLSB files

April 22, 2018 5 comments

Latest enhancement to ExcelTable : support for Excel Binary File Format (.xlsb).

The .xlsb file format is quite similar to .xlsx as they are both using the same OPC package structure, i.e. a zip archive with different parts containing document data and metadata.

The difference lies in the internal format of the data parts.
In the xlsx format, most data parts (except purely binary parts like images) are XML, whereas in the xlsb format, those parts are binary (with a default .bin extension) :

Relationship parts (.rels) are still exposed as XML, so the workflow to read and extract data is the same as for xlsx, and already implemented in ExcelTable interface.
However, for modularity, reading the different binary parts (workbook, sharedStrings, comments, sheet) is handled by a separate PL/SQL package : XUTL_XLSB.

Technically, each binary part is organized much like the Workbook stream in the XLS format, i.e. as a sequence of records with a common header structure (type and size) followed by record data.
That makes .xlsb files usually smaller and faster to read than the same document saved as xlsx, especially for very large Excel files.

Source code, documentation and samples available on GitHub :

/mbleron/ExcelTable

 

The XLSB reader

My implementation of the XLSB reader resembles that of the XLS reader.
It is based on the official XLSB specs available on MSDN, like other Microsoft Office file formats :
[MS-XLSB]: Excel (.xlsb) Binary File Format

It uses the same open/iterate/close methods to interface with ExcelTable and feed cell data in a streaming fashion.

A pipelined table interface is provided, for debugging purpose, as it takes the sheet and sharedStrings parts as parameters :

function get_rows (
  p_sheet_part  in blob 
, p_sst_part    in blob
, p_cols        in varchar2 default null
, p_firstRow    in pls_integer default null
, p_lastRow     in pls_integer default null
)
return ExcelTableCellList
pipelined;

Sample usage, using sheet1.bin and sharedStrings.bin parts manually extracted from ooxdata3.xlsb :

SQL> select t.cellrow
  2       , t.cellcol
  3       , case when t.cellData.getTypeName() = 'SYS.VARCHAR2' then t.cellData.accessVarchar2() end as strval
  4       , case when t.cellData.getTypeName() = 'SYS.NUMBER' then t.cellData.accessNumber() end as numval
  5  from table(
  6         xutl_xlsb.get_rows(
  7           p_sheet_part => file2blob('TMP_DIR','sheet1.bin')
  8         , p_sst_part   => file2blob('TMP_DIR','sharedStrings.bin')
  9         , p_cols       => 'A,B,C,D,E,F'
 10         )
 11       ) t
 12  ;
 
   CELLROW CELLCOL STRVAL             NUMVAL
---------- ------- -------------- ----------
         1 A       SRNO           
         1 B       NAME           
         1 C       VALUE          
         1 D       DT             
         1 E       SPARE1         
         1 F       SPARE2         
         2 A                               1
         2 B       LINE-00001     
         2 C                      68459.0560
         2 D                      8598.67446                        

...

       100 A                              99
       100 B       LINE-00099     
       100 C                      7348.46845
       100 D                      5275.00033
       101 A                             100
       101 B       LINE-00100     
       101 C                      94805.7298
       101 D                      4492.44441
 
412 rows selected.
 

 

Categories: ExcelTable, PL/SQL Tags: ,

ExcelTable 2.0 : new support for Excel 97-2003 files (.xls)

April 1, 2018 Leave a comment

Here’s the new version of ExcelTable, which can now read old Excel 97-2003 (.xls) files.
I’ve decided to add this feature not so long ago when I realized there was still a lot of those files around, although they were superseded by OOX-based files (.xlsx) more than a decade ago.

This new version is labeled 2.0 as it now requires hard dependencies that were formerly optional :

The interface remains otherwise unchanged.

I also took the opportunity to migrate ExcelTable project to its own GitHub repository :

/mbleron/ExcelTable

The dependencies were migrated as well to :

/mbleron/MSUtilities

 

The XLS reader

My implementation of the XLS reader is based on the official Microsoft specs published on MSDN :
[MS-XLS]: Excel Binary File Format (.xls) Structure

For an overview :
Understanding the Excel .xls Binary File Format

I’ll just say a few words about the format itself because it’s very well explained in the specs.
Basically, an .xls file is a CFBF container whose ‘Workbook’ binary stream contains the workbook structure and data :

SQL> select *
  2  from table(xutl_cdf.get_streams(file2blob('TMP_DIR','ooxdata2c.xls')));
 
PATH                              STREAM_SIZE CREATION_TIME MODIFIED_TIME STREAM
--------------------------------- ----------- ------------- ------------- ------
/CompObj                                 122                             <BLOB>
/DocumentSummaryInformation              236                             <BLOB>
/SummaryInformation                      236                             <BLOB>
/Workbook                               71871                             <BLOB>
 

The Workbook content is composed of a sequence of records sharing a common header structure (type and record size) followed by the actual record data. This format is known as BIFF8.
The work of the XLS reader is to scan through those records and extract the information we need to expose data with ExcelTable, the same way it’s done for .xlsx files :

  • Encryption information
  • Sheet list
  • Shared strings
  • Comments
  • Cell data

As said, encrypted .xls files are also supported. The RC4 encryption method used by default is described in the [MS-OFFCRYPTO] specs.
The corresponding key-derivation routines have been added in XUTL_OFFCRYPTO package.

Although the XLS reader is primarily meant to be used internally by ExcelTable, it also possesses a “raw” pipelined table interface to directly extract cell data from the Workbook stream.

Here’s an example :

SQL> select t.cellrow
  2       , t.cellcol
  3       , case when t.cellData.getTypeName() = 'SYS.VARCHAR2' then t.cellData.accessVarchar2() end as strval
  4       , case when t.cellData.getTypeName() = 'SYS.NUMBER' then t.cellData.accessNumber() end as numval
  5       , case when t.cellData.getTypeName() = 'SYS.CLOB' then t.cellData.accessClob() end as lobval
  6  from table(
  7         xutl_xls.getRows(
  8           p_file     => xutl_cdf.get_stream(file2blob('TMP_DIR','ooxdata2c.xls'),'/Workbook')
  9         , p_sheet    => 'DataSource'
 10         , p_password => 'pass123'
 11         , p_cols     => 'A,B,C,D,E,F'
 12         , p_firstRow =>  1
 13         , p_lastRow  =>  91
 14         )
 15       ) t
 16  ;
 
   CELLROW CELLCOL STRVAL             NUMVAL LOBVAL
---------- ------- -------------- ---------- --------
         1 A                               1 
         1 B       LINE-00001                
         1 C       ABCD                      
         1 D       1899-12                   
         2 A                               2 
         2 B       LINE-00002                
         2 C       ABC                       
         2 D                               2 
         2 F       TEST                      
         3 A                               3 
         3 B       LINE-00003                
         3 C       ABC                       
         3 D                               3 
         4 A                               4 
         4 B       LINE-00004                
         4 C       ABC                       
         4 D                               4 
 

Stay tuned for the upcoming addition : XLSB format…

 

Categories: PL/SQL Tags: , , ,

Reading Large XML Node Values using the DOM API

March 24, 2018 Leave a comment

As of Oracle release 11.1, DBMS_XMLDOM API has been extended with a set of stream-based functions and procedures to read and write large node data (i.e. data exceeding VARCHAR2 size limit) :
Link to documentation

Sounds great on the paper and well documented, but looking closer and actually trying those functionalities on real examples turns out to be a bit disappointing.
That’s why I could have subtitled this post “The Good, the Bad and the Ugly”.

 

1- Reading a large node value composed of single-byte characters : the “Good”

In this example, I’ll use a simple XML document containing a single 128K text node :

<test>XXXX ... XXXX</test>

and try to read it into a CLOB instance via the DOM streaming API.

declare

  xmldoc   xmltype;
  domdoc   dbms_xmldom.DOMDocument;
  node     dbms_xmldom.DOMNode;
  istream  utl_characterinputstream;
  tmp      varchar2(32767);
  nread    integer := 32767;
  output   clob;
  
begin
  
  select xmlelement("test", rpad(to_clob('X'),131072,'X'))
  into xmldoc
  from dual;
  
  domdoc := dbms_xmldom.newDOMDocument(xmldoc);
  node := dbms_xslprocessor.selectSingleNode(dbms_xmldom.makenode(domdoc), '/test/text()');
  
  dbms_lob.createtemporary(output, true);
  
  istream := dbms_xmldom.getNodeValueAsCharacterStream(node);
  
  loop
    istream.read(tmp, nread);
    exit when nread = 0;
    dbms_lob.writeappend(output, nread, tmp);
  end loop;
  
  istream.close();
  
  dbms_output.put_line('output size = '||dbms_lob.getlength(output));
  
  dbms_lob.freetemporary(output);
  dbms_xmldom.freeDocument(domdoc);
  
end;  
/

output size = 131072
 
PL/SQL procedure successfully completed.

Works great.
We retrieve an input stream for the large node by calling getNodeValueAsCharacterStream() function.
Node data is then read from that stream in 32K chunks, and appended to a temp CLOB.

This was tested on 12.1.0.2, with database character set AL32UTF8.
Note that the large string we want to extract in this example consists only in single-byte characters.

Let’s see what happens when the data contains multi-byte characters…

 

Read more…