Archive for April, 2016

Simple CSV parsing using XMLTABLE or JSON_TABLE

April 22, 2016 2 comments

Oracle does not currently provide any built-in functionality to parse a flat file stored in the database as CLOB (or BLOB), like the External Table facility for external files.
Mike Kutz submitted the idea a couple of years ago : parse CLOBS/BLOBS using “external table” capability.

Parsing a simple CSV content, with no quoted field, is however possible with minimum effort using XMLTABLE or JSON_TABLE functions.
Both approaches are based on a little preprocessing to convert the data to the required format and pass it to the corresponding function.

The last part of this post will focus on the limitations.

Read more…

Categories: JSON, SQL, XML DB Tags: , ,

How To : Load XML from File with Encoding Detection

April 10, 2016 1 comment

This post was inspired by this recent OTN thread : LPX-00200: could not convert from encoding UTF-8 to UCS2, where it was asked if it were possible to load an XML file without actually knowing its character encoding.

We can already build an XMLType instance from a BFILE, using the corresponding constructor :

constructor function XMLType(
  xmlData    IN bfile
, csid       IN number
, schema     IN varchar2 := NULL
, validated  IN number   := 0
, wellformed IN number   := 0
return self as result 

However, the character encoding must be known upfront and passed to the constructor (csid argument).
For example :

SQL> select XMLSerialize(document
  2           xmltype(
  3             bfilename('TMP_DIR','test_utf-8_bom.xml')
  4           , nls_charset_id('AL32UTF8')
  5           )
  6         ) as xmldoc
  7  from dual ;
<?xml version="1.0" encoding="UTF-8"?>

Following is a little helper function that provides basic character encoding detection for XML files.
It should work for most 8-bit encodings (except EBCDIC), as well as UTF-16 (little and big-endian) and of course UTF-8.
The function uses this algorithm to detect the encoding :

  • If a byte order mark (BOM) is present then UTF-8, UTF-16LE or UTF-16BE is assumed, respectively :
    • 0xEFBBBF = UTF-8
    • 0xFFFE = UTF-16LE
    • 0xFEFF = UTF-16BE
  • If no BOM is found then it looks for an XML prolog :
    • If no prolog or encoding declaration is found then a default encoding is assumed.
    • Else, the specified encoding is used.

Read more…