Home > JSON, SQL, XML DB > Simple CSV parsing using XMLTABLE or JSON_TABLE

Simple CSV parsing using XMLTABLE or JSON_TABLE

Oracle does not currently provide any built-in functionality to parse a flat file stored in the database as CLOB (or BLOB), like the External Table facility for external files.
Mike Kutz submitted the idea a couple of years ago : parse CLOBS/BLOBS using “external table” capability.

Parsing a simple CSV content, with no quoted field, is however possible with minimum effort using XMLTABLE or JSON_TABLE functions.
Both approaches are based on a little preprocessing to convert the data to the required format and pass it to the corresponding function.

The last part of this post will focus on the limitations.

 

Using XMLTABLE

1,ABC,123
2,XYZ,567

will be preprocessed into :

<r><c>1</c><c>ABC</c><c>123</c></r>
<r><c>2</c><c>XYZ</c><c>567</c></r>

 

For example, given a table like this :

create table csv_store (
  id    integer
, data  clob
);

whose CLOB column contains this data (generated from DBA_OBJECTS) :

20,SYS,ICOL$,TABLE,2014-09-11T08:40:51,60BQ37Z6QU1FMRFETJRTXVG97MX58MTFSZLVSDDVTN5R217OU3KXRVZDM8WCUTLITMAE4SAM7XK0WNX1PH30HTTE7H8M6YEM9G56,821943.0591
4,SYS,TAB$,TABLE,2014-09-11T08:40:51,HA104XOIXFCPZDCMEX13J0AQO9GEQ6Q483QM1P0R9RXGLYSDCOCZ6GK6KHEV4EN6J3C02CDOPZJ4Z341XSDWNABTL5H53NXSU9HN,307403.4572
47,SYS,I_USER2,INDEX,2014-09-11T08:40:51,611NW45KO7HASSSPYZ48T650BGJ81RA962LOI27ZIAVY9YTLXW33OTBIHLWXWGTVA2ZBU0OGA0735YM9CNCPLLSNOHMEI1UTDV94,953217.7555
58,SYS,I_CCOL2,INDEX,2014-09-11T08:40:51,I28WA7TSXSU89NH1MJYKMPD1JAOVG73GOH9XJJDJKIXQAVKNDRYHZIZ1CWSRFW8M946S7UK61VBWLYPX1MDFQ3I1E2PYGL5UN3JH,723613.7001
36,SYS,I_OBJ1,INDEX,2014-09-11T08:40:51,5TFN9EKGPYEI5750RON6WRNGQ0PZVDB4JCPWR07SHGW7W3GDTIU0B56WU4N5RVL8B4EW7CS84KESV4TKIJ06IH0DHD1JVS0SKB43,449681.8371
5,SYS,CLU$,TABLE,2014-09-11T08:40:51,9LY28S69ZGDONBADMLSHGF24T1PWIWBEFILSSCSHHQ51DU8EML2SJ6L3NOZROHNFYYGGO3U3CYJU9BJGMSO22WH26KJHUFWFPMH2,929358.6482
10,SYS,C_USER#,CLUSTER,2014-09-11T08:40:51,3IZKLMTI4J9GNTRE4H5JZYIJRJ1K1W7F2F7EAVK11CM8C98KEK8LGLGV115NI0X7ZHUYPBF6TSRB7R454A6HHOA3GC1TWI80TH61,591723.4671

<91336 lines total>

We can then use this query to parse the CLOB into relational rows and columns :

select x.* 
from csv_store t
   , xmltable(
       '/r'
       passing xmlparse(content 
                 '<r><c>' ||
                 replace(
                   replace(
                     dbms_xmlgen.convert(t.data)
                   , ','
                   , '</c><c>'
                   )
                 , chr(10)
                 , '</c></r><r><c>'
                 ) ||
                 '</c></r>'
               )
       columns object_id   number        path 'c[1]'
             , owner       varchar2(30)  path 'c[2]'
             , object_name varchar2(128) path 'c[3]'
             , object_type varchar2(23)  path 'c[4]'
             , created     timestamp     path 'c[5]'
             , rand_string varchar2(100) path 'c[6]'
             , rand_number number        path 'c[7]'
     ) x
where t.id = 1 ;

A few comments about the query :

  • DBMS_XMLGEN.CONVERT call is necessary to handle XML-reserved characters that may exist in the data.
  • In this example, the query expects a single LF (ASCII 10) character as record delimiter and a comma as field delimiter. Of course, that could be changed depending on the input file format.
  • Though one may be easily added, the transient XML content generated this way doesn’t have a root element, hence the use of the CONTENT option of XMLPARSE function.
  • The projection defined in the COLUMNS clause leverages the built-in mapping between the XML and SQL data model to automatically convert xs:dateTime format to SQL TIMESTAMP. If the field format doesn’t conform to the XML model, the projection must be defined as VARCHAR2 and the actual conversion performed in the SELECT clause.
  • This approach will only performed decently on version 11.2.0.4 and onwards, thanks to the optimization made on XMLTABLE for in-memory evaluation.

 

Using JSON_TABLE

The idea is to generate an array of string arrays, so that :

1,ABC,123
2,XYZ,567

will be preprocessed into :

[["1","ABC","123"],
 ["2","XYZ","567"]]

 

The query will then look like this :

select x.object_id
     , x.owner
     , x.object_name
     , x.object_type
     , to_timestamp(created, 'YYYY-MM-DD"T"HH24:MI:SS') as created
     , x.rand_string, x.rand_number 
from csv_store t
   , json_table(
       '[["'||replace(replace(t.data, ',', '","'), chr(10), '"],["')||'"]]'
     , '$[*]'
       columns object_id   number        path '$[0]'
             , owner       varchar2(30)  path '$[1]'
             , object_name varchar2(128) path '$[2]'
             , object_type varchar2(23)  path '$[3]'
             , created     varchar2(19)  path '$[4]'
             , rand_string varchar2(100) path '$[5]'
             , rand_number number        path '$[6]'
     ) x
where t.id = 1 ;

(Of course, since we’re using JSON_TABLE, we can only run it on database version 12.1.0.2 and onwards.)

 

Performance considerations

Here’s a compilation of run times for both queries (CTAS), using input CSVs of different sizes.
The “n/a” label means the query didn’t finish, running out of PGA (ORA-04036).

Rows Size (MiB) Time (s)
XMLTABLE JSON_TABLE
91336 15.5 3.8 2.2
182672 31 7.5 4.3
365344 62 14.5 9.4
730688 124 186 22
1461376 248 n/a 840

As we can see, the run time skyrockets around 500,000 rows for XMLTABLE and 1,000,000 rows for JSON_TABLE.
Below chart focuses on the apparent linear part :

csvplot01

 

Advertisements
Categories: JSON, SQL, XML DB Tags: , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s