How To : Parse repeating groups correlated by position as relational data

Here’s a new post in the category “How to deal with that unfriendly design?”
Again, since the title is not very explicit, an example will surely explain it all :

<root>
  <ids>
    <id>7369</id>
    <id>7499</id>
    <id>7521</id>
  </ids>
  <names>
    <name>SMITH</name>
    <name>ALLEN</name>
    <name>WARD</name>
  </names>
  <jobs>
    <job>CLERK</job>
    <job>SALESMAN</job>
    <job>SALESMAN</job>
  </jobs>
</root>

This is the result we want out of that data :

   ID NAME    JOB
----- ------- ----------  
 7369 SMITH   CLERK
 7499 ALLEN   SALESMAN
 7521 WARD    SALESMAN

 

0. Set up

This script creates an XMLType table (binary XML storage) and a sample XML document :

create table tmp_xml of xmltype;

insert into tmp_xml
select xmlelement("root"
       , xmlelement("ids", xmlagg(xmlelement("id", level)))
       , xmlelement("names", xmlagg(xmlelement("name", 'NAME'||to_char(level,'fm09999'))))
       , xmlelement("dates", xmlagg(xmlelement("date", cast(date '2015-01-01' + level/24 as timestamp(0)))))
       )
from dual
connect by level <= 10000;

The resulting XML document has the following structure :

<root>
  <ids>
    <id>1</id>
    <id>2</id>
    <id>3</id>
    
    <!-- up to 10000 -->
    
  </ids>
  <names>
    <name>NAME00001</name>
    <name>NAME00002</name>
    <name>NAME00003</name>
    
    <!-- up to NAME10000 -->
    
  </names>
  <dates>
    <date>2015-01-01T01:00:00</date>
    <date>2015-01-01T02:00:00</date>
    <date>2015-01-01T03:00:00</date>

    <!-- up to 2016-02-21T16:00:00 -->

  </dates>
</root>

Expected output :

    ID NAME         DT
------ ------------ --------------------
     1 NAME00001    01/01/2015 01:00:00
     2 NAME00002    01/01/2015 02:00:00
     3 NAME00003    01/01/2015 03:00:00
     4 NAME00004    01/01/2015 04:00:00
...

Following are two methods to achieve that.

 

1. XMLTABLE with a single XQuery expression

select /*+ no_xml_query_rewrite */ x.* 
from tmp_xml t
   , xmltable(
       'for $i at $p in $d/root/ids/id
        return element r {
          $i
        , $d/root/names/name[$p]
        , $d/root/dates/date[$p]
        }'
       passing t.object_value as "d"
       columns id    number(5)    path 'id'
             , name  varchar2(10) path 'name'
             , dt    timestamp    path 'date'
     ) x
;

Here, the FLWOR expression iterates the sequence of ids, gathering the index of each item into the positional variable $p. The return clause then builds a “row” element consisting in the current id and both the name and date elements at the same position in their corresponding groups.

Also note the usage of the NO_XML_QUERY_REWRITE hint which forces evaluation of the XQuery expression by the XVM.
Without the hint, Oracle resolves the XQuery using internal SQL functions but it’s very inefficient when positional iteration is used.

 

2. Multiple XMLTABLEs + PIVOT operation

with unpivoted_data (cn, rn, val) as (
  select 1, x.* from tmp_xml t, xmltable('/root/ids/id' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
  union all
  select 2, x.* from tmp_xml t, xmltable('/root/names/name' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
  union all
  select 3, x.* from tmp_xml t, xmltable('/root/dates/date' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
)
select to_number(id) as id
     , name
     , to_date(dt, 'YYYY-MM-DD"T"HH24:MI:SS') as dt 
from unpivoted_data
     pivot (
       min(val) for cn in (1 as "ID", 2 as "NAME", 3 as "DT")
     ) 
; 

The approach conducted here is more SQL-ish and leverages the power of set-based operations.
Each group (ids, names and dates) is broken into relational rows and assigned a common column number (CN). The whole dataset is then pivoted into three target columns, producing the required result.

 

3. Performance considerations and conclusion

First method, no hint :

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> insert into tmp_xml
  2  select xmlelement("root"
  3         , xmlelement("ids", xmlagg(xmlelement("id", level)))
  4         , xmlelement("names", xmlagg(xmlelement("name", 'NAME'||to_char(level,'fm09999'))))
  5         , xmlelement("dates", xmlagg(xmlelement("date", cast(date '2015-01-01' + level/24 as timestamp(0)))))
  6         )
  7  from dual
  8  connect by level <= 10000;

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL>
SQL> set autotrace traceonly
SQL> set lines 200
SQL> set timing on
SQL>
SQL> select x.*
  2  from tmp_xml t
  3     , xmltable(
  4         'for $i at $p in $d/root/ids/id
  5          return element r {
  6            $i
  7          , $d/root/names/name[$p]
  8          , $d/root/dates/date[$p]
  9          }'
 10         passing t.object_value as "d"
 11         columns id    number(5)    path 'id'
 12               , name  varchar2(10) path 'name'
 13               , dt    timestamp    path 'date'
 14       ) x
 15  ;
^C

I had to cancel the query after 2 hours of execution, waiting on ‘direct path read’ I/O event.
Here’s the corresponding explain plan anyway :

Execution Plan
----------------------------------------------------------
Plan hash value: 3262674909

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name                   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                        |  8168 |   725K|    32   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                        |     1 |     4 |            |          |
|*  2 |   COLLECTION ITERATOR PICKLER FETCH  | XMLSEQUENCEFROMXMLTYPE |    82 |   328 |    29   (0)| 00:00:01 |
|   3 |    SORT AGGREGATE                    |                        |     1 |     4 |            |          |
|*  4 |     COLLECTION ITERATOR PICKLER FETCH| XMLSEQUENCEFROMXMLTYPE |    82 |   328 |    29   (0)| 00:00:01 |
|   5 |  NESTED LOOPS                        |                        |  8168 |   725K|    32   (0)| 00:00:01 |
|   6 |   TABLE ACCESS FULL                  | TMP_XML                |     1 |    87 |     3   (0)| 00:00:01 |
|   7 |   COLLECTION ITERATOR PICKLER FETCH  | XMLSEQUENCEFROMXMLTYPE |  8168 | 32672 |    29   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(SYS_XQFNPREDTRUTH(:B1,VALUE(KOKBF$) /* KOKBI$ */ )=1)
   4 - filter(SYS_XQFNPREDTRUTH(:B1,VALUE(KOKBF$) /* KOKBI$ */ )=1)

Note
-----
   - Unoptimized XML construct detected (enable XMLOptimizationCheck for more information)

 

First method, with NO_XML_QUERY_REWRITE hint :

SQL> select /*+ no_xml_query_rewrite */ x.*
  2  from tmp_xml t
  3     , xmltable(
  4         'for $i at $p in $d/root/ids/id
  5          return element r {
  6            $i
  7          , $d/root/names/name[$p]
  8          , $d/root/dates/date[$p]
  9          }'
 10         passing t.object_value as "d"
 11         columns id    number(5)    path 'id'
 12               , name  varchar2(10) path 'name'
 13               , dt    timestamp    path 'date'
 14       ) x
 15  ;

10000 rows selected.

Elapsed: 00:00:34.68

Execution Plan
----------------------------------------------------------
Plan hash value: 1966229631

--------------------------------------------------------------------------------------
| Id  | Operation                  | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |         |  8168 |   741K|    32   (0)| 00:00:01 |
|   1 |  NESTED LOOPS              |         |  8168 |   741K|    32   (0)| 00:00:01 |
|   2 |   TABLE ACCESS STORAGE FULL| TMP_XML |     1 |    87 |     3   (0)| 00:00:01 |
|   3 |   XMLTABLE EVALUATION      |         |       |       |            |          |
--------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          8  consistent gets
         44  physical reads
          0  redo size
     288929  bytes sent via SQL*Net to client
       7798  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      10000  rows processed

 

Second method :

SQL> with unpivoted_data (cn, rn, val) as (
  2    select 1, x.* from tmp_xml t, xmltable('/root/ids/id' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
  3    union all
  4    select 2, x.* from tmp_xml t, xmltable('/root/names/name' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
  5    union all
  6    select 3, x.* from tmp_xml t, xmltable('/root/dates/date' passing t.object_value columns rn for ordinality, val varchar2(4000) path '.') x
  7  )
  8  select to_number(id) as id
  9       , name
 10       , to_date(dt, 'YYYY-MM-DD"T"HH24:MI:SS') as dt
 11  from unpivoted_data
 12       pivot (
 13         min(val) for cn in (1 as "ID", 2 as "NAME", 3 as "DT")
 14       )
 15  ;

10000 rows selected.

Elapsed: 00:00:01.79

Execution Plan
----------------------------------------------------------
Plan hash value: 4047723032

-----------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         | 24504 |    47M|    98   (3)| 00:00:02 |
|   1 |  SORT GROUP BY PIVOT          |         | 24504 |    47M|    98   (3)| 00:00:02 |
|   2 |   VIEW                        |         | 24504 |    47M|    96   (0)| 00:00:02 |
|   3 |    UNION-ALL                  |         |       |       |            |          |
|   4 |     NESTED LOOPS              |         |  8168 |    16M|    32   (0)| 00:00:01 |
|   5 |      TABLE ACCESS STORAGE FULL| TMP_XML |     1 |    87 |     3   (0)| 00:00:01 |
|   6 |      VIEW                     |         |  8168 |    15M|    29   (0)| 00:00:01 |
|   7 |       COUNT                   |         |       |       |            |          |
|   8 |        XPATH EVALUATION       |         |       |       |            |          |
|   9 |     NESTED LOOPS              |         |  8168 |    16M|    32   (0)| 00:00:01 |
|  10 |      TABLE ACCESS STORAGE FULL| TMP_XML |     1 |    87 |     3   (0)| 00:00:01 |
|  11 |      VIEW                     |         |  8168 |    15M|    29   (0)| 00:00:01 |
|  12 |       COUNT                   |         |       |       |            |          |
|  13 |        XPATH EVALUATION       |         |       |       |            |          |
|  14 |     NESTED LOOPS              |         |  8168 |    16M|    32   (0)| 00:00:01 |
|  15 |      TABLE ACCESS STORAGE FULL| TMP_XML |     1 |    87 |     3   (0)| 00:00:01 |
|  16 |      VIEW                     |         |  8168 |    15M|    29   (0)| 00:00:01 |
|  17 |       COUNT                   |         |       |       |            |          |
|  18 |        XPATH EVALUATION       |         |       |       |            |          |
-----------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         42  consistent gets
        707  physical reads
          0  redo size
     288929  bytes sent via SQL*Net to client
       7798  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
      10000  rows processed

 

Conclusion :

Method Rows fetched Overall time (second)
Single XQuery 10000 +∞
Single XQuery + NO_XML_QUERY_REWRITE hint 10000 35
XMLTABLEs + PIVOT 10000 1.8

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.