Home > Miscellaneous, PL/SQL, SQL, XML DB > Yet Another XML Flattening Technique

Yet Another XML Flattening Technique

As a follow-up to my previous post introducing XMLNest function, here’s now its “inverse” (to borrow from maths terminology) : XMLFlattenDOM, a PL/SQL DOM-based pipelined function.

We’ll see in the last part how this approach compares to the others described earlier :

  • Java pipelined function
  • XQuery/XSLT
  •  

    1. The XMLFlattenDOM function

    The purpose of the function is to produce an “Edge” dataset, exactly like the one used for the Java-based approach.
    Therefore, I’ll reuse the same object type and collection : XMLEdgeTableRow and XMLEdgeTable.

    And here’s the code :

    create or replace function XMLFlattenDOM (
      xmldoc in XMLType
    ) 
    return XMLEdgeTable 
    pipelined
    is
    
      XMLNS_ATTRIBUTE  constant varchar2(32) := 'xmlns';
    
      type stack_item is record (id integer, nlist dbms_xmldom.DOMNodeList, len integer, idx integer);
      type stack      is table of stack_item;
    
      doc              dbms_xmldom.DOMDocument;
      node             dbms_xmldom.DOMNode;
      nodeList         dbms_xmldom.DOMNodeList;
      attr             dbms_xmldom.DOMNode;
      attrList         dbms_xmldom.DOMNamedNodeMap;
      
      edge             XMLEdgeTableRow := new XMLEdgeTableRow(null, null, null, null, null, null);
      st               stack := stack();
      i                pls_integer := 0;
      id               integer := 0;
      backward         boolean := false;
      nodeType         pls_integer;
      prefix           varchar2(30);
        
    begin
      
      doc := dbms_xmldom.newDOMDocument(xmldoc);
      node := dbms_xmldom.makeNode(doc);
      nodeType := dbms_xmldom.getNodeType(node);
    
      loop
    
        if not backward and nodeType in (dbms_xmldom.ELEMENT_NODE, dbms_xmldom.DOCUMENT_NODE) then
          
          nodeList := dbms_xmldom.getChildNodes(node);       
          if not dbms_xmldom.isNull(nodeList) then
            st.extend;
            i := i+1;
            if id > 0 then
              st(i).id := id;
            end if;
            st(i).nlist := nodeList;
            st(i).len := dbms_xmldom.getLength(nodeList);
            st(i).idx := 0;
          end if;
    
          attrList := dbms_xmldom.getAttributes(node);
          
          if not dbms_xmldom.isNull(attrList) then
    
            edge.parent_node_id := id;
            
            for i in 0 .. dbms_xmldom.getLength(attrList)-1 loop
    
              attr := dbms_xmldom.item(attrList, i);
              prefix := dbms_xmldom.getPrefix(attr);
              dbms_xmldom.getLocalName(attr, edge.node_name);
              
              if not(prefix is null and edge.node_name = XMLNS_ATTRIBUTE 
                     or prefix = XMLNS_ATTRIBUTE and prefix is not null) 
              then 
                
                id := id + 1;
                edge.node_id := id;
                
                edge.node_type := 'attribute';
                edge.node_value := dbms_xmldom.getNodeValue(attr);
                dbms_xmldom.getNamespace(attr, edge.namespace_uri);
                
                pipe row (edge);
                
              end if;
              
            end loop;
          end if;
        
        end if;
        
        dbms_xmldom.freeNode(node);
        
        if st(i).idx < st(i).len then
          
          backward := false;
          
          node := dbms_xmldom.item(st(i).nlist, st(i).idx);
          edge.parent_node_id := st(i).id;
          
          id := id + 1;
          edge.node_id := id;
          
          nodeType := dbms_xmldom.getNodeType(node);
          
          if nodeType in (dbms_xmldom.ELEMENT_NODE, dbms_xmldom.PROCESSING_INSTRUCTION_NODE) then
            dbms_xmldom.getLocalName(node, edge.node_name);
          else
            edge.node_name := null;
          end if;  
          
          edge.node_type := 
            case nodeType
            when dbms_xmldom.ELEMENT_NODE then 'element'
            when dbms_xmldom.TEXT_NODE then 'text'
            when dbms_xmldom.CDATA_SECTION_NODE then 'text'
            when dbms_xmldom.PROCESSING_INSTRUCTION_NODE then 'processing-instruction'
            when dbms_xmldom.COMMENT_NODE then 'comment'
            end ;
          
          edge.node_value := dbms_xmldom.getNodeValue(node);
          dbms_xmldom.getNamespace(node, edge.namespace_uri);
          
          pipe row (edge);
    
          st(i).idx := st(i).idx + 1;
    
        else
          dbms_xmldom.freeNodeList(st(i).nlist);
          st.trim;
          i := i - 1;
          backward := true;
        end if;
    
        exit when st is empty;
    
      end loop;
    
      dbms_xmldom.freeDocument(doc);
      
      return;
      
    end XMLFlattenDOM;
    

    The implementation is actually simple. A DOMDocument is built over the input XMLType instance and each node is analyzed recursively (depth-first).
    Since we cannot call a pipelined function recursively, the document is processed using a loop and a stack structure to keep track of ancestors and the node list currently being iterated.

     

    2. Example

    SQL> select *
      2  from table(
      3         xmlflattendom(
      4           xmlparse(document
      5            '<root xmlns:ns0="test">
      6              <!-- my comment -->
      7              <item ns0:id="1" id2="2">ABC</item>
      8              <item2 xmlns="dummy"/>
      9              <ns0:item3>
     10                <item4>
     11                  <item5>123<sub><![CDATA[R&D]]></sub></item5>
     12                </item4>
     13              </ns0:item3>
     14              <?mypi test pi?>
     15            </root>'
     16           )
     17         )
     18       )
     19  ;
     
       NODE_ID NODE_NAME    NODE_TYPE               PARENT_NODE_ID NODE_VALUE       NAMESPACE_URI
    ---------- ------------ ----------------------- -------------- ---------------- -------------------
             1 root         element                                                 
             2              comment                              1  my comment      
             3 item         element                              1                  
             4 id           attribute                            3 1                test
             5 id2          attribute                            3 2                
             6              text                                 3 ABC              
             7 item2        element                              1                  dummy
             8 item3        element                              1                  test
             9 item4        element                              8                  
            10 item5        element                              9                  
            11              text                                10 123              
            12 sub          element                             10                  
            13              text                                12 R&D             
            14 mypi         processing-instruction               1 test pi          
     
    14 rows selected.
     
    

     

    3. Performance

    Using this XML file (3.8MB spreadsheetML file) stored in a binary XMLType table :

    create table tmp_xml of xmltype;
    
    insert into tmp_xml values (
      xmltype(bfilename('TMP_DIR','test_az.xml'), nls_charset_id('AL32UTF8'))
    );
    

    The comparison is made on the time to execute a CTAS using each of the four approaches.
    The table created this way (MY_EDGE_TABLE) is dropped before each run but that step is not shown in the following outputs.

    XMLFlattenDOM :

    Connected to:
    Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
    With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
    
    SQL> set timing on
    SQL>
    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml t
      4     , table(XMLFlattenDOM(t.object_value)) x
      5  ;
    
    Table created.
    
    Elapsed: 00:00:06.25
    SQL> select count(*) from my_edge_table;
    
      COUNT(*)
    ----------
        318115
    
    

    XMLFlatten (Java-based pipelined function) :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml t
      4     , table(XMLFlatten(t.object_value, 0)) x
      5  ;
    
    Table created.
    
    Elapsed: 00:00:03.44
    
    

    Recursive XQuery function :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml t
      4     , XMLTable(
      5       'declare function local:getChildren($e as node(), $pid as xs:string?) as element()*
      6        {
      7          for $i at $p in $e/(node()|@*)
      8          let $id := if ($pid) then concat($pid,".",$p) else "1"
      9          return element r
     10          {
     11            element node_id {$id}
     12          , element parent_node_id {$pid}
     13          , element node_name {local-name($i)}
     14          , element node_value { if (not($i instance of element())) then data($i) else () }
     15          , element node_type {
     16              typeswitch($i)
     17                case text()      return "text"
     18                case attribute() return "attribute"
     19                case element()   return "element"
     20                case processing-instruction() return "processing-instruction"
     21                case comment()   return "comment"
     22                default return "other"
     23            }
     24          , element namespace_uri {namespace-uri($i)}
     25          }
     26          | local:getChildren($i, $id)
     27        }; (: :)
     28        local:getChildren($d,())'
     29        passing t.object_value as "d"
     30        columns node_id         varchar2(100)   path 'node_id'
     31              , node_name       varchar2(2000)  path 'node_name'
     32              , node_value      varchar2(4000)  path 'node_value'
     33              , parent_node_id  varchar2(100)   path 'parent_node_id'
     34              , node_type       varchar2(30)    path 'node_type'
     35              , namespace_uri   varchar2(2000)  path 'namespace_uri'
     36       ) x
     37  ;
    
    Table created.
    
    Elapsed: 00:00:07.14
    
    

    XSLT + XQuery :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml
      4     , xmltable('/ROWSET/ROW'
      5         passing xmltransform(
      6                   object_value
      7                 , xdburitype('/public/xslt/xmlflatten.xsl').getXML()
      8                 )
      9         columns node_id         varchar2(100)  path '@id'
     10               , node_name       varchar2(2000) path '@name'
     11               , node_value      varchar2(4000) path 'text()'
     12               , parent_node_id  varchar2(100)  path '@pid'
     13               , node_type       varchar2(30)   path '@type'
     14               , namespace_uri   varchar2(2000) path '@nsuri'
     15       ) x
     16  ;
    
    Table created.
    
    Elapsed: 00:00:06.69
    
    

    Summary :

    xmlflatten_plot01

    The Java method is still leading with roughly half the time of the other approaches.
    Let’s now see the figures for a significantly bigger XML document, for example one of the MEDLINE samples (138MB) available here :

    XMLFlattenDOM :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml t
      4     , table(XMLFlattenDOM(t.object_value)) x
      5  ;
    
    Table created.
    
    Elapsed: 00:01:59.64
    SQL>
    SQL>
    SQL> select count(*) from my_edge_table;
    
      COUNT(*)
    ----------
       5871589
    
    

    XMLFlatten (Java-based pipelined function) :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml t
      4     , table(XMLFlatten(t.object_value, 0)) x
      5  ;
    
    Table created.
    
    Elapsed: 00:00:54.67
    
    

    XSLT + XQuery :

    SQL> create table my_edge_table as
      2  select x.*
      3  from tmp_xml
      4     , xmltable('/ROWSET/ROW'
      5         passing xmltransform(
      6                   object_value
      7                 , xdburitype('/public/xslt/xmlflatten.xsl').getXML()
      8                 )
      9         columns node_id         varchar2(100)  path '@id'
     10               , node_name       varchar2(2000) path '@name'
     11               , node_value      varchar2(4000) path 'text()'
     12               , parent_node_id  varchar2(100)  path '@pid'
     13               , node_type       varchar2(30)   path '@type'
     14               , namespace_uri   varchar2(2000) path '@nsuri'
     15       ) x
     16  ;
    create table my_edge_table as
    *
    ERROR at line 1:
    ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
    
    

    Oops! Running out of PGA for that one, and likewise with the recursive XQuery solution.

    We’re now beginning to see the limitation of full in-memory techniques compared to streaming or DOM-based ones whose memory footprint may be controlled more finely.

    Let’s conclude with a focus on XMLFlatten (Java) vs. XMLFlattenDOM.
    I’ve added a couple of intermediate-sized documents and plotted the following :

    xmlflatten_plot02

    Both techniques scale linearly within the range of the experiment.

     

    Advertisements
  1. paulzip
    April 6, 2016 at 18:22

    Nice. I’m wondering what performance would be like with an iterative XQuery example (depth first iterative rather than depth first recursive one) – which may prevent the PGA blowing – if that were possible using something like XQuery sequences as the stack?

    • April 8, 2016 at 19:20

      Thanks.
      XQuery is not a procedural language. In particular, it lacks a loop construct and variables are immutable, so I’m almost certain it’s not possible to implement the iterative approach.
      Without resorting to recursivity, I think the best we can do is walking down the tree in document order (using a descendant axis) to list all nodes and their parents but that preclude the generation of node IDs.

  2. paulzip
    April 11, 2016 at 12:29

    Could you not use nodes as a stack and iterate nodes? If that’s a stupid suggestion, please ignore, especially as your knowledge of XQuery is far superior to mine, Marc. Perhaps it isn’t possible.

  3. CT
    May 24, 2016 at 20:03

    Hi
    I am new to using xml documents in OracleDatabase. Your knowledge on the subject is very commendable. I am looking forward to define an XML document in a relational table structure. I was looking through the Oracle Documents and Found there is a way we can do this using object relational Model. I am wondering do you have any publication on this subject. If so could you please point me to the post or any other valuable resources.

    examples in oracle documentation is so confusing :-)
    Thank you
    CT

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s