Home > XML DB > Inside Binary XML

Inside Binary XML

As you might know, Oracle 11g introduced a new format to store and transfer XML content : Binary XML, also known as CSX (Compact Schema-aware XML), which provides optimization in both storage and processing areas.
Binary XML became the default XMLType storage method in version 11.2.0.2.

The purpose of this post is to present a short overview of the format, with a concrete example, as well as some Oracle specifics regarding Binary XML persistent storage in the database.

Although Binary XML has been designed with schema-awareness in mind, I’ll only focus on the non schema-based format.

 

Size optimization

Binary XML achieves metadata compression by using token tables to store and share unique references to elements or attributes, and a wide range of binary opcodes to encode nodes depending on their natures and actual sizes.

When using Binary XML in the database, by default, Oracle stores metadata in two internal token tables in the XDB schema, one for namespace URIs and the other for QNames (qualified names).

Table names are system-generated and can be retrieved using DBMS_CSX_ADMIN :

SQL> select xdb.dbms_csx_admin.NamespaceIdTable
  2       , xdb.dbms_csx_admin.QnameIdTable
  3  from dual;
 
NAMESPACEIDTABLE                     QNAMEIDTABLE
------------------------------------ ------------------------------------
XDB.X$NM46FTFDTC7V946BVPU5TH71DPAC   XDB.X$QN46FTFDTC7V946BVPU5TH71DPAC
 
SQL> desc XDB.X$QN46FTFDTC7V946BVPU5TH71DPAC                                    
 Name                                      Null?    Type                        
 ----------------------------------------- -------- ----------------------------
 NMSPCID                                            RAW(8)                      
 LOCALNAME                                          VARCHAR2(2000)              
 FLAGS                                              RAW(4)                      
 ID                                                 RAW(8)                      

SQL> desc XDB.X$NM46FTFDTC7V946BVPU5TH71DPAC
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 NMSPCURI                                           VARCHAR2(2000)
 ID                                                 RAW(8)
 

For example :

SQL> select qname.*
  2       , nmspc.nmspcuri
  3  from XDB.X$QN46FTFDTC7V946BVPU5TH71DPAC qname
  4       join XDB.X$NM46FTFDTC7V946BVPU5TH71DPAC nmspc on nmspc.id = qname.nmspcid
  5  where qname.id in (
  6    hextoraw('6C95')
  7  , hextoraw('1CA2')
  8  );
 
NMSPCID          LOCALNAME   FLAGS    ID               NMSPCURI
---------------- ----------- -------- ---------------- -------------------------------
5A81             item3       00       6C95             test
266A             type        01       1CA2             http://www.w3.org/1999/xlink
 

In this example, the first row stores metadata for an element (FLAGS=00) whose local name is “item3” and namespace URI is “test”.
The second row stores metadata for an attribute (FLAGS=01) whose local name is “type” and namespace URI is “http://www.w3.org/1999/xlink

Every single occurrence of those two QNames in any XML documents will be encoded using their corresponding token IDs.

 

Decoding Binary XML

Let’s now create a simple test case and get our hands dirty…

The below XML document will be used. It is sufficiently simple to not be overwhelmed by the number of nodes and yet contains at least one occurrence of each node types – element and attribute (namespace-qualified or not), comment, text and processing instruction :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root xmlns:ns0="test">
  <!-- my comment -->
  <item ns0:id="1" id2="2">ABC</item>
  <item2 xmlns="dummy">
    <sub>123</sub>
  </item2>
  <item3>
    <item4>1</item4>
    <item4>2</item4>
    <item4>3</item4>
  </item3>
  <?mypi test pi?>
</root>

SQL> create table tmp_xml of xmltype;
 
Table created.
 
SQL> 
SQL> insert into tmp_xml values(
  2    xmlparse(document
  3  '<?xml version="1.0" encoding="UTF-8" standalone="no"?>
  4  <root xmlns:ns0="test">
  5    <!-- my comment -->
  6    <item ns0:id="1" id2="2">ABC</item>
  7    <item2 xmlns="dummy">
  8      <sub>123</sub>
  9    </item2>
 10    <item3>
 11      <item4>1</item4>
 12      <item4>2</item4>
 13      <item4>3</item4>
 14    </item3>
 15    <?mypi test pi?>
 16  </root>'
 17    )
 18  );
 
1 row inserted.
 

The binary content is stored in a hidden BLOB column, as SECUREFILE (unless otherwise specified during table creation).
The following queries shows the internal column (XMLDATA) created for the purpose, and its content :

SQL> select column_name, data_type, hidden_column, virtual_column
  2  from user_tab_cols
  3  where table_name = 'TMP_XML';
 
COLUMN_NAME      DATA_TYPE  HIDDEN_COLUMN VIRTUAL_COLUMN
---------------- ---------- ------------- --------------
SYS_NC_ROWINFO$  XMLTYPE    NO            YES
SYS_NC_OID$      RAW        YES           NO
XMLDATA          BLOB       YES           NO            
 
SQL> select t.xmldata                                                           
  2  from tmp_xml t;                                                            
                                                                                
XMLDATA                                                                         
--------------------------------------------------------------------------------
9F01639E00000FB20300005A8100016E7330C8150CDD0001AB0C206D7920636F6D6D656E7420C820
8DC0006ACA31C0007DB33202414243D9B200000039630002C8675BDD0002C0027C07313233D9C85D
B0C00056EC31D700320033D8D9A90B046D79706974657374207069D9A0

Based on the Oracle Binary XML RFC draft, we’ll now “manually” parse that binary stream and elucidate the format.

The opcodes mapping to byte values is missing from the RFC, probably intentionally because the document mentions a “section 16” (that doesn’t exist) where that mapping should be listed.
So I first had to reverse-engineered them myself before I realized they were actually available in the Java interface oracle.xml.binxml.BinXMLConstants.

I’ve summarized the binary structure of the sample document in the below table.

Actual node names (for elements, attributes or namespaces) were looked up in either the QName or Namespace token tables using the token ID extracted from the binary stream :

Opcodes Bytes Description RFC ref.
STRTSEC 9F Start of section §8.4.1, §8.5.5
01 CSX version
63 [01100011]
bit 0 = 1 : no inlined token
bit 1 = 1 : no schema
bit 2 = 0 : no CSX processor GUID
bit 3 = 0 : no Doc-ID
bit 4 = 0 : no Path-ID/Order key
bit 5 = 1 : ?
bit 6 = 1 : ?
bit 7 = 0 : ?
DOC 9E Document node §8.5.5
00 length of charset ID
000F [0000000000001111]
bit 0 = 1 : standalone declared in prolog
bit 1 = 1 : prolog present
bit 2 = 1 : encoding declared
bit 3 = 1 : XML version declared
bit 4 = 0 : standalone = “no”
bit 5 = 0 : ignore whitespaces
bit 6 = 0 : ?
bit 7 = 0 : ?
bits 8-15 = 00000000 : version = “1.0”
DEFPFX4 B2 Namespace definition §8.5.7
03 prefix length
00005A81 namespace token ID → “test”
0001 prefix ID = 1
6E7330 prefix data = “ns0”
PRPSTT2 C8 Start of element or attribute §8.5.8
150C 2-byte token ID → “root”
NMSPC DD Namespace node §8.5.9
0001 2-byte prefix ID = 0001 → xmlns:ns0=”test”
CMT1 AB Comment §8.5.6
0C 1-byte length
206D7920636F
6D6D656E7420
” my comment “
PRPSTT2 C8 Start of element or attribute
208D 2-byte token ID → “item”
PRPT2L1 C0 Simple element or attribute §8.5.8
00 1-byte data length (minus 1)
6ACA 2-byte token ID → “@id”
31 data = “1”
PRPT2L1 C0 Simple element or attribute
00 length = 1
7DB3 token ID → “@id2”
32 data = “2”
DATSTR3 02 3-byte string §8.5.1
414243 data = “ABC”
ENDPRP D9 End of current element → “item” §8.5.9
DEFPFX4 B2 Namespace definition
00 prefix length = 0 → no prefix
00003963 namespace token ID → “dummy”
0002 prefix ID = 2
PRPSTT2 C8 Start of element or attribute
675B 2-byte token ID → “item2”
NMSPC DD Namespace node
0002 prefix ID = 2 → xmlns=”dummy”
PRPT2L1 C0 Simple element or attribute
02 length = 3
7C07 token ID → “sub”
313233 data = “123”
ENDPRP D9 “</item2>”
PRPSTT2 C8 Start of element or attribute
5DB0 “item3”
PRPT2L1 C0 Simple element or attribute
00 length = 1
56EC token ID → “item4”
31 data = “1”
ARRBEG D7 Begin Array mode :
All following opcodes will implicitly serve as content for a new element of the same name as the last one (“item4”)
§8.5.9
DATSTR1 00 1-byte string
32 data = “2” → <item4>2</item4>
DATSTR1 00 1-byte string
33 <item4>3</item4>
ARREND D8 End Array mode §8.5.9
ENDPRP D9 “</item3>”
PI1L1 A9 Processing instruction §8.5.6
0B length of target + data = 11
04 length of target = 4
6D7970697465
7374207069
target = 6D797069 = “mypi”
data = 74657374207069 = “test pi”
ENDPRP D9 “</root>”
ENDSEC A0 End of section §8.5.5

 

The CSXReader package

For learning purpose, I’ve also created a program that reads a binary encoded stream and outputs back XML content as readable text.

I’ve made it available on GitHub :

/mbleron/oracle/CSXReader

Since the program accesses the internal token tables, the user needs the proper admin privileges (I used the DBA role), as well as an explicit grant on the DBMS_CSX_ADMIN package :

grant execute on xdb.dbms_csx_admin to <user>;

Example on the sample document :

SQL> select CSXReader.getXML(t.xmldata) from tmp_xml t;

CSXREADER.GETXML(T.XMLDATA)
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8" standalone="no"?><root xmlns:ns0="test"><!-
- my comment --><item ns0:id="1" id2="2">ABC</item><item2 xmlns="dummy"><sub>123
</sub></item2><item3><item4>1</item4><item4>2</item4><item4>3</item4></item3><?m
ypi test pi?></root>


SQL> select XMLSerialize(document CSXReader.getXMLType(t.xmldata) indent) from tmp_xml t;

XMLSERIALIZE(DOCUMENTCSXREADER.GETXMLTYPE(T.XMLDATA)INDENT)
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8" standalone='no'?>
<root xmlns:ns0="test">
  <!-- my comment -->
  <item ns0:id="1" id2="2">ABC</item>
  <item2 xmlns="dummy">
    <sub>123</sub>
  </item2>
  <item3>
    <item4>1</item4>
    <item4>2</item4>
    <item4>3</item4>
  </item3>
  <?mypi test pi?>
</root>

Again, except for learning purpose and fun (of course!), this program has no practical use in real life as Oracle already provides out-of-the-box and transparent serialization methods.

 

The Path-ID table

In addition to the two token tables we’ve seen so far, I’ll also say a few words about the third shared table : the Path-ID table.
It stores unique paths to elements and attributes (including namespace nodes) and is used in conjunction with unstructured XML indexes, where the PATH_ID column of the index PATH table matches the ID of the Path-ID table.

SQL> select xdb.dbms_csx_admin.PathIdTable
  2  from dual;

PATHIDTABLE
--------------------------------------------------------------------------------
XDB.X$PT46FTFDTC7V946BVPU5TH71DPAC

SQL> desc XDB.X$PT46FTFDTC7V946BVPU5TH71DPAC
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 PATH                                               RAW(2000)
 ID                                                 RAW(8)

For example :

SQL> select path
  2  from XDB.X$PT46FTFDTC7V946BVPU5TH71DPAC
  3  where id = hextoraw('775F');

PATH
--------------------------------------------
02150C0248D60247C0

The path is a sequence of QName token IDs, each prefixed by a length byte :

02

length = 2 bytes
150C 2-byte token ID
02 length = 2 bytes
48D6 2-byte token ID
02 length = 2 bytes
47C0 2-byte token ID

If we now create a simple unstructured XML index on TMP_XML table with the default options, all paths in the sample document will be indexed and the shared Path-ID table automatically populated.

create index tmp_xml_uxi on tmp_xml (object_value)
indextype is xdb.xmlindex;

Using the following recursive query, we can parse the binary-encoded paths and rebuild them in a human-readable form :

SQL> with tokenizer (id, path, len, step, localname) as (
  2    select id, utl_raw.substr(path, 2), utl_raw.cast_to_binary_integer(utl_raw.substr(path, 1, 1)), 0, ''
  3    from XDB.X$PT46FTFDTC7V946BVPU5TH71DPAC
  4    -- filter paths starting with "root"
  5    where utl_raw.substr(path,1,3) = hextoraw('02150C')
  6    union all
  7    select t.id
  8         , case when utl_raw.length(path) > len then utl_raw.substr(path, len+2) end
  9         , case when utl_raw.length(path) > len then utl_raw.cast_to_binary_integer(utl_raw.substr(path, len+1, 1)) end
 10         , step + 1
 11         , case when qn.flags = hextoraw('01') then '@' end || qn.localname
 12    from tokenizer t
 13         join XDB.X$QN46FTFDTC7V946BVPU5TH71DPAC qn on qn.id = utl_raw.substr(t.path, 1, t.len)
 14    where path is not null
 15  )
 16  select id
 17       , '/' || listagg(localname, '/') within group (order by step) as xpath
 18  from tokenizer
 19  where step > 0
 20  group by id
 21  order by xpath;
 
ID    XPATH
----- -------------------
6C75  /root
36B7  /root/@ns0
2DF1  /root/item
1244  /root/item/@id
63EA  /root/item/@id2
7863  /root/item2
47EC  /root/item2/@xmlns
7330  /root/item2/sub
4D1B  /root/item3
0FC5  /root/item3/item4
 
10 rows selected.
 

That concludes this short overview of binary XML.
To be done in a future article : schema-based binary XML and unstructured XML index internals…

 

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: