Home > HowTo, PL/SQL, SQL, XML DB > How To : Load XML from File with Encoding Detection

How To : Load XML from File with Encoding Detection

This post was inspired by this recent OTN thread : LPX-00200: could not convert from encoding UTF-8 to UCS2, where it was asked if it were possible to load an XML file without actually knowing its character encoding.

We can already build an XMLType instance from a BFILE, using the corresponding constructor :

constructor function XMLType(
  xmlData    IN bfile
, csid       IN number
, schema     IN varchar2 := NULL
, validated  IN number   := 0
, wellformed IN number   := 0
)
return self as result 
deterministic 
parallel_enable

However, the character encoding must be known upfront and passed to the constructor (csid argument).
For example :

SQL> select XMLSerialize(document
  2           xmltype(
  3             bfilename('TMP_DIR','test_utf-8_bom.xml')
  4           , nls_charset_id('AL32UTF8')
  5           )
  6         ) as xmldoc
  7  from dual ;
 
XMLDOC
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 

Following is a little helper function that provides basic character encoding detection for XML files.
It should work for most 8-bit encodings (except EBCDIC), as well as UTF-16 (little and big-endian) and of course UTF-8.
The function uses this algorithm to detect the encoding :

  • If a byte order mark (BOM) is present then UTF-8, UTF-16LE or UTF-16BE is assumed, respectively :
    • 0xEFBBBF = UTF-8
    • 0xFFFE = UTF-16LE
    • 0xFEFF = UTF-16BE
  • If no BOM is found then it looks for an XML prolog :
    • If no prolog or encoding declaration is found then a default encoding is assumed.
    • Else, the specified encoding is used.

 

The XMLLoadFromFile function

create or replace function XMLLoadFromFile(
  p_directory   in varchar2
, p_filename    in varchar2
, p_charset     in varchar2 default 'CHAR_CS'
) 
return XMLType 
is

  -- Byte order marks for UTF-8, UTF-16LE and UTF-16BE
  C_BOM_UTF8     constant raw(3) := hextoraw('EFBBBF');
  C_BOM_UTF16LE  constant raw(2) := hextoraw('FFFE');
  C_BOM_UTF16BE  constant raw(2) := hextoraw('FEFF');

  l_blob         blob;
  l_bfile        bfile := bfilename(p_directory, p_filename);
  l_dest_offset  integer := 1;
  l_src_offset   integer := 1;
  l_prolog_start integer;
  l_prolog_stop  integer;
  l_prolog       varchar2(32767);
  l_encoding     varchar2(30);
  l_charset      varchar2(30) := p_charset;
  
  l_bom          raw(3);
  
  l_result       xmltype;
  
begin
  
  dbms_lob.open(l_bfile, dbms_lob.lob_readonly);
  dbms_lob.createtemporary(l_blob, false);

  dbms_lob.loadblobfromfile(
    dest_lob    => l_blob
  , src_bfile   => l_bfile
  , amount      => dbms_lob.getlength(l_bfile)
  , dest_offset => l_dest_offset
  , src_offset  => l_src_offset
  );
  
  dbms_lob.close(l_bfile);
  
  l_bom := dbms_lob.substr(l_blob, 3);
  
  if l_bom = C_BOM_UTF8 then
    
    l_encoding := 'UTF-8';
    
  else
    
    case utl_raw.substr(l_bom, 1, 2)
    
    when C_BOM_UTF16LE then
      l_encoding := 'UTF-16LE';
    
    when C_BOM_UTF16BE then
      l_encoding := 'UTF-16BE';
      
    else
    
      l_prolog_start := dbms_lob.instr(l_blob, utl_raw.cast_to_raw('<?xml'));
      
      if l_prolog_start != 0 then
        
        l_prolog_stop := dbms_lob.instr(l_blob, utl_raw.cast_to_raw('?>'), l_prolog_start);
        l_prolog := utl_raw.cast_to_varchar2(dbms_lob.substr(l_blob, l_prolog_stop - l_prolog_start + 2, l_prolog_start));
        
        -- search for a well-formed XML declaration in the prolog 
        -- as per https://www.w3.org/TR/REC-xml/#NT-XMLDecl
        l_encoding := regexp_substr(
                        l_prolog
                      , '<\?xml'
                      || '\s+version\s*=\s*(''1\.[0-9]+''|"1\.[0-9]+")'     -- version
                      || '(\s+encoding\s*=\s*(''.*?''|".*?"))?'             -- encoding
                      || '(\s+standalone\s*=\s*(''(yes|no)''|"(yes|no)"))?' -- standalone  
                      || '\s*\?>'
                      , 1
                      , 1
                      , null
                      , 3
                      );
        
        -- trim apostrophes or quotes
        l_encoding := substr(l_encoding, 2, length(l_encoding)-2);
        
      end if;
      
    end case;    
      
  end if;
  
  if l_encoding is not null then
    l_charset := utl_i18n.map_charset(charset => l_encoding, flag => utl_i18n.IANA_TO_ORACLE);
  end if;
  
  l_result := xmltype(l_blob, nls_charset_id(l_charset));
  
  dbms_lob.freetemporary(l_blob);
  
  return l_result;

end XMLLoadFromFile;

The file referenced by the directory and filename arguments is first loaded in a BLOB using DBMS_LOB.loadBlobFromFile.
Then the content is examined for a BOM or a valid XML prolog and encoding declaration. If none is found, it uses the character set name passed in the optional p_charset argument (Oracle naming convention), which defaults to the database character set.
The function returns an XMLType instance built from the BLOB and the encoding information.

 

Examples

Here are some examples using a set of small XML files in different encodings.
I’ve included below screenshots of the files viewed in an hexadecimal editor so that we can see the BOM (if any) and the encoding of some specific characters I’ve placed there on purpose – an accented character (ô) and the euro sign (€) :

test_iso-8859-15.xml (ISO-8859-15)

xmlenc_iso-8859-15

test_windows-1252.xml (WINDOWS-1252)

xmlenc_windows-1252

test_utf-8.xml (UTF-8 w/o BOM)

xmlenc_utf-8

test_utf-8_bom.xml (UTF-8 w/ BOM)

xmlenc_utf-8_bom

test_utf-16be_bom.xml (UTF-16BE)

xmlenc_utf-16be_bom

test_utf-16le_bom.xml (UTF-16LE)

xmlenc_utf-16le_bom

We can see from the below that each file is correctly loaded and converted to my database character set (AL32UTF8) :

SQL> select value from nls_database_parameters where parameter = 'NLS_CHARACTERSET';
 
VALUE
--------------------------------------------------------------------------------
AL32UTF8
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_iso-8859-15.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_windows-1252.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_utf-8.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_utf-8_bom.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_utf-16be_bom.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 
SQL> select XMLSerialize(document XMLLoadFromFile('TMP_DIR','test_utf-16le_bom.xml')) as output
  2  from dual;
 
OUTPUT
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<root>Contrôle:€</root>
 

 

Advertisements
  1. Anonymous
    April 11, 2016 at 09:46

    Thanks Marc.
    Igor

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s