204 lines
No EOL
17 KiB
HTML
204 lines
No EOL
17 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<!-- Generated by Apache Maven Doxia Site Renderer 1.4 at 2013-09-27 -->
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|
<title>ROME - XML Charset Encoding detection, how Rss and atOM utilitiEs (ROME) helps getting the right charset encoding</title>
|
|
<style type="text/css" media="all">
|
|
@import url("../css/maven-base.css");
|
|
@import url("../css/maven-theme.css");
|
|
@import url("../css/site.css");
|
|
</style>
|
|
<link rel="stylesheet" href="../css/print.css" type="text/css" media="print" />
|
|
<meta name="author" content="mkurz" />
|
|
<meta name="Date-Creation-yyyymmdd" content="20110815" />
|
|
<meta name="Date-Revision-yyyymmdd" content="20130927" />
|
|
<meta http-equiv="Content-Language" content="en" />
|
|
|
|
</head>
|
|
<body class="composite">
|
|
<div id="banner">
|
|
<a href="http://github.com/rometools/" id="bannerLeft">
|
|
<img src="../images/romelogo.png" alt="ROME" />
|
|
</a>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
<div id="breadcrumbs">
|
|
|
|
|
|
<div class="xright">
|
|
|
|
<span id="publishDate">Last Published: 2013-09-27</span>
|
|
| <span id="projectVersion">Version: 1.1-SNAPSHOT</span>
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
<div id="leftColumn">
|
|
<div id="navcolumn">
|
|
|
|
|
|
<h5>Rome</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../index.html" title="Overview">Overview</a>
|
|
</li>
|
|
<li class="collapsed">
|
|
<a href="../HowRomeWorks/index.html" title="How Rome Works">How Rome Works</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../RssAndAtOMUtilitiEsROMEV0.5AndAboveTutorialsAndArticles/index.html" title="Tutorials And Articles">Tutorials And Articles</a>
|
|
</li>
|
|
<li class="collapsed">
|
|
<a href="../ROMEReleases/index.html" title="Releases">Releases</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../ROMEDevelopmentProposals/index.html" title="ROME Development Proposals">ROME Development Proposals</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Project Documentation</h5>
|
|
<ul>
|
|
<li class="collapsed">
|
|
<a href="../project-info.html" title="Project Information">Project Information</a>
|
|
</li>
|
|
<li class="collapsed">
|
|
<a href="../project-reports.html" title="Project Reports">Project Reports</a>
|
|
</li>
|
|
</ul>
|
|
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
|
|
<img class="poweredBy" alt="Built by Maven" src="../images/logos/maven-feather.png" />
|
|
</a>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
<div id="bodyColumn">
|
|
<div id="contentBox">
|
|
<div class="section">
|
|
<h2>XML Charset Encoding detection, how Rss and atOM utilitiEs (ROME) helps getting the right charset encoding<a name="XML_Charset_Encoding_detection_how_Rss_and_atOM_utilitiEs_ROME_helps_getting_the_right_charset_encoding"></a></h2>
|
|
<p>Determining the charset set encoding of an XML document it may prove not as easy as it seems, and when the XML document is served over HTTP things get a little more hairy.</p>
|
|
<p>Current Java libraries or utilities don't do this by default, A JAXP SAX parser may detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in <a class="externalLink" href="http://www.w3.org/TR/2004/REC-xml-20040204/">XML 1.0 (Third Edition)</a> specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in <a class="externalLink" href="http://java.sun.com/j2se/1.4.2">J2SE 1.4.2</a> does not handle Appendix F.1 and <a class="externalLink" href="http://xml.apache.org/xerces2-j/">Xerces 2.6.2</a> just added support for it. But still this does not solve the whole problem. JAXP SAX parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a>. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case.</p>
|
|
<p><a class="externalLink" href="http://diveintomark.org/">Mark Pilgrim</a> did a very good job explaining how the charset encoding should be determined, <a class="externalLink" href="http://diveintomark.org/archives/2004/02/13/xml-media-types">Determining the character encoding of a feed</a> and <a class="externalLink" href="http://www.xml.com/pub/a/2004/07/21/dive.html">XML on the Web Has Failed</a>.</p>
|
|
<p>To isolate developers from this issue ROME has a special character stream class, the <i>XmlReader</i>. The <i>XmlReader</i> class is a subclass of the <i>java.io.Reader</i> that detects the charset encoding of XML documents read from files, input streams, URLs and input streams obtained over HTTP. Ideally this should be built into the JAXP SAX classes, most likely in the InputSource class.</p>
|
|
<div class="section">
|
|
<h3>Default Lenient Behavior<a name="Default_Lenient_Behavior"></a></h3>
|
|
<p>It is very common for many sites, due to improper configuration or lack of knowledge, to declare an invalid charset encoding. Invalid according to the XML 1.0 specification and the RFC 3023. For example a mismatch between the implicit charset encoding in the HTTP content-type and the explicit charset encoding in the XML prolog.</p>
|
|
<p>Because of this, ROME XmlReader by default has a lenient detection. This lenient detection works in the following order:</p>
|
|
<ul>
|
|
<li>A strict charset encoding detection, as specified by the Algorithms below is attempted</li>
|
|
<li>If the content type was 'text/html' it replaces it with 'text/xml' and tries strict charset encoding detection again</li>
|
|
<li>If the XML prolog had a charset encoding that encoding is used</li>
|
|
<li>If the content type had a charset encoding that encoding is used</li>
|
|
<li>'UTF-8' is used</li></ul>
|
|
<p>The XmlReader class has 2 constructors that allow a strict (non-lenient) charset encoding detection to be performed. This constructors take a lenient boolean flag, the flag should be set to <i>false</i> for a strict detection.</p></div>
|
|
<div class="section">
|
|
<h3>The Algorithms per XML 1.0 and RFC 3023 specifications<a name="The_Algorithms_per_XML_1.0_and_RFC_3023_specifications"></a></h3>
|
|
<p>Following it's a detailed explanation on the algorithms the <i>XmlReader</i> uses to determine the charset encoding. These algorithms first appeared in <a class="externalLink" href="http://blogs.sun.com/roller/page/tucu/Weblog">Tucu's Weblog</a>.</p></div>
|
|
<div class="section">
|
|
<h3>Raw XML charset encoding detection<a name="Raw_XML_charset_encoding_detection"></a></h3>
|
|
<p>Detection of the charset encoding of a XML document without external information (i.e. reading a XML document from a file). Following <a class="externalLink" href="http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding">Section 4.3.3</a> and <a class="externalLink" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Appendix F.1</a> of the XML 1.0 specification the charset encoding of an XML document is determined as follows:</p>
|
|
<div class="source">
|
|
<pre>
|
|
BOMEnc : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
|
|
XMLGuessEnc: best guess using the byte representation of the first bytes of XML declaration
|
|
('<?xml...?>') if present. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
|
|
XMLEnc : encoding in the XML declaration ('<?xml encoding="..."?>'). Possible values: anything or NULL
|
|
|
|
if BOMEnc is NULL
|
|
if XMLGuessEnc is NULL or XMLEnc is NULL
|
|
encoding is 'UTF-8' [1.0]
|
|
else
|
|
if XMLEnc is 'UTF-16' and (XMLGuessEnc is 'UTF-16BE' or XMLGuessEnc is 'UTF-16LE')
|
|
encoding is XMLGuessEnc [1.1]
|
|
else
|
|
encoding is XMLEnc [1.2]
|
|
else
|
|
if BOMEnc is 'UTF-8'
|
|
if XMLGuessEnc is not NULL and XMLGuessEnc is not 'UTF-8'
|
|
ERROR, encoding mismatch [1.3]
|
|
if XMLEnc is not NULL and XMLEnc is not 'UTF-8'
|
|
ERROR, encoding mismatch [1.4]
|
|
encoding is 'UTF-8'
|
|
else
|
|
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
|
|
if XMLGuessEnc is not NULL and XMLGuessEnc is not BOMEnc
|
|
ERROR, encoding mismatch [1.5]
|
|
if XMLEnc is not NULL and XMLEnc is not 'UTF-16' and XMLEnc is not BOMEnc
|
|
ERROR, encoding mismatch [1.6]
|
|
encoding is BOMEnc
|
|
else
|
|
ERROR, cannot happen given BOMEnc possible values (see above) [1.7]
|
|
</pre></div>
|
|
<p>Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the <a class="externalLink" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">XML 1.0 Third Edition Appendix F.1</a>. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.</p>
|
|
<ul>
|
|
<li><b>#1.0.</b> There is no BOM, an encoding or encoding family cannot be guessed from the first bytes in the stream, defaulting to UTF-8.</li>
|
|
<li><b>#1.1.</b> Strictly following <a class="externalLink" href="http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding">XML 1.0 Third Edition Section 4.3.3 Charset Encoding in Entities</a> (2nd paragraph) no BOM and UTF-16 XML declaration encoding is an error. The BOM is required to identify if the encoding is BE or LE. But if XMLEnc was read it means that the encoding byte order of the stream was guessed from the first bytes in the stream, note that this is possible only if the document starts with a XML declaration. The logic verifies that there is no conflicting encoding information and relaxes the BOM requirement if a XML declaration (that allows guessing the byte order) is present.</li>
|
|
<li><b>#1.2.</b> XMLEnc is present, it is not UTF-16 (although it can be UTF-16BE or UTF-16LE), the guessed encoding is used to read the XML declaration encoding. Detecting encoding mismatches here it would require awareness of charset families, instead the algorithm relies on the charset encoding routines that will process the stream to discover and report mismatches. <i>IMPORTANT:</i> To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here.</li>
|
|
<li><b>#1.3, #1.4, #1.5, #1.6</b> There is an explicit encoding mismatch.</li>
|
|
<li><b>#1.7.</b> Given the currently assumed BOMEnc values this case cannot happen. <i>IMPORTANT:</i> To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here.</li></ul>
|
|
<div class="section">
|
|
<h4>XML over HTTP charset encoding detection<a name="XML_over_HTTP_charset_encoding_detection"></a></h4>
|
|
<p>Detection of the charset encoding of a XML document with external information (provided by HTTP). Following <a class="externalLink" href="http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding">Section 4.3.3</a>, <a class="externalLink" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Appendix F.1</a> and <a class="externalLink" href="http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-with-ext-info">Appendix F.2</a> of the XML 1.0 specification, plus <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> the charset encoding of an XML document served over HTTP is determined as follows:</p>
|
|
<div class="source">
|
|
<pre>
|
|
ContentType: Content-Type HTTP header
|
|
CTMime : MIME type defined in the ContentType
|
|
CTEnc : charset encoding defined in the ContentType, NULL otherwise
|
|
BOMEnc : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
|
|
XMLGuessEnc: best guess using the byte representation of the first bytes of XML declaration
|
|
('<?xml...?>') if present. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
|
|
XMLEnc : encoding in the XML declaration ('<?xml encoding="..."?>'). Possible values: anything or NULL
|
|
APP-XML : RFC 3023 defined 'application/*xml' types
|
|
TEXT-XML : RFC 3023 defined 'text/*xml' types
|
|
|
|
if CTMime is APP-XML or CTMime is TEXT-XML
|
|
if CTEnc is NULL
|
|
if CTMime is APP-XML
|
|
encoding is determined using the Raw XML charset encoding detection algorithm [2.0]
|
|
else
|
|
if CTMime is TEXT-XML
|
|
encoding is 'US-ASCII' [2.1]
|
|
else
|
|
if (CTEnc is 'UTF-16BE' or CTEnc is 'UTF-16LE') and BOMEnc is not NULL
|
|
ERROR, RFC 3023 explicitly forbids this [2.2]
|
|
else
|
|
if CTEnc is 'UTF-16'
|
|
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
|
|
encoding is BOMEnc [2.3]
|
|
else
|
|
ERROR, missing BOM or encoding mismatch [2.4]
|
|
else
|
|
encoding is CTEnc [2.5]
|
|
else
|
|
ERROR, handling for other MIME types is undefined [2.6]
|
|
</pre></div>
|
|
<p>Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the <a class="externalLink" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">XML 1.0 Third Edition Appendix F.1</a>. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.</p>
|
|
<ul>
|
|
<li><b>#2.0.</b> HTTP content type declares a MIME of application type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is determined using the Raw XML charset encoding detection algorithm. Refer to <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> Section <i>3.2. Application/xml Registration</i>.</li>
|
|
<li><b>#2.1.</b> HTTP content type declares a MIME of text type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is US-ASCII. Refer to <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> Section <i>3.1. Text/xml Registration</i>.</li>
|
|
<li><b>#2.2.</b> For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16BE' or 'UTF-16LE' as the charset encoding, a BOM is prohibited. Refer to <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> Section <i>4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset</i>.</li>
|
|
<li><b>#2.3.</b> For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM is required and it must be used to determine the byte order. Refer to <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> Section <i>4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset</i>.</li>
|
|
<li><b>#2.4.</b> For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM must be present and it must be 'UTF-16BE' or 'UTF-16LE'. Refer to <a class="externalLink" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> Section <i>4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset</i>.</li>
|
|
<li><b>#2.5.</b> For text-XML and application-XML MIME types if the HTTP content type declares a charset encoding other than the UTF-16 variants, that charset encoding is used, no other special handling is required.</li>
|
|
<li><b>#2.6.</b> There is not defined logic to determine the charset encoding based on the MIME type given by the HTTP content type.</li></ul></div></div></div>
|
|
</div>
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
<div id="footer">
|
|
<div class="xright">
|
|
Copyright © 2004-2013
|
|
<a href="http://www.rometools.org">ROME Project</a>.
|
|
All Rights Reserved.
|
|
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html> |