`

再议UTF-16的编码

 
阅读更多

上面的文章已经说过UTF-16的来历,但是所有的Unicode都有一个区分,那就是有签名和无签名之分,何为签名呢,就是文件编辑器在Unicode文件的开头字节中自动加入的识别字符。

现在来看看UTF-16的区别,UTF-16有两种编码:UTF-16LE和UTF-16BE,这两种编码有分别分为有签名和无签名,要正确读取解析这中文件就要用字符集一一对应,目前在JAVA中的对应方式如下:

Unicode:UTF-16LE有签名

UTF-16LE:UTF-16LE无签名

UTF-16BE:UTF-16BE无签名

UTF-16:UTF-16BE有签名

如何识别这四种编码方式,请看以下代码:

**
* 判断文件的编码格式
* @author luoyifan
* @date 2010-03-22
*/
private String getCharset(String filename){
String charset = "GBK";//设置默认为ANSI
byte[] first3Bytes = new byte[3];
try {
boolean checked = false;
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(filename));
bis.mark(0);
int read = bis.read(first3Bytes,0,3);
if (read == -1 ) return charset;
if (first3Bytes[0] == (byte) 0xFF && first3Bytes[1] == (byte)0xFE) {
charset = "UTF-16LE-sign"; //UTF-16LE有签名
checked = true;
}
else if (first3Bytes[0] == (byte)0xFE && first3Bytes[1] == (byte)0xFF) {
charset = "UTF-16BE-sign"; //UTF-16BE有签名
checked = true;
}
else if(first3Bytes[0] == (byte)0x22 && first3Bytes[1] == (byte)0x00){
charset = "UTF-16LE-unsign";//UTF-16LE无签名
checked = true;
}
else if(first3Bytes[0] == (byte)0x00 && first3Bytes[1] == (byte)0x22){
charset = "UTF-16BE-unsign";//UTF-16BE无签名
checked = true;
}
else if (first3Bytes[0] == (byte)0xEF && first3Bytes[1] == (byte)0xBB && first3Bytes[2] == (byte)0xBF ) {
charset = "UTF-8";
checked = true;
}
bis.reset();
if (!checked){
int loc = 0;
while ((read = bis.read()) != -1) {
loc++;
if (read >= 0xF0) break;
if (0x80 <= read && read <= 0xBF) // 单独出现BF以下的,也算是GBK
break;
if (0xC0 <= read && read <= 0xDF) {
read = bis.read();
if (0x80 <= read && read <= 0xBF) //双字节 (0xC0 - 0xDF) (0x80
// - 0xBF),也可能在GB编码内
continue;
else break;
}
else if (0xE0 <= read && read <= 0xEF) {//也有可能出错,但是几率较小
read = bis.read();
if (0x80 <= read && read <= 0xBF) {
read = bis.read();
if (0x80 <= read && read <= 0xBF) {
charset = "UTF-8";
break;
}
else break;
}
else break;
}
}
//System.out.println( loc + " " + Integer.toHexString( read ) );
}

bis.close();
} catch ( Exception e ) {
e.printStackTrace();
}

return charset;
}

上面的代码是总结的开头字节,UTF-16LE有签名对应着[-1,-2],UTF-16BE有签名对应着[-2,-1],UTF-16LE无签名对应着[34,0],UTF-16BE无签名对应着[0,34]。

目前在系统中创建的UTF-16文件都是按如上的开头字节对应的,但是也有例外的情况,当从UTF-8的文件转换为UTF-16的无签名文件就无签名就会转变为[70,85]和[85,70]这个地方是不定的,随文件的不同而不同

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics