您当前位置：首页 > php开源 > 综合技术 > 有关unicode，UTF-8及 ANSI等编码转换

有关unicode，UTF-8及 ANSI等编码转换

来源：程序员人生发布时间：2014-02-01 00:26:24 阅读次数：3076次

1. ANSI
美国国家标准码，操作系统默认的编码格式；但是不同国家有不同的文字，由此各个国家制定了自己的国标码，如GB2312等。注意，各个国家制定时还是按照ANSI准则进行的：即不属于ASCII（0～127）的一个文字（符号）占两个字节，属于ASCII的占一个字节。这样一来，一个大字符串用不同国标码，解释的结果就不同（但一点就是，属于ASCII部分的字符解释出来是相同的）。

ANSI是编码，GB2312等国标码是符合ANSI标准的字符集。字符集与编码是两回事。

2. 字符
字符是一个符号，如：'＃'，'◎'等，存储时根据不同的编码标准可能占用1个或多个字节的空间。不同编码标准下，字符占用空间大小不同，如Unicode编码所有字符都是2个字节，utf-8编码占用从1个到6个不等。

3. 多字节字符串：
字符串在内存中，如果“字符”是以ANSI编码形式存放的，则一个字符可能使用一个字节或多个字节来表示，称这个字符串为ANSI字符串或多字节字符串。

4. Unicode:
统一码，任何字符都占两个字节。各个国家文字、符号统一编码。

VC++或其他编程工具对汉字或字符都采用操作系统的编码标准，一般都是ANSI标准。这就涉及往其它编码转化的问题。

5. ANSI与Unicode
如果ANSI全部属于ASCII（0～127），则 mbstowcs, wcstombs 即可。因为Unicode对ASCII（0～127）的处理是“直接扩展ANSI”－－由一个字节到两个字节。

size_t mbstowcs( wchar_t *wcstr, const char *mbstr, size_t count );
size_t wcstombs( char *mbstr, const wchar_t *wcstr, size_t count );对于不属于ASCII的，如汉字，使用 WideCharToMultiByte，MultiByteToWideChar。

int MultiByteToWideChar(
UINT CodePage, // code page
DWORD dwFlags, // character-type options
LPCSTR lpMultiByteStr, // string to map
int cbMultiByte, // number of bytes in string
LPWSTR lpWideCharStr, // wide-character buffer
int cchWideChar // size of buffer
);
6. 多字节到宽字符（Unicode）
待转换的字符并不一定是多字节字符串（ANSI字符串），uft-8编码的也可以。

CodePage：待转换的代码页，如CP_ACP（ANSI），utf-8；
dwFlags: 0;
lpMultiByteStr，cbMultiByte：待转换
lpWideCharStr，cchWideChar：转换完

int WideCharToMultiByte(
UINT CodePage, // code page
DWORD dwFlags, // performance and mapping flags
LPCWSTR lpWideCharStr, // wide-character string
int cchWideChar, // number of chars in string
LPSTR lpMultiByteStr, // buffer for new string
int cbMultiByte, // size of buffer
LPCSTR lpDefaultChar, // default for unmappable chars
LPBOOL lpUsedDefaultChar // set when default char used
);
7. 宽字符（Unicode）到多字节
新字符串不必是多字节（ANSI）字符集。

CodePag：要转换成的代码页，如CP_ACP（ANSI），utf-8；
dwFlags：0；
lpWideCharStr，cchWideChar：待转换
lpMultiByteStr，cbMultiByte：转换完
lpDefaultChar，lpUsedDefaultChar：失败时缺省字符；

一个字符串中有utf-8，如何转换成ANSI？
首先，utf-8 到 Unicode
其次，Unicode 到 ANSI

代码如下：

int ConvUtf8ToAnsi(CString& strSource, CString& strChAnsi)
{
if (strSource.GetLength() <= 0)
return 0;

CString strWChUnicode;

strSource.TrimLeft();
strSource.TrimRight();
strChAnsi.Empty();

int iLenByWChNeed = MultiByteToWideChar(CP_UTF8, 0,
strSource.GetBuffer(0),
strSource.GetLength(),
NULL, 0);

int iLenByWchDone = MultiByteToWideChar(CP_UTF8, 0,
strSource.GetBuffer(0),
strSource.GetLength(),
(LPWSTR)strWChUnicode.GetBuffer(iLenByWChNeed * 2),
iLenByWChNeed);

strWChUnicode.ReleaseBuffer(iLenByWchDone * 2);

int iLenByChNeed = WideCharToMultiByte(CP_ACP, 0,
(LPCWSTR)strWChUnicode.GetBuffer(0),
iLenByWchDone,
NULL, 0,
NULL, NULL);

int iLenByChDone = WideCharToMultiByte(CP_ACP, 0,
(LPCWSTR)strWChUnicode.GetBuffer(0),
iLenByWchDone,
strChAnsi.GetBuffer(iLenByChNeed),
iLenByChNeed,
NULL, NULL);

strChAnsi.ReleaseBuffer(iLenByChDone);

if (iLenByWChNeed != iLenByWchDone || iLenByChNeed != iLenByChDone)
return 1;

return 0;
}
8. Unicode 到 ANSI 其它方法
1. 调用CRT 函数wcstombs()；
2. 使用CString 构造器或赋值操作(仅用于MFC )；
3. 使用ATL 串转换宏；

size_t wcstombs (
char* mbstr, // 接受结果ANSI串的字符（char）缓冲。
const wchar_t* wcstr, // 要转换的Unicode串。
size_t count ); // mbstr参数所指的缓冲大小。
MFC中的CString包含有构造函数和接受Unicode串的赋值操作，所以你可以用CString来实现转换。

// 假设有一个Unicode串wszSomeString...
CString str1 ( wszSomeString ); // 用构造器转换
CString str2;
str2 = wszSomeString; // 用赋值操作转换

ATL有一组很方便的宏用于串的转换。W2A()用于将Unicode串转换为ANSI串（记忆方法是“wide to ANSI”——宽字符到ANSI）。实际上使用OLE2A()更精确，“OLE”表示的意思是COM串或者OLE串。

9. VC6 IDE下查看Unicode字符的方法
unicode字符后面加",su"

生活不易，码农辛苦
如果您觉得本网站对您的学习有所帮助,可以手机扫描二维码进行捐赠
程序员人生

------分隔线----------------------------

上一篇 CSS中background:url(图片) 不能显示的问题

下一篇 oracle密码特殊字符在imp,exp里的使用

分享到:

------分隔线----------------------------

栏目热点