Alaska Software Inc. - Re: Encoding for a file UTF or Other.

	Author	Topic: Re: Encoding for a file UTF or Other.
	Andreas Gehrs-Pahl View the complete thread for this message in: public.xbase++.data-access	Re: Encoding for a file UTF or Other. on Mon, 14 Nov 2016 17:42:19 -0500 César, Thomas, >>How could I know the encoding for a file? You can do some simple checks, but depending on the circumstances, you won't be able to be 100% sure. >>Is there any function in Alaska XBase for this? >No. Actually, there is a function in Windows to determine if a string is most probably (rather than certainly) in a particular (Unicode) encoding, such as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only test for UTF-16LE, and no other Unicode encodings are recognized as Unicode. The implementation of "IsUnicode()" can be found, depending on your Xbase++ version, in either the "\Source\Runtime\DUI\activex-helpers.prg" or the "\Source\Sys\activex.prg" program file. You can of course do some simple checks for yourself. The most basic option is to check for the BOM (Byte Order Mark), which would be the first two to five characters of the string (or file). Here is a list of BOM values: https://en.wikipedia.org/wiki/Byte_order_mark If a specific BOM is used, you should be able to rely on its existence alone to determine what encoding is used, but if no BOM is used, you can do some additional tests, such as checking if every first or second byte is a NULL character -- chr(0) -- which would indicate that you are dealing with either UTF-16(LE) or UTF-16(BE) encoding, etc. >>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? Keep in mind that UTF-8 is the same as ASCII for the first 128 (ASCII) characters, which is also true for UTF-16/UTF-32, just with those extra NULL characters thrown in. Xbase++ has several conversion routines, such as: * Unicode2Str() to convert UTF-16LE to ANSI/OEM * Str2Unicode() to convert ANSI/OEM to UTF-16LE * ConvToAnsiCP() to convert OEM to ANSI * ConvToOemCP() to convert ANSI to OEM The first two functions are again wrappers for the Windows API functions "WideCharToMultiByte()" and "MultiByteToWideChar()", which can handle some additional formats (besides UTF-16LE), and they are implemented in the same source file as the "IsUnicode()" function. To determine what code page is used, or if ANSI (or CP-1252 / Windows-1252) or ISO 8859-1, or even an OEM character set is used, depends entirely on the font you are using to display the characters. There is no way to determine which glyph (letter, symbol, etc.) should be displayed for a particular byte or character, though. If you only deal with 7-bit ASCII characters, UTF-8, ANSI, and ASCII will all be the same. If you deal with 8-bit character sets, different Code Pages might have a different symbol/glyph for the same character value, though. Here is some additional reading material: https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/Unicode https://en.wikipedia.org/wiki/Code_page https://en.wikipedia.org/wiki/Windows-1252 https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-16 Hope that helps, Andreas Andreas Gehrs-Pahl Absolute Software, LLC phone: (989) 723-9927 email: Andreas@AbsoluteSoftwareLLC.com web: http://www.AbsoluteSoftwareLLC.com [F]: https://www.facebook.com/AbsoluteSoftwareLLC

Author

Topic: Re: Encoding for a file UTF or Other.

View the complete thread for this message in:

public.xbase++.data-access

Re: Encoding for a file UTF or Other.
on Mon, 14 Nov 2016 17:42:19 -0500

César, Thomas,

>>How could I know the encoding for a file?

You can do some simple checks, but depending on the circumstances, you won't 
be able to be 100% sure.

>>Is there any function in Alaska XBase for this?
>No.

Actually, there is a function in Windows to determine if a string is most 
probably (rather than certainly) in a particular (Unicode) encoding, such 
as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is 
wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only 
test for UTF-16LE, and no other Unicode encodings are recognized as Unicode.

The implementation of "IsUnicode()" can be found, depending on your Xbase++ 
version, in either the "\Source\Runtime\DUI\activex-helpers.prg" or the 
"\Source\Sys\activex.prg" program file.

You can of course do some simple checks for yourself. The most basic option 
is to check for the BOM (Byte Order Mark), which would be the first two to 
five characters of the string (or file). Here is a list of BOM values:

	https://en.wikipedia.org/wiki/Byte_order_mark

If a specific BOM is used, you should be able to rely on its existence alone 
to determine what encoding is used, but if no BOM is used, you can do some 
additional tests, such as checking if every first or second byte is a NULL 
character -- chr(0) -- which would indicate that you are dealing with either 
UTF-16(LE) or UTF-16(BE) encoding, etc.

>>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X?

Keep in mind that UTF-8 is the same as ASCII for the first 128 (ASCII) 
characters, which is also true for UTF-16/UTF-32, just with those extra NULL 
characters thrown in.

Xbase++ has several conversion routines, such as: 
* Unicode2Str()  to convert UTF-16LE to ANSI/OEM
* Str2Unicode()  to convert ANSI/OEM to UTF-16LE
* ConvToAnsiCP() to convert OEM to ANSI
* ConvToOemCP()  to convert ANSI to OEM

The first two functions are again wrappers for the Windows API functions 
"WideCharToMultiByte()" and "MultiByteToWideChar()", which can handle some 
additional formats (besides UTF-16LE), and they are implemented in the same 
source file as the "IsUnicode()" function.

To determine what code page is used, or if ANSI (or CP-1252 / Windows-1252) 
or ISO 8859-1, or even an OEM character set is used, depends entirely on the 
font you are using to display the characters. There is no way to determine 
which glyph (letter, symbol, etc.) should be displayed for a particular byte 
or character, though.

If you only deal with 7-bit ASCII characters, UTF-8, ANSI, and ASCII will 
all be the same. If you deal with 8-bit character sets, different Code Pages 
might have a different symbol/glyph for the same character value, though.

Here is some additional reading material:

https://en.wikipedia.org/wiki/Character_encoding
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/Code_page
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16

Hope that helps,

Andreas

Andreas Gehrs-Pahl
Absolute Software, LLC

phone: (989) 723-9927
email: Andreas@AbsoluteSoftwareLLC.com
web:   http://www.AbsoluteSoftwareLLC.com
[F]:   https://www.facebook.com/AbsoluteSoftwareLLC