Author | Topic: Re: Encoding for a file UTF or Other. | |
---|---|---|
Andreas Gehrs-Pahl View the complete thread for this message in: | Re: Encoding for a file UTF or Other. on Mon, 14 Nov 2016 17:42:19 -0500 César, Thomas, >>How could I know the encoding for a file? You can do some simple checks, but depending on the circumstances, you won't be able to be 100% sure. >>Is there any function in Alaska XBase for this? >No. Actually, there is a function in Windows to determine if a string is most probably (rather than certainly) in a particular (Unicode) encoding, such as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only test for UTF-16LE, and no other Unicode encodings are recognized as Unicode. The implementation of "IsUnicode()" can be found, depending on your Xbase++ version, in either the "\Source\Runtime\DUI\activex-helpers.prg" or the "\Source\Sys\activex.prg" program file. You can of course do some simple checks for yourself. The most basic option is to check for the BOM (Byte Order Mark), which would be the first two to five characters of the string (or file). Here is a list of BOM values: https://en.wikipedia.org/wiki/Byte_order_mark If a specific BOM is used, you should be able to rely on its existence alone to determine what encoding is used, but if no BOM is used, you can do some additional tests, such as checking if every first or second byte is a NULL character -- chr(0) -- which would indicate that you are dealing with either UTF-16(LE) or UTF-16(BE) encoding, etc. >>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? Keep in mind that UTF-8 is the same as ASCII for the first 128 (ASCII) characters, which is also true for UTF-16/UTF-32, just with those extra NULL characters thrown in. Xbase++ has several conversion routines, such as: * Unicode2Str() to convert UTF-16LE to ANSI/OEM * Str2Unicode() to convert ANSI/OEM to UTF-16LE * ConvToAnsiCP() to convert OEM to ANSI * ConvToOemCP() to convert ANSI to OEM The first two functions are again wrappers for the Windows API functions "WideCharToMultiByte()" and "MultiByteToWideChar()", which can handle some additional formats (besides UTF-16LE), and they are implemented in the same source file as the "IsUnicode()" function. To determine what code page is used, or if ANSI (or CP-1252 / Windows-1252) or ISO 8859-1, or even an OEM character set is used, depends entirely on the font you are using to display the characters. There is no way to determine which glyph (letter, symbol, etc.) should be displayed for a particular byte or character, though. If you only deal with 7-bit ASCII characters, UTF-8, ANSI, and ASCII will all be the same. If you deal with 8-bit character sets, different Code Pages might have a different symbol/glyph for the same character value, though. Here is some additional reading material: https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/Unicode https://en.wikipedia.org/wiki/Code_page https://en.wikipedia.org/wiki/Windows-1252 https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-16 Hope that helps, Andreas Andreas Gehrs-Pahl Absolute Software, LLC phone: (989) 723-9927 email: Andreas@AbsoluteSoftwareLLC.com web: http://www.AbsoluteSoftwareLLC.com [F]: https://www.facebook.com/AbsoluteSoftwareLLC |