Author | Topic: Encoding for a file UTF or Other. | |
---|---|---|
César Calvo | Encoding for a file UTF or Other. on Sun, 13 Nov 2016 16:46:19 +0100 Good afternoon. How could I know the encoding for a file? I mean if is UTF-16 or UTF-8, WIN1252 or other. Is there any function in Alaska XBase for this? If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? Best. César. | |
Thomas Braun | Re: Encoding for a file UTF or Other. on Mon, 14 Nov 2016 08:23:44 +0100 César Calvo wrote: > How could I know the encoding for a file? Easy answer - Correctly detecting the encoding all times is impossible and highly unreliable: https://en.wikipedia.org/wiki/Charset_detection > Is there any function in Alaska XBase for this? No. > If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? You could, but not without possible loss of information. Thomas | |
Andreas Gehrs-Pahl | Re: Encoding for a file UTF or Other. on Mon, 14 Nov 2016 17:42:19 -0500 César, Thomas, >>How could I know the encoding for a file? You can do some simple checks, but depending on the circumstances, you won't be able to be 100% sure. >>Is there any function in Alaska XBase for this? >No. Actually, there is a function in Windows to determine if a string is most probably (rather than certainly) in a particular (Unicode) encoding, such as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only test for UTF-16LE, and no other Unicode encodings are recognized as Unicode. The implementation of "IsUnicode()" can be found, depending on your Xbase++ version, in either the "\Source\Runtime\DUI\activex-helpers.prg" or the "\Source\Sys\activex.prg" program file. You can of course do some simple checks for yourself. The most basic option is to check for the BOM (Byte Order Mark), which would be the first two to five characters of the string (or file). Here is a list of BOM values: https://en.wikipedia.org/wiki/Byte_order_mark If a specific BOM is used, you should be able to rely on its existence alone to determine what encoding is used, but if no BOM is used, you can do some additional tests, such as checking if every first or second byte is a NULL character -- chr(0) -- which would indicate that you are dealing with either UTF-16(LE) or UTF-16(BE) encoding, etc. >>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? Keep in mind that UTF-8 is the same as ASCII for the first 128 (ASCII) characters, which is also true for UTF-16/UTF-32, just with those extra NULL characters thrown in. Xbase++ has several conversion routines, such as: * Unicode2Str() to convert UTF-16LE to ANSI/OEM * Str2Unicode() to convert ANSI/OEM to UTF-16LE * ConvToAnsiCP() to convert OEM to ANSI * ConvToOemCP() to convert ANSI to OEM The first two functions are again wrappers for the Windows API functions "WideCharToMultiByte()" and "MultiByteToWideChar()", which can handle some additional formats (besides UTF-16LE), and they are implemented in the same source file as the "IsUnicode()" function. To determine what code page is used, or if ANSI (or CP-1252 / Windows-1252) or ISO 8859-1, or even an OEM character set is used, depends entirely on the font you are using to display the characters. There is no way to determine which glyph (letter, symbol, etc.) should be displayed for a particular byte or character, though. If you only deal with 7-bit ASCII characters, UTF-8, ANSI, and ASCII will all be the same. If you deal with 8-bit character sets, different Code Pages might have a different symbol/glyph for the same character value, though. Here is some additional reading material: https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/Unicode https://en.wikipedia.org/wiki/Code_page https://en.wikipedia.org/wiki/Windows-1252 https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-16 Hope that helps, Andreas Andreas Gehrs-Pahl Absolute Software, LLC phone: (989) 723-9927 email: Andreas@AbsoluteSoftwareLLC.com web: http://www.AbsoluteSoftwareLLC.com [F]: https://www.facebook.com/AbsoluteSoftwareLLC | |
Thomas Braun | Re: Encoding for a file UTF or Other. on Tue, 15 Nov 2016 11:03:08 +0100 Andreas Gehrs-Pahl wrote: > Actually, there is a function in Windows to determine if a string is most > probably (rather than certainly) in a particular (Unicode) encoding, such > as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is > wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only > test for UTF-16LE, and no other Unicode encodings are recognized as Unicode. Good to know, so it basically is useless... and I'd rather have native support for unicode strings on the Xbase++ language level. The lack of proper unicode support in Xbase++ is one of the reasons why I'm using Python 3 and the Django Framwework for all my new web application projects. > Here is some additional reading material: > > https://en.wikipedia.org/wiki/Character_encoding > https://en.wikipedia.org/wiki/Unicode > https://en.wikipedia.org/wiki/Code_page > https://en.wikipedia.org/wiki/Windows-1252 > https://en.wikipedia.org/wiki/UTF-8 > https://en.wikipedia.org/wiki/UTF-16 Well, that's a lot to read Thomas | |
Steven Scheffer | Re: Encoding for a file UTF or Other. on Tue, 15 Nov 2016 23:22:09 +0100 Hi César, I have done some work on this. If you want drop me an E-mail and I will send you my dll. Steven César Calvo <ccalvoc@telefonica.net> wrote in message news:6d255e4$712b3326$a9f2f@news.alaska-software.com... >Good afternoon. > >How could I know the encoding for a file? > >I mean if is UTF-16 or UTF-8, WIN1252 or other. > >Is there any function in Alaska XBase for this? > >If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X? > >Best. >César. |