Alaska Software Inc. - Encoding for a file UTF or Other.
Username: Password:
AuthorTopic: Encoding for a file UTF or Other.
César Calvo Encoding for a file UTF or Other.
on Sun, 13 Nov 2016 16:46:19 +0100
Good afternoon.

How could I know the encoding for a file?

I mean if is UTF-16 or UTF-8, WIN1252 or other.

Is there any function in Alaska XBase for this?

If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X?

Best.
César.
Thomas BraunRe: Encoding for a file UTF or Other.
on Mon, 14 Nov 2016 08:23:44 +0100
César Calvo wrote:

> How could I know the encoding for a file?

Easy answer - Correctly detecting the encoding all times is impossible and
highly unreliable:

https://en.wikipedia.org/wiki/Charset_detection

> Is there any function in Alaska XBase for this?

No.

> If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X?

You could, but not without possible loss of information.

Thomas
Andreas Gehrs-Pahl
Re: Encoding for a file UTF or Other.
on Mon, 14 Nov 2016 17:42:19 -0500
César, Thomas,

>>How could I know the encoding for a file?

You can do some simple checks, but depending on the circumstances, you won't 
be able to be 100% sure.

>>Is there any function in Alaska XBase for this?
>No.

Actually, there is a function in Windows to determine if a string is most 
probably (rather than certainly) in a particular (Unicode) encoding, such 
as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is 
wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only 
test for UTF-16LE, and no other Unicode encodings are recognized as Unicode.

The implementation of "IsUnicode()" can be found, depending on your Xbase++ 
version, in either the "\Source\Runtime\DUI\activex-helpers.prg" or the 
"\Source\Sys\activex.prg" program file.

You can of course do some simple checks for yourself. The most basic option 
is to check for the BOM (Byte Order Mark), which would be the first two to 
five characters of the string (or file). Here is a list of BOM values:

	https://en.wikipedia.org/wiki/Byte_order_mark

If a specific BOM is used, you should be able to rely on its existence alone 
to determine what encoding is used, but if no BOM is used, you can do some 
additional tests, such as checking if every first or second byte is a NULL 
character -- chr(0) -- which would indicate that you are dealing with either 
UTF-16(LE) or UTF-16(BE) encoding, etc.

>>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X?

Keep in mind that UTF-8 is the same as ASCII for the first 128 (ASCII) 
characters, which is also true for UTF-16/UTF-32, just with those extra NULL 
characters thrown in.

Xbase++ has several conversion routines, such as: 
* Unicode2Str()  to convert UTF-16LE to ANSI/OEM
* Str2Unicode()  to convert ANSI/OEM to UTF-16LE
* ConvToAnsiCP() to convert OEM to ANSI
* ConvToOemCP()  to convert ANSI to OEM

The first two functions are again wrappers for the Windows API functions 
"WideCharToMultiByte()" and "MultiByteToWideChar()", which can handle some 
additional formats (besides UTF-16LE), and they are implemented in the same 
source file as the "IsUnicode()" function.

To determine what code page is used, or if ANSI (or CP-1252 / Windows-1252) 
or ISO 8859-1, or even an OEM character set is used, depends entirely on the 
font you are using to display the characters. There is no way to determine 
which glyph (letter, symbol, etc.) should be displayed for a particular byte 
or character, though.

If you only deal with 7-bit ASCII characters, UTF-8, ANSI, and ASCII will 
all be the same. If you deal with 8-bit character sets, different Code Pages 
might have a different symbol/glyph for the same character value, though.

Here is some additional reading material:

https://en.wikipedia.org/wiki/Character_encoding
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/Code_page
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16

Hope that helps,

Andreas

Andreas Gehrs-Pahl
Absolute Software, LLC

phone: (989) 723-9927
email: Andreas@AbsoluteSoftwareLLC.com
web:   http://www.AbsoluteSoftwareLLC.com
[F]:   https://www.facebook.com/AbsoluteSoftwareLLC
Thomas BraunRe: Encoding for a file UTF or Other.
on Tue, 15 Nov 2016 11:03:08 +0100
Andreas Gehrs-Pahl wrote:

> Actually, there is a function in Windows to determine if a string is most 
> probably (rather than certainly) in a particular (Unicode) encoding, such 
> as UTF-8 or UTF-16LE, etc. The Windows API function, "IsTextUnicode()" is 
> wrapped in Xbase++ as function "IsUnicode()". The Xbase++ wrapper will only 
> test for UTF-16LE, and no other Unicode encodings are recognized as Unicode.

Good to know, so it basically is useless... and I'd rather have native
support for unicode strings on the Xbase++ language level.

The lack of proper unicode support in Xbase++ is one of the reasons why I'm
using Python 3 and the Django Framwework for all my new web application
projects.

> Here is some additional reading material:
> 
> https://en.wikipedia.org/wiki/Character_encoding
> https://en.wikipedia.org/wiki/Unicode
> https://en.wikipedia.org/wiki/Code_page
> https://en.wikipedia.org/wiki/Windows-1252
> https://en.wikipedia.org/wiki/UTF-8
> https://en.wikipedia.org/wiki/UTF-16

Well, that's a lot to read 

Thomas
Steven SchefferRe: Encoding for a file UTF or Other.
on Tue, 15 Nov 2016 23:22:09 +0100
Hi César,

I have done some work on this. 
If you want drop me an E-mail and I will send you my dll.

Steven


César Calvo <ccalvoc@telefonica.net> wrote in message
news:6d255e4$712b3326$a9f2f@news.alaska-software.com...
>Good afternoon.
>
>How could I know the encoding for a file?
>
>I mean if is UTF-16 or UTF-8, WIN1252 or other.
>
>Is there any function in Alaska XBase for this?
>
>If this, Could I change from UTF-X to Win1252 or Win1252 to UTF-X?
>
>Best.
>César.