utf8 character types ?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

utf8 character types ?

Kuniaki Mukai
Hi,

I have written a small script in swi to generate a dict named reserved_char_class
of  character types of swi-prolog  with characte intervals.  The last three
entries of the dict were added by hand for purpose to parse text in UTF8 encoding.

I have little knowledge about encoding like UTF8, but for
testing regular expressions using the utf8  character classes,
the automata synthesized by my regular expression compiler recognises
input texts in Japanese charaters (Kanji, multi-bytes) as I expected.

I hope this simplest addition of character classes for UTF8
will work without serious unexpected problems. If it works without problems,
I, as a multi-bytes language user,  have to pay respects to inventors
and developers of UTF8 encoding.

reserved_char_class([
         alnum-['0'-'9','A'-'Z',a-z],
         alpha-['A'-'Z',a-z],
         ascii-['\000\'-'\177\'],
         cntrl-['\000\'-'\037\','\177\'-'\177\'],
         csym-['0'-'9','A'-'Z','_'-'_',a-z],
         csymf-['A'-'Z','_'-'_',a-z],
         digit-['0'-'9'],
         end_of_line-['\n'-'\r'],
         graph-[ (!)- (~)],
         lower-[a-z],
         newline-['\n'-'\n'],
         period-[ (!)- (!), ('.')- ('.'), (?)- (?)],
         prolog_atom_start-[a-z],
         prolog_identifier_continue-['0'-'9','A'-'Z','_'-'_',a-z],
         prolog_symbol-[ (#)- ($), (&)- (&), (*)- (+), (-)- (/), (:)- (:), (<)- (@), (\)- (\), (^)- (^), (~)- (~)],
         prolog_var_start-['A'-'Z','_'-'_'],
         punct-[ (!)- (/), (:)- (@),'['- ('`'),'{'- (~)],
         quote-['"'-'"','\''-'\'', ('`')- ('`')],
         space-['\t'-'\r',' '-' '],
         upper-['A'-'Z'],
         white-['\t'-'\t',' '-' '],
         utf8 -['\200\' - '\377\'],
         utf8c -['\200\' - '\277\'],       %  UTF8 byte after the first one.
         utf8b -['\300\' - '\377\']        %  the first byte of UTF8  codes.
                     
]).

Regards,

Kuniaki Mukai

_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Kilian Evang
Hi,

you didn't say what your reserved_char_class/1 is supposed to do, so I
cannot really comment on it. But you really want to read this article:

http://kunststube.net/encoding/

In particular, make sure that you understand the difference between a
character set (such as Unicode) and a character encoding (such as UTF-8).

Best,
Kilian

On 09/12/2014 05:01 PM, Kuniaki Mukai wrote:

> Hi,
>
> I have written a small script in swi to generate a dict named reserved_char_class
> of  character types of swi-prolog  with characte intervals.  The last three
> entries of the dict were added by hand for purpose to parse text in UTF8 encoding.
>
> I have little knowledge about encoding like UTF8, but for
> testing regular expressions using the utf8  character classes,
> the automata synthesized by my regular expression compiler recognises
> input texts in Japanese charaters (Kanji, multi-bytes) as I expected.
>
> I hope this simplest addition of character classes for UTF8
> will work without serious unexpected problems. If it works without problems,
> I, as a multi-bytes language user,  have to pay respects to inventors
> and developers of UTF8 encoding.
>
> reserved_char_class([
> alnum-['0'-'9','A'-'Z',a-z],
> alpha-['A'-'Z',a-z],
> ascii-['\000\'-'\177\'],
> cntrl-['\000\'-'\037\','\177\'-'\177\'],
> csym-['0'-'9','A'-'Z','_'-'_',a-z],
> csymf-['A'-'Z','_'-'_',a-z],
> digit-['0'-'9'],
> end_of_line-['\n'-'\r'],
> graph-[ (!)- (~)],
> lower-[a-z],
> newline-['\n'-'\n'],
> period-[ (!)- (!), ('.')- ('.'), (?)- (?)],
> prolog_atom_start-[a-z],
> prolog_identifier_continue-['0'-'9','A'-'Z','_'-'_',a-z],
> prolog_symbol-[ (#)- ($), (&)- (&), (*)- (+), (-)- (/), (:)- (:), (<)- (@), (\)- (\), (^)- (^), (~)- (~)],
> prolog_var_start-['A'-'Z','_'-'_'],
> punct-[ (!)- (/), (:)- (@),'['- ('`'),'{'- (~)],
> quote-['"'-'"','\''-'\'', ('`')- ('`')],
> space-['\t'-'\r',' '-' '],
> upper-['A'-'Z'],
> white-['\t'-'\t',' '-' '],
> utf8 -['\200\' - '\377\'],
> utf8c -['\200\' - '\277\'],       %  UTF8 byte after the first one.
> utf8b -['\300\' - '\377\']        %  the first byte of UTF8  codes.
>    
> ]).
>
> Regards,
>
> Kuniaki Mukai
>
> _______________________________________________
> SWI-Prolog mailing list
> [hidden email]
> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>

_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Kuniaki Mukai

On Sep 13, 2014, at 4:22, Kilian Evang <[hidden email]> wrote:

> Hi,
>
> you didn't say what your reserved_char_class/1 is supposed to do, so I
> cannot really comment on it.

My purpose is simple:

This is a familiar case about codes for ascii strings.
?-  X = `ABC`, length(X, Length).
%@ X = [65, 66, 67],
%@ Length = 3.

But, this is a little bit surprise for me.
?-  X = `漢字`, length(X, Length).
%@ X = [230, 188, 162, 229, 173, 151],
%@ Length = 6.
I expected that X is a list of length 2 of (positive) integers
which may be large enough.

Having known so, I realise that it is programmer's responsibility to write code
to recognise multi-byte characters in the given list of integers.

> But you really want to read this article:
>
> http://kunststube.net/encoding/
>

Thank you for pointing the article.

> In particular, make sure that you understand the difference between a
> character set (such as Unicode) and a character encoding (such as UTF-8).

It is not clear for me what is a character set, but I suppose it
is some linearly ordered set of character images which is defined by authority.

I hope I need not go into  details of a character set, because my purpose
is simply to delimit codes for each characters.

Thank you for your response.

Regards

Kuniaki



>
> Best,
> Kilian
>
> On 09/12/2014 05:01 PM, Kuniaki Mukai wrote:
>> Hi,
>>
>> I have written a small script in swi to generate a dict named reserved_char_class
>> of  character types of swi-prolog  with characte intervals.  The last three
>> entries of the dict were added by hand for purpose to parse text in UTF8 encoding.
>>
>> I have little knowledge about encoding like UTF8, but for
>> testing regular expressions using the utf8  character classes,
>> the automata synthesized by my regular expression compiler recognises
>> input texts in Japanese charaters (Kanji, multi-bytes) as I expected.
>>
>> I hope this simplest addition of character classes for UTF8
>> will work without serious unexpected problems. If it works without problems,
>> I, as a multi-bytes language user,  have to pay respects to inventors
>> and developers of UTF8 encoding.
>>
>> reserved_char_class([
>> alnum-['0'-'9','A'-'Z',a-z],
>> alpha-['A'-'Z',a-z],
>> ascii-['\000\'-'\177\'],
>> cntrl-['\000\'-'\037\','\177\'-'\177\'],
>> csym-['0'-'9','A'-'Z','_'-'_',a-z],
>> csymf-['A'-'Z','_'-'_',a-z],
>> digit-['0'-'9'],
>> end_of_line-['\n'-'\r'],
>> graph-[ (!)- (~)],
>> lower-[a-z],
>> newline-['\n'-'\n'],
>> period-[ (!)- (!), ('.')- ('.'), (?)- (?)],
>> prolog_atom_start-[a-z],
>> prolog_identifier_continue-['0'-'9','A'-'Z','_'-'_',a-z],
>> prolog_symbol-[ (#)- ($), (&)- (&), (*)- (+), (-)- (/), (:)- (:), (<)- (@), (\)- (\), (^)- (^), (~)- (~)],
>> prolog_var_start-['A'-'Z','_'-'_'],
>> punct-[ (!)- (/), (:)- (@),'['- ('`'),'{'- (~)],
>> quote-['"'-'"','\''-'\'', ('`')- ('`')],
>> space-['\t'-'\r',' '-' '],
>> upper-['A'-'Z'],
>> white-['\t'-'\t',' '-' '],
>> utf8 -['\200\' - '\377\'],
>> utf8c -['\200\' - '\277\'],       %  UTF8 byte after the first one.
>> utf8b -['\300\' - '\377\']        %  the first byte of UTF8  codes.
>>    
>> ]).
>>
>> Regards,
>>
>> Kuniaki Mukai
>>
>> _______________________________________________
>> SWI-Prolog mailing list
>> [hidden email]
>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>>
>
> _______________________________________________
> SWI-Prolog mailing list
> [hidden email]
> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140913/6377e6de/signature.asc>
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Jan Wielemaker-5
Hi Kuniaki,

On 09/13/2014 03:27 PM, Kuniaki Mukai wrote:
> But, this is a little bit surprise for me.
> ?-  X = `漢字`, length(X, Length).
> %@ X = [230, 188, 162, 229, 173, 151],
> %@ Length = 6.
> I expected that X is a list of length 2 of (positive) integers
> which may be large enough.

Which version are you using??  I get this (SWI-Prolog 7.1.22 running on
Linux with UTF-8 locale, but this should hold for all versions that are
not completely ancient):

1 ?- X = `漢字`.
X = [28450, 23383].

That is what is supposed to happen: all I/O is translated into Unicode,
which is used internally.  Each stream has an `encoding' property that
translates between the external world and internal strings or atoms.

I guess you can also put the system in ISO-Latin-1 mode, which makes it
basically 8-bit transparent and if you feed it some multibyte character
stream, it will store them internally as bytes and output them consistently.

The default I/O encoding is `text`, which basically means it will use
the C-library function to translate the input into Unicode.  If it
recognises that the native encoding is UTF-8 based, it will use its
own UTF-8 routines.

        Cheers --- Jan
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Kuniaki Mukai

Hi Jan,

I have found a workaround by chance:

Run the GUI Emacs.app  which is located at the Application folder
from command line (not giving double-click to the application icon.)

%   /Applications/Emacs.app/Contents/MacOS/Emacs &

?- length(`漢字`, L).
L = 2.

I am surprised at this difference so that I am
reluctant to say it is a bug,  rather it is my discovery
of a hidden feature of the system !
However, of course  I want to know a true fix of the problem.

Regards,

Kuniaki



> Hi Jan,
>
> On Sep 13, 2014, at 23:26, Jan Wielemaker <[hidden email]> wrote:
>
>> Hi Kuniaki,
>>
>> On 09/13/2014 03:27 PM, Kuniaki Mukai wrote:
>>> But, this is a little bit surprise for me.
>>> ?-  X = `漢字`, length(X, Length).
>>> %@ X = [230, 188, 162, 229, 173, 151],
>>> %@ Length = 6.
>>> I expected that X is a list of length 2 of (positive) integers
>>> which may be large enough.
>>
>> Which version are you using??  
>
> % swipl -v
> SWI-Prolog version 7.1.19 for x86_64-darwin13.3.0
>
>> I get this (SWI-Prolog 7.1.22 running on
>> Linux with UTF-8 locale, but this should hold for all versions that are
>> not completely ancient):
>>
>> 1 ?- X = `漢字`.
>> X = [28450, 23383].
>
> I have tested three cases after your suggestion:
>
> 1)    Run SWI-Prolog from command line:
> % swipl
> Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.1.19-11-gb66fd67)
>
> ?- X = `漢字`.
> X = [28450, 23383].
>
> 2)  Run SWI-Prolog at shell mode  in GNU Emacs 24.3.50.1
>
> The same as in the case 1). That is,
> ?- X = `漢字`.
> X = [28450, 23383].
>
> 3)  Run SWI-Prolog in ediprolog mode in GNU Emacs 24.3.50.1
> ?- X = `漢字`.
> %@ X = [230, 188, 162, 229, 173, 151].
>
> I cannot go further.
>
>> That is what is supposed to happen: all I/O is translated into Unicode,
>> which is used internally.  Each stream has an `encoding' property that
>> translates between the external world and internal strings or atoms.
>>
>> I guess you can also put the system in ISO-Latin-1 mode, which makes it
>> basically 8-bit transparent and if you feed it some multibyte character
>> stream, it will store them internally as bytes and output them consistently.
>
> It sounds like fixing my problem, I will try it anyhow, though
> I don't know  exactly what I should do.
>
>> The default I/O encoding is `text`, which basically means it will use
>> the C-library function to translate the input into Unicode.  If it
>> recognises that the native encoding is UTF-8 based, it will use its
>> own UTF-8 routines.
>
> Thank you. Following your suggestion I will try what I can do,
> holding ediprolog mode at hand.  
>
> Regards
>
> Kuniaki
>
>>
>> Cheers --- Jan
>> _______________________________________________
>> SWI-Prolog mailing list
>> [hidden email]
>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140914/5dd5b19d/signature.asc>
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Kuniaki Mukai
In reply to this post by Jan Wielemaker-5

Hi,

I remember a fix for similar trouble.

Open   ~/.MacOSX/environment.plist and add an entry for PATH.
I have set the value to that obtained by the following
command line.

% echo  $PATH

Still I have only vague idea why this fix works, though.

Thank you for clear explanation about encoding
in SWI-Prolog.


Kuniaki

>
> Hi Jan,
>
> I have found a workaround by chance:
>
> Run the GUI Emacs.app  which is located at the Application folder
> from command line (not giving double-click to the application icon.)
>
> %   /Applications/Emacs.app/Contents/MacOS/Emacs &
>
> ?- length(`漢字`, L).
> L = 2.
>
> I am surprised at this difference so that I am
> reluctant to say it is a bug,  rather it is my discovery
> of a hidden feature of the system !
> However, of course  I want to know a true fix of the problem.
>
> Regards,
>
> Kuniaki
>
>
>
>> Hi Jan,
>>
>> On Sep 13, 2014, at 23:26, Jan Wielemaker <[hidden email]> wrote:
>>
>>> Hi Kuniaki,
>>>
>>> On 09/13/2014 03:27 PM, Kuniaki Mukai wrote:
>>>> But, this is a little bit surprise for me.
>>>> ?-  X = `漢字`, length(X, Length).
>>>> %@ X = [230, 188, 162, 229, 173, 151],
>>>> %@ Length = 6.
>>>> I expected that X is a list of length 2 of (positive) integers
>>>> which may be large enough.
>>>
>>> Which version are you using??  
>>
>> % swipl -v
>> SWI-Prolog version 7.1.19 for x86_64-darwin13.3.0
>>
>>> I get this (SWI-Prolog 7.1.22 running on
>>> Linux with UTF-8 locale, but this should hold for all versions that are
>>> not completely ancient):
>>>
>>> 1 ?- X = `漢字`.
>>> X = [28450, 23383].
>>
>> I have tested three cases after your suggestion:
>>
>> 1)    Run SWI-Prolog from command line:
>> % swipl
>> Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.1.19-11-gb66fd67)
>>
>> ?- X = `漢字`.
>> X = [28450, 23383].
>>
>> 2)  Run SWI-Prolog at shell mode  in GNU Emacs 24.3.50.1
>>
>> The same as in the case 1). That is,
>> ?- X = `漢字`.
>> X = [28450, 23383].
>>
>> 3)  Run SWI-Prolog in ediprolog mode in GNU Emacs 24.3.50.1
>> ?- X = `漢字`.
>> %@ X = [230, 188, 162, 229, 173, 151].
>>
>> I cannot go further.
>>
>>> That is what is supposed to happen: all I/O is translated into Unicode,
>>> which is used internally.  Each stream has an `encoding' property that
>>> translates between the external world and internal strings or atoms.
>>>
>>> I guess you can also put the system in ISO-Latin-1 mode, which makes it
>>> basically 8-bit transparent and if you feed it some multibyte character
>>> stream, it will store them internally as bytes and output them consistently.
>>
>> It sounds like fixing my problem, I will try it anyhow, though
>> I don't know  exactly what I should do.
>>
>>> The default I/O encoding is `text`, which basically means it will use
>>> the C-library function to translate the input into Unicode.  If it
>>> recognises that the native encoding is UTF-8 based, it will use its
>>> own UTF-8 routines.
>>
>> Thank you. Following your suggestion I will try what I can do,
>> holding ediprolog mode at hand.  
>>
>> Regards
>>
>> Kuniaki
>>
>>>
>>> Cheers --- Jan
>>> _______________________________________________
>>> SWI-Prolog mailing list
>>> [hidden email]
>>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 496 bytes
> Desc: Message signed with OpenPGP using GPGMail
> URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140914/5dd5b19d/signature.asc>
> _______________________________________________
> SWI-Prolog mailing list
> [hidden email]
> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Jan Wielemaker-5
On 09/14/2014 09:03 AM, Kuniaki Mukai wrote:

>
> Hi,
>
> I remember a fix for similar trouble.
>
> Open   ~/.MacOSX/environment.plist and add an entry for PATH.
> I have set the value to that obtained by the following
> command line.
>
> % echo  $PATH
>
> Still I have only vague idea why this fix works, though.

The only thing I know is that apps are started with a different
environment than running a program from the terminal.  What
matters is the setting of the environment variable LANG, which
drives the `locale' initialization.

Thanks for sorting this out

        --- Jan

>
> Thank you for clear explanation about encoding
> in SWI-Prolog.
>
>
> Kuniaki
>
>>
>> Hi Jan,
>>
>> I have found a workaround by chance:
>>
>> Run the GUI Emacs.app  which is located at the Application folder
>> from command line (not giving double-click to the application icon.)
>>
>> %   /Applications/Emacs.app/Contents/MacOS/Emacs &
>>
>> ?- length(`漢字`, L).
>> L = 2.
>>
>> I am surprised at this difference so that I am
>> reluctant to say it is a bug,  rather it is my discovery
>> of a hidden feature of the system !
>> However, of course  I want to know a true fix of the problem.
>>
>> Regards,
>>
>> Kuniaki
>>
>>
>>
>>> Hi Jan,
>>>
>>> On Sep 13, 2014, at 23:26, Jan Wielemaker <[hidden email]> wrote:
>>>
>>>> Hi Kuniaki,
>>>>
>>>> On 09/13/2014 03:27 PM, Kuniaki Mukai wrote:
>>>>> But, this is a little bit surprise for me.
>>>>> ?-  X = `漢字`, length(X, Length).
>>>>> %@ X = [230, 188, 162, 229, 173, 151],
>>>>> %@ Length = 6.
>>>>> I expected that X is a list of length 2 of (positive) integers
>>>>> which may be large enough.
>>>>
>>>> Which version are you using??  
>>>
>>> % swipl -v
>>> SWI-Prolog version 7.1.19 for x86_64-darwin13.3.0
>>>
>>>> I get this (SWI-Prolog 7.1.22 running on
>>>> Linux with UTF-8 locale, but this should hold for all versions that are
>>>> not completely ancient):
>>>>
>>>> 1 ?- X = `漢字`.
>>>> X = [28450, 23383].
>>>
>>> I have tested three cases after your suggestion:
>>>
>>> 1)    Run SWI-Prolog from command line:
>>> % swipl
>>> Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.1.19-11-gb66fd67)
>>>
>>> ?- X = `漢字`.
>>> X = [28450, 23383].
>>>
>>> 2)  Run SWI-Prolog at shell mode  in GNU Emacs 24.3.50.1
>>>
>>> The same as in the case 1). That is,
>>> ?- X = `漢字`.
>>> X = [28450, 23383].
>>>
>>> 3)  Run SWI-Prolog in ediprolog mode in GNU Emacs 24.3.50.1
>>> ?- X = `漢字`.
>>> %@ X = [230, 188, 162, 229, 173, 151].
>>>
>>> I cannot go further.
>>>
>>>> That is what is supposed to happen: all I/O is translated into Unicode,
>>>> which is used internally.  Each stream has an `encoding' property that
>>>> translates between the external world and internal strings or atoms.
>>>>
>>>> I guess you can also put the system in ISO-Latin-1 mode, which makes it
>>>> basically 8-bit transparent and if you feed it some multibyte character
>>>> stream, it will store them internally as bytes and output them consistently.
>>>
>>> It sounds like fixing my problem, I will try it anyhow, though
>>> I don't know  exactly what I should do.
>>>
>>>> The default I/O encoding is `text`, which basically means it will use
>>>> the C-library function to translate the input into Unicode.  If it
>>>> recognises that the native encoding is UTF-8 based, it will use its
>>>> own UTF-8 routines.
>>>
>>> Thank you. Following your suggestion I will try what I can do,
>>> holding ediprolog mode at hand.  
>>>
>>> Regards
>>>
>>> Kuniaki
>>>
>>>>
>>>> Cheers --- Jan
>>>> _______________________________________________
>>>> SWI-Prolog mailing list
>>>> [hidden email]
>>>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>>>
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: signature.asc
>> Type: application/pgp-signature
>> Size: 496 bytes
>> Desc: Message signed with OpenPGP using GPGMail
>> URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140914/5dd5b19d/signature.asc>
>> _______________________________________________
>> SWI-Prolog mailing list
>> [hidden email]
>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: utf8 character types ?

Kuniaki Mukai
In reply to this post by Kuniaki Mukai
Hi,

I made a mistake. ~/.MacOSX/environment.plist seems not supported anymore
at least on Mavericks.  However there are several ways to set environment variables
such as LANG for GUI application such as Emacs.app in my case. ( keywords for search:
MacOSX, Mavericks, launchd.conf, launchctl. )   I have  chosen among them
a general way of writing a tiny  c-script following an article from a site
in Japanese.  This time, it works well even after rebooting unlike
the wrong one I posted in the previous message.

Kuniaki

On Sep 14, 2014, at 16:03, Kuniaki Mukai <[hidden email]> wrote:

>
> Hi,
>
> I remember a fix for similar trouble.
>
> Open   ~/.MacOSX/environment.plist and add an entry for PATH.
> I have set the value to that obtained by the following
> command line.
>
> % echo  $PATH
>
> Still I have only vague idea why this fix works, though.
>
> Thank you for clear explanation about encoding
> in SWI-Prolog.
>
>
> Kuniaki
>
>>
>> Hi Jan,
>>
>> I have found a workaround by chance:
>>
>> Run the GUI Emacs.app  which is located at the Application folder
>> from command line (not giving double-click to the application icon.)
>>
>> %   /Applications/Emacs.app/Contents/MacOS/Emacs &
>>
>> ?- length(`漢字`, L).
>> L = 2.
>>
>> I am surprised at this difference so that I am
>> reluctant to say it is a bug,  rather it is my discovery
>> of a hidden feature of the system !
>> However, of course  I want to know a true fix of the problem.
>>
>> Regards,
>>
>> Kuniaki
>>
>>
>>
>>> Hi Jan,
>>>
>>> On Sep 13, 2014, at 23:26, Jan Wielemaker <[hidden email]> wrote:
>>>
>>>> Hi Kuniaki,
>>>>
>>>> On 09/13/2014 03:27 PM, Kuniaki Mukai wrote:
>>>>> But, this is a little bit surprise for me.
>>>>> ?-  X = `漢字`, length(X, Length).
>>>>> %@ X = [230, 188, 162, 229, 173, 151],
>>>>> %@ Length = 6.
>>>>> I expected that X is a list of length 2 of (positive) integers
>>>>> which may be large enough.
>>>>
>>>> Which version are you using??  
>>>
>>> % swipl -v
>>> SWI-Prolog version 7.1.19 for x86_64-darwin13.3.0
>>>
>>>> I get this (SWI-Prolog 7.1.22 running on
>>>> Linux with UTF-8 locale, but this should hold for all versions that are
>>>> not completely ancient):
>>>>
>>>> 1 ?- X = `漢字`.
>>>> X = [28450, 23383].
>>>
>>> I have tested three cases after your suggestion:
>>>
>>> 1)    Run SWI-Prolog from command line:
>>> % swipl
>>> Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.1.19-11-gb66fd67)
>>>
>>> ?- X = `漢字`.
>>> X = [28450, 23383].
>>>
>>> 2)  Run SWI-Prolog at shell mode  in GNU Emacs 24.3.50.1
>>>
>>> The same as in the case 1). That is,
>>> ?- X = `漢字`.
>>> X = [28450, 23383].
>>>
>>> 3)  Run SWI-Prolog in ediprolog mode in GNU Emacs 24.3.50.1
>>> ?- X = `漢字`.
>>> %@ X = [230, 188, 162, 229, 173, 151].
>>>
>>> I cannot go further.
>>>
>>>> That is what is supposed to happen: all I/O is translated into Unicode,
>>>> which is used internally.  Each stream has an `encoding' property that
>>>> translates between the external world and internal strings or atoms.
>>>>
>>>> I guess you can also put the system in ISO-Latin-1 mode, which makes it
>>>> basically 8-bit transparent and if you feed it some multibyte character
>>>> stream, it will store them internally as bytes and output them consistently.
>>>
>>> It sounds like fixing my problem, I will try it anyhow, though
>>> I don't know  exactly what I should do.
>>>
>>>> The default I/O encoding is `text`, which basically means it will use
>>>> the C-library function to translate the input into Unicode.  If it
>>>> recognises that the native encoding is UTF-8 based, it will use its
>>>> own UTF-8 routines.
>>>
>>> Thank you. Following your suggestion I will try what I can do,
>>> holding ediprolog mode at hand.  
>>>
>>> Regards
>>>
>>> Kuniaki
>>>
>>>>
>>>> Cheers --- Jan
>>>> _______________________________________________
>>>> SWI-Prolog mailing list
>>>> [hidden email]
>>>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
>>>
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: signature.asc
>> Type: application/pgp-signature
>> Size: 496 bytes
>> Desc: Message signed with OpenPGP using GPGMail
>> URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140914/5dd5b19d/signature.asc>
>> _______________________________________________
>> SWI-Prolog mailing list
>> [hidden email]
>> https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.iai.uni-bonn.de/pipermail/swi-prolog/attachments/20140914/604518ce/signature.asc>
_______________________________________________
SWI-Prolog mailing list
[hidden email]
https://lists.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog
Loading...