펄 문자 인코딩에 대한 간략한 가이드

34518 단어 mojibakeperlunicode

크레딧



팀이 Mojibake 때문에 컴퓨터에 소리를 지르는 데 너무 많은 날을 보낸 후에 원래 직장에서 이것을 썼습니다. 출판을 허락한 고용주와 유용한 피드백을 제공한 여러 동료들에게 감사드립니다. 모든 오류는 당연히 그들의 잘못이 아닙니다.

목차


  • 12:45. Restate my assumptions

  • The Royal Road
  • Characters, representations, and strings
  • Source code encoding, the utf8 pragma, and why you shouldn’t use it

  • Input and output
  • PerlIO layers



  • The Encode module
  • Encode::encode
  • Encode::decode
  • Encode:: everything else


  • Debugging
  • The UTF8 flag
  • Devel::Peek
  • hexdump
  • PerlIO::get_layers


  • The many ways of writing a character
  • String literals
  • The chr function
  • Octal
  • Hexadecimal
  • By codepoint name
  • Other hexadecimal
  • In regular expressions
  • ASCII-encoded JSON strings in your code
  • Accented character vs character + combining accent


  • 12시 45분 내 가정을 다시 말해

    We will normally want to read and write UTF-8 encoded data. Therefore you should make sure that your terminal can handle it. While we will occasionally have to deal with other encodings, and will often want to look at the byte sequences that we are reading and writing and not just the characters they represent, your life will still be much easier if you have a UTF-8 capable terminal. You can test your terminal thus:

    $ perl -E 'binmode(STDOUT, ":encoding(UTF-8)"); say "\N{GREEK SMALL LETTER LAMDA}"'
    

    That should print λ , a letter that looks a bit like a lower-case y mirrored through the horizontal axis.

    And if you pipe the output from that into hexdump -C you should see the byte sequence 0xce 0xbb 0x0a .

    로열 로드

    Ideally, your code will only have to care about any of this at the edges - that is, where data enters and leaves the application. That could be when reading or writing a file, sending/receiving data across the network, making system calls, or talking to a database. And in many of these cases - especially talking to a database - you will be using a library which already handles everything for you. In a brand new code-base which doesn’t have to deal with any legacy baggage you should, in theory, only have to read this first section of this document.

    Alas, most real programming is a habitation of devils, who will beset you from all around and make you have to care about the rest of it.

    문자, 표현 및 문자열

    Perl can work with strings containing any character in Unicode. Characters are written in source code either as a literal character such as "m" or in several other ways. These are all equivalent:

    "m"
    chr(0x6d) # or chr(109), of course
    "\x{6d}"
    "\N{U+6d}"
    "\N{LATIN SMALL LETTER M}"
    

    As are these:

    chr(0x3bb)
    "\x{3bb}"
    "\N{U+3bb}"
    "\N{GREEK SMALL LETTER LAMDA}"
    

    Non-ASCII characters can also appear as literals in your code, for example "λ" , but this is not recommended - see the discussion of the utf8 pragma below. You can also use octal - "\154" - but this too is not recommended as hexadecimal encodings are marginally more familiar and easier to read.

    Internally, characters have a representation, a sequence of bytes that is unique for a particular combination of character and encoding. Most modern languages default to using UTF-8 for that representation, but perl is old enough to pre-date UTF-8 - and indeed to pre-date any concern for most character sets. For backward-compatibility reasons, and for compatibility with the many C libraries for which perl bindings exist, it was decided when perl sprouted its Unicode tentacle that the default representation should be ISO-Latin-1. This is a single-byte character set that covers most characters used in most modern Western European languages, and is a strict superset of ASCII.

    Any string consisting solely of characters in ISO-Latin-1 will by default be represented internally in ISO-Latin-1. Consider these strings:

    Release the raccoon! - consists solely of ASCII characters. ASCII is a subset of ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes.

    Libérez le raton laveur! - consists solely of characters that exist in ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes. The "é" character has code point 0xe9 and is represented as the byte 0xe9 internally.

    Rhyddhewch y racŵn! - the "ŵ" does not exist in ISO-Latin-1. But it does exist in Unicode, with code point 0x175. As soon as perl sees a non-ISO-Latin-1 character in a string, it switches to using something UTF-8-ish, so code point 0x175 is represented by byte sequence 0xc5 0xb5. Note that while valid characters’ internal representations are valid UTF-8 byte sequences, this can also encode invalid characters.

    Libérez le raton laveur! Rhyddhewch y racŵn! - this contains both an "é" (which is in ISO-Latin-1) and a "ŵ" (which is not), so the whole string is UTF-8 encoded. The "ŵ" is as before encoded as byte sequence 0xc5 0xb5, but the "é" must also be UTF-8 encoded instead of ISO-Latin-1-encoded, so becomes byte sequence 0xc3 0xa9.

    But notice that ISO-Latin-1 not only contains ASCII, and characters like "é" (at code point 0xe9, remember), it also contains characters "Ã" (capital A with a tilde, code point 0xc3) and "©" (copyright symbol, code point 0xa9). So how do we tell the difference between the ISO-Latin-1 byte sequence 0xc3 0xa9 representing "é" and the UTF-8 byte sequence 0xc3 0xa9 representing "é"? Remember that a representation is "a sequence of bytes that is unique for a particular combination of character and encoding". So perl stores the encoding as well as the byte sequence. It is stored as a single bit flag. If the flag is unset then the sequence is ISO-Latin-1, if it is set then it is UTF-8.

    소스 코드 인코딩, utf8 pragma 및 사용하지 말아야 하는 이유

    It is possible to put non-ASCII characters into your source code. For example, consider this file:

    my $string = "é";
    
    print "$string contains ".length($string)." characters\n";
    

    from which some problems arise. First, if the file is encoded in UTF-8, how can perl tell when it comes across the byte sequence 0xc3 0xa9 what encoding that is? Is it ISO-Latin-1? Well, it could be. Is it UTF-8? Again, it could be. In general, it isn’t possible to tell from a sequence of bytes what encoding is in use. For backward-compatibility reasons, perl assumes ISO-Latin-1.

    If you save that file encoded in UTF-8, and have a UTF-8-savvy terminal, that code will output:

    é contains 2 characters
    

    which is quite clearly wrong. It interpreted the 0xc3 0xa9 as two characters, but then when it spat those two characters out your terminal treated them as one.

    We can tell perl that the file contains UTF-8-encoded source code by adding a use utf8 . We also need to fix the output encoding - use utf8 doesn’t do that for you, it only asserts that the source file is UTF-8 encoded:

    use utf8;
    binmode(STDOUT, ":encoding(UTF-8)");
    
    my $string = "é";
    
    print "$string contains ".length($string)." character\n";
    

    (For more on output encoding see the next section)

    And now we get this:

    é contains 1 character
    

    Hurrah!

    At this point a second problem arises. Some editors aren’t very clever about encodings and even if they correctly read a file that is encoded in UTF-8, they will save it in ISO-Latin-1. VSCode for example is known to do this at least some of the time. If that happens, you’re still asserting via use utf8 that the file is UTF-8, but the "é" in the sample file will be encoded as byte 0xe9, and the following double-quote and semicolon as 0x22 0x3b. This results in a fatal error:

    Malformed UTF-8 character: \xe9\x22\x3b (unexpected non-continuation byte 0x22,
    immediately after start byte 0xe9; need 3 bytes, got 1) at ...
    

    So given that you’re basically screwed if you have non-ASCII source code no matter whether you use utf8 or not, I recommend that you just don’t do it. If you need a non-ASCII character in your code, use any of the many other ways of specifying it, and if necessary put a comment nearby so that whoever next has to fiddle with the code knows what it is:

    chr(0xe9);   # e-acute
    

    입력과 출력

    Strings aren’t the only things that have encodings. File handles do too. Just like how perl defaults to assuming that your source code is encoded in ISO-Latin-1, it assumes unless told otherwise that file handles similarly are ISO-Latin-1, and so if you try to print "é" to a a handle, what actually gets written is the byte 0xe9.

    Even if your source code has the use utf8 pragma, and your code contains the byte sequence 0xc3 0xa9, which will internally by decoded as the character "é", your handles are still ISO-Latin-1 and you'll get a single byte for that character. For how this happens see "PerlIO layers" below.

    Things get a bit more interesting if you try to send a non-ISO-Latin-1 character to an ISO-Latin-1 handle. Perl does the best it can and sends the internal representation - which is UTF-8, remember - to the handle and emits a warning "Wide character in print". Pay attention to the warnings!

    This behaviour is another common source of bugs. If you send the two strings "Libérez le raton laveur!" followed by "Rhyddhewch y racŵn!" to an ISO-Latin-1 handle, then the first one will sail through, correctly encoded, but the second will also go through. You’ve now got two different character encodings in your output stream and no matter what encoding is expected at the other end you’ll get mojibake.

    PerlIO 레이어

    We’ve seen how by default input and output is assumed to be in ISO-Latin-1. But that can be changed. Perl has supported different encodings for I/O since the dawn of time - since at least perl 3.016. That’s when it started to automatically convert "\n" into "\r\n" and vice versa on MSDOS, and the binmode() function was introduced in case you wanted to open a file on DOS without any translation.

    These days this is implemented via PerlIO layers, which allows you to open a file with all kinds of translation layers, including those which you write yourself or grab from the CPAN (see for example File::BOM ). 이미 열려 있는 핸들에서 레이어를 추가 및 제거할 수도 있습니다.

    일반적으로 요즘에는 항상 UTF-8 또는 원시 바이너리를 읽고 쓰기를 원하므로 다음과 같은 파일을 엽니다.

    open(my $fh, ">:encoding(UTF-8)", "some.log")
    
    open(my $fh, "<:raw", "image.jpg")
    


    또는 이미 열려 있는 핸들의 인코딩을 변경하려면:

    binmode(STDOUT, ":encoding(UTF-8)")
    


    (참고로 STDOUT과 같은 베어워드 파일 핸들에 적용된 인코딩은 전역적 영향을 미칩니다!)

    Windows에 대해 걱정할 필요가 없다면 일반적으로 핸들에서 중요한 작업을 수행하는 레이어는 하나뿐입니다(Windows에서 :crlf 레이어는 다른 레이어와 함께 유용하므로 Windows의 CP/M과의 멋진 하위 호환성에 대처할 수 있습니다. ), 그러나 더 많이 가질 수 있습니다. 일반적으로 읽기 위해 핸들을 열면 open() 함수 호출에 지정된 순서대로 왼쪽에서 오른쪽으로 인코딩이 데이터에 적용됩니다. 쓸 때 오른쪽에서 왼쪽으로 적용됩니다.

    하나 이상의 레이어가 필요하다고 생각하거나 위의 예에 나온 레이어 이외의 레이어를 원하는 경우 PerlIO 을 참조하십시오.

    인코딩 모듈

    The above explains the "royal road", where you are in complete control of how data gets into and out of your code. In that situation, you should never need to re-encode data, as it will always be Just A Bunch Of Characters whose underlying representation you don’t care about. That is, however, often not the case in the real world where we are beset by demons. We sometimes have to deal with libraries that do their own encoding/decoding and expect us to supply them with a byte stream ( XML::LibXML) 또는 위에서 언급한 문제에 대해 잘못되거나 부분적인 버그 수정이 적용되어 현재 버그가 있는 동작에 의존하는 다른 코드로 인해 적절한 수정을 쉽게 제공할 수 없는 예를 들어 잘못 인코딩된 데이터를 수정하기 위한 해결 방법이 있음).

    인코딩::인코딩

    The Encode::encode() function takes a string of characters and returns a string of bytes that represent that string in your desired encoding. For example:

    my $string = "Libérez le raton laveur!";
    encode("UTF-8", $string, Encode::FB_CROAK|Encode::LEAVE_SRC);
    

    will return a string where the character "é" has been replaced by the two bytes 0xc3 0xa9. If the original string was encoded in UTF-8 then the underlying representation of the input and output strings will be the same, but their encodings (as stored in the single bit flag we mentioned earlier) will be different, and the output will be reported as being one character longer by the length() function.

    Encode::encode can sometimes for Complicated Internals Optimisation Reasons modify its input. To avoid this set the Encode::LEAVE_SRC bit in its third argument.

    If you are encoding to anything other than UTF-8 or your string may contain characters outside of Unicode then you should consider telling encode() to be strict about characters that it can't encode, such as if you try to encode "ŵ" into a ISO-Latin-1 byte sequence. That's what the Encode::FB_CROAK bit is about in the example - in real code the encode should be in a try / catch block to deal with the exception that may arise. Encode 's documentation has a whole section on handling malformed data .

    인코딩::디코드

    It is quite common for us to receive data, either from a network connection or from a library, which is a UTF-8-encoded byte stream. Naively treating this as ISO-Latin-1 characters will lead to doom and disaster, as the byte sequence 0xc3 0xa9 will, as already explained, be interpreted as the characters "Ã" and "©". Encode::decode() takes a bunch of bytes and turns them into characters assuming that they are in a specified encoding. For example, this will return a "é" character:

    decode("UTF-8", chr(0xc3).chr(0xa9), Encode::FB_CROAK)
    

    You should consider how to handle a byte stream that turns out to not be valid in your desired encoding and again I recommend use of Encode::FB_CROAK .

    인코딩:: 다른 모든 것

    The "Encode" module provides some other functions that, on the surface, look useful. They are, mostly, not.

    Remember how waaaay back I briefly mentioned that perl’s internal representation for non-ISO-Latin-1 characters was UTF-8-ish and how they could contain invalid characters? That’s why you shouldn’t use encode_utf8 or decode_utf8 . You may be tempted to use Encode::is_utf8() to check a string's encoding. Don't, for the same reason.

    You will generally not be calling encode() with a string literal as its input, but with a variable as its input. However, any errors like "Modification of a read-only value attempted" are your fault, you should have told it to Encode::LEAVE_SRC .

    Don't even think about using the _utf8_on and _utf8_off functions. They are only useful for deliberately breaking things at a lower level than you should care about.

    디버깅

    UTF8 플래그

    The UTF8 flag is a reliable indicator that the underlying representation uses multiple bytes per non-ASCII character, but that’s about it. It is not a reliable indicator whether a string’s underlying representation is valid UTF-8 or that the string is valid Unicode.

    The result of this:

    Encode::encode("UTF-8", chr(0xe9), 8)
    

    is a string whose underlying representation is valid UTF-8 but the flag is off.

    This, on the other hand has the flag on but the underlying representation is not valid UTF-8 because the character is out of range:

    chr(2097153)
    

    This is an invalid character in Unicode, but perl encodes it (it has to encode it so it can store it) and turns the UTF8 flag on (so that it knows how the underlying representation is encoded):

    chr(0xfff8)
    

    And finally, this variable that someone else’s broken code might pass to you contains an invalid encoding of a valid character:

    my $str = chr(0xf0).chr(0x82).chr(0x82).chr(0x1c);
    Encode::_utf8_on($str);
    

    개발::피크

    10444410410410410410104446792410 10464666666666666666666666666666666666666666666666666666666666664666646464664664666646464 또한 Dump()FLAGS =의 차이점에 관심이 있다면 doco를 참조하십시오. 이 문서의 현명한 조언을 부지런히 따른다면 PV =을 사용하지 않을 것이기 때문에 상관하지 않을 것입니다.

    이것은 perl 변수의 내부를 살펴보는 데 매우 유용한 모듈입니다. 특히 perl이 생각하는 문자와 그 기본 표현이 무엇인지 확인하는 데 유용합니다. 인수의 내부 구조에 대한 세부 정보를 STDERR로 인쇄하는 "\351" 함수를 내보냅니다. 예를 들어: $ perl -MDevel::Peek -E '덤프(chr(0xe9))' SV = PV(0x7fa98980b690) at 0x7fa98a00bf90 참조 = 1 플래그 = (PADTMP, POK, 읽기 전용, 보호, pPOK) PV = 0x7fa989408170 "\351"\0 통화 = 1 렌 = 10 문자 인코딩 문제를 디버깅하기 위해 살펴볼 두 가지 중요한 사항은 "\316\273" 과 hexdump 로 시작하는 줄입니다. 문자열이 단일 바이트 ISO-Latin-1 인코딩을 사용함을 나타내는 UTF8 플래그가 설정되어 있지 않습니다. 그리고 문자열의 기본 표현은 cat 과 같이 (8진수로, 짜증나게) 표시됩니다. 다음은 문자열이 ISO-Latin-1 외부의 코드 포인트를 포함하거나 바이트 스트림에서 UTF-8로 디코딩된 경우의 모습입니다. $ perl -MDevel::Peek -E '덤프(chr(0x3bb))' SV = PV(0x7ff37e80b090) at 0x7ff388012390 참조 = 1 플래그 = (PADTMP, POK, 읽기 전용, 보호, pPOK, UTF8) PV = 0x7ff37f907350 "\316\273"\0 [UTF8 "\x{3bb}"] 통화 = 2 렌 = 10 UTF8 플래그가 표시되고 두 개의 8진수 바이트 PerlIO::get_layers로 기본 표현이 표시되고 해당 바이트가 나타내는 문자(필요한 경우 16진수로 - mmm, 일관성)가 표시됩니다.

    헥스 덤프 입력 및 출력 디버깅을 위해 외부 PerlIO 유틸리티를 권장합니다. 파일을 입력하면 그 안의 바이트가 표시되므로 단순히 파일을 사용하는 경우 터미널에서 수행할 수 있는 영리한 UTF-8 디코딩을 방지합니다. $ 고양이 그리스어 αβγ $ hexdump -C 그리스어 00000000 ce b1 ce b2 ce b3 0a |.......| 00000007 물론 STDIN에서도 읽을 수 있습니다.

    PerlIO::get_layers 코드가 잘못된 작업을 수행하지 않는다고 확신하지만 데이터가 여전히 입력/출력에서 엉망이 되면 use 함수를 사용하여 핸들에서 사용 중인 인코딩 레이어를 확인할 수 있습니다. open()은 특수 내장 네임스페이스이므로 :utf8할 필요가 없습니다. 실제로 :encoding(UTF-8) 하려고 하면 모듈로 존재하지 않기 때문에 실패합니다. 레이어는 배열에 대해 :utf8에 알려준 순서대로 반환됩니다. 레이어는 파일 핸들뿐만 아니라 모든 핸들에 적용할 수 있습니다. 소켓을 다루고 있다면 입력 측과 출력 측이 모두 다른 레이어를 가질 수 있음을 기억하십시오.

    the PerlIO manpage

    문자를 쓰는 다양한 방법 코드에서 문자를 나타내는 다양한 방법이 있습니다.

    문자열 리터럴 "중" 위에 설명된 이유로 ASCII 문자에만 이것을 사용하십시오.

    chr 함수 이 함수는 숫자를 인수로 사용하고 해당 코드포인트가 있는 문자를 반환합니다. 예를 들어, chr(0x3bb) 는 λ 를 반환합니다.

    8진수 ISO-Latin-1 문자에만 최대 3개의 8진수 "\155"를 사용할 수 있지만 사용하지 마십시오. 16진수보다 덜 친숙한 인코딩이므로 16진수는 읽기가 약간 더 쉬우며 아래에 설명된 "이 숫자는 얼마나 긴지" 문제도 있습니다.

    16진수 "\x{e9}" 중괄호 사이에는 원하는 수의 16진수를 넣을 수 있습니다. 중괄호를 사용하지 않는 "\xe9" 버전도 있습니다. 하나 또는 두 개의 16진수만 사용할 수 있으므로 ISO-Latin-1 문자에만 유효합니다. 구분 기호가 없으면 혼동과 오류가 발생할 수 있습니다. "\xa9" 를 고려하십시오. 중괄호가 없는 \x는 하나 또는 두 개의 16진수를 사용할 수 있으므로 \xa(줄 바꿈 문자) 뒤에 숫자 9가 옵니다. 아니면 저작권 기호인 \xa9입니까? 중괄호가 없는 \x는 탐욕스럽기 때문에 두 개의 16진수가 있는 것처럼 보이면 있다고 가정합니다. 첫 번째 숫자 뒤에 문자열의 끝 또는 16진수가 아닌 숫자가 오는 경우에만 단일 숫자 형식을 사용하려는 것으로 가정합니다. 이것은 예를 들어 \xap 가 단일 16진수이므로 \x{0a}p , 새 줄 다음에 문자 p 가 온다는 것을 의미합니다. 중괄호를 사용하면 상황이 훨씬 명확해지기 때문에 중괄호가 없는 변형이 더 이상 사용되지 않는다는 데 동의할 것이라고 생각합니다.

    By codepoint name and
    "\N{GREEK SMALL LETTER LAMDA}"
    

    This may sometimes be preferable to providing the (hexa)decimal codepoint with an associated comment, but it gets awful wordy awful fast. By default the name must correspond exactly to that in the Unicode standard. Shorter aliases are available if you ask for them, via the charnames pragma. The documentation only mentions this for the Greek and Cyrillic scripts, but they are available for all scripts which have letters. For example, these are equivalent:

    "\x{5d0}"
    
    \N{HEBREW LETTER ALEF}"
    
    use charnames qw(hebrew);
    "\N{ALEF}"                  # א
    

    Be careful if you ask for character-set-specific aliases as there may be name clashes. Both Arabic and Hebrew have a letter called "alef", for example:

    use charnames qw(arabic);
    "\N{ALEF}"                  # ا
    
    use charnames qw(arabic hebrew);
    "\N{ALEF}"                  # Always Hebrew, no matter the order of the imports!
    

    A happy medium ground is to ask for :short aliases:

    use charnames qw(:short);
    "\N{ALEF}"                           # error
    "\N{hebrew:alef} \N{arabic:alef}"    # does what it says on the tin
    
    . \N{...} 수정자를 사용하는 것을 고려해야 합니다. \x{...}이 ASCII와만 일치하도록 강제하는 것과 같은 작업을 수행하고 예를 들어 \x처럼 보이지만 실제로는 \N{...}\p은 일치하지 않습니다.

    기타 16진수

    "\N{U+3bb}"
    

    This notation looks a little bit more like the U-ish hexadecimal notations used in other languages while also being a bit like the \P notation for codepoint names. Unless you want to mix hexadecimal along with codepoint names you should probably not use this, and prefer \X which is more familiar to perl programmers.

    정규식에서

    You can use any of the /a and \d variants in regular expressions. You may also see , 8 , and BENGALI DIGIT FOUR as well. See perlunicode .

    좋은 웹페이지 즐겨찾기