Python 裡中文目錄與os.path.join問題

2017-03-062017-03-08

編碼的問題在 Python 2.x 似乎是有可能會遇到，在Python 3.x 中，所有字串都以unicode統一處理，而Python 2.x 中用str表示的中文則被稱為bytestream。也就是中文的二進制編碼，相同的檔案名稱在不同的平台可能有不同的表示，具體表示方法由sys.getfilesystemencoding()取得。在 Mac/Linux 應該都是 utf-8, 在 Windows 可能 utf-8 或 ANSI.

os.path.join(path, filename)

If path and filename are unicodes then everthing will work as expected. But what will happen if filename is an UTF-8 encoded string which contains “öäü.jpg”?

Python always uses unicodes to join a string with an unicode. Therefore Python will decode the string with it’s default encoding (ascii).

os.path.join(path, filename.decode('ascii'))

Due to the missing öäü within the ASCII codepage you’ll get an unicode exception! That’s the reason why you must explicitly convert the string to unicode!

os.path.join(path, filename.decode('utf-8'))

Here’s some interesting stuff from the documentation:

sys.getfilesystemencoding()

Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used. The result value depends on the operating system: On Mac OS X, the encoding is ‘utf-8’. On Unix, the encoding is the user’s preference according to the result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET) failed. On Windows NT+, file names are Unicode natively, so no conversion is performed. getfilesystemencoding() still returns ‘mbcs’, as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names. On Windows 9x, the encoding is ‘mbcs’.

New in version 2.3.

If I understand this correctly, you should pass the file name as unicode:

f = open(unicode(path, encoding))

Python 編碼問題整理
幾個概念性的東西
ANSCII:
標準的 ANSCII 編碼只使用7個比特來表示一個字符，因此最多編碼128個字符。擴充的 ANSCII 使用8個比特來表示一個字符，最多也只能
編碼 256 個字符。
UNICODE:
使用2個甚至4個字節來編碼一個字符，因此可以將世界上所有的字符進行統一編碼。
UTF:
UNICODE編碼轉換格式，就是用來指導如何將 unicode 編碼成適合文件存儲和網絡傳輸的字節序列的形式 (unicode ->
str)。像其他的一些編碼方式 gb2312, gb18030, big5 和 UTF 的作用是一樣的，只是編碼方式不同。
Python 裡面有兩種數據模型來支持字符串這種數據類型，一種是 str，另外一種是 unicode ，它們都是 sequence 的派生類
型，這個可以參考 Python Language Ref 中的描述：

Strings

The items of a string are characters. There is no separate
character type; a character is represented by a string of one item.
Characters represent (at least) 8-bit bytes. The built-in functions
chr() and ord() convert between characters and nonnegative integers
representing the byte values. Bytes with the values 0-127 usually
represent the corresponding ASCII values, but the interpretation of
values is up to the program. The string data type is also used to
represent arrays of bytes, e.g., to hold data read from a file.
(On systems whose native character set is not ASCII, strings
may use EBCDIC in their internal representation, provided the
functions chr() and ord() implement a mapping between ASCII and
EBCDIC, and string comparison preserves the ASCII order. Or perhaps
someone can propose a better rule?)

Unicode

The items of a Unicode object are Unicode code units. A
Unicode code unit is represented by a Unicode object of one item and
can hold either a 16-bit or 32-bit value representing a Unicode
ordinal (the maximum value for the ordinal is given in sys.maxunicode,
and depends on how Python is configured at compile time). Surrogate
pairs may be present in the Unicode object, and will be reported as
two separate items. The built-in functions unichr() and ord() convert
between code units and nonnegative integers representing the Unicode
ordinals as defined in the Unicode Standard 3.0. Conversion from and
to other encodings are possible through the Unicode method encode()
and the built-in function unicode().

這裡面是這麼幾句：
 "The items of a string are characters", "The items of a Unicode object
 are Unicode code units", "The string data type is also used to
 represent arrays of bytes, e.g., to hold data read from a file."
 一二句說明 str 和 unicode 的組成單元(item)是什麼（因為它們同是 sequence ) 。sequence 默認的
 __len__ 函數的返回值正是該序列組成單元的個數。這樣的話，len('abcd') == 4 和 len(u'我是中文') == 4 就很
 容易理解了。
 第三句告訴我們像從文件輸入輸出的時候是用 str 來表示數據的數組。不止是文件操作，我想在網絡傳輸的時候應該也是這樣的。這就是為什麼一個
 unicode 字符串在寫入文件或者在網絡上傳輸的時候要進行編碼的原因了。
 Python 裡面的編碼和解碼也就是 unicode 和 str 這兩種形式的相互轉化。編碼是 unicode -> str，相反的，解碼就
 是 str -> unicode。
 下面剩下的問題就是確定何時需要進行編碼或者解碼了，像一些庫是 unicode 版的，這樣我們在將這些庫函數的返回值進行傳輸或者寫入文件的時候就
 要考慮將它編碼成合適的類型。
 關於文件開頭的"編碼指示"，也就是 # -*- coding: -*- 這個語句。Python 默認腳本文件都是 ANSCII 編碼的，當文件
 中有非 ANSCII 編碼範圍內的字符的時候就要使用"編碼指示"來修正。
 關於 sys.defaultencoding，這個在解碼沒有明確指明解碼方式的時候使用。比如我有如下代碼：
 #! /usr/bin/env python
 # -*- coding: utf-8 -*-
 s = '中文' # 注意這裡的 str 是 str 類型的，而不是 unicode
 s.encode('gb18030')
 這句代碼將 s 重新編碼為 gb18030 的格式，即進行 unicode -> str 的轉換。因為 s 本身就是 str 類型的，因此
 Python 會自動的先將 s 解碼為 unicode ，然後再編碼成 gb18030。因為解碼是python自動進行的，我們沒有指明解碼方
 式，python 就會使用 sys.defaultencoding 指明的方式來解碼。很多情況下 sys.defaultencoding 是
 ANSCII，如果 s 不是這個類型就會出錯。
 拿上面的情況來說，我的 sys.defaultencoding 是 anscii，而 s 的編碼方式和文件的編碼方式一致，是 utf8 的，所
 以出錯了:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position
 0: ordinal not in range(128)
 對於這種情況，我們有兩種方法來改正錯誤：
 一是明確的指示出 s 的編碼方式
 #! /usr/bin/env python
 # -*- coding: utf-8 -*-
 s = '中文'
 s.decode('utf-8').encode('gb18030')

python 2中是將字串分作 str及Unicode兩種物件，”中文”這個字串實例建立時是處在 UTF-8編碼的狀態。若想要得到”中文”二字的 Unicode，你可以透過 .decode()來實現。另外，你也可以在字串前面加上一個英文字母 u，如u”中文”，來直接建立一個 unicode物件的實例。

input ###

print type("中文")
print type("中文".decode("utf-8"))
print type(u"中文")

output ###

type 'str'
type 'unicode'
type 'unicode'

當你想要從 UTF-8編碼狀態轉換成 Big5編碼狀態時，你必需要將編碼狀態的字串先解碼成 Unicode後，再重新編碼成 Big5的編碼狀態才能成功。

print "中文" # encoded in utf-8
print "中文".decode("utf-8").encode("big5") # encoded in big5

編碼不同除了造成亂碼之外，也會造成文本分析的錯誤，最明顯的例子就是計算字數的問題。當你用 len()這個內建的函式(built-in function)來計算”中文”這個字串有幾個字時，你會發現結果並不是預期的 2而是 6。

input ###

print len("中文")
print len("中文".decode("utf-8"))
print len(u"中文")

output ###

6
2
2

這跟 len()這個函式演算法的設計有關，當字串是編碼狀態時就會無法得到正確的字數。若要修正這個問題，請記得一定要以 unicode去計算才會得到正確的結果。這也是為什麼許多中文處理套件的第一步總是先將字串轉換成 unicode的原因。

python 3中是將字串分作 str及 byte兩種物件，分別對應到 Unicode以及編碼狀態。你發現差異了嗎？這就表示說，當你建立一個字串實例如”中文”時，這個字串就會是 Unicode的文字了。以 Unicode優先的這種設計避掉了許多因編碼不同而挖的坑，像是前面舉例使用 len()計算字數的問題。

input ###

print(type("中文"))
print(type("中文".encode("utf-8")))
print(type(u"中文"))
print(len("中文"))

output ###

class 'str'
class 'bytes'
class 'str'
2

資料來源：

Python not able to open file with non-english characters in path
http://stackoverflow.com/questions/2004137/unicodeencodeerror-on-joining-file-name

python os.walk filename ‘ascii’ codec can’t decode
https://stackoverflow.max-everyday.com/2017/03/os-walk-filename-ascii-codec-cant-decode/

Max的程式語言筆記

Python 裡中文目錄與os.path.join問題

input ###

output ###

input ###

output ###

input ###

output ###

資料來源：

發佈留言取消回覆

Related Posts

發佈留言 取消回覆

發佈留言取消回覆