我在抓取facebook 網頁裡的資料時遇到會傳回 404 error, 但透過browser 或 curl 指令卻又可以下載的問題。
在 Google 了一下,發現有神奇的人解了這個問題,他的source code 裡,令人驚訝,像一本神奇的python 教科書,在講各個python 版本的在文字編碼這塊會遇到的問題,並附上完美解法!
source code:
https://github.com/rg3/youtube-dl/blob/master/youtube_dl/compat.py
github project:
https://github.com/rg3/youtube-dl
反正 curl 指令可行就先把 curl 的結果放到外部檔案,然後再 open file 就解決了。
curl 指令
curl https://www.facebook.com/dai.d.jie -o fb_curl.txt -v
傳回結果:
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 31.13.87.36... * TCP_NODELAY set * Connected to www.facebook.com (31.13.87.36) port 443 (#0) * TLS 1.2 connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 * Server certificate: *.facebook.com * Server certificate: DigiCert SHA2 High Assurance Server CA * Server certificate: DigiCert High Assurance EV Root CA > GET /dai.d.jie HTTP/1.1 > Host: www.facebook.com > User-Agent: curl/7.51.0 > Accept: */* > < HTTP/1.1 404 Not Found < X-Frame-Options: DENY < X-Content-Type-Options: nosniff < X-XSS-Protection: 0 < Access-Control-Allow-Credentials: true < Access-Control-Allow-Origin: https://www.facebook.com < Access-Control-Expose-Headers: X-FB-Debug, X-Loader-Length < Vary: Origin < Pragma: no-cache < public-key-pins-report-only: max-age=500; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="r/mIkG3eEpVdm+u/ko/cwxzOMo1bk4TyHIlByibiA5E="; pin-sha256="q4PO2G2cbkZhZ82+JgmRUyGMoAeozA+BSXVXQWB8XWQ="; report-uri="http://reports.fb.com/hpkp/" < access-control-allow-method: OPTIONS < Expires: Sat, 01 Jan 2000 00:00:00 GMT < Strict-Transport-Security: max-age=15552000; preload < Cache-Control: private, no-cache, no-store, must-revalidate < Set-Cookie: fr=0tKalL3WTb9wBS4Wd..BZb3wp.Mi.AAA.0.0.BZb3wp.AWVgsoOu; expires=Tue, 17-Oct-2017 15:35:05 GMT; Max-Age=7776000; path=/; domain=.facebook.com; secure; httponly < Vary: Accept-Encoding < Content-Type: text/html; charset=UTF-8 < X-FB-Debug: MO3rULHEHjyoIquCcfDT+xkVWnfF7yy3rRngUAWTDnV0ZKTSy0zEo1cY4E4XjxJ6uWidLsB+1KbIka6IFTQLvw== < Date: Wed, 19 Jul 2017 15:35:05 GMT < Transfer-Encoding: chunked < Connection: keep-alive < { [329 bytes data] * Curl_http_done: called premature == 0 100 128k 0 128k 0 0 395k 0 --:--:-- --:--:-- --:--:-- 396k * Connection #0 to host www.facebook.com left intact