urllib2.urlopen throws 404 exception for urls that browser opens

我在抓取facebook 網頁裡的資料時遇到會傳回 404 error, 但透過browser 或 curl 指令卻又可以下載的問題。

在 Google 了一下,發現有神奇的人解了這個問題,他的source code 裡,令人驚訝,像一本神奇的python 教科書,在講各個python 版本的在文字編碼這塊會遇到的問題,並附上完美解法!

source code:

https://github.com/rg3/youtube-dl/blob/master/youtube_dl/compat.py

github project:

https://github.com/rg3/youtube-dl

 

反正 curl 指令可行就先把 curl 的結果放到外部檔案,然後再 open file 就解決了。


curl 指令

curl https://www.facebook.com/dai.d.jie -o fb_curl.txt -v

傳回結果:

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 31.13.87.36...
* TCP_NODELAY set
* Connected to www.facebook.com (31.13.87.36) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.facebook.com
* Server certificate: DigiCert SHA2 High Assurance Server CA
* Server certificate: DigiCert High Assurance EV Root CA
> GET /dai.d.jie HTTP/1.1
> Host: www.facebook.com
> User-Agent: curl/7.51.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< X-Frame-Options: DENY
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 0
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Origin: https://www.facebook.com
< Access-Control-Expose-Headers: X-FB-Debug, X-Loader-Length
< Vary: Origin
< Pragma: no-cache
< public-key-pins-report-only: max-age=500; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="r/mIkG3eEpVdm+u/ko/cwxzOMo1bk4TyHIlByibiA5E="; pin-sha256="q4PO2G2cbkZhZ82+JgmRUyGMoAeozA+BSXVXQWB8XWQ="; report-uri="http://reports.fb.com/hpkp/"
< access-control-allow-method: OPTIONS
< Expires: Sat, 01 Jan 2000 00:00:00 GMT
< Strict-Transport-Security: max-age=15552000; preload
< Cache-Control: private, no-cache, no-store, must-revalidate
< Set-Cookie: fr=0tKalL3WTb9wBS4Wd..BZb3wp.Mi.AAA.0.0.BZb3wp.AWVgsoOu; expires=Tue, 17-Oct-2017 15:35:05 GMT; Max-Age=7776000; path=/; domain=.facebook.com; secure; httponly
< Vary: Accept-Encoding
< Content-Type: text/html; charset=UTF-8
< X-FB-Debug: MO3rULHEHjyoIquCcfDT+xkVWnfF7yy3rRngUAWTDnV0ZKTSy0zEo1cY4E4XjxJ6uWidLsB+1KbIka6IFTQLvw==
< Date: Wed, 19 Jul 2017 15:35:05 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< 
{ [329 bytes data]
* Curl_http_done: called premature == 0
100 128k 0 128k 0 0 395k 0 --:--:-- --:--:-- --:--:-- 396k
* Connection #0 to host www.facebook.com left intact

 

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *