2006-11-03

urllibを使ってみます。

urllib.urlopenの引数にURLを入れる
返ってきたオブジェクトはファイルのように扱える

基本的にはこれだけです。簡単ですね。

# get.py
import sys
import urllib

print urllib.urlopen(sys.argv[1]).read()

実行例は以下のようになります。

> python get.py http://d.hatena.ne.jp/pythonco/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=euc-jp">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<title>pythonco(ぱいそんこ)の日記</title>
<link rel="start" href="./" title="pythonco(ぱいそんこ)の日記">
<link rel="help" href="/help" title="ヘルプ">
<link rel="prev" href="/pythonco/?of=5" title="前の5日分">
...

> python get.py http://www.hatena.ne.jp
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<link rel="start" href="http://www.hatena.ne.jp/" title="・  ・>
<link rel="stylesheet" href="/css/base.css" type="text/css" media="all">
<link rel="search" type="application/opensearchdescription+xml" href="http://search.hatena.ne.jp/opensearch/all.xml" tit
<link rel="search" type="application/opensearchdescription+xml" href="http://q.hatena.ne.jp/opensearch/question.xml" tit

<title>・・・/title>
<meta name="description" content="??・鋍腓障・・・吟㏍若・若х蕭罘純㏍違篋冴℡査膈篋阪罎膣・Q&A鐚純若激ｃ・・若・RSS・若
<meta name="keywords" content="・・・hatena,篋阪罎膣↑純若激ｃ・・若・㏍穐ゃ≪・蒔RSS,≪・祉壕В%
...

文字化けしてしまいました。仕方がないのでヘッダのcharsetを見てencodeしてみます。

# get.py
import sys
import urllib

f = urllib.urlopen(sys.argv[1])
c = f.headers.getparam('charset')
print c
print unicode(f.read(), c).encode('euc-jp')

> python get.py http://www.hatena.ne.jp/
utf-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<link rel="start" href="http://www.hatena.ne.jp/" title="はてな">
<link rel="stylesheet" href="/css/base.css" type="text/css" media="all">
<link rel="search" type="application/opensearchdescription+xml" href="http://search.hatena.ne.jp/opensearch/all.xml" tit
<link rel="search" type="application/opensearchdescription+xml" href="http://q.hatena.ne.jp/opensearch/question.xml" tit

<title>はてな</title>
<meta name="description" content="株式会社はてなが運営。キーワードで繋がる高機能ブログ、人がたずね人が答える人力検索（Q&
く便利になるサービスが揃っています。">
<meta name="keywords" content="はてな,hatena,人力検索,ソーシャルブックマーク,ブログ,ダイアリー,RSS,アクセス解析">
<link rel="stylesheet" href="/css/top.css" type="text/css" media="all">
<link rel="stylesheet" href="/css/minwidth.css" type="text/css" media="all">
<SCRIPT type="text/javascript">
<!--
var cookie_name = "PORTAL_TAB";
...

これで完璧かな、などと思っていたのですが、charsetを返さないサーバもあるので困りました。

> python get.py http://www.google.co.jp
None
Traceback (most recent call last):
  File "get.py", line 17, in ?
    print unicode(f.read(), c).encode('euc-jp')
TypeError: unicode() argument 2 must be string, not None

仕方がないので片っ端からunicodeに変換して、例外が出なかった文字コードを採用する、という方法でやってみます。

# get.py
import sys
import urllib
import encodings

def encode(s, c):
  for i in encodings.aliases.aliases:
    try:
      return unicode(s, i).encode(c), i
    except:
      pass
  return s, None

s, e =  encode(urllib.urlopen(sys.argv[1]).read(), 'euc-jp')
print e
print s

> python get.py http://www.google.co.jp
mskanji
<html><head><meta http-equiv="content-type" content="text/html; charset=Shift_JIS"><title>Google</title><style><!--
body,td,a,p,.h{font-family:arial,sans-serif}
.h{font-size:20px}
.h{color:#3366cc}
.q{color:#00c}
--></style>
<script>
<!--
function sf(){document.f.q.focus();}
// -->
</script>
</head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onload=sf() topmargin=3 marginheight=
idth=100%><font size=-1><a href="/url?sa=p&pref=ig&pval=3&q=http://www.google.co.jp/ig%3Fhl%3Dja&usg=__zzRfDx_8QtfWxWHbI
https://www.google.com/accounts/Login?continue=http://www.google.co.jp/&hl=ja">ログイン</a></font></div><img alt="Google
br><form action="/search" name=f><table border=0 cellspacing=0 cellpadding=4><tr><td nowrap><font size=-1><b>ウェブ</b>&
o.jp/imghp?ie=Shift_JIS&oe=Shift_JIS&hl=ja&tab=wi">イメージ</a>&nbsp;&nbsp;&nbsp;&nbsp;<a class=q href="http://news.goog
...

まあとりあえずこれでいい、のかな……。

参考資料

http://effbot.org/librarybook/simplehttpserver.htm

参考資料ではSimpleHTTPServer.SimpleHTTPRequestHandlerのdo_GETをオーバーライドしていますが、必要なものはdo_GETとcopyfileだけなのでBaseHTTPServer.BaseHTTPRequestHandlerでいいや、と思ったのでした。

# -*- coding: euc-jp -*-
import BaseHTTPServer
import os
import shutil
import StringIO
import urllib

class ProxyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
  def encode(self, s, c):
    for i in ('euc-jp', 'sjis', 'utf-8'):
      try:
        return unicode(s, i).encode(c), i
      except:
        pass
    return s, None

  def do_GET(self):
    print 'get...', self.path
    f = urllib.urlopen(self.path)
    if 'text' in f.info().gettype() and os.path.splitext(self.path)[1] in ('', '.htm', '.html', '.cgi', '.php'):
      s, e = self.encode(f.read(), 'euc-jp')
      if e:
        s = s.replace('。', '（笑')
        f = StringIO.StringIO(self.encode(s, e)[0])
    shutil.copyfileobj(f, self.wfile)

port = 3128
print 'proxy at', port
BaseHTTPServer.HTTPServer(('', port), ProxyHandler).serve_forever()

使い方

python proxy.py　で起動
ブラウザの設定をいじって起動したproxyに繋ぐ
- こんな感じ　
日本語のページを見たときに '。' が '（笑' に書き換わってイラッとする

ときどき書き換わらなかったりするのは仕様です。嘘です。よくわかりません。

処理の流れ

ブラウザが何かをGETしようとするとdo_GETが呼ばれて、self.pathにブラウザがGETしようとしているURLが入る
urllib.urlopenでself.pathをGET
GETした結果得られるものはファイルオブジェクトっぽいものなのでreadで文字列だけを取り出して書き換える
StringIOで文字列からファイルオブジェクトっぽいものに変換
変換したファイルオブジェクトっぽいものをshutil.copyfileobjでself.wfileにコピー
最終的にブラウザに表示されるものはself.wfileの中身