requests 使用 proxy ip 爬取豌豆荚和应用宝

本文主要内容如下:

  1. 简单介绍requests库
  2. 使用requests库爬取豌豆荚和应用宝网站(使用代理ip)
  3. 检查是否使用代理ip去访问目标站点

简单介绍requests库

会话对象让你能够跨请求保持某些参数。它也会在同一个 Session 实例发出的所有请求之间保持 cookie, 期间使用 urllib3 的 connection pooling 功能。所以如果你向同一主机发送多个请求,底层的 TCP 连接将会被重用,从而带来显著的性能提升。 (参见 HTTP persistent connection).

如下是一个简单的使用会话对象发送get请求的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# coding=utf-8
import requests
from lxml import html

if __name__ == '__main__':
s = requests.Session()
url = 'https://www.wandoujia.com/apps/com.duowan.kiwi'
resp = s.get(url)
resp.encoding = 'utf-8'
source = resp.text
s_html = html.fromstring(source)
# title = s_html.xpath("//span[@class='title']/text()")
# print(title) # 列表中中文会显示unicode编码 [u'\u864e\u7259\u76f4\u64ad']
title = s_html.xpath("//span[@class='title']/text()")[0]
print(title) # 虎牙直播

get方法内部会调用以下方法,参数说明如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def request(self, method, url,
params=None, data=None, headers=None, cookies=None, files=None,
auth=None, timeout=None, allow_redirects=True, proxies=None,
hooks=None, stream=None, verify=None, cert=None, json=None):
"""Constructs a :class:`Request <Request>`, prepares it and sends it.
Returns :class:`Response <Response>` object.

:param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query
string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send
in the body of the :class:`Request`.
:param json: (optional) json to send in the body of the
:class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the
:class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the
:class:`Request`.
:param files: (optional) Dictionary of ``'filename': file-like-objects``
for multipart encoding upload.
:param auth: (optional) Auth tuple or callable to enable
Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send
data before giving up, as a float, or a :ref:`(connect timeout,
read timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Set to True by default.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol or protocol and
hostname to the URL of the proxy.
:param stream: (optional) whether to immediately download the response
content. Defaults to ``False``.
:param verify: (optional) Either a boolean, in which case it controls whether we verify
the server's TLS certificate, or a string, in which case it must be a path
to a CA bundle to use. Defaults to ``True``.
:param cert: (optional) if String, path to ssl client cert file (.pem).
If Tuple, ('cert', 'key') pair.
:rtype: requests.Response
"""

具体使用可见官方文档

使用requests库爬取豌豆荚和应用宝网站(使用代理ip)

就目前来说,豌豆荚主页 https://www.wandoujia.com/,而应用宝主页为 http://sj.qq.com/。区别很明显,协议类型一个为https,一个为http,而这刚好与我们接下来要做的事情有关。

requests使用代理ip十分方便,只需要在发送请求的时候带上参数proxies即可。那么还有哪里需要我们注意的呢?

废话不多说,直接上代码
豌豆荚的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# coding=utf-8
import requests
from lxml import html

if __name__ == '__main__':
s = requests.Session()
url = 'https://www.wandoujia.com/apps/com.duowan.kiwi'
ip = {'https': 'http://81.16.248.34:61661'} # 重点: schema 为https
resp = s.get(url, proxies=ip)
resp.encoding = 'utf-8'
source = resp.text
s_html = html.fromstring(source)
title = s_html.xpath("//span[@class='title']/text()")[0]
print(title) # 虎牙直播

重点是代理ip的schema设置,因为目标站点是https的,所以我们的schema应该设为https(前提的我们的代理ip支持https),虽然设置http好像也能爬取到数据,但其实http因为不能匹配https的目标站点,所以会用我们本机ip去爬取数据,稍后将会给出验证。

1
ip = {'https': 'http://81.16.248.34:61661'}

应用宝的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# coding=utf-8
import requests
from lxml import html

if __name__ == '__main__':
s = requests.Session()
url = 'http://sj.qq.com/myapp/detail.htm?apkName=com.tencent.mobileqq'
ip = {'http': 'http://81.16.248.34:61661'} # 重点: schema 为http
resp = s.get(url, proxies=ip)
resp.encoding = 'utf-8'
source = resp.text
s_html = html.fromstring(source)
title = s_html.xpath("//div[@class='det-name-int']/text()")[0]
print(title) # QQ

同理,重点在于schema的设置

1
ip = {'http': 'http://81.16.248.34:61661'}

检查是否使用代理ip去访问目标站点

这儿先提供两个可以查看访问者ip的站点,一个是https协议,一个是http协议的

接下来我们进行4次使用代理的访问站点实验。分别是:

  1. http代理ip -> http站点
  2. http代理ip -> https站点
  3. https代理ip -> http站点
  4. https代理ip -> https站点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# coding=utf-8
import requests
from lxml import html


def http_ip_to_http():
ip = {'http': 'http://81.16.248.34:61661'}
resp = requests.get('http://2018.ip138.com/ic.asp', proxies=ip, timeout=30)
resp.encoding = 'gb2312'
source = resp.text
s_html = html.fromstring(source)
ip = s_html.xpath("//body/center/text()")[0]
return ip


def http_ip_to_https():
ip = {'http': 'http://81.16.248.34:61661'}
resp = requests.get('https://ip.cn/', proxies=ip, timeout=30)
resp.encoding = 'utf-8'
source = resp.text
s_html = html.fromstring(source)
ip = s_html.xpath("//div[@class='well']/p/code/text()")[0]
return ip


def https_ip_to_http():
ip = {'https': 'http://81.16.248.34:61661'}
resp = requests.get('http://2018.ip138.com/ic.asp', proxies=ip, timeout=30)
resp.encoding = 'gb2312'
source = resp.text
s_html = html.fromstring(source)
ip = s_html.xpath("//body/center/text()")[0]
return ip


def https_ip_to_https():
ip = {'https': 'http://81.16.248.34:61661'}
resp = requests.get('https://ip.cn/', proxies=ip, timeout=30)
resp.encoding = 'utf-8'
source = resp.text
s_html = html.fromstring(source)
ip = s_html.xpath("//div[@class='well']/p/code/text()")[0]
return ip


if __name__ == '__main__':
print('1: %s' % http_ip_to_http()) # 您的IP是:[81.16.248.34] 来自:格鲁吉亚 ## 代理ip
print('2: %s' % http_ip_to_https()) # 112.10.136.164 ## 本机ip
print('3: %s' % https_ip_to_http()) # 您的IP是:[112.10.136.164] 来自:浙江省杭州市 移动 ##本机ip
print('4: %s' % https_ip_to_https()) # 81.16.248.34 ## 代理ip

最后一点补充说明

  1. 本次测试使用的代理ip在当时都是有效的,且支持http和https两种schema,不保证下次仍然可用。
  2. 写代理ip时,如果代理ip同时支持http和https,则建议如下书写
    ip = {'https': 'http://81.16.248.34:61661', 'http': 'http://81.16.248.34:61661'}
    
  3. 在代理ip不可用时,默认会使用本机ip,所以不要以为自己设置了代理ip就走代理ip了。

18.11.27 更新

可以访问 https://httpbin.org/ip 知道访问的ip