Crawl Web Content - 详解爬虫特殊场景语法

好些没写点啥了。最近在爬取一些数据的时候，发现Y结构挺复杂的啊，各种tag中套tag，目标内容各种被tag隔断，甚至有些内容要在load more网络请求后才能出现，简直口怕。所以今天就记录一些颇具代表性的case，以备后用。
新任务获得：详解部分特殊场景下的爬取逻辑

拼接Key值与使用变量值 - %s %()

<div class="contson" id="contsoneae647c5c110">
	单车欲问边，属国过居延。
	<br />
	征蓬出汉塞，归雁入胡天。
	<br />
	大漠孤烟直，长河落日圆。
	<br />
	萧关逢候骑，都护在燕然。
</div>

目标是全取诗句内容，通过id属性，但这个值是contson + 一个变量动态构成的，所以需要拼接。同时在使用这个拼接后的变量时，写法如下

1
2
3

content_id = 'contson'+item_id
content_xpath = "div[@id='%s']/text()" %(content_id)
item['content'] = content.xpath(content_xpath).extract()

结果：

'content': ['\n单车欲问边，属国过居延。',
             '征蓬出汉塞，归雁入胡天。',
             '大漠孤烟直，长河落日圆。',
             '萧关逢候骑，都护在燕然。\n']

不含有某属性 - not

<div class="sonspic">
	<div class="cont" style="margin-top:13px;">
	<div class="divimg" style="margin-top:2px;">
		<a href="/authorv_52fceee85532.aspx"><img src="https://img.gushiwen.org/authorImg/wangwei.jpg" width="105" height="150" alt="王维"/></a>
	</div>
	<p style="height:22px;">
		<a style="font-size:18px; line-height:22px; height:22px;" href="/authorv_52fceee85532.aspx"><b>王维</b></a>
		<a style="margin-left:5px;" href="javascript:PlayAuthor(515)"><img id="speakerimgAuthor515" src="/img/speaker.png" / alt="" width="16" height="16"/></a>
		<span id="authorPlay515" style=" display:none;width:1px; height:1px;"></span>
	</p>
	<p style=" margin:0px;">
		王维（701年－761年，一说699年—761年），字摩诘，汉族，河东蒲州（今山西运城）人，祖籍山西祁县，唐朝诗人，有“诗佛”之称。苏轼评价其：“味摩诘之诗，诗中有画；观摩诘之画，画中有诗。”开元九年（721年）中进士，任太乐丞。王维是盛唐诗人的代表，今存诗400余首，重要诗作有《相思》《山居秋暝》等。王维精通佛学，受禅宗影响很大。佛教有一部《维摩诘经》，是王维名和字的由来。王维诗书画都很有名，非常多才多艺，音乐也很精通。与孟浩然合称“王孟”。
		<a href="/authors/authorvsw_52fceee85532A1.aspx">► 446篇诗文</a>
	</p>
</div>

目标是获取到这块内容中的作者介绍模块内容。其位于第二个 中, 但讨厌的是其上还有个讨厌的 , 怎么定位呢？找到所有的 取第二个？根据特殊的style属性的值来找到第二个？都不靠谱，可以看到第一个中包含一个，但目标 中是没有的，所以如下取值

1	item['authorDetail'] = authorReltated.xpath("p[not(span)]/text()").extract()

结果：

'authorDetail': ['王维（701年－761年，一说699年—761年），字摩诘，汉族，河东蒲州（今山西运城）人，祖籍山西祁县，唐朝诗人，有“诗佛”之称。苏轼评价其：“味摩诘之诗，诗中有画；。王维精通佛学，受禅宗影响很大。佛教有一部《维摩诘经》，是王维名和字的由来。王维诗书画都很有名，非常多才多艺，音乐也很精通。与孟浩然合称“王孟”。']

load more - 网络请求

经常网页上需要点击一下load more然后再能显示更多内容么。这个流程肢解开来就是，你在点的时候，发送了一个网络请求，而后根据response内容做动态渲染。所以如下处理

查看网络请求

打开浏览器「检查元素」
切换到「网络」一栏
点击「load more」触发网络请求

这里我们可以看到具体的细节了

摘要
URL: https://so.gushiwen.org/shiwen2017/ajaxshiwencont.aspx?id=eae647c5c110&value=yi
状态: 200
来源: 网络
地址: 223.111.239.88:443

请求
:method: GET
:scheme: https
:authority: so.gushiwen.org
:path: /shiwen2017/ajaxshiwencont.aspx?id=eae647c5c110&value=yi
Cookie: Hm_lpvt_04660099568f561a75456483228a9516=1552025928; ASP.NET_SessionId=5vbb1cxxl5lectozzgalmwyb; Hm_lvt_04660099568f561a75456483228a9516=1551419804,1551665879,1551924009,1552025928
Accept: */*
Accept-Encoding: br, gzip, deflate
Host: so.gushiwen.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15
Accept-Language: zh-cn
Referer: https://so.gushiwen.org/shiwenv_eae647c5c110.aspx
Connection: keep-alive

响应
:status: 200
Content-Type: text/html; charset=utf-8
Set-Cookie: sec_tc=AQAAAAgAOWBM0AYAdYNoEC8xgGUPsd89; Path=/; Expires=Fri, 08-Mar-19 07:23:36 GMT; HttpOnly
Via: cache2.l2st4-4[48,200-0,M], cache9.l2st4-4[49,0], skunlun7.cn1418[0,200-0,H], skunlun9.cn1418[1,0]
Age: 5696
Expires: Fri, 08 Mar 2019 07:38:41 GMT
Timing-Allow-Origin: *
Cache-Control: public, max-age=7200
Date: Fri, 08 Mar 2019 05:38:40 GMT
Content-Encoding: gzip
Vary: Accept-Encoding, *
Last-Modified: Fri, 08 Mar 2019 05:38:41 GMT
x-swift-savetime: Fri, 08 Mar 2019 05:38:41 GMT
x-aspnet-version: 4.0.30319
x-cache: HIT TCP_MEM_HIT dirn:-2:-2
x-powered-by: UrlRewriter.NET 1.7.0, ASP.NET
Server: Tengine
ali-swift-global-savetime: 1552023521
x-swift-cachetime: 7200
eagleid: df6fef1d15520292164464124e

查询字符串参数
id: eae647c5c110
value: zhu

模拟发送请求

req_url = 'https://so.gushiwen.org/shiwen2017/ajaxshiwencont.aspx?id=%s&value=yi'
request_translation = scrapy.Request(req_url%item_id, callback=self.parse_translation, dont_filter=True)
request_translation.meta['item'] = item
request_translation.meta['item_id'] = item_id
yield request_translation

解读

拼接url
建立request
申明这个request的回调函数为「parse_translation」方法，「dont_filter」设置为True是为了防止因为「requestUrl」被「allowed_domain」给过滤掉了导致请求失败
把之前处理好的数据，塞入「request.meta」中带到callback方法中
同上
递归callback回传的值作为return值

之后就可以在callback方法里继续解析啦

琐碎信息聚合 -

<p>
	征蓬出汉塞
   <span style="color:#286345;">(sài)</span>，归雁
   <span style="color:#286345;">(yàn)</span>入胡天。
   <br />
   <span style="color:#286345;">征蓬：随风飘飞的蓬草，此处为诗人自喻。归雁：雁是候鸟，春天北飞，秋天南行，这里是指大雁北飞。胡天：胡人的领空。这里是指唐军占领的北方地方。</span>
</p>

这里我们可以看到目标信息，有的是直接作为 的内容存在的，有的是被塞到 中的，比较琐碎，需要聚合

a_raw = response.xpath("//p[not (@style)]")
	for a_raw_p in a_raw:
		a_raw_contents = a_raw_p.xpath("normalize-space()").extract()
		print(a_raw_contents)

结果

['征蓬出汉塞(sài)，归雁(yàn)入胡天。征蓬：随风飘飞的蓬草，此处为诗人自喻。归雁：雁是候鸟，春天北飞，秋天南行，这里是指大雁北飞。胡天：胡人的领空。这里是指唐军占领的北方地方。']

尾声

未完待续。。。

This artical is avaliable under WTFPL-V2. Generally, everyone is permitted to copy and do what the fuck you want to.
P.S. Even so said, your kindly declaration that inspired from this site - Chen’s Alchemy would be appreciated

本文链接：http://yoursite.com/2019/03/08/python-Scrapy-parser-addon/