-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathatom.xml
More file actions
795 lines (647 loc) · 179 KB
/
atom.xml
File metadata and controls
795 lines (647 loc) · 179 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>布咯咯_rieuse</title>
<link href="/atom.xml" rel="self"/>
<link href="http://bulolo.cn/"/>
<updated>2017-06-20T23:52:59.130Z</updated>
<id>http://bulolo.cn/</id>
<author>
<name>布咯咯_rieuse</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Scrapy爬虫:果壳热门和精彩问答信息爬取</title>
<link href="http://bulolo.cn/2017/06/20/scrapy3/"/>
<id>http://bulolo.cn/2017/06/20/scrapy3/</id>
<published>2017-06-20T01:21:35.000Z</published>
<updated>2017-06-20T23:52:59.130Z</updated>
<content type="html"><![CDATA[<p><img src="http://upload-images.jianshu.io/upload_images/4701426-c5d788b76b2f843d.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="果壳问答.jpg"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>继续练习Scrapy框架,这次抓取的果壳问答网站的热门问答和精彩问答相关信息,信息如下:标题,关注量,回答数目,简介等。之后保存到mongodb和json文件中以备后续使用。代码地址:<a href="https://github.com/rieuse/ScrapyStudy" target="_blank" rel="external">https://github.com/rieuse/ScrapyStudy</a><br><a id="more"></a></p>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm 2017</li>
<li>Python3.6</li>
<li>pymongo 3.4.0</li>
<li>scrapy 1.3.3<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1>1.首先进入果壳问答<a href="http://www.guokr.com/ask/" target="_blank" rel="external">http://www.guokr.com/ask/</a> ,我这次爬取的是热门问答和精彩问答的全部信息。</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-dcb73cc63b1c6a5c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="果壳"></p>
<p>2.进入热门问答和精彩问答,他们的页面结构是一样的。网址是www.guokr.com/ask/hottest 和www.guokr.com/ask/highlight 然后他们都有很多页面,点击下一页后页面地址就会加上后缀<code>?page=数字</code>,后面的数字就是页面的数目,随后我们将使用列表迭代生成我们爬取的页面地址。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">start_urls = ['http://www.guokr.com/ask/hottest/?page={}'.format(n) for n in range(1, 8)] + ['http://www.guokr.com/ask/highlight/?page={}'.format(m) for m in range(1, 101)]</div></pre></td></tr></table></figure></p>
<p>3.抓取内容:问答的关注,回答,标题,简介。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-3183c61c1135bc3a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="抓取内容"><br>4.网页结构分析:全部问答内容在class=”ask-list-cp”的ul下的li中,<br>所以对应的xpath地址如下,问答的单个信息的xpath取值是在全部信息的基础上取的。这里xpath选取比较灵活,可以使用属性,不同的相对位置。很多方式都可以选择到我们需要的数据,一种不成功就换其他的。比如这里的几个div都有自己单独的属性,就可以利用这个去选择。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">全部信息:/html/body/div[3]/div[1]/ul[2]/li</div><div class="line">关注://div[@class="ask-hot-nums"]/p[1]/span/text()</div><div class="line">回答://div[1]/p[2]/span/text()</div><div class="line">标题://div[2]/h2/a/text()</div><div class="line">简介://div[2]/p/text()</div><div class="line">链接://div[2]/h2/a/@href</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-a064473ba2fd764b.gif?imageMogr2/auto-orient/strip" alt="GIF.gif"></p>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p>分析好页面结构和数据位置就可以使用scrapy框架来抓取数据了。完整代码地址:<a href="https://github.com/rieuse/ScrapyStudy/tree/master/Guoke" target="_blank" rel="external">github.com/rieuse/ScrapyStudy</a></p>
<p>1.首先使用命令行工具输入代码创建一个新的Scrapy项目,之后创建一个爬虫。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">scrapy startproject Guoke</div><div class="line">cd Guoke\Guoke\spiders</div><div class="line">scrapy genspider guoke guokr.com</div></pre></td></tr></table></figure></p>
<p>2.打开Guoke文件夹中的items.py,改为以下代码,定义我们爬取的项目。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">import scrapy</div><div class="line"></div><div class="line"></div><div class="line">class GuokeItem(scrapy.Item):</div><div class="line"> title = scrapy.Field()</div><div class="line"> Focus = scrapy.Field()</div><div class="line"> answer = scrapy.Field()</div><div class="line"> link = scrapy.Field()</div><div class="line"> content = scrapy.Field()</div></pre></td></tr></table></figure></p>
<p>3.配置middleware.py配合settings中的User_Agent设置可以在下载中随机选择UA有一定的反ban效果,在原有代码基础上加入下面代码。这里的user_agent_list可以加入更多。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div></pre></td><td class="code"><pre><div class="line">import random</div><div class="line">from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware</div><div class="line"></div><div class="line"></div><div class="line">class RotateUserAgentMiddleware(UserAgentMiddleware):</div><div class="line"> def __init__(self, user_agent=''):</div><div class="line"> self.user_agent = user_agent</div><div class="line"></div><div class="line"> def process_request(self, request, spider):</div><div class="line"> ua = random.choice(self.user_agent_list)</div><div class="line"> if ua:</div><div class="line"> print(ua)</div><div class="line"> request.headers.setdefault('User-Agent', ua)</div><div class="line"></div><div class="line"> user_agent_list = [</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"</div><div class="line"> "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",</div><div class="line"> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",</div><div class="line"></div><div class="line"> ]</div></pre></td></tr></table></figure></p>
<p>4.明确一下目标,这是抓取的数据保存到mongodb数据库中和本地json文件。所以需要设置一下Pipeline<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div></pre></td><td class="code"><pre><div class="line">import json</div><div class="line">import pymongo</div><div class="line">from scrapy.conf import settings</div><div class="line"></div><div class="line"></div><div class="line">class GuokePipeline(object):</div><div class="line"> def __init__(self):</div><div class="line"> self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])</div><div class="line"> self.db = self.client[settings['MONGO_DB']]</div><div class="line"> self.post = self.db[settings['MONGO_COLL']]</div><div class="line"></div><div class="line"> def process_item(self, item, spider):</div><div class="line"> postItem = dict(item)</div><div class="line"> self.post.insert(postItem)</div><div class="line"> return item</div><div class="line"></div><div class="line"></div><div class="line">class JsonWriterPipeline(object):</div><div class="line"> def __init__(self):</div><div class="line"> self.file = open('guoke.json', 'w', encoding='utf-8')</div><div class="line"></div><div class="line"> def process_item(self, item, spider):</div><div class="line"> line = json.dumps(dict(item), ensure_ascii=False) + "\n"</div><div class="line"> self.file.write(line)</div><div class="line"> return item</div><div class="line"></div><div class="line"> def spider_closed(self, spider):</div><div class="line"> self.file.close()</div></pre></td></tr></table></figure></p>
<p>5.然后设置里面也要修改一下,这样才能启动Pipeline相关配置,最后可以保存相关数据。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div></pre></td><td class="code"><pre><div class="line">BOT_NAME = 'Guoke'</div><div class="line">SPIDER_MODULES = ['Guoke.spiders']</div><div class="line">NEWSPIDER_MODULE = 'Guoke.spiders'</div><div class="line"></div><div class="line"># 配置mongodb</div><div class="line">MONGO_HOST = "127.0.0.1" # 主机IP</div><div class="line">MONGO_PORT = 27017 # 端口号</div><div class="line">MONGO_DB = "Guoke" # 库名</div><div class="line">MONGO_COLL = "Guoke_info" # collection</div><div class="line"></div><div class="line"># pipeline文件的入口,这里进</div><div class="line">ITEM_PIPELINES = {</div><div class="line"> 'Guoke.pipelines.JsonWriterPipeline': 300,</div><div class="line"> 'Guoke.pipelines.GuokePipeline': 300,</div><div class="line"> }</div><div class="line"></div><div class="line"># 设置随机User_Agent</div><div class="line">DOWNLOADER_MIDDLEWARES = {</div><div class="line"> 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,</div><div class="line"> 'Guoke.middlewares.RotateUserAgentMiddleware': 400,</div><div class="line">}</div><div class="line"></div><div class="line">ROBOTSTXT_OBEY = False # 不遵循网站的robots.txt策略</div><div class="line">DOWNLOAD_DELAY = 1 # 下载同一个网站页面前等待的时间,可以用来限制爬取速度减轻服务器压力。</div><div class="line">COOKIES_ENABLED = False # 关闭cookies</div></pre></td></tr></table></figure></p>
<p>6.最后就是重点了,打开spiders文件夹中的guoke.py,改为以下代码,这个是爬虫主程序。这里面的开始链接就是热门回答和精彩回答结合。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div></pre></td><td class="code"><pre><div class="line"># -*- coding: utf-8 -*-</div><div class="line">import scrapy</div><div class="line">from Guoke.items import GuokeItem</div><div class="line"></div><div class="line"></div><div class="line">class GuokeSpider(scrapy.Spider):</div><div class="line"> name = "guoke"</div><div class="line"> allowed_domains = ["guokr.com"]</div><div class="line"> start_urls = ['http://www.guokr.com/ask/hottest/?page={}'.format(n) for n in range(1, 8)] + [</div><div class="line"> 'http://www.guokr.com/ask/highlight/?page={}'.format(m) for m in range(1, 101)]</div><div class="line"></div><div class="line"> def parse(self, response):</div><div class="line"> item = GuokeItem()</div><div class="line"> i = 0</div><div class="line"> for content in response.xpath('/html/body/div[3]/div[1]/ul[2]/li'):</div><div class="line"> item['title'] = content.xpath('//div[2]/h2/a/text()').extract()[i]</div><div class="line"> item['Focus'] = content.xpath('//div[@class="ask-hot-nums"]/p[1]/span/text()').extract()[i]</div><div class="line"> item['answer'] = content.xpath('//div[1]/p[2]/span/text()').extract()[i]</div><div class="line"> item['link'] = content.xpath('//div[2]/h2/a/@href').extract()[i]</div><div class="line"> item['content'] = content.xpath('//div[2]/p/text()').extract()[i]</div><div class="line"> i += 1</div><div class="line"> yield item</div></pre></td></tr></table></figure></p>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>先来看看抓取后的效果如何,mongodb我使用的可视化客户端是robomongodb,日常打开代码的工具是notepad++,atom,vscode都还不错推荐一波。代码都放在github中了,有喜欢的朋友可以点击 start follw,<a href="https://github.com/rieuse" target="_blank" rel="external">https://github.com/rieuse</a> 。<br>mongodb:<br><img src="http://upload-images.jianshu.io/upload_images/4701426-d2cfb2b67050ca30.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="mongodbg"><br>json文件:</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-d0178a74edfcfe00.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="json文件"></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-95302976921e8254.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="不断学习,继续加油!"></p>
]]></content>
<summary type="html">
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-c5d788b76b2f843d.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="果壳问答.jpg"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>继续练习Scrapy框架,这次抓取的果壳问答网站的热门问答和精彩问答相关信息,信息如下:标题,关注量,回答数目,简介等。之后保存到mongodb和json文件中以备后续使用。代码地址:<a href="https://github.com/rieuse/ScrapyStudy" target="_blank" rel="external">https://github.com/rieuse/ScrapyStudy</a><br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="scpray" scheme="http://bulolo.cn/tags/scpray/"/>
<category term="信息" scheme="http://bulolo.cn/tags/%E4%BF%A1%E6%81%AF/"/>
</entry>
<entry>
<title>Python爬虫:大规模爬取喜马拉雅电台详细音频数据</title>
<link href="http://bulolo.cn/2017/06/18/scrapy2/"/>
<id>http://bulolo.cn/2017/06/18/scrapy2/</id>
<published>2017-06-18T01:21:35.000Z</published>
<updated>2017-06-20T23:51:51.844Z</updated>
<content type="html"><![CDATA[<p><img src="http://upload-images.jianshu.io/upload_images/4701426-3187de405166e2a0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="喜马拉雅"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>本次爬取的是喜马拉雅的热门栏目下全部电台的每个频道的信息和频道中的每个音频数据的各种信息,然后把爬取的数据保存到mongodb以备后续使用。这次数据量在70万左右。音频数据包括音频下载地址,频道信息,简介等等,非常多。<br>昨天进行了人生中第一次面试,对方是一家人工智能大数据公司,我准备在这大二的暑假去实习,他们就要求有爬取过音频数据,所以我就来分析一下喜马拉雅的音频数据爬下来。目前我还在等待三面中,或者是通知最终面试消息。 (因为能得到一定肯定,不管成功与否都很开心)<br><a id="more"></a></p>
<hr>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm 2017</li>
<li>Python3.6</li>
<li>pymongo 3.4.0</li>
<li>requests 2.14.2</li>
<li>lxml 3.7.2</li>
<li>BeautifulSoup 4.5.3</li>
</ul>
<hr>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.首先进入这次爬取的主页面<a href="http://www.ximalaya.com/dq/all/" target="_blank" rel="external">http://www.ximalaya.com/dq/all/</a> ,可以看到每页12个频道,每个频道下面有很多的音频,有的频道中还有很多分页。抓取计划:循环84个页面,对每个页面解析后抓取每个频道的名称,图片链接,频道链接保存到mongodb。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-891631b667237aa2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="热门频道"></p>
<p>2.打开开发者模式,分析页面,很快就可以得到想要的数据的位置。下面的代码就实现了抓取全部热门频道的信息,就可以保存到mongodb中。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">start_urls = ['http://www.ximalaya.com/dq/all/{}'.format(num) for num in range(1, 85)]</div><div class="line">for start_url in start_urls:</div><div class="line"> html = requests.get(start_url, headers=headers1).text</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> for item in soup.find_all(class_="albumfaceOutter"):</div><div class="line"> content = {</div><div class="line"> 'href': item.a['href'],</div><div class="line"> 'title': item.img['alt'],</div><div class="line"> 'img_url': item.img['src']</div><div class="line"> }</div><div class="line"> print(content)</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-9615666111457d3a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="分析频道"></p>
<p>3.<strong>下面就是开始获取每个频道中的全部音频数据了</strong>,前面通过解析页面获取到了美国频道的链接。比如我们进入<a href="http://www.ximalaya.com/6565682/album/237771" target="_blank" rel="external">http://www.ximalaya.com/6565682/album/237771</a> 这个链接后分析页面结构。可以看出每个音频都有特定的ID,这个ID可以在一个div中的属性中获取。使用split()和int()来转换为单独的ID。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-d72a77797d0b258f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="频道页面分析"></p>
<p>4.接着点击一个音频链接,进入开发者模式后刷新页面然后点击XHR,再点击一个json链接可以看到这个就包括这个音频的全部详细信息。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">html = requests.get(url, headers=headers2).text</div><div class="line">numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')</div><div class="line">for i in numlist:</div><div class="line"> murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)</div><div class="line"> html = requests.get(murl, headers=headers1).text</div><div class="line"> dic = json.loads(html)</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-b808483245c72daa.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="音频页面分析"><br>5.上面只是对一个频道的主页面解析全部音频信息,但是实际上频道的音频链接是有很多分页的。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">html = requests.get(url, headers=headers2).text</div><div class="line">ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page')</div><div class="line">if len(ifanother):</div><div class="line"> num = ifanother[0]</div><div class="line"> print('本频道资源存在' + num + '个页面')</div><div class="line"> for n in range(1, int(num)):</div><div class="line"> print('开始解析{}个中的第{}个页面'.format(num, n))</div><div class="line"> url2 = url + '?page={}'.format(n)</div><div class="line"> # 之后就接解析音频页函数就行,后面有完整代码说明</div></pre></td></tr></table></figure>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-c86a52c57fd81fbf.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="分页"></p>
<p>6.全部代码<br><strong>完整代码地址<a href="https://github.com/rieuse/learnPython/blob/master/Python%E7%88%AC%E8%99%AB%E6%97%A5%E8%AE%B0%E7%B3%BB%E5%88%97/Python%E7%88%AC%E5%8F%96%E6%97%A5%E8%AE%B0%E5%8D%81%EF%BC%9A%E6%8A%93%E5%8F%96%E5%96%9C%E9%A9%AC%E6%8B%89%E9%9B%85%E7%94%B5%E5%8F%B0%E9%9F%B3%E9%A2%91.py" target="_blank" rel="external">github.com/rieuse/learnPython</a></strong><br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div></pre></td><td class="code"><pre><div class="line">__author__ = '布咯咯_rieuse'</div><div class="line"></div><div class="line">import json</div><div class="line">import random</div><div class="line">import time</div><div class="line">import pymongo</div><div class="line">import requests</div><div class="line">from bs4 import BeautifulSoup</div><div class="line">from lxml import etree</div><div class="line"></div><div class="line">clients = pymongo.MongoClient('localhost')</div><div class="line">db = clients["XiMaLaYa"]</div><div class="line">col1 = db["album"]</div><div class="line">col2 = db["detaile"]</div><div class="line"></div><div class="line">UA_LIST = [] # 很多User-Agent用来随机使用可以防ban,显示不方便不贴出来了</div><div class="line">headers1 = {} # 访问网页的headers,这里显示不方便我就不贴出来了</div><div class="line">headers2 = {} # 访问网页的headers这里显示不方便我就不贴出来了</div><div class="line"></div><div class="line">def get_url():</div><div class="line"> start_urls = ['http://www.ximalaya.com/dq/all/{}'.format(num) for num in range(1, 85)]</div><div class="line"> for start_url in start_urls:</div><div class="line"> html = requests.get(start_url, headers=headers1).text</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> for item in soup.find_all(class_="albumfaceOutter"):</div><div class="line"> content = {</div><div class="line"> 'href': item.a['href'],</div><div class="line"> 'title': item.img['alt'],</div><div class="line"> 'img_url': item.img['src']</div><div class="line"> }</div><div class="line"> col1.insert(content)</div><div class="line"> print('写入一个频道' + item.a['href'])</div><div class="line"> print(content)</div><div class="line"> another(item.a['href'])</div><div class="line"> time.sleep(1)</div><div class="line"></div><div class="line"></div><div class="line">def another(url):</div><div class="line"> html = requests.get(url, headers=headers2).text</div><div class="line"> ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page')</div><div class="line"> if len(ifanother):</div><div class="line"> num = ifanother[0]</div><div class="line"> print('本频道资源存在' + num + '个页面')</div><div class="line"> for n in range(1, int(num)):</div><div class="line"> print('开始解析{}个中的第{}个页面'.format(num, n))</div><div class="line"> url2 = url + '?page={}'.format(n)</div><div class="line"> get_m4a(url2)</div><div class="line"> get_m4a(url)</div><div class="line"></div><div class="line"></div><div class="line">def get_m4a(url):</div><div class="line"> time.sleep(1)</div><div class="line"> html = requests.get(url, headers=headers2).text</div><div class="line"> numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')</div><div class="line"> for i in numlist:</div><div class="line"> murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)</div><div class="line"> html = requests.get(murl, headers=headers1).text</div><div class="line"> dic = json.loads(html)</div><div class="line"> col2.insert(dic)</div><div class="line"> print(murl + '中的数据已被成功插入mongodb')</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> get_url()</div></pre></td></tr></table></figure></p>
<p>7.如果改成异步的形式可以快一点,只需要修改成下面这样就行了。我试了每分钟要比普通的多获取近100条数据。这个源代码也在github中。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-858ca9052b8d9585.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="异步"></p>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>这次抓取的数据量在70万左右,后续可以进行很多研究,比如播放量排行榜、时间区段排行、频道音频数量等等。后续继续学习使用科学计算和绘图工具做一些数据分析,清洗的工作。<br>贴出我的github地址,我的爬虫代码和学习写的代码都放进去了,有喜欢的朋友可以点击 start follw一起学习交流吧!<strong><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></strong></p>
]]></content>
<summary type="html">
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-3187de405166e2a0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="喜马拉雅"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>本次爬取的是喜马拉雅的热门栏目下全部电台的每个频道的信息和频道中的每个音频数据的各种信息,然后把爬取的数据保存到mongodb以备后续使用。这次数据量在70万左右。音频数据包括音频下载地址,频道信息,简介等等,非常多。<br>昨天进行了人生中第一次面试,对方是一家人工智能大数据公司,我准备在这大二的暑假去实习,他们就要求有爬取过音频数据,所以我就来分析一下喜马拉雅的音频数据爬下来。目前我还在等待三面中,或者是通知最终面试消息。 (因为能得到一定肯定,不管成功与否都很开心)<br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="scpray" scheme="http://bulolo.cn/tags/scpray/"/>
<category term="音频" scheme="http://bulolo.cn/tags/%E9%9F%B3%E9%A2%91/"/>
</entry>
<entry>
<title>Scrapy爬虫:抓取大量斗图网站最新表情图片</title>
<link href="http://bulolo.cn/2017/06/12/scrapy1/"/>
<id>http://bulolo.cn/2017/06/12/scrapy1/</id>
<published>2017-06-12T01:21:35.000Z</published>
<updated>2017-06-12T03:38:40.664Z</updated>
<content type="html"><![CDATA[<p><img src="http://upload-images.jianshu.io/upload_images/4701426-732f41204c843fd8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="一:目标"><a href="#一:目标" class="headerlink" title="一:目标"></a>一:目标</h1><p>第一次使用Scrapy框架遇到很多坑,坚持去搜索,修改代码就可以解决问题。这次爬取的是一个斗图网站的最新表情图片<a href="https://www.doutula.com/photo/list/" target="_blank" rel="external">www.doutula.com/photo/list</a>,练习使用Scrapy框架并且使用的随机user agent防止被ban,斗图表情包每日更新,一共可以抓取5万张左右的表情到硬盘中。为了节省时间我就抓取了1万多张。<br><a id="more"></a></p>
<hr>
<h1 id="二:Scrapy简介"><a href="#二:Scrapy简介" class="headerlink" title="二:Scrapy简介"></a>二:Scrapy简介</h1><p>Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。</p>
<blockquote>
<p>使用过程</p>
<ul>
<li>创建一个Scrapy项目</li>
<li>定义提取的Item</li>
<li>编写爬取网站的?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/spiders.html#topics-spiders" target="_blank" rel="external">spider</a>?并提取?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/items.html#topics-items" target="_blank" rel="external">Item</a></li>
<li>编写?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/item-pipeline.html#topics-item-pipeline" target="_blank" rel="external">Item Pipeline</a>?来存储提取到的Item(即数据)</li>
</ul>
</blockquote>
<p>接下来的图表展现了Scrapy的架构,包括组件及在系统中发生的数据流的概览(绿色箭头所示)。 下面对每个组件都做了简单介绍,并给出了详细内容的链接。数据流如下所描述<br><img src="http://upload-images.jianshu.io/upload_images/4701426-26f8f9a2007beee4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br><strong> 组件</strong> </p>
<ul>
<li><p>Scrapy Engine<br>引擎负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。 详细内容查看下面的数据流(Data Flow)部分。</p>
</li>
<li><p>调度器(Scheduler)<br>调度器从引擎接受request并将他们入队,以便之后引擎请求他们时提供给引擎。</p>
</li>
<li><p>下载器(Downloader)<br>下载器负责获取页面数据并提供给引擎,而后提供给spider。</p>
</li>
<li><p>Spiders<br>Spider是Scrapy用户编写用于分析response并提取item(即获取到的item)或额外跟进的URL的类。 每个spider负责处理一个特定(或一些)网站。 更多内容请看?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/spiders.html#topics-spiders" target="_blank" rel="external">Spiders</a>?。</p>
</li>
<li><p>Item Pipeline<br>Item Pipeline负责处理被spider提取出来的item。典型的处理有清理、 验证及持久化(例如存取到数据库中)。 更多内容查看?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/item-pipeline.html#topics-item-pipeline" target="_blank" rel="external">Item Pipeline</a>?。</p>
</li>
<li><p>下载器中间件(Downloader middlewares)<br>下载器中间件是在引擎及下载器之间的特定钩子(specific hook),处理Downloader传递给引擎的response。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。更多内容请看?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/downloader-middleware.html#topics-downloader-middleware" target="_blank" rel="external">下载器中间件(Downloader Middleware)</a>?。</p>
</li>
<li><p>Spider中间件(Spider middlewares)<br>Spider中间件是在引擎及Spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items及requests)。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。更多内容请看?<a href="http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/spider-middleware.html#topics-spider-middleware" target="_blank" rel="external">Spider中间件(Middleware)</a>?。</p>
</li>
</ul>
<hr>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.从网站的主页进入最新斗图表情后网址是<a href="https://www.doutula.com/photo/list/" target="_blank" rel="external">https://www.doutula.com/photo/list/</a> ,点击第二页后看到网址变成了<a href="https://www.doutula.com/photo/list/?page=2" target="_blank" rel="external">https://www.doutula.com/photo/list/?page=2</a> ,那我们就知道了网址的构成最后的page就是不同的页数。那么spider中的start_urls开始入口就如下定义,爬取1到20页的图片表情。想下载更多表情页数你可以再增加。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">start_urls = ['https://www.doutula.com/photo/list/?page={}'.format(i) for i in range(1, 20)]</div></pre></td></tr></table></figure></p>
<p>2.进入开发者模式分析网页结构,可以看到如下结构。右击复制一下xpath地址即可得到全部的表情所在的a标签内容。a[1]表示第一个a,去掉[1]就是全部的a。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">//*[@id="pic-detail"]/div/div[1]/div[2]/a</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-df7feb858433c56f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>值得注意的是这里的表情有两种:一个jpg,一个gif动图。如果获取图片地址时只抓取a标签下面第一个img的src就会出错,所以我们要抓取img中的含有data-original的值。这里a标签下面还一个p标签是图片简介,我们也抓取下来作为图片文件的名称。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">图片的连接是 'http:' + content.xpath('//img/@data-original')</div><div class="line">图片的名称是 content.xpath('//p/text()')</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-f67f62f1a92f7513.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<hr>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p><strong>完整代码地址 <a href="https://github.com/rieuse/learnPython/tree/master/ScrapyDoutu" target="_blank" rel="external">github.com/rieuse/learnPython</a></strong><br>1.首先使用命令行工具输入代码创建一个新的Scrapy项目,之后创建一个爬虫。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">scrapy startproject ScrapyDoutu</div><div class="line">cd ScrapyDoutu\ScrapyDoutu\spiders</div><div class="line">scrapy genspider doutula doutula.com</div></pre></td></tr></table></figure></p>
<p>2.打开Doutu文件夹中的items.py,改为以下代码,定义我们爬取的项目。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">import scrapy</div><div class="line"></div><div class="line">class DoutuItem(scrapy.Item):</div><div class="line"> img_url = scrapy.Field()</div><div class="line"> name = scrapy.Field()</div></pre></td></tr></table></figure></p>
<p>3.打开spiders文件夹中的doutula.py,改为以下代码,这个是爬虫主程序。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div></pre></td><td class="code"><pre><div class="line"># -*- coding: utf-8 -*-</div><div class="line">import os</div><div class="line">import scrapy</div><div class="line">import requests</div><div class="line">from ScrapyDoutu.items import DoutuItems</div><div class="line"></div><div class="line"></div><div class="line">class Doutu(scrapy.Spider):</div><div class="line"> name = "doutu"</div><div class="line"> allowed_domains = ["doutula.com", "sinaimg.cn"]</div><div class="line"> start_urls = ['https://www.doutula.com/photo/list/?page={}'.format(i) for i in range(1, 40)] # 我们暂且爬取40页图片</div><div class="line"></div><div class="line"> def parse(self, response):</div><div class="line"> i = 0</div><div class="line"> for content in response.xpath('//*[@id="pic-detail"]/div/div[1]/div[2]/a'):</div><div class="line"> i += 1</div><div class="line"> item = DoutuItems()</div><div class="line"> item['img_url'] = 'http:' + content.xpath('//img/@data-original').extract()[i]</div><div class="line"> item['name'] = content.xpath('//p/text()').extract()[i]</div><div class="line"> try:</div><div class="line"> if not os.path.exists('doutu'):</div><div class="line"> os.makedirs('doutu')</div><div class="line"> r = requests.get(item['img_url'])</div><div class="line"> filename = 'doutu\\{}'.format(item['name']) + item['img_url'][-4:]</div><div class="line"> with open(filename, 'wb') as fo:</div><div class="line"> fo.write(r.content)</div><div class="line"> except:</div><div class="line"> print('Error')</div><div class="line"> yield item</div></pre></td></tr></table></figure></p>
<p>3.这里面有很多值得注意的部分:</p>
<ul>
<li>因为图片的地址是放在sinaimg.cn中,所以要加入allowed_domains的列表中</li>
<li><figure class="highlight plain"><figcaption><span>里面是一些你提取的内容,[i]是结合前面的i的循环每次获取下一个标签内容,如果不这样设置,就会把全部的标签内容放入一个字典的值中。</span></figcaption><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">* ```filename = 'doutu\\{}'.format(item['name']) + item['img_url'][-4:]``` 是用来获取图片的名称,最后item['img_url'][-4:]是获取图片地址的最后四位这样就可以保证不同的文件格式使用各自的后缀。</div><div class="line">* 最后一点就是如果xpath没有正确匹配,则会出现<GET http://*****> (referer: None)</div><div class="line"></div><div class="line">4.配置settings.py,如果想抓取快一点CONCURRENT_REQUESTS设置大一些,DOWNLOAD_DELAY设置小一些,或者为0.</div></pre></td></tr></table></figure>
</li>
</ul>
<h1 id="coding-utf-8"><a href="#coding-utf-8" class="headerlink" title="-- coding: utf-8 --"></a>-<em>- coding: utf-8 -</em>-</h1><p>BOT_NAME = ‘ScrapyDoutu’</p>
<p>SPIDER_MODULES = [‘ScrapyDoutu.spiders’]<br>NEWSPIDER_MODULE = ‘ScrapyDoutu.spiders’</p>
<p>DOWNLOADER_MIDDLEWARES = {<br> ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’: None,<br> ‘ScrapyDoutu.middlewares.RotateUserAgentMiddleware’: 400,<br>}</p>
<p>ROBOTSTXT_OBEY = False # 不遵循网站的robots.txt策略<br>CONCURRENT_REQUESTS = 16 #Scrapy downloader 并发请求(concurrent requests)的最大值<br>DOWNLOAD_DELAY = 0.2 # 下载同一个网站页面前等待的时间,可以用来限制爬取速度减轻服务器压力。<br>COOKIES_ENABLED = False # 关闭cookies</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">5.配置middleware.py配合settings中的UA设置可以在下载中随机选择UA有一定的反ban效果,在原有代码基础上加入下面代码。这里的user_agent_list可以加入更多。</div></pre></td></tr></table></figure>
<p>import random<br>from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware</p>
<p>class RotateUserAgentMiddleware(UserAgentMiddleware):<br> def <strong>init</strong>(self, user_agent=’’):<br> self.user_agent = user_agent</p>
<pre><code>def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
print(ua)
request.headers.setdefault('User-Agent', ua)
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
</code></pre><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">6.到现在为止,代码都已经完成了。那么开始执行吧!</div><div class="line">```scrapy crawl doutu</div></pre></td></tr></table></figure>
<p>之后可以看到一边下载,一边修改User Agent。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-d650dd3b084ba0fc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<hr>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>学习使用Scrapy遇到很多坑,但是强大的搜索系统不会让我感觉孤单。所以感觉Scrapy还是很强大的也很意思,后面继续学习Scrapy的其他方面内容。</p>
<p>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友可以点击 start follw一起学习交流吧!<strong><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></strong></p>
]]></content>
<summary type="html">
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-732f41204c843fd8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="一:目标"><a href="#一:目标" class="headerlink" title="一:目标"></a>一:目标</h1><p>第一次使用Scrapy框架遇到很多坑,坚持去搜索,修改代码就可以解决问题。这次爬取的是一个斗图网站的最新表情图片<a href="https://www.doutula.com/photo/list/" target="_blank" rel="external">www.doutula.com/photo/list</a>,练习使用Scrapy框架并且使用的随机user agent防止被ban,斗图表情包每日更新,一共可以抓取5万张左右的表情到硬盘中。为了节省时间我就抓取了1万多张。<br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="scpray" scheme="http://bulolo.cn/tags/scpray/"/>
<category term="图片" scheme="http://bulolo.cn/tags/%E5%9B%BE%E7%89%87/"/>
</entry>
<entry>
<title>Python爬虫日记9:Python爬虫日记九:豌豆荚设计奖多进程,异步IO爬取速度对比</title>
<link href="http://bulolo.cn/2017/06/07/spider9/"/>
<id>http://bulolo.cn/2017/06/07/spider9/</id>
<published>2017-06-07T11:21:35.000Z</published>
<updated>2017-06-08T01:56:28.748Z</updated>
<content type="html"><![CDATA[<p><img src="http://upload-images.jianshu.io/upload_images/4701426-fdd4ccad379e3f01.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt=""></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>使用requests+BeautifulSoup或者xpath等网页解析工具就可以爬取大部分的网页 ,但是有时爬取的量很大时爬取的速度就让人头疼,今天我就使用三种方式来爬取豌豆荚的设计奖APP相关信息并保存到mongodb,从而对比速度让我们更清楚的认识这些东西用处。<br><a id="more"></a></p>
<blockquote>
<ul>
<li>正常requests爬取</li>
<li>requests + pool多进程爬取</li>
<li>asynico + aiohttp异步IO爬取</li>
</ul>
<hr>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm 2017</li>
<li>Python 3.6</li>
<li>aiohttp 2.1.0</li>
<li>asyncio 3.4.3</li>
<li>pymongo 3.4.0</li>
</ul>
<hr>
</blockquote>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.豌豆荚的设计奖首页是<a href="http://www.wandoujia.com/award" target="_blank" rel="external">http://www.wandoujia.com/award</a> 点击下一页之后就会发现网页地址变成了<a href="http://www.wandoujia.com/award?page=x" target="_blank" rel="external">http://www.wandoujia.com/award?page=x</a> x就是当前的页数。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-c3b0ddb8b3a89556.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt=""><br>2.然后来看看本次抓取的信息分布,我抓取的是每个设计奖的背景图片,APP名称,图标,获奖说明。进入浏览器开发者模式后即可查找信息位置。(使用Ctrl+Shift+C选择目标快速到达代码位置,同时这个夸克浏览器也挺不错的,简洁流畅推荐大家安装试试。)<br><img src="http://upload-images.jianshu.io/upload_images/4701426-13f049fd61f89010.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="夸克浏览器"><br>3.信息位置都找到了就可以使用BeautifulSoup来解析网页选择到这些数据,然后保存到mongodb。</p>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><h5 id="完整代码放在github中,github-com-rieuse-learnPython"><a href="#完整代码放在github中,github-com-rieuse-learnPython" class="headerlink" title="完整代码放在github中,github.com/rieuse/learnPython"></a>完整代码放在github中,<a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></h5><p>共用部分是url的构造,一些headers,代理部分。前几次爬虫可以不用headers和代理,但是测试几次后爬取的网站就可能给你封ip或者限速。我这里就需要这些反ban方法,因为我测试几次就呗网站限制了。<br>这里为了反反爬虫可以加入headers,User-Agent也是随机选择。再配合代理ip就很棒了。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div></pre></td><td class="code"><pre><div class="line"># 共用部分</div><div class="line">clients = pymongo.MongoClient('localhost')</div><div class="line">db = clients["wandoujia"]</div><div class="line">col = db["info"]</div><div class="line"></div><div class="line">urls = ['http://www.wandoujia.com/award?page={}'.format(num) for num in range(1, 46)]</div><div class="line">UA_LIST = [</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",</div><div class="line"> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",</div><div class="line"> "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"</div><div class="line">]</div><div class="line">headers = {</div><div class="line"> 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',</div><div class="line"> 'Accept-Encoding': 'gzip, deflate, sdch',</div><div class="line"> 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',</div><div class="line"> 'Connection': 'keep-alive',</div><div class="line"> 'Host': 'www.wandoujia.com',</div><div class="line"> 'User-Agent': random.choice(UA_LIST)</div><div class="line">}</div><div class="line"></div><div class="line">proxies = {</div><div class="line"> 'http': 'http://123.206.6.17:3128',</div><div class="line"> 'https': 'http://123.206.6.17:3128'</div><div class="line">}</div></pre></td></tr></table></figure></p>
<h5 id="方式一-正常requests爬取"><a href="#方式一-正常requests爬取" class="headerlink" title="方式一:正常requests爬取"></a>方式一:正常requests爬取</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line">def method_1():</div><div class="line"> start = time.time()</div><div class="line"> for url in urls:</div><div class="line"> html = requests.get(url, headers=headers, proxies=proxies).text</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> title = soup.find_all(class_='title')</div><div class="line"> app_title = soup.find_all(class_='app-title')</div><div class="line"> item_cover = soup.find_all(class_='item-cover')</div><div class="line"> icon_cover = soup.select('div.list-wrap > ul > li > div.icon > img')</div><div class="line"> for title_i, app_title_i, item_cover_i, icon_cover_i in zip(title, app_title, item_cover, icon_cover):</div><div class="line"> content = {</div><div class="line"> 'title': title_i.get_text(),</div><div class="line"> 'app_title': app_title_i.get_text(),</div><div class="line"> 'item_cover': item_cover_i['data-original'],</div><div class="line"> 'icon_cover': icon_cover_i['data-original']</div><div class="line"> }</div><div class="line"> col.insert(content)</div><div class="line"> print('成功插入一组数据' + str(content))</div><div class="line"> print('一共用时:' + str(time.time() - start))</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> method_1()</div></pre></td></tr></table></figure>
<p>执行这部分的代码后运行时间</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-0354e874cebff02b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="方法一时间"><br>之后mongodb的数据库中就有了这些数据。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-7bdd7bd9cd8e12f8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="mongodb数据"></p>
<h5 id="方式二-使用Requests-Pool进程池爬取"><a href="#方式二-使用Requests-Pool进程池爬取" class="headerlink" title="方式二:使用Requests + Pool进程池爬取"></a>方式二:使用Requests + Pool进程池爬取</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div></pre></td><td class="code"><pre><div class="line">def method_2(url):</div><div class="line"> html = requests.get(url, headers=headers, proxies=proxies).text</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> title = soup.find_all(class_='title')</div><div class="line"> app_title = soup.find_all(class_='app-title')</div><div class="line"> item_cover = soup.find_all(class_='item-cover')</div><div class="line"> icon_cover = soup.select('div.list-wrap > ul > li > div.icon > img')</div><div class="line"> for title_i, app_title_i, item_cover_i, icon_cover_i in zip(title, app_title, item_cover, icon_cover):</div><div class="line"> content = {</div><div class="line"> 'title': title_i.get_text(),</div><div class="line"> 'app_title': app_title_i.get_text(),</div><div class="line"> 'item_cover': item_cover_i['data-original'],</div><div class="line"> 'icon_cover': icon_cover_i['data-original']</div><div class="line"> }</div><div class="line"> # time.sleep(1)</div><div class="line"> col.insert(content)</div><div class="line"> print('成功插入一组数据' + str(content))</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> start = time.time()</div><div class="line"> pool = multiprocessing.Pool(4) # 使用4个进程</div><div class="line"> pool.map(method_2, urls) # map函数就是把后面urls列表中的url分别传递给method_2()函数</div><div class="line"> pool.close()</div><div class="line"> pool.join()</div><div class="line"> print('一共用时:' + str(time.time() - start))</div></pre></td></tr></table></figure>
<p>执行这部分的代码后运行时间,确实比方法一快了一些</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-312c9e164e44e4f5.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="方法二时间"></p>
<h5 id="方式三-使用Asyncio-Aiohttp异步IO爬取"><a href="#方式三-使用Asyncio-Aiohttp异步IO爬取" class="headerlink" title="方式三:使用Asyncio + Aiohttp异步IO爬取"></a>方式三:使用Asyncio + Aiohttp异步IO爬取</h5><p>使用这个方法需要对每个函数前面加async,表示成一个异步函数,调用asyncio.get_event_loop创建线程,run_until_complete方法负责安排执行 tasks中的任务。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div></pre></td><td class="code"><pre><div class="line">def method_3():</div><div class="line"> async def get_url(url):</div><div class="line"> async with aiohttp.ClientSession() as session:</div><div class="line"> async with session.get(url) as html:</div><div class="line"> response = await html.text(encoding="utf-8")</div><div class="line"> return response</div><div class="line"></div><div class="line"> async def parser(url):</div><div class="line"> html = await get_url(url)</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> title = soup.find_all(class_='title')</div><div class="line"> app_title = soup.find_all(class_='app-title')</div><div class="line"> item_cover = soup.find_all(class_='item-cover')</div><div class="line"> icon_cover = soup.select('div.list-wrap > ul > li > div.icon > img')</div><div class="line"> for title_i, app_title_i, item_cover_i, icon_cover_i in zip(title, app_title, item_cover, icon_cover):</div><div class="line"> content = {</div><div class="line"> 'title': title_i.get_text(),</div><div class="line"> 'app_title': app_title_i.get_text(),</div><div class="line"> 'item_cover': item_cover_i['data-original'],</div><div class="line"> 'icon_cover': icon_cover_i['data-original']</div><div class="line"> }</div><div class="line"> col.insert(content)</div><div class="line"> print('成功插入一组数据' + str(content))</div><div class="line"></div><div class="line"> start = time.time()</div><div class="line"> loop = asyncio.get_event_loop()</div><div class="line"> tasks = [parser(url) for url in urls]</div><div class="line"> loop.run_until_complete(asyncio.gather(*tasks))</div><div class="line"> print(time.time() - start)</div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> method_3()</div></pre></td></tr></table></figure></p>
<p>执行这部分的代码后运行时间,又快了很多。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-fcccb5fcc9da33b9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="方法三时间"></p>
<hr>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>使用三种方法爬取数据保存到mongodb,从这里可以看出使用Asyncio + Aiohttp的方法最快,比普通只用requests的方法快很多,如果处理更多的任务的时候使用异步IO是非常有效率的。</p>
<p>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友可以点击 start follw一起学习交流吧!<strong><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a></strong></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-a7b1dda6931cfdee.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="加油!"></p>
]]></content>
<summary type="html">
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-fdd4ccad379e3f01.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt=""></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>使用requests+BeautifulSoup或者xpath等网页解析工具就可以爬取大部分的网页 ,但是有时爬取的量很大时爬取的速度就让人头疼,今天我就使用三种方式来爬取豌豆荚的设计奖APP相关信息并保存到mongodb,从而对比速度让我们更清楚的认识这些东西用处。<br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="异步" scheme="http://bulolo.cn/tags/%E5%BC%82%E6%AD%A5/"/>
<category term="多进程" scheme="http://bulolo.cn/tags/%E5%A4%9A%E8%BF%9B%E7%A8%8B/"/>
</entry>
<entry>
<title>Python数据分析:斗鱼弹幕内容词云分析</title>
<link href="http://bulolo.cn/2017/06/04/data1/"/>
<id>http://bulolo.cn/2017/06/04/data1/</id>
<published>2017-06-04T03:21:35.000Z</published>
<updated>2017-06-04T10:26:42.383Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>上次把斗鱼弹幕数据抓取搞定后,我就拿来试试用词云分析看看效果,简单学习一下。这是弹幕抓拍去分析的对象是斗鱼主播大司马,因为他直播比较搞笑,虽然我不玩游戏,但是之前看他还是有意思。这次我使用的数据是弹幕爬取后保存到text中的,实现代码放在这里:<a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a>,有了这个数据后续就可以使用词云分析了。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-bc5ea565f7056a76.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="大司马老师上课"><br><a id="more"></a></p>
<h1 id="二:遇到的坑"><a href="#二:遇到的坑" class="headerlink" title="二:遇到的坑"></a>二:遇到的坑</h1><p>第一次用需要安装几个插件:jieba,scipy,wordcloud,但是这个wordcloud在win下面容易出错。解决方法是使用<a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/" target="_blank" rel="external">http://www.lfd.uci.edu/~gohlke/pythonlibs/</a> 网站下载对应的模块,然后复制到你的电脑目录中,之后使用命令行工具进入该文件夹中,然后执行一下安装操作,我是python3.6 64位电脑:<strong>pip install wordcloud‑1.3.1‑cp36‑cp36m‑win_amd64.whl</strong>其他版本请下载安装对应模块。</p>
<h1 id="三:简单原理"><a href="#三:简单原理" class="headerlink" title="三:简单原理"></a>三:简单原理</h1><ul>
<li>jieba这个模块是用来分词的,把一段文字分解成一个一个的词汇,就像我的锤子手机的大爆炸一样分词。<strong>jieba.cut()</strong>分词函数提供了三个模式:全模式,精确模式,搜索引擎模式。全模式:速度块,扫描成词的词语,但时会出现歧义的词语<br>精确模式:尽可能最准确非切分词语,比较适合作文本分析<br>搜索引擎模式:就是精确模式的基础上,对长词再次切分,提高召回率<br>这里默认的话就是精确模式。第一次可以不用考虑模式问题,先上来弄出个图给自己美滋滋一下再说。</li>
<li>之后对分词后的数据使用wordcloud模块,进行对词汇分析了</li>
<li>最后使用matplotlib.pyplot,绘图展现出来。</li>
</ul>
<h1 id="四:代码展现"><a href="#四:代码展现" class="headerlink" title="四:代码展现"></a>四:代码展现</h1><ul>
<li>1.Python的代码量很少,所以学习还是比其他语言来的舒服。这里的就简单这些代码就实现了词云的目的。我这里导入的文档是弹幕爬取后的,代码在我gituhub中,就是把之前弹幕数据保存到mongodb改成只把弹幕内容保存到text文档中。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-9bc9735a6fac3e7f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">import jieba,jieba.analyse</div><div class="line">from wordcloud import WordCloud, ImageColorGenerator</div><div class="line">import matplotlib.pyplot as plt</div><div class="line">import os</div><div class="line">import PIL.Image as Image</div><div class="line">import numpy as np</div></pre></td></tr></table></figure>
</li>
</ul>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div></pre></td><td class="code"><pre><div class="line">with open('大司马即将上课前后.txt','r',encoding='utf-8') as f:</div><div class="line"> text = f.read()</div><div class="line"> f.close()</div><div class="line">cut_text = " ".join(jieba.cut(text)) #使用空格连接 进行中文分词</div><div class="line">cut_an= jieba.analyse.extract_tags(cut_text,30) # 关键词提取,返回权重最高的前30,数字可以不填默认20</div><div class="line"></div><div class="line">d = os.path.dirname(__file__) # 获取当前文件路径</div><div class="line">color_mask = np.array(Image.open(os.path.join(d,'img.jpg'))) # 设置图片</div><div class="line">cloud = WordCloud(</div><div class="line"> background_color='#F0F8FF', # 参数为设置背景颜色,默认颜色则为黑色</div><div class="line"> font_path="FZLTKHK--GBK1-0.ttf", # 使用指定字体可以显示中文,或者修改wordcloud.py文件字体设置并且放入相应字体文件</div><div class="line"> max_words=1000, # 词云显示的最大词数</div><div class="line"> font_step=10, # 步调太大,显示的词语就少了</div><div class="line"> mask=color_mask, #设置背景图片</div><div class="line"> random_state= 15, # 设置有多少种随机生成状态,即有多少种配色方案</div><div class="line"> min_font_size=15, #字体最小值</div><div class="line"> max_font_size=232, #字体最大值</div><div class="line"> )</div><div class="line">cloud.generate(cut_text) #对分词后的文本生成词云</div><div class="line">image_colors = ImageColorGenerator(color_mask) # 从背景图片生成颜色值</div><div class="line">plt.show(cloud.recolor(color_func=image_colors)) # 绘制时用背景图片做为颜色的图片</div><div class="line">plt.imshow(cloud) # 以图片的形式显示词云</div><div class="line">plt.axis('off') # 关闭坐标轴</div><div class="line">plt.show() # 展示图片</div><div class="line"></div><div class="line">cloud.to_file(os.path.join(d, 'pic.jpg')) # 图片大小将会按照 mask 保存</div></pre></td></tr></table></figure>
<ul>
<li>2.执行之后就可以显示出来了:</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-706611c5b3e75830.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="直播开始前弹幕"></p>
<p>之后我又对斗鱼主播芜湖大司马直播后一段时间的弹幕分析了一下,结果如下:</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-209d70682cd2c819.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="直播开始后弹幕"></p>
<ul>
<li>3.前面是使用自己的词云分析,之后我把我的两个时间段的弹幕内容放到<a href="http://www.bluemc.cn/" target="_blank" rel="external">bluemc</a>,这里来分析是这样的:</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-3a97d0538b12521f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-581351180717765b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>通过两个时间段的词云分析,可以看的出来观众说的最多的,关注点是哪些。这次我做的词云也很简单,后续在研究研究让它更美观一些,精准一些。<br>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友可以点击 start follw一起学习交流吧!<strong><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a></strong></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>上次把斗鱼弹幕数据抓取搞定后,我就拿来试试用词云分析看看效果,简单学习一下。这是弹幕抓拍去分析的对象是斗鱼主播大司马,因为他直播比较搞笑,虽然我不玩游戏,但是之前看他还是有意思。这次我使用的数据是弹幕爬取后保存到text中的,实现代码放在这里:<a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a>,有了这个数据后续就可以使用词云分析了。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-bc5ea565f7056a76.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="大司马老师上课"><br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="弹幕" scheme="http://bulolo.cn/tags/%E5%BC%B9%E5%B9%95/"/>
<category term="数据" scheme="http://bulolo.cn/tags/%E6%95%B0%E6%8D%AE/"/>
<category term="词云" scheme="http://bulolo.cn/tags/%E8%AF%8D%E4%BA%91/"/>
</entry>
<entry>
<title>Python爬虫日记8:Python爬虫日记八:利用API实时爬取斗鱼弹幕</title>
<link href="http://bulolo.cn/2017/06/03/spider8/"/>
<id>http://bulolo.cn/2017/06/03/spider8/</id>
<published>2017-06-03T01:21:35.000Z</published>
<updated>2017-06-04T10:34:49.228Z</updated>
<content type="html"><![CDATA[<p><img src="http://upload-images.jianshu.io/upload_images/4701426-794cd62ab5883346.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="斗鱼"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>这些天一直想做一个斗鱼爬取弹幕,但是socket搞的不清楚,而且这个斗鱼的api接口虽然开放了但是我在github上没有找到可以完美使用的代码。我看了好多文章,学了写然后总结一下。也为后面数据分析做准备,后面先对弹幕简单词云化,然后再对各个房间的数据可视化。<br>代码地址:<strong><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a></strong><br><a id="more"></a></p>
<hr>
<blockquote>
<p>这次爬取的房间是斗鱼直播的芜湖大司马,因为他人气比较多,方便分析。主播也是我老乡,嘿嘿。然后把弹幕的信息的uid,昵称,等级,弹幕内容保存mongodb。</p>
</blockquote>
<p><strong>先看看效果</strong></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-7e8fe2be0f306467.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br><img src="http://upload-images.jianshu.io/upload_images/4701426-75877568d0c8960a.gif?imageMogr2/auto-orient/strip" alt="GIF.gif"></p>
<hr>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm</li>
<li>Python3.6</li>
<li>pymongo 3.4.0</li>
</ul>
<hr>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>首先要想爬取弹幕要看看官方的<a href="https://github.com/rieuse/learnPython/blob/master/%E6%96%97%E9%B1%BC%E7%9B%B4%E6%92%AD%E5%BC%B9%E5%B9%95%E6%8A%93%E5%8F%96/%E6%96%97%E9%B1%BC%E5%BC%B9%E5%B9%95%E6%9C%8D%E5%8A%A1%E5%99%A8%E7%AC%AC%E4%B8%89%E6%96%B9%E6%8E%A5%E5%85%A5%E5%8D%8F%E8%AE%AEv1.4.1.pdf" target="_blank" rel="external">开发文档</a>。</p>
<ul>
<li>第一点就是协议组成:<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">def sendmsg(msgstr):</div><div class="line"> msg = msgstr.encode('utf-8')</div><div class="line"> data_length = len(msg) + 8</div><div class="line"> code = 689</div><div class="line"> msgHead = int.to_bytes(data_length, 4, 'little') \</div><div class="line"> + int.to_bytes(data_length, 4, 'little') + int.to_bytes(code, 4, 'little')</div><div class="line"> client.send(msgHead)</div><div class="line"> sent = 0</div><div class="line"> while sent < len(msg):</div><div class="line"> tn = client.send(msg[sent:])</div><div class="line"> sent = sent + tn</div></pre></td></tr></table></figure>
</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-6b585b39e45d96b2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<ul>
<li>第二点是登录请求,之后把这个传递给sendmsg即可发送请求:<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"> msg = 'type@=loginreq/username@=rieuse/password@=douyu/roomid@={}/\0'.format(roomid)</div><div class="line">sendmsg(msg)</div></pre></td></tr></table></figure>
</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-68bcbcf41c0cfb8c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<ul>
<li>第三点是获取弹幕信息<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">msg_more = 'type@=joingroup/rid@={}/gid@=-9999/\0'.format(roomid)</div><div class="line">sendmsg(msg_more)</div></pre></td></tr></table></figure>
</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-695f8c0712a62fc4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<ul>
<li><p>第四点是要保存登录状态</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">def keeplive():</div><div class="line"> while True:</div><div class="line"> msg = 'type@=keeplive/tick@=' + str(int(time.time())) + '/\0'</div><div class="line"> sendmsg(msg)</div><div class="line"> time.sleep(15)</div></pre></td></tr></table></figure>
</li>
<li><p>第五点是要把接受到的byte,转换我们识别的编码,然后保存到monggodb,也可以保存到text文档中。</p>
</li>
</ul>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-1c54edfbb01f0a88.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<hr>
<ul>
<li>补充说明<blockquote>
<p>到这里这个API的主要功能已经了解了,剩下的就是具体实现,有以下几点:</p>
<ul>
<li>1.用户输入房间号,获取房间说明</li>
<li>2.发送数据后,我们就会接受到斗鱼返回的数据,但是返回的数据是二进制所以我 们需要对数据转换编码。</li>
<li>3.我这里爬取了斗鱼用户发送弹幕的信息有uid,昵称,等级,弹幕内容,这里的等级有的人是空的,如果不处理就会造成错误所以要使用下面处理一下。<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">if not level_more:</div><div class="line"> level_more = b'0'</div></pre></td></tr></table></figure>
</li>
</ul>
</blockquote>
</li>
</ul>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">点击查看完整代码</a><br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div><div class="line">77</div><div class="line">78</div><div class="line">79</div><div class="line">80</div><div class="line">81</div><div class="line">82</div><div class="line">83</div><div class="line">84</div><div class="line">85</div><div class="line">86</div><div class="line">87</div></pre></td><td class="code"><pre><div class="line">import multiprocessing</div><div class="line">import socket</div><div class="line">import time</div><div class="line">import re</div><div class="line">import pymongo</div><div class="line">import requests</div><div class="line">from bs4 import BeautifulSoup</div><div class="line"></div><div class="line">clients = pymongo.MongoClient('localhost')</div><div class="line">db = clients["DouyuTV_danmu"]</div><div class="line">col = db["info"]</div><div class="line"></div><div class="line">client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)</div><div class="line">host = socket.gethostbyname("openbarrage.douyutv.com")</div><div class="line">port = 8601</div><div class="line">client.connect((host, port))</div><div class="line"></div><div class="line">danmu_path = re.compile(b'txt@=(.+?)/cid@')</div><div class="line">uid_path = re.compile(b'uid@=(.+?)/nn@')</div><div class="line">nickname_path = re.compile(b'nn@=(.+?)/txt@')</div><div class="line">level_path = re.compile(b'level@=([1-9][0-9]?)/sahf')</div><div class="line"></div><div class="line">def sendmsg(msgstr):</div><div class="line"> msg = msgstr.encode('utf-8')</div><div class="line"> data_length = len(msg) + 8</div><div class="line"> code = 689</div><div class="line"> msgHead = int.to_bytes(data_length, 4, 'little') \</div><div class="line"> + int.to_bytes(data_length, 4, 'little') + int.to_bytes(code, 4, 'little')</div><div class="line"> client.send(msgHead)</div><div class="line"> sent = 0</div><div class="line"> while sent < len(msg):</div><div class="line"> tn = client.send(msg[sent:])</div><div class="line"> sent = sent + tn</div><div class="line"></div><div class="line"></div><div class="line">def start(roomid):</div><div class="line"> msg = 'type@=loginreq/username@=rieuse/password@=douyu/roomid@={}/\0'.format(roomid)</div><div class="line"> sendmsg(msg)</div><div class="line"> msg_more = 'type@=joingroup/rid@={}/gid@=-9999/\0'.format(roomid)</div><div class="line"> sendmsg(msg_more)</div><div class="line"></div><div class="line"> print('---------------欢迎连接到{}的直播间---------------'.format(get_name(roomid)))</div><div class="line"> while True:</div><div class="line"> data = client.recv(1024)</div><div class="line"> uid_more = uid_path.findall(data)</div><div class="line"> nickname_more = nickname_path.findall(data)</div><div class="line"> level_more = level_path.findall(data)</div><div class="line"> danmu_more = danmu_path.findall(data)</div><div class="line"> if not level_more:</div><div class="line"> level_more = b'0'</div><div class="line"> if not data:</div><div class="line"> break</div><div class="line"> else:</div><div class="line"> for i in range(0, len(danmu_more)):</div><div class="line"> try:</div><div class="line"> product = {</div><div class="line"> 'uid': uid_more[0].decode(encoding='utf-8'),</div><div class="line"> 'nickname': nickname_more[0].decode(encoding='utf-8'),</div><div class="line"> 'level': level_more[0].decode(encoding='utf-8'),</div><div class="line"> 'danmu': danmu_more[0].decode(encoding='utf-8')</div><div class="line"> }</div><div class="line"> print(product)</div><div class="line"> col.insert(product)</div><div class="line"> print('成功导入mongodb')</div><div class="line"> except Exception as e:</div><div class="line"> print(e)</div><div class="line"></div><div class="line"></div><div class="line">def keeplive():</div><div class="line"> while True:</div><div class="line"> msg = 'type@=keeplive/tick@=' + str(int(time.time())) + '/\0'</div><div class="line"> sendmsg(msg)</div><div class="line"> time.sleep(15)</div><div class="line"></div><div class="line"></div><div class="line">def get_name(roomid):</div><div class="line"> r = requests.get("http://www.douyu.com/" + roomid)</div><div class="line"> soup = BeautifulSoup(r.text, 'lxml')</div><div class="line"> return soup.find('a', {'class', 'zb-name'}).string</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> room_id = input('请出入房间ID: ')</div><div class="line"> p1 = multiprocessing.Process(target=start, args=(room_id,))</div><div class="line"> p2 = multiprocessing.Process(target=keeplive)</div><div class="line"> p1.start()</div><div class="line"> p2.start()</div></pre></td></tr></table></figure></p>
<h1 id="五:弹幕的后续使用"><a href="#五:弹幕的后续使用" class="headerlink" title="五:弹幕的后续使用"></a>五:弹幕的后续使用</h1><p>这里我们是将弹幕的几个信息,uid,用户昵称,等级,弹幕内容保存到mongodb,后续要对数据分析就可以直接拿出来,如果我们只需要弹幕那么就可以只把弹幕信息保存到txt文档中就行了。<br>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友可以点击 start follw一起学习交流吧!<strong><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a></strong></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-607dafc482f194fd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="加油!"></p>
]]></content>
<summary type="html">
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-794cd62ab5883346.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="斗鱼"></p>
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>这些天一直想做一个斗鱼爬取弹幕,但是socket搞的不清楚,而且这个斗鱼的api接口虽然开放了但是我在github上没有找到可以完美使用的代码。我看了好多文章,学了写然后总结一下。也为后面数据分析做准备,后面先对弹幕简单词云化,然后再对各个房间的数据可视化。<br>代码地址:<strong><a href="https://github.com/rieuse/DouyuTV" target="_blank" rel="external">github.com/rieuse/DouyuTV</a></strong><br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="弹幕" scheme="http://bulolo.cn/tags/%E5%BC%B9%E5%B9%95/"/>
</entry>
<entry>
<title>Python爬虫日记7:批量抓取花瓣网高清美图并保存</title>
<link href="http://bulolo.cn/2017/05/21/%E7%88%AC%E8%99%AB7/"/>
<id>http://bulolo.cn/2017/05/21/爬虫7/</id>
<published>2017-05-21T12:21:35.000Z</published>
<updated>2017-05-26T07:24:14.059Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>嘀嘀嘀,上车请刷卡。昨天看到了不错的图片分享网——<a href="http://huaban.com/boards/favorite/beauty/" target="_blank" rel="external">花瓣</a>,里面的图片质量还不错,所以利用selenium+xpath我把它的妹子的栏目下爬取了下来,以图片栏目名称给文件夹命名分类保存到电脑中。这个妹子主页<a href="http://huaban.com/boards/favorite/beauty" target="_blank" rel="external">http://huaban.com/boards/favorite/beauty</a> 是动态加载的,如果想获取更多内容可以模拟下拉,这样就可以更多的图片资源。这种之前爬虫中也做过,但是因为网速不够快所以我就抓了19个栏目,一共500多张美图,也已经很满意了。<br><strong>先看看效果:</strong><br><img src="http://upload-images.jianshu.io/upload_images/4701426-fc6379612e2fe8d2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<a id="more"></a>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-ed7262dc8ff7c969.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm</li>
<li>Python3.6</li>
<li>lxml 3.7.2</li>
<li>Selenium 3.4.0</li>
<li>requests 2.12.4</li>
</ul>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.这次爬虫我开始做的思路是:进入这个网页<a href="http://huaban.com/boards/favorite/beauty" target="_blank" rel="external">http://huaban.com/boards/favorite/beauty</a> 然后来获取所有的图片栏目对应网址,然后进入每一个网页中去获取全部图片。(如下图所示)</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-70dc0b6228b7c976.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-f8129fa85d8817fb.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<p>2.但是爬取获取的图片分辨率是236x354,图片质量不够高,但是那个时候已经是晚上1点30之后了,所以第二天做了另一个版本:在这个基础上再进入每个缩略图对应的网页,再抓取像下面这样高清的图片。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-8e96a09519b2fa7b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p>1.第一步导入本次爬虫需要的模块<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">__author__ = <span class="string">'布咯咯_rieuse'</span></div><div class="line"><span class="keyword">from</span> selenium.webdriver.common.by <span class="keyword">import</span> By</div><div class="line"><span class="keyword">from</span> selenium.webdriver.support <span class="keyword">import</span> expected_conditions <span class="keyword">as</span> EC</div><div class="line"><span class="keyword">from</span> selenium.webdriver.support.ui <span class="keyword">import</span> WebDriverWait</div><div class="line"><span class="keyword">from</span> selenium <span class="keyword">import</span> webdriver</div><div class="line"><span class="keyword">import</span> requests</div><div class="line"><span class="keyword">import</span> lxml.html</div><div class="line"><span class="keyword">import</span> os</div></pre></td></tr></table></figure></p>
<p>2.下面是设置webdriver的种类,就是使用什么浏览器进行模拟,可以使用火狐来看它模拟的过程,也可以是无头浏览器PhantomJS来快速获取资源,[‘–load-images=false’, ‘–disk-cache=true’]这个意思是模拟浏览的时候不加载图片和缓存,这样运行速度会加快一些。WebDriverWait标明最大等待浏览器加载为10秒,set_window_size可以设置一下模拟浏览网页的大小。有些网站如果大小不到位,那么一些资源就不加载出来。<br><figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']</span></div><div class="line"><span class="comment"># browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)</span></div><div class="line">browser = webdriver.Firefox()</div><div class="line">wait = WebDriverWait(browser, <span class="number">10</span>)</div><div class="line">browser.set_window_size(<span class="number">1400</span>, <span class="number">900</span>)</div></pre></td></tr></table></figure></p>
<p>3.parser(url, param)这个函数用来解析网页,后面有几次都用用到这些代码,所以直接写一个函数会让代码看起来更整洁有序。函数有两个参数:一个是网址,另一个是显性等待代表的部分,这个可以是网页中的某些板块,按钮,图片等等…<br><figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">parser</span><span class="params">(url, param)</span>:</span></div><div class="line"> browser.get(url)</div><div class="line"> wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, param)))</div><div class="line"> html = browser.page_source</div><div class="line"> doc = lxml.html.fromstring(html)</div><div class="line"> <span class="keyword">return</span> doc</div></pre></td></tr></table></figure></p>
<p>4.下面的代码就是解析本次主页面<a href="http://huaban.com/boards/favorite/beauty/" target="_blank" rel="external">http://huaban.com/boards/favorite/beauty/</a> 然后获取到每个栏目的网址和栏目的名称,使用xpath来获取栏目的网页时,进入网页开发者模式后,如图所示进行操作。之后需要用栏目名称在电脑中建立文件夹,所以在这个网页中要获取到栏目的名称,这里遇到一个问题,一些名称不符合文件命名规则要剔除,我这里就是一个 * 影响了。<br><figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_main_url</span><span class="params">()</span>:</span></div><div class="line"> print(<span class="string">'打开主页搜寻链接中...'</span>)</div><div class="line"> <span class="keyword">try</span>:</div><div class="line"> doc = parser(<span class="string">'http://huaban.com/boards/favorite/beauty/'</span>, <span class="string">'# waterfall'</span>)</div><div class="line"> name = doc.xpath(<span class="string">'//*[@id="waterfall"]/div/a[1]/div[2]/h3/text()'</span>)</div><div class="line"> u = doc.xpath(<span class="string">'//*[@id="waterfall"]/div/a[1]/@href'</span>)</div><div class="line"> <span class="keyword">for</span> item, fileName <span class="keyword">in</span> zip(u, name):</div><div class="line"> main_url = <span class="string">'http://huaban.com'</span> + item</div><div class="line"> print(<span class="string">'主链接已找到'</span> + main_url)</div><div class="line"> <span class="keyword">if</span> <span class="string">'*'</span> <span class="keyword">in</span> fileName:</div><div class="line"> fileName = fileName.replace(<span class="string">'*'</span>, <span class="string">''</span>)</div><div class="line"> download(main_url, fileName)</div><div class="line"> <span class="keyword">except</span> Exception <span class="keyword">as</span> e:</div><div class="line"> print(e)</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-15c45e7520131b7f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<p>5.前面已经获取到栏目的网页和栏目的名称,这里就需要对栏目的网页分析,进入栏目网页后,只是一些缩略图,我们不想要这些低分辨率的图片,所以要再进入每个缩略图中,解析网页获取到真正的高清图片网址。这里也有一个地方比较坑人,就是一个栏目中,不同的图片存放dom格式不一样,所以我这样做<br><figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">img_url = doc.xpath(<span class="string">'//*[@id="baidu_image_holder"]/a/img/@src'</span>)</div><div class="line">img_url2 = doc.xpath(<span class="string">'//*[@id="baidu_image_holder"]/img/@src'</span>)</div></pre></td></tr></table></figure></p>
<p>这就把两种dom格式中的图片地址都获取了,然后把两个地址list合并一下。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">img_url +=img_url2</div></pre></td></tr></table></figure>
<p>在本地创建文件夹使用</p>
<figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">filename = <span class="string">'image\\{}\\'</span>.format(fileName) + str(i) + <span class="string">'.jpg'</span></div></pre></td></tr></table></figure>
<p>表示文件保存在与这个爬虫代码同级目录image下,然后获取的图片保存在image中按照之前获取的栏目名称的文件夹中。</p>
<figure class="highlight py"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">download</span><span class="params">(main_url, fileName)</span>:</span></div><div class="line"> print(<span class="string">'-------准备下载中-------'</span>)</div><div class="line"> <span class="keyword">try</span>:</div><div class="line"> doc = parser(main_url, <span class="string">'# waterfall'</span>)</div><div class="line"> <span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(<span class="string">'image\\'</span> + fileName):</div><div class="line"> print(<span class="string">'创建文件夹...'</span>)</div><div class="line"> os.makedirs(<span class="string">'image\\'</span> + fileName)</div><div class="line"> link = doc.xpath(<span class="string">'//*[@id="waterfall"]/div/a/@href'</span>)</div><div class="line"> <span class="comment"># print(link)</span></div><div class="line"> i = <span class="number">0</span></div><div class="line"> <span class="keyword">for</span> item <span class="keyword">in</span> link:</div><div class="line"> i += <span class="number">1</span></div><div class="line"> minor_url = <span class="string">'http://huaban.com'</span> + item</div><div class="line"> doc = parser(minor_url, <span class="string">'# pin_view_page'</span>)</div><div class="line"> img_url = doc.xpath(<span class="string">'//*[@id="baidu_image_holder"]/a/img/@src'</span>)</div><div class="line"> img_url2 = doc.xpath(<span class="string">'//*[@id="baidu_image_holder"]/img/@src'</span>)</div><div class="line"> img_url +=img_url2</div><div class="line"> <span class="keyword">try</span>:</div><div class="line"> url = <span class="string">'http:'</span> + str(img_url[<span class="number">0</span>])</div><div class="line"> print(<span class="string">'正在下载第'</span> + str(i) + <span class="string">'张图片,地址:'</span> + url)</div><div class="line"> r = requests.get(url)</div><div class="line"> filename = <span class="string">'image\\{}\\'</span>.format(fileName) + str(i) + <span class="string">'.jpg'</span></div><div class="line"> <span class="keyword">with</span> open(filename, <span class="string">'wb'</span>) <span class="keyword">as</span> fo:</div><div class="line"> fo.write(r.content)</div><div class="line"> <span class="keyword">except</span> Exception:</div><div class="line"> print(<span class="string">'出错了!'</span>)</div><div class="line"> <span class="keyword">except</span> Exception:</div><div class="line"> print(<span class="string">'出错啦!'</span>)</div><div class="line"></div><div class="line"></div><div class="line"><span class="keyword">if</span> __name__ == <span class="string">'__main__'</span>:</div><div class="line"> get_main_url()</div></pre></td></tr></table></figure>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>这次爬虫继续练习了Selenium和xpath的使用,在网页分析的时候也遇到很多问题,只有不断练习才能把自己不会部分减少,当然这次爬取了500多张妹纸还是挺养眼的。<br>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友一起学习交流吧!<strong><em><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></em></strong></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-a2c3ad6cc56eed72.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>嘀嘀嘀,上车请刷卡。昨天看到了不错的图片分享网——<a href="http://huaban.com/boards/favorite/beauty/" target="_blank" rel="external">花瓣</a>,里面的图片质量还不错,所以利用selenium+xpath我把它的妹子的栏目下爬取了下来,以图片栏目名称给文件夹命名分类保存到电脑中。这个妹子主页<a href="http://huaban.com/boards/favorite/beauty" target="_blank" rel="external">http://huaban.com/boards/favorite/beauty</a> 是动态加载的,如果想获取更多内容可以模拟下拉,这样就可以更多的图片资源。这种之前爬虫中也做过,但是因为网速不够快所以我就抓了19个栏目,一共500多张美图,也已经很满意了。<br><strong>先看看效果:</strong><br><img src="http://upload-images.jianshu.io/upload_images/4701426-fc6379612e2fe8d2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="图片" scheme="http://bulolo.cn/tags/%E5%9B%BE%E7%89%87/"/>
</entry>
<entry>
<title>Python爬虫日记6:Selenium+xpath+bs4爬取亚马逊数据保存到mongodb</title>
<link href="http://bulolo.cn/2017/05/19/%E7%88%AC%E8%99%AB6/"/>
<id>http://bulolo.cn/2017/05/19/爬虫6/</id>
<published>2017-05-19T13:21:29.000Z</published>
<updated>2017-05-26T07:23:35.442Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><blockquote>
<p>上周末非常开心,第一次去北京然后参见了zealer和夸克浏览器的联合线下沙龙会议,和大家交流很多收获很多,最让我吃惊的是他们团队非常年轻就有各种能力,每个人都很强。一个结论:我要继续努力!<br>贴上我们的合影,我很帅!:)</p>
</blockquote>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-86fe7ad392d957de.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="zealer&夸克浏览器.jpg"></p>
<a id="more"></a>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-e7a4abb83027103a.JPG?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="夸克浏览器合影.JPG"></p>
<blockquote>
<p>这次爬虫是使用selenium来模拟输入关键字(我是测试输入各种图书)然后把全部页数的相关的商品数据保存到mongodb,期间遇到各种问题,很多网站不是很容易就一次可以把网页解析好,很轻松的提取数据。这个亚马逊就是有点怪,这次是提取商品的名称,图片地址,价格,时间,因为我的初始目的是出入有关图书的关键字,所以时间就是图书出版时间。</p>
</blockquote>
<p>关于‘python’关键字如图所示,爬取了300条数据。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-cc1ce81fb481de52.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="mongodb数据.png"></p>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li>IDE:Pycharm</li>
<li>Python3.6</li>
<li>Selenium 3.4.0</li>
<li>pymongo 3.3.0</li>
<li>BeautifulSoup 4.5.3</li>
</ul>
<h1 id="三:-爬虫中重要(keng)的部分"><a href="#三:-爬虫中重要(keng)的部分" class="headerlink" title="三: 爬虫中重要(keng)的部分"></a>三: 爬虫中重要(keng)的部分</h1><ul>
<li>商品的时间使用Beautifulsoup是提取不出来的,使用正则表达式也搞不定,我最后用xpath才提取出来</li>
<li>每个商品框架都是独立id,没有使用共同的class,所以要想获取他们使用正则表达式挺合适的</li>
<li>因为商品的名称,图片地址,价格这三个是使用beautifulsoup提取的,而时间是用的xpath提取,要想把他们一起装入一个字典中然后写入mongodb就需要用到zip这个函数了。<br>像这样的处理两个列表一起迭代<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">for item, time in zip(content, date)</div></pre></td></tr></table></figure>
</li>
</ul>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div><div class="line">77</div><div class="line">78</div><div class="line">79</div><div class="line">80</div><div class="line">81</div><div class="line">82</div><div class="line">83</div><div class="line">84</div><div class="line">85</div><div class="line">86</div><div class="line">87</div><div class="line">88</div><div class="line">89</div><div class="line">90</div><div class="line">91</div><div class="line">92</div><div class="line">93</div><div class="line">94</div><div class="line">95</div><div class="line">96</div><div class="line">97</div><div class="line">98</div><div class="line">99</div><div class="line">100</div><div class="line">101</div></pre></td><td class="code"><pre><div class="line">from selenium.common.exceptions import TimeoutException</div><div class="line">from selenium.webdriver.common.by import By</div><div class="line">from selenium.webdriver.support import expected_conditions as EC</div><div class="line">from selenium.webdriver.support.ui import WebDriverWait</div><div class="line">from selenium import webdriver</div><div class="line">from bs4 import BeautifulSoup</div><div class="line">import lxml.html</div><div class="line">import pymongo</div><div class="line">import re</div><div class="line"></div><div class="line">MONGO_URL = 'localhost'</div><div class="line">MONGO_DB = 'amazon'</div><div class="line">MONGO_TABLE = 'amazon-python'</div><div class="line">SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']</div><div class="line">KEYWORD = 'python'</div><div class="line">client = pymongo.MongoClient(MONGO_URL)</div><div class="line">db = client[MONGO_DB]</div><div class="line"></div><div class="line">browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)</div><div class="line"># browser = webdriver.Firefox()</div><div class="line">wait = WebDriverWait(browser, 10)</div><div class="line">browser.set_window_size(1400, 900)</div><div class="line"></div><div class="line"></div><div class="line">def search():</div><div class="line"> print('正在搜索')</div><div class="line"> try:</div><div class="line"> browser.get('https://www.amazon.cn/')</div><div class="line"> input = wait.until(</div><div class="line"> EC.presence_of_element_located((By.CSS_SELECTOR, '# twotabsearchtextbox'))</div><div class="line"> )</div><div class="line"> submit = wait.until(</div><div class="line"> EC.element_to_be_clickable((By.CSS_SELECTOR, '# nav-search > form > div.nav-right > div > input')))</div><div class="line"> input.send_keys(KEYWORD)</div><div class="line"> submit.click()</div><div class="line"> total = wait.until(</div><div class="line"> EC.presence_of_element_located((By.CSS_SELECTOR, '# pagn > span.pagnDisabled')))</div><div class="line"> get_products()</div><div class="line"> print('一共' + total.text + '页')</div><div class="line"> return total.text</div><div class="line"> except TimeoutException:</div><div class="line"> return search()</div><div class="line"></div><div class="line"></div><div class="line">def next_page(number):</div><div class="line"> print('正在翻页', number)</div><div class="line"> try:</div><div class="line"> wait.until(EC.text_to_be_present_in_element(</div><div class="line"> (By.CSS_SELECTOR, '# pagnNextString'), '下一页'))</div><div class="line"> submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '# pagnNextString')))</div><div class="line"> submit.click()</div><div class="line"> wait.until(EC.text_to_be_present_in_element(</div><div class="line"> (By.CSS_SELECTOR, '.pagnCur'), str(number)))</div><div class="line"> get_products()</div><div class="line"> except TimeoutException:</div><div class="line"> next_page(number)</div><div class="line"></div><div class="line"></div><div class="line">def get_products():</div><div class="line"> try:</div><div class="line"> wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '# s-results-list-atf')))</div><div class="line"> html = browser.page_source</div><div class="line"> soup = BeautifulSoup(html, 'lxml')</div><div class="line"> doc = lxml.html.fromstring(html)</div><div class="line"> date = doc.xpath('//*[@class="s-result-item celwidget "]/div/div[2]/div[1]/span[2]/text()')</div><div class="line"> content = soup.find_all(attrs={"id": re.compile(r'result_\d+')})</div><div class="line"> for item, time in zip(content, date):</div><div class="line"> product = {</div><div class="line"> 'title': item.find(class_='s-access-title').get_text(),</div><div class="line"> 'image': item.find(class_='s-access-image cfMarker').get('src'),</div><div class="line"> 'price': item.find(class_='a-size-base a-color-price s-price a-text-bold').get_text(),</div><div class="line"> 'date': time</div><div class="line"> }</div><div class="line"> save_to_mongo(product)</div><div class="line"> print(product)</div><div class="line"> except Exception as e:</div><div class="line"> print(e)</div><div class="line"></div><div class="line"></div><div class="line">def save_to_mongo(result):</div><div class="line"> try:</div><div class="line"> if db[MONGO_TABLE].insert(result):</div><div class="line"> print('存储到mongodb成功', result)</div><div class="line"> except Exception:</div><div class="line"> print('存储到mongodb失败', result)</div><div class="line"></div><div class="line"></div><div class="line">def main():</div><div class="line"> try:</div><div class="line"> total = search()</div><div class="line"> total = int(re.compile('(\d+)').search(total).group(1))</div><div class="line"> for i in range(2, total + 1):</div><div class="line"> next_page(i)</div><div class="line"> except Exception as e:</div><div class="line"> print('出错啦', e)</div><div class="line"> finally:</div><div class="line"> browser.close()</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> main()</div></pre></td></tr></table></figure>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>这次学习的东西还是很多,selenium用的模块很多,也利用了无头浏览器PhantomJS的不加载图片和缓存。爬取数据的时候使用了不同的方式,并用zip函数一起迭代保存为字典成功导入到mongodb中。<br>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友一起学习交流吧!<strong><em><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></em></strong></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><blockquote>
<p>上周末非常开心,第一次去北京然后参见了zealer和夸克浏览器的联合线下沙龙会议,和大家交流很多收获很多,最让我吃惊的是他们团队非常年轻就有各种能力,每个人都很强。一个结论:我要继续努力!<br>贴上我们的合影,我很帅!:)</p>
</blockquote>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-86fe7ad392d957de.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="zealer&amp;夸克浏览器.jpg"></p>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
</entry>
<entry>
<title>Python实现斐波那契数列方法及其优化总结</title>
<link href="http://bulolo.cn/2017/05/15/python1/"/>
<id>http://bulolo.cn/2017/05/15/python1/</id>
<published>2017-05-15T05:04:42.000Z</published>
<updated>2017-05-26T07:24:02.341Z</updated>
<content type="html"><![CDATA[<blockquote>
<p>斐波那契数列的相关题目是面试常见的,所以我看了些资料总结记录一下这些小的知识点。</p>
</blockquote>
<h3 id="1-元组实现"><a href="#1-元组实现" class="headerlink" title="1. 元组实现"></a>1. 元组实现</h3><p>代码:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">fibs = [0, 1]</div><div class="line">for i in range(8):</div><div class="line"> fibs.append(fibs[-2] + fibs[-1])</div><div class="line">print(fibs)</div></pre></td></tr></table></figure></p>
<a id="more"></a>
<p>输出:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]</div></pre></td></tr></table></figure></p>
<h3 id="2-迭代器实现"><a href="#2-迭代器实现" class="headerlink" title="2. 迭代器实现"></a>2. 迭代器实现</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">class Fibs:</div><div class="line"> def __init__(self):</div><div class="line"> self.a = 0</div><div class="line"> self.b = 1</div><div class="line"></div><div class="line"> def next(self):</div><div class="line"> self.a, self.b = self.b, self.a + self.b</div><div class="line"> return self.a</div><div class="line"></div><div class="line"> def __iter__(self):</div><div class="line"> return self</div></pre></td></tr></table></figure>
<p>这将得到一个无穷的数列, 可以采用如下方式访问:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">fibs = Fibs()</div><div class="line">for f in fibs:</div><div class="line"> if f > 1000:</div><div class="line"> print(f)</div><div class="line"> break</div><div class="line"> else:</div><div class="line"> print(f)</div></pre></td></tr></table></figure></p>
<h3 id="3-通过定制类实现"><a href="#3-通过定制类实现" class="headerlink" title="3. 通过定制类实现"></a>3. 通过定制类实现</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line">class Fib(object):</div><div class="line"> def __getitem__(self, n):</div><div class="line"> if isinstance(n, int):</div><div class="line"> a, b = 1, 1</div><div class="line"> for x in range(n):</div><div class="line"> a, b = b, a + b</div><div class="line"> return a</div><div class="line"> elif isinstance(n, slice):</div><div class="line"> start = n.start</div><div class="line"> stop = n.stop</div><div class="line"> a, b = 1, 1</div><div class="line"> L = []</div><div class="line"> for x in range(stop):</div><div class="line"> if x >= start:</div><div class="line"> L.append(a)</div><div class="line"> a, b = b, a + b</div><div class="line"> return L</div><div class="line"> else:</div><div class="line"> raise TypeError("Fib indices must be integers")</div></pre></td></tr></table></figure>
<p>这样可以得到一个类似于序列的数据结构,可以通过下标来访问数据:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">f = Fib()</div><div class="line">print (f[0:10])</div></pre></td></tr></table></figure></p>
<h3 id="4-Python实现比较简易的斐波那契数列示例"><a href="#4-Python实现比较简易的斐波那契数列示例" class="headerlink" title="4.Python实现比较简易的斐波那契数列示例"></a>4.Python实现比较简易的斐波那契数列示例</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">i, j = 0, 1</div><div class="line">while i < 10000:</div><div class="line"> print( i,j, = j, i+j)</div></pre></td></tr></table></figure>
<p> 最后展示运行结果:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765</div></pre></td></tr></table></figure></p>
<h3 id="5-列表生成式实现"><a href="#5-列表生成式实现" class="headerlink" title="5.列表生成式实现"></a>5.列表生成式实现</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">def fib(n):</div><div class="line"> if n == 1 or n == 0:</div><div class="line"> return 1</div><div class="line"> else:</div><div class="line"> return fib(n - 2) + fib(n - 1)</div><div class="line">print([fib(n) for n in range(10)])</div></pre></td></tr></table></figure>
<p>这个计算斐波那契数列前n项很简单,但是从下面的图可以看出这个计算花费的时间较多因为会重复计算很多值。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-be5adfbba96c60a6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>这个时候我需要修改一下,加入<strong>缓存</strong>机制。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">def fib(n, cache=None):</div><div class="line"> if cache is None:</div><div class="line"> cache = {}</div><div class="line"> if n in cache:</div><div class="line"> return cache[n]</div><div class="line"> if n == 1 or n == 0:</div><div class="line"> return 1</div><div class="line"> else:</div><div class="line"> cache[n] = fib(n - 2, cache) + fib(n - 1, cache)</div><div class="line"> return cache[n]</div><div class="line">print([fib(n) for n in range(999)])</div></pre></td></tr></table></figure></p>
<p>这样即使是n的值很大也能很快的计算很出来。</p>
]]></content>
<summary type="html">
<blockquote>
<p>斐波那契数列的相关题目是面试常见的,所以我看了些资料总结记录一下这些小的知识点。</p>
</blockquote>
<h3 id="1-元组实现"><a href="#1-元组实现" class="headerlink" title="1. 元组实现"></a>1. 元组实现</h3><p>代码:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">fibs = [0, 1]</div><div class="line">for i in range(8):</div><div class="line"> fibs.append(fibs[-2] + fibs[-1])</div><div class="line">print(fibs)</div></pre></td></tr></table></figure></p>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="技巧" scheme="http://bulolo.cn/tags/skill/"/>
<category term="算法" scheme="http://bulolo.cn/tags/%E7%AE%97%E6%B3%95/"/>
</entry>
<entry>
<title>Python爬虫日记5:使用Selenium爬取一点资讯动态数据</title>
<link href="http://bulolo.cn/2017/05/05/%E7%88%AC%E8%99%AB5/"/>
<id>http://bulolo.cn/2017/05/05/爬虫5/</id>
<published>2017-05-05T13:21:22.000Z</published>
<updated>2017-05-26T07:23:32.817Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>这几天哈尔滨天气天天刮风下雨的挺烦的,抱个电脑去学校图书馆学编程回来还要被雨淋,马上专业课也要考试了,Python集中学习要等到几门课考完了。<br>今天使用Selenium来处理JS一点资讯文章动态加载问题,本来是想配合PhantomJS无界面浏览器来实现的,但是一直出问题等有空在找找原因吧,所以我就Firefox()了。</p>
<blockquote>
<p><strong>目标:获取一点资讯动态文章信息并以csv格式保存</strong></p>
</blockquote>
<a id="more"></a>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li><p>Python3.6,Anaconda集成版本,方便管理各种模块。</p>
</li>
<li><p>Selenium 3.4.0</p>
</li>
</ul>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.先看看网站<a href="http://www.yidianzixun.com/channel/c6" target="_blank" rel="external">一点资讯</a>,的分析,红色部分是文章标题,文章作者,还有评价数目,这几个是我需要提取的数据,右边的按钮是用来刷新新文章的一会儿要用到。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-06a97c18ccf6ca8b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="一点资讯1.png"><br>2.进入开发者模式后找到相应位置可以看到文章链接,标题,文章作者,评论数目。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-888421d366bdf520.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="一点资讯2.png"><br>3.但是从下面看到首页的这样的新闻可以爬取的只有几个而已,我们如果想爬取多一点怎么办呢?当我们打开这个页面的时候,鼠标滚轮向下滚动的时候发现这些数据就变多了,说明这是一个JS动态加载数据的方式。我这次就要用到Selenium来模拟浏览器从而获取更多文章信息。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-fb628f5d5da9f800.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="一点资讯3.png"></p>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p>1.首先把全部的模块导入一下,这次用到了好几个。selenium.webdriver用来模拟浏览器用到的;BeautifulSoup用来解析网页结构;csv模块用来把数据保存为csv格式;time用来延时的,不然网页没有加载完就解析数据,那么保存的数据不完整,不够多。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">from selenium.webdriver.common.keys import Keys</div><div class="line">from selenium import webdriver</div><div class="line">from bs4 import BeautifulSoup</div><div class="line">import csv,time</div></pre></td></tr></table></figure></p>
<p>2.首先webdriver.Firefox()模拟一个火狐浏览器,之后请求这个一点资讯地址延时2秒,保证加载完成。这里本打算用PhantomJS无界面浏览器来爬取的,但是结果不够好,暂时就用火狐吧。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">driver = webdriver.Firefox()</div><div class="line">first_url = 'http://www.yidianzixun.com/channel/c6'</div><div class="line">driver.get(first_url)</div><div class="line">time.sleep(2)</div></pre></td></tr></table></figure></p>
<p>3.接下来模拟鼠标点击那个刷新按钮可以加载更多数据,次数可以自己在加,我这里就用一次,icon-refresh是在开发者模式中按照我图中步骤找到那个按钮的相应位置,使用快捷键Ctrl+Shift+c,然后点击那个刷新按钮就可以跳转到相应代码位置。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-ea432b133fa4bd58.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>4.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">find_element_by_class_name().click()</div></pre></td></tr></table></figure></p>
<p>来模拟点击按钮,之前导入的模块<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">from selenium.webdriver.common.keys import Keys</div></pre></td></tr></table></figure></p>
<p>这个是模拟键盘的操作的模块<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">driver.find_element_by_class_name('icon-refresh').send_keys(Keys.DOWN)</div></pre></td></tr></table></figure></p>
<p>用来模拟键盘的<strong> ↓ </strong>方向键,这样就可以使侧边的滚动条往下滚动,从而实现动态加载文章数据。然后也需要延迟一下让网页加载完全。这里的for语句执行多次是为了按下↓键更多。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">driver.find_element_by_class_name('icon-refresh').click()</div><div class="line">for i in range(1,90):</div><div class="line"> driver.find_element_by_class_name('icon-refresh').send_keys(Keys.DOWN)</div><div class="line">time.sleep(3)</div></pre></td></tr></table></figure></p>
<p>5.到现在网页已经加载完全了,我们想要的文章数据也加载很多了,那么就开始解析网页来爬取数据吧!通过上面实例分析的第二步可以知道文章标题,作者,评论数目,文章链接等信息位置,最后解析完就把这个模拟的浏览器关闭。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">soup = BeautifulSoup(driver.page_source, 'lxml')</div><div class="line">articles = []</div><div class="line">for article in soup.find_all(class_='item doc style-small-image style-content-middle'):</div><div class="line"> title = article.find(class_='doc-title').get_text()</div><div class="line"> source = article.find(class_='source').get_text()</div><div class="line"> comment = article.find(class_='comment-count').get_text()</div><div class="line"> link = 'http://www.yidianzixun.com' + article.get('href')</div><div class="line"> articles.append([title, source, comment, link])</div><div class="line">driver.quit()</div></pre></td></tr></table></figure></p>
<p>6.数据也获取了那么我们现在使用csv格式来保存这些数据就行了。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">with open('yidian.csv', 'w') as f:</div><div class="line"> writer = csv.writer(f)</div><div class="line"> writer.writerow(['文章标题', '作者', '评论数', '文章地址'])</div><div class="line"> for row in articles:</div><div class="line"> writer.writerow(row)</div></pre></td></tr></table></figure></p>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>这次爬虫练习了selenium模拟浏览器的各种操作,继续加油!<br>贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友一起学习交流吧!<strong><em><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></em></strong></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>这几天哈尔滨天气天天刮风下雨的挺烦的,抱个电脑去学校图书馆学编程回来还要被雨淋,马上专业课也要考试了,Python集中学习要等到几门课考完了。<br>今天使用Selenium来处理JS一点资讯文章动态加载问题,本来是想配合PhantomJS无界面浏览器来实现的,但是一直出问题等有空在找找原因吧,所以我就Firefox()了。</p>
<blockquote>
<p><strong>目标:获取一点资讯动态文章信息并以csv格式保存</strong></p>
</blockquote>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
</entry>
<entry>
<title>Python爬虫日记4:Charles抓包获取黑大帐号密码验证码并登录</title>
<link href="http://bulolo.cn/2017/05/02/%E7%88%AC%E8%99%AB4/"/>
<id>http://bulolo.cn/2017/05/02/爬虫4/</id>
<published>2017-05-02T11:44:16.000Z</published>
<updated>2017-05-26T07:23:30.617Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>今天看了一篇安利Charles这个软件的文章,就拿来试试,我们大学的登录页面用开发者模式进去chrome有屏蔽相关模块,用火狐可以正常不过还是抓不到验证码这个js动态数据而且帐号密码的请求后Cookies并找不到。那么这个时候使用抓包软件就是一个好的方法之一了,之前也用过其他抓包软件,比如Fidder,今天用过Charles后才发现还有比Fidder好用的抓包软件,这个比较简洁,数据查找也很直观。</p>
<blockquote>
<p><strong>目标:</strong>使用抓包软件Charles对页面数据分析找到帐号密码以及验证码的接口,然后用Python实现模拟登录,并提取登录后的页面。</p>
</blockquote>
<a id="more"></a>
<h1 id="二:运行环境"><a href="#二:运行环境" class="headerlink" title="二:运行环境"></a>二:运行环境</h1><ul>
<li><p>Python3.6,我用的是Anaconda集成版本,方便管理各种模块。</p>
</li>
<li><p>Charles版本是4.02,使用很简单,数据显示直观。</p>
</li>
</ul>
<h1 id="三:实例分析"><a href="#三:实例分析" class="headerlink" title="三:实例分析"></a>三:实例分析</h1><p>1.分析网站登录情况,网址是<a href="http://my.hlju.edu.cn/login.portal" target="_blank" rel="external">http://my.hlju.edu.cn/login.portal</a> 进去之后用火狐的浏览器进去开发者模式,看到了验证码地址captchaGenerate.portal?后面跟的随机数字代表的不同的验证码,我把这个配合主网址组成这个网址 <a href="http://my.hlju.edu.cn/captchaGenerate.portal?" target="_blank" rel="external">http://my.hlju.edu.cn/captchaGenerate.portal?</a> 在浏览器打开就是随机的验证码。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-c69e3728e1ed797f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-b99d0c5a741ef9ce.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>2.验证码的网址已经找到了,现在我们使用Charles抓包工具,抓取登录时的数据分析一下,这一张是抓包后的图。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-39b7a8e18720e86b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>3.然后点击这个userPasswordValidate.portal,可知道这个保存着登录的全部数据,我们点击一下From数据就变得整洁多了,可以看到有几个键值对这样我们帐号密码对应地址也找到了,之后就可以开始用Python模拟登录了。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">Login.Token1 *******</div><div class="line">Login.Token2 *******</div><div class="line">captcha w4dy</div><div class="line">goto http://my.hlju.edu.cn/loginSuccess.portal</div><div class="line">gotoOnFail http://my.hlju.edu.cn/loginFailure.portal</div></pre></td></tr></table></figure></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-7d84cb900c2e9388.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"></p>
<h1 id="四:实战代码"><a href="#四:实战代码" class="headerlink" title="四:实战代码"></a>四:实战代码</h1><p>帐号密码改成自己的学号密码即可模拟登录,之前爬虫都没有使用requests.session(),这里就需要因为用了这个回话对象,可以使几次请求都在同一个Cookie下进行,方便我们模拟登录后获取登录后的主页面。<br>会话对象让你能够跨请求保持某些参数。它也会在同一个 Session 实例发出的所有请求之间保持 cookie, 期间使用 urllib3 <a href="https://urllib3.readthedocs.io/en/latest/pools.html" target="_blank" rel="external">connection pooling</a> 功能。所以如果你向同一主机发送多个请求,底层的 TCP 连接将会被重用,从而带来显著的性能提升。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div></pre></td><td class="code"><pre><div class="line">import requests</div><div class="line">from PIL import Image</div><div class="line">from bs4 import BeautifulSoup</div><div class="line"></div><div class="line">url1 = 'http://my.hlju.edu.cn/captchaGenerate.portal?'</div><div class="line">url2 = 'http://my.hlju.edu.cn/userPasswordValidate.portal'</div><div class="line">url3 = 'http://my.hlju.edu.cn'</div><div class="line">headers = {</div><div class="line"> 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'</div><div class="line">}</div><div class="line">s = requests.session()</div><div class="line">response = s.get(url1, headers=headers)</div><div class="line">html = response.text</div><div class="line">soup = BeautifulSoup(html, 'html.parser')</div><div class="line">with open('img\code.jpg', 'wb') as f:</div><div class="line"> f.write(response.content)</div><div class="line">img = Image.open('img\code.jpg')</div><div class="line">img.show()</div><div class="line">data = {}</div><div class="line">data['Login.Token1'] = '帐号'</div><div class="line">data['Login.Token2'] = '密码'</div><div class="line">data['captcha'] = input('输入验证码:')</div><div class="line">data['goto'] = 'http://my.hlju.edu.cn/loginSuccess.portal'</div><div class="line">data['gotoOnFail'] = 'http://my.hlju.edu.cn/loginFailure.portal'</div><div class="line">response2 = s.post(url=url2, data=data, headers=headers)</div><div class="line">response3 = s.get(url3, headers=headers)</div><div class="line">print(response3.text)</div></pre></td></tr></table></figure></p>
<h1 id="五:总结"><a href="#五:总结" class="headerlink" title="五:总结"></a>五:总结</h1><p>这次练习了一下Charles抓包的使用和对抓包数据的分析,每天写一写小Demo,继续加油!<br>这里贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了,有喜欢的朋友一起学习交流吧!<a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">github.com/rieuse/learnPython</a></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><p>今天看了一篇安利Charles这个软件的文章,就拿来试试,我们大学的登录页面用开发者模式进去chrome有屏蔽相关模块,用火狐可以正常不过还是抓不到验证码这个js动态数据而且帐号密码的请求后Cookies并找不到。那么这个时候使用抓包软件就是一个好的方法之一了,之前也用过其他抓包软件,比如Fidder,今天用过Charles后才发现还有比Fidder好用的抓包软件,这个比较简洁,数据查找也很直观。</p>
<blockquote>
<p><strong>目标:</strong>使用抓包软件Charles对页面数据分析找到帐号密码以及验证码的接口,然后用Python实现模拟登录,并提取登录后的页面。</p>
</blockquote>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
</entry>
<entry>
<title>Python爬虫日记3:爬取v2ex数据用csv保存</title>
<link href="http://bulolo.cn/2017/05/02/%E7%88%AC%E8%99%AB3/"/>
<id>http://bulolo.cn/2017/05/02/爬虫3/</id>
<published>2017-05-02T04:44:11.000Z</published>
<updated>2017-05-26T07:23:28.117Z</updated>
<content type="html"><![CDATA[<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><hr>
<p>v2ex是一个汇集各类奇妙好玩的话题和流行动向的网站,有很多不错的问答。这次爬虫是五一期间做的,贴出来网址<a href="https://www.v2ex.com/?tab=all。" target="_blank" rel="external">https://www.v2ex.com/?tab=all。</a></p>
<blockquote>
<p><strong>目标:</strong>爬取全部分类中的文章标题,分类,作者,文章地址这些内容然后以csv格式保存下来。</p>
</blockquote>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-9ae03cf9f5b89721.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br><a id="more"></a></p>
<h2 id="二:说明"><a href="#二:说明" class="headerlink" title="#二:说明"></a>#二:说明</h2><ul>
<li>本次使用的是Python3.6版本</li>
<li>作者这个内容是js动态数据 使用xpath Beautifulsoup的tag和select都抓取不到,我试了试用正则表达式可以,目前还没学其他方法就这样头铁了。</li>
<li>使用csv保存数据的时候我发现writer.writerow()和writer.writerows()是不一样的,本次用的前者。</li>
</ul>
<h1 id="三:实战分析"><a href="#三:实战分析" class="headerlink" title="三:实战分析"></a>三:实战分析</h1><hr>
<p>1.导入本次使用的模块,csv, re, requests, BeautifulSoup。<br> <figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">import csv, requests, re</div><div class="line">from bs4 import BeautifulSoup</div></pre></td></tr></table></figure></p>
<p>2.请求网页与解析网页。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">url = 'https://www.v2ex.com/?tab=all'</div><div class="line">html = requests.get(url).text</div><div class="line">soup = BeautifulSoup(html, 'html.parser')</div></pre></td></tr></table></figure></p>
<p>3.先看一下网页结构。</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-128de4d1b5ff7cd4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>然后来获取文章标题,分类,作者,文章地址,这里的标题和分类都很容易获取,使用BeautifulSoup解析后按照class就可以找到,然后使用get_text()即可获取我们需要的内容,最头疼的是作者和文章链接,我这里使用正则才把他们挖掘出来,不过也算是练习正则表达式的使用。最后把获取的内容都传给articles列表。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">articles = []</div><div class="line">for article in soup.find_all(class_='cell item'):</div><div class="line"> title = article.find(class_='item_title').get_text()</div><div class="line"> category = article.find(class_='node').get_text()</div><div class="line"> author = re.findall(r'(?<=<a href="/member/).+(?="><img)', str(article))[0]</div><div class="line"> u = article.select('.item_title > a')</div><div class="line"> link = 'https://www.v2ex.com' + re.findall(r'(?<=href=").+(?=")', str(u))[0]</div><div class="line"> articles.append([title, category, author, link])</div></pre></td></tr></table></figure></p>
<p>4.把列表中的数据保存在csv中,并且给他们第一行写入标题。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">with open('v2ex.csv', 'w') as f:</div><div class="line"> writer = csv.writer(f)</div><div class="line"> writer.writerow(['文章标题', '分类', '作者', '文章地址'])</div><div class="line"> for row in articles:</div><div class="line"> writer.writerow(row)</div></pre></td></tr></table></figure></p>
<h1 id="四:总结"><a href="#四:总结" class="headerlink" title="四:总结"></a>四:总结</h1><hr>
<p>最后的效果:</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-62d7f0746576de00.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>这次爬取遇到了一些问题,慢慢的学会更多东西,爬虫让我非常快乐。我以后会坚持写下去,有喜欢的朋友一起学习交流吧!<br>这里贴出我的github地址,我的爬虫代码和学习的基础部分都放进去了。<br><a href="https://github.com/rieuse/learnPython" target="_blank" rel="external">https://github.com/rieuse/learnPython</a></p>
]]></content>
<summary type="html">
<h1 id="一:前言"><a href="#一:前言" class="headerlink" title="一:前言"></a>一:前言</h1><hr>
<p>v2ex是一个汇集各类奇妙好玩的话题和流行动向的网站,有很多不错的问答。这次爬虫是五一期间做的,贴出来网址<a href="https://www.v2ex.com/?tab=all。" target="_blank" rel="external">https://www.v2ex.com/?tab=all。</a></p>
<blockquote>
<p><strong>目标:</strong>爬取全部分类中的文章标题,分类,作者,文章地址这些内容然后以csv格式保存下来。</p>
</blockquote>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-9ae03cf9f5b89721.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
</entry>
<entry>
<title>Python爬虫日记2:使用lxml解析HTML输出对应值</title>
<link href="http://bulolo.cn/2017/04/28/%E7%88%AC%E8%99%AB2/"/>
<id>http://bulolo.cn/2017/04/28/爬虫2/</id>
<published>2017-04-28T11:07:05.000Z</published>
<updated>2017-05-26T07:23:25.045Z</updated>
<content type="html"><![CDATA[<h1 id="一、前言"><a href="#一、前言" class="headerlink" title="一、前言"></a>一、前言</h1><p>今天我要做的是爬取凤凰网资讯的一个即时新闻列表的标题和对应链接,很简单的requests与lxml练习,同时使用xpath。贴出网址:<a href="http://news.ifeng.com/listpage/11502/0/1/rtlist.shtml" target="_blank" rel="external">http://news.ifeng.com/listpage/11502/0/1/rtlist.shtml</a></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-7aaff42d387ccea6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="凤凰资讯.png"><br><a id="more"></a></p>
<h1 id="二、运行环境"><a href="#二、运行环境" class="headerlink" title="二、运行环境"></a>二、运行环境</h1><ul>
<li>系统版本<br>Windows10 64位</li>
<li>Python版本<br>Python3.6 我用的是Anaconda集成版本</li>
<li>IDE<br>PyCharm 学生可以通过edu邮箱免费使用,不是学生的朋友可以试试社区版。</li>
</ul>
<h1 id="三、分析"><a href="#三、分析" class="headerlink" title="三、分析"></a>三、分析</h1><p>解析HTML常用方式有<strong>BeautifulSoup</strong>,<strong>lxml.html</strong>,性能方面lxml要优于BeautifulSoup,BeautifulSoup是基于DOM的,会解析整个DOM树,lxml只会局部遍历。</p>
<p>python3网络请求常用的有自带的urllib,第三方库requests,使用起来requests还是比urllib更简单明了,而且requests有更强的功能。</p>
<h1 id="四、实战"><a href="#四、实战" class="headerlink" title="四、实战"></a>四、实战</h1><p>首先导入今天需要的模块requests,lxml.html。</p>
<pre><code>import requests
import lxml.html
</code></pre><p>然后url是目标网址,html保存着这个网页的文本内容,这时候需用lxml来解析它,这样才能提取我们需要的数据。</p>
<pre><code>url = 'http://news.ifeng.com/listpage/11502/0/1/rtlist.shtml'
html = requests.get(url).text
doc = lxml.html.fromstring(html)
</code></pre><p>解析完成后,我们首先提取文章的标题,这里使用了xpath来搜索标题所在的标签,对原网址F12 开发者模式打开可以查询标题。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-8f61ab5ce62f6e3d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="凤凰资讯标题.png"></p>
<pre><code>titles = doc.xpath('//div[@class="newsList"]/ul/li/a/text()')
href = doc.xpath('//div[@class="newsList"]/ul/li/a/@href')
</code></pre><p>这里第一行是将网页中的符合标题的内容都传给titles变量中,第二行是将标题所在的网址全部传给href。</p>
<p>说到这个xpath查询有很多人不太会用,或者觉得很麻烦,不过这里推荐一款xpath查询插件,这样我们查询目标的时候就很容易获取了。这款chrome插件是xpath heper ,安装好之后我们重新打开浏览器按ctrl+shift+x就能调出xpath-helper框了,按shift配合鼠标可以切换查询的目标。</p>
<p>最后一步:将标题和对应的网址结合起来,遍历后输出即可看到结果<br> i = 0<br> for content in titles:<br> results = {<br> ‘标题’:titles[i],<br> ‘链接’:href[i]<br> }<br> i += 1<br> print(results)</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-6172297303818caf.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="凤凰资讯2.png"></p>
<h1 id="五、总结"><a href="#五、总结" class="headerlink" title="五、总结"></a>五、总结</h1><p>查询标签用BeautifulSoup也挺合适的,这次为了练习一下就使用了lxml 配合xpath。继续努力,给自己加油!ヾ(o◕∀◕)ノヾ</p>
]]></content>
<summary type="html">
<h1 id="一、前言"><a href="#一、前言" class="headerlink" title="一、前言"></a>一、前言</h1><p>今天我要做的是爬取凤凰网资讯的一个即时新闻列表的标题和对应链接,很简单的requests与lxml练习,同时使用xpath。贴出网址:<a href="http://news.ifeng.com/listpage/11502/0/1/rtlist.shtml" target="_blank" rel="external">http://news.ifeng.com/listpage/11502/0/1/rtlist.shtml</a></p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-7aaff42d387ccea6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="凤凰资讯.png"><br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
</entry>
<entry>
<title>Python爬虫日记1:爬取豆瓣电影中速度与激情8演员图片</title>
<link href="http://bulolo.cn/2017/04/27/%E7%88%AC%E8%99%AB1/"/>
<id>http://bulolo.cn/2017/04/27/爬虫1/</id>
<published>2017-04-27T08:34:19.000Z</published>
<updated>2017-05-26T07:24:36.772Z</updated>
<content type="html"><![CDATA[<h1 id="一、前言"><a href="#一、前言" class="headerlink" title="一、前言"></a>一、前言</h1><p>这是我第一次写文章,作为一个非计算机,编程类专业的大二学生,我希望能够给像我这样的入门的朋友一些帮助,也同时激励自己努力写代码。好了废话不多说,今天我做的爬虫是豆瓣的一个电影——速度与激情8的全部影人页面,贴出网址:<a href="https://movie.douban.com/subject/26260853/celebrities" target="_blank" rel="external">速度与激情8 全部影人</a>。<br><strong>目标</strong>:爬取速度与激情8中全部影人的图片并且用图中人物的名字给图片文件命名,最后保存在电脑中。<br><a id="more"></a><br><img src="http://upload-images.jianshu.io/upload_images/4701426-e240ffe03f1ae5d5.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="豆瓣1.png"></p>
<h1 id="二、运行环境"><a href="#二、运行环境" class="headerlink" title="二、运行环境"></a>二、运行环境</h1><ul>
<li>系统版本<br>Windows10 64位</li>
<li>Python版本<br>Python3.6 我用的是Anaconda集成版本</li>
<li>IDE<br>PyCharm 学生可以通过edu邮箱免费使用,不是学生的朋友可以试试社区版,不明白怎么安装的可以留言或者 私信我。</li>
</ul>
<h1 id="三、分析"><a href="#三、分析" class="headerlink" title="三、分析"></a>三、分析</h1><p>爬虫的三个要点:请求,解析,存储<br><strong>请求</strong>可以使用urllib Requests ,其中urllib是自带的, Requests是第三方库,功能更强大,本次使用的是urllib。<br><strong>解析</strong>我用的有正则表达式,xpath,本次使用的是正则表达式,主要是想自己用正则来练练 只看正则的说明不能理解其中的奥秘ヾ(o◕∀◕)ノヾ,必须多试试。<br><strong>储存</strong>常用的有保存到内存,数据库,硬盘中,本次是保存到电脑硬盘中</p>
<h1 id="四、实战"><a href="#四、实战" class="headerlink" title="四、实战"></a>四、实战</h1><p>首先导入我们需要的模块<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">import urllib.request</div><div class="line">import os</div><div class="line">import re</div></pre></td></tr></table></figure></p>
<p>urllib.request是用来请求的,os是操作文件目录常用的模块,re是python中正则表达式的模块,<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">url = 'https://movie.douban.com/subject/26260853/celebrities'</div><div class="line">r = urllib.request.urlopen(url)</div><div class="line">html = r.read().decode('utf-8')</div></pre></td></tr></table></figure></p>
<p>第一行很明显是本次爬虫的网页, r = urllib.request.urlopen(url)用来打开网页, r.read()是读取网页内容,decode(‘utf-8’)是用utf-8编码对字符串str进行解码,以获取unicode。</p>
<p>之后我们来获取一下图片的地址,用Chrome浏览器打开速度与激情8的全部影人页面,按下F12,分析一下,可知每个人的照片地址都是img1或者3.doubanio.com/img/celebrity/medium/几个数字.jpg</p>
<p><img src="http://upload-images.jianshu.io/upload_images/4701426-a5412a6886373783.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>我们使用正则表达式来匹配一下这些图片地址,1或者3部分用\d匹配,末尾数字部分用.*来匹配即可。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">result = re.findall(r'https://img\d.doubanio.com/img/celebrity/medium/.*.jpg',html)</div></pre></td></tr></table></figure></p>
<p>现在图片地址也有了,还需要把这些人物的名字给爬下来,之后才能配对文件,再次分析一下刚才的网址。看到这些人物的名字都是以title=开头,我们就用它来正则匹配一下,来获取全部的人物名字,放进一个列表中。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-409edabe4ba6433c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">result2 = re.findall(r'(?<=title=").\S+', html)</div><div class="line">result2.pop()</div><div class="line">result3 = sorted(set(result2), key=result2.index)</div><div class="line">result3.pop(-3)</div></pre></td></tr></table></figure></p>
<p>第一行代码中re.findall(r’(?<=title=”).\S+’, html)用来匹配截图中title=”后面的名字<br>第二行代码中pop()是去除最后一个元素,因为前面匹配后的列表中有一个非人物名字的元素所以我们就需要把它去掉<br>第三行代码中sorted(set(result2), key=result2.index)有两个功能,一个是使用set()集合函数来去除列表中重复元素,另一个是sorted()函数是给列表排序用的,key=result2.index的意思是以result2原来的索引顺序来给新的列表排序,因为每张图片很名字是对应的,如果单单使用set(),虽然重复的去除了但是顺序也变了,所以我们需要利用sort()结合key=result2.index来排序才行。<br>result3.pop(-3)意思是删除result3中倒数第三个元素,因为克里斯·摩根这个没照片所以我就把他删了。</p>
<p>之后我们来给本地创建一个文件夹用来保存图片,这里就用到了os模块<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">if not os.path.exists('douban'):</div><div class="line"> os.makedirs('douban')</div></pre></td></tr></table></figure></p>
<p>之后需要的是下载这些人物图片,利用之前爬取的人物名字给对应图片命名并保存。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">i = 0</div><div class="line">for link in result:</div><div class="line"> filename = 'douban\' + str(result3[i])+ '.jpg'</div><div class="line"> i += 1</div><div class="line"> with open(filename, 'w') as file:</div><div class="line"> urllib.request.urlretrieve(link, filename)</div></pre></td></tr></table></figure>
<p>完整代码贴出来,需要的同学可以试试。<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line">import urllib.request</div><div class="line">import os</div><div class="line">import re</div><div class="line">url = 'https://movie.douban.com/subject/26260853/celebrities'</div><div class="line">r = urllib.request.urlopen(url)</div><div class="line">html = r.read().decode('utf-8')</div><div class="line">result = re.findall(r'https://img\d.doubanio.com/img/celebrity/medium/.*.jpg', html)</div><div class="line">result2 = re.findall(r'(?<=title=").\S+', html)</div><div class="line">result2.pop()</div><div class="line">result3 = sorted(set(result2), key=result2.index)</div><div class="line">result3.pop(-3)</div><div class="line">if not os.path.exists('douban'):</div><div class="line"> os.makedirs('douban')</div><div class="line">i = 0</div><div class="line">for link in result:</div><div class="line"> filename = 'douban\\' + str(result3[i]) + '.jpg'</div><div class="line"> i += 1</div><div class="line"> with open(filename, 'w') as file:</div><div class="line"> urllib.request.urlretrieve(link, filename)</div></pre></td></tr></table></figure></p>
<h1 id="五、总结"><a href="#五、总结" class="headerlink" title="五、总结"></a>五、总结</h1><p>最后效果,图片都下载在我刚才指定的文件夹中了。<br><img src="http://upload-images.jianshu.io/upload_images/4701426-5c08e5da83f42dab.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" alt="Paste_Image.png"><br>第一次写文章,对很多东西不是很熟悉,如果有任何问题,请多多指教。</p>
]]></content>
<summary type="html">
<h1 id="一、前言"><a href="#一、前言" class="headerlink" title="一、前言"></a>一、前言</h1><p>这是我第一次写文章,作为一个非计算机,编程类专业的大二学生,我希望能够给像我这样的入门的朋友一些帮助,也同时激励自己努力写代码。好了废话不多说,今天我做的爬虫是豆瓣的一个电影——速度与激情8的全部影人页面,贴出网址:<a href="https://movie.douban.com/subject/26260853/celebrities" target="_blank" rel="external">速度与激情8 全部影人</a>。<br><strong>目标</strong>:爬取速度与激情8中全部影人的图片并且用图中人物的名字给图片文件命名,最后保存在电脑中。<br>
</summary>
<category term="python" scheme="http://bulolo.cn/tags/python/"/>
<category term="爬虫" scheme="http://bulolo.cn/tags/Spider/"/>
<category term="图片" scheme="http://bulolo.cn/tags/%E5%9B%BE%E7%89%87/"/>
</entry>
</feed>