高效爬虫框架使用

xiaotaomi

21-12-08 16:34 👁536

一直以来跟大家分享了很多爬虫的知识给大家，今天就换一种爬虫知识分享，一些较为高效的Python爬虫框架，分享给大家。

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。用这个框架来访问一些比较难采集的网站是很好的选择。

pyspider 是一个用python实现的功能强大的网络爬虫系统，能在浏览器界面上进行脚本的编写，功能的调度和爬取结果的实时查看，后端使用常用的数据库进行爬取结果的存储，还能定时设置任务与任务优先级等。还有Crawley、Portia、Newspaper、Beautiful Soup、Grab等高效的爬虫框架。

今天就主要跟大家交流下Scrapy这个框架在爬虫中的使用，简单的爬虫示例如下：

#! -*- encoding:utf-8 -*-
        import base64            
        import sys
        import random

        PY3 = sys.version_info[0] >= 3

        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes

        class ProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)

                # 添加验证头
                encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass                    

                # 设置IP切换头(根据需求)
                tunnel = random.randint(1,10000)
                request.headers['Proxy-Tunnel'] = str(tunnel)

这是一段加了代理使用方式的示例，代理是有亿牛云代理提供，还有其他的语言示例，需要的可以去了解下https://www.16yun.cn/help/ss_demo/#1python。不知道大家平时都是使用的哪个框架呢？实现爬虫技术的编程环境有很多种，Java、Python、C++等都可以用来爬虫。但很多人选择Python来写爬虫，因为Python确实很适合做爬虫，丰富的第三方库十分强大，简单几行代码便可实现你想要的功能。更重要的，Python也是数据挖掘和分析的好能手。

高效爬虫框架使用

xiaotaomi

会员积分：6520

一直以来跟大家分享了很多爬虫的知识给大家，今天就换一种爬虫知识分享，一些较为高效的Python爬虫框架，分享给大家。

今天就主要跟大家交流下Scrapy这个框架在爬虫中的使用，简单的爬虫示例如下：

#! -*- encoding:utf-8 -*-
        import base64            
        import sys
        import random

        PY3 = sys.version_info[0] >= 3

        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes

        class ProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)

                # 添加验证头
                encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass                    

                # 设置IP切换头(根据需求)
                tunnel = random.randint(1,10000)
                request.headers['Proxy-Tunnel'] = str(tunnel)

21-12-08 16:34

536

暂无评论

*论坛内容来自网络，不代表本网观点，发现侵权请联系客服删除！