scrapy爬取后中文乱码,解决word转为html 时cp1252编码问题

解决思路1、

循环暴力寻找编码，但是不如思路3

 def parse(self, response):
        print(response.text[:100])
        body = response.body#直接是bytes,response.text是str
        encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1']#实际可用response.encoding获得
        for encoding in encodings:
            try:
                print(body.decode(encoding)[:100])#decode必须是bytes
            except Exception as e:
                print('decode {0}, error: {1}\n'.format(encoding, e))
                pass

解决思路2、
download minddlewares中有个process_response方法，修改它的encoding即可需要自己组装一个修改了charset的页面response，利用HtmlResponse可以完美解决,实现中文乱码的解决，当然你要在setting.py中启用该download middleware

from scrapy.http import HtmlResponse
 
   def process_response(self,request, response, spider):
        # 修改页面编码
        if response.encoding == 'cp1252':
            response = HtmlResponse(url=response.url, body=response.body, encoding='utf-8')
        return response

解决思路3、

scrapy爬取编码为gb2312的网页时出现中文乱码
python3用库chardet查看编码方式;先用encode编码成bytes,再用decode编码成str

import chardet

txt_b=response.xpath('//title').extract()[0].encode(response.encoding)#对找到的具体内容用encode编码成bytes
print(chardet.detect(txt_b))
txt_str= txt_b.decode(response.encoding,errors='ignore')#对bytes内容用decode解码成str
print(txt_str)

原文地址：http://www.cnblogs.com/kuba8/p/16918265.html

1. 本站所有资源来源于用户上传和网络，如有侵权请邮件联系站长！ 2. 分享目的仅供大家学习和交流，请务用于商业用途! 3. 如果你也有好源码或者教程，可以到用户中心发布，分享有积分奖励和额外收入！ 4. 本站提供的源码、模板、插件等等其他资源，都不包含技术服务请大家谅解！ 5. 如有链接无法下载、失效或广告，请联系管理员处理！ 6. 本站资源售价只是赞助，收取费用仅维持本站的日常运营所需！ 7. 如遇到加密压缩包，默认解压密码为"gltf",如遇到无法解压的请联系管理员！ 8. 因为资源和程序源码均为可复制品，所以不支持任何理由的退款兑现，请斟酌后支付下载声明：如果标题没有注明"已测试"或者"测试可用"等字样的资源源码均未经过站长测试.特别注意没有标注的源码不保证任何可用性

scrapy爬取后中文乱码,解决word转为html 时cp1252编码问题

解决思路1、

解决思路3、

排行榜展示

3D打印机glb模型下载-机械glb模型

树glb模型下载-树2

水稻glb模型下载-水稻1

变电箱1glb模型下载-机械glb模型

树glb模型下载-树3

模型

变电箱1glb模型下载-机械glb模型

树glb模型下载-树3

树glb模型下载-树2

水稻glb模型下载-水稻1

3D打印机glb模型下载-机械glb模型

树glb模型下载-树1

scrapy爬取后中文乱码,解决word转为html 时cp1252编码问题

解决思路1、

解决思路3、

排行榜展示

标签

模型