python网络爬虫与信息提取
in Python with 0 comment

python网络爬虫与信息提取

in Python with 0 comment

Requests

自动爬取HTML页面,自动网络请求提交
Requests库的7个主要方法
requests.request():构造一个请求,支撑以下各方法的基础方法;
requests.get():获取HTML网页的主要方法,对应于HTTP的GET;
requests.head():获取HTML网页头信息的方法,对应于HTTP的HEAD;
requests.post():向HTML网页提交POST请求的方法,对应于HTTP的POST;
requests.put():向HTML网页提交PUT请求的方法,对应于HTTP的PUT;
requests.patch():向HTML网页提交局部修改请求,对应于HTTP的PATCH;
requests.delete():向HTML页面提交删除请求,对应于HTTP的DELETE;

get方法

r = requests.get(url)
Response对象的属性
r.status_code:HTTP请求的返回状态,200表示连接成功,404表示失败;
r.text:HTTP响应内容的字符串形式,即,url对应的页面内容;
r.encoding:从HTTP header中猜测的响应内容编码方式;
r.apparent_encoding:从内容中分析出的响应内容编码方式(备选编码方式);
r.content:HTTP响应内容的二进制形式;

爬取网页的通用代码框架

import requests
def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status() #如果状态不是200,引发HTTPError异常
        r.encoding =  r.apparent_encoding
        return t.text
    except:
        return "产生异常"
        
if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

HTTP协议

超文本传输协议

robots.txt

网络爬虫排除标准

网络爬虫的尺寸

1,小规模,数据量小,爬取速度不敏感,想要爬取网页,玩转网页,使用Requests库;
2,中规模,数据规模较大,爬取速度敏感;比如爬取网站,爬取系列网站;使用Scrapy库;
3,大规模,搜索引擎爬取速度关键,爬取全网,需要定制开发;

网络爬虫的限制

来源审查:判断User-Agent进行限制

检查来访HTTP协议头的User-Agent域,只响应浏览器或友好爬虫的访问;

发布公告:Robots协议

告知所有爬虫网站的爬取策略,要求爬虫遵守

#注释,*代表所有,/代表根目录
User-agent:*
Disallow:/

遇到第一种情况的代码

import requests
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv = { 'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败”)

百度搜索全代码

import requests
keyword = "Python"
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

图片爬取全代码

import requests
import os
url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
root = "D://pics//"
path = root + url.split(/'/)[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

IP地址查询

import requests
url = "http://m.ip138.com/ip.asp?ip=?"
try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失败")

解析HTML页面(信息标记与提取方法)

Beautiful Soup

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>','html.parser')

正则表达式详解,提取页面关键信息

Re
Re

#CrowTaobaoPrice.py
import requests
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
     
def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price , title])
    except:
        print("")
 
def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))
         
def main():
    goods = '书包'
    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
     
main()

Scrapy

网络爬虫原理介绍,专业爬虫框架介绍