(htmlpython做网站)(python的网页制作)

Python-爬取HTML网页数据

软件环境

Mac 10.13.1 (17B1003)

Python 2.7.10

VSCode 1.18.1

摘要

本文是练手Demo，主要是使用 Beautiful Soup 来爬取网页数据。

Beautiful Soup 介绍

Beautiful Soup提供一些简单的、python式的用来处理导航、搜索、修改分析树等功能。

Beautiful Soup 官方中文文档

特点

简单：它是一个工具箱，通过解析文档为用户提供需要抓取的数据

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

Beautiful Soup 的安装

安装 pip (如果需要): sudo easy_install pip

安装 Beautiful Soup: sudo pip install beautifulsoup4

示例

本示例是抓取一个靠谱的真诚透明的互联网金融公司的投资列表页面【点我访问网页】，页面如下图：

确定获取数据范围

本示例是获取项目列表，打开Chrome的调试栏，找到对应的位置，如下图：

导包

import sys

import json

import urllib2 as HttpUtils

import urllib as UrlUtils

from bs4 import BeautifulSoup

获取页面信息（分页）

def gethtml(page):

'获取指定页码的网页数据'

url = 'https://box.jimu.com/Project/List'

values = {

'category': '',

'rate': '',

'range': '',

'page': page

}

data = UrlUtils.urlencode(values)

# 使用 DebugLog

httphandler = HttpUtils.HTTPHandler(debuglevel=1)

httpshandler = HttpUtils.HTTPSHandler(debuglevel=1)

opener = HttpUtils.build_opener(httphandler, httpshandler)

HttpUtils.install_opener(opener)

request = HttpUtils.Request(url + '?' + data)

request.get_method = lambda: 'GET'

try:

response = HttpUtils.urlopen(request, timeout=10)

except HttpUtils.URLError, err:

if hasattr(err, 'code'):

print err.code

if hasattr(err, 'reason'):

print err.reason

return None

else:

print '====== Http request OK ======'

return response.read().decode('utf-8')

TIPS

urlopen(url, data, timeout)

url: 请求的 URL

data: 访问 URL 时要传送的数据

timeout: 超时时间

HttpUtils.build_opener(httphandler, httpshandler)

开启日志，将会在调试控制台输出网络请求日志，方便调试

必要的 try-catch，以便可以捕获到网络异常

解析获取的数据

创建BeautifulSoup对象

soup = BeautifulSoup(html, 'html.parser')

获取待遍历的对象

# items 是一个 <listiterator object at 0x10a4b9950> 对象，不是一个list，但是可以循环遍历所有子节点。

items = soup.find(attrs={'class':'row'}).children

遍历子节点，解析并获取所需参数

projectList = []

for item in items:

if item == '\n': continue

# 获取需要的数据

title = item.find(attrs={'class': 'title'}).string.strip()

projectId = item.find(attrs={'class': 'subtitle'}).string.strip()

projectType = item.find(attrs={'class': 'invest-item-subtitle'}).span.string

percent = item.find(attrs={'class': 'percent'})

state = 'Open'

if percent is None: # 融资已完成

percent = '100%'

state = 'Finished'

totalAmount = item.find(attrs={'class': 'project-info'}).span.string.strip()

investedAmount = totalAmount

else:

percent = percent.string.strip()

state = 'Open'

decimalList = item.find(attrs={'class': 'decimal-wrap'}).find_all(attrs={'class': 'decimal'})

totalAmount = decimalList[0].string

investedAmount = decimalList[1].string

investState = item.find(attrs={'class': 'invest-item-type'})

if investState != None:

state = investState.string

profitSpan = item.find(attrs={'class': 'invest-item-rate'}).find(attrs={'class': 'invest-item-profit'})

profit1 = profitSpan.next.strip()

profit2 = profitSpan.em.string.strip()

profit = profit1 + profit2

term = item.find(attrs={'class': 'invest-item-maturity'}).find(attrs={'class': 'invest-item-profit'}).string.strip()

project = {

'title': title,

'projectId': projectId,

'type': projectType,

'percent': percent,

'totalAmount': totalAmount,

'investedAmount': investedAmount,

'profit': profit,

'term': term,

'state': state

}

projectList.append(project)

输出解析结果，如下：

TIPS

解析html代码，主要是运用了BeautifulSoup的几大对象，Tag、NavigableString、BeautifulSoup、Comment，可以参考Beautiful Soup 官方中文文档

解析后的数据，可以持久化，然后做一个提醒投标的程序，不会放过每一笔收益 ^_^

声明：我要去上班所有作品（图文、音视频）均由用户自行上传分享，仅供网友学习交流，版权归原作者python技术开发所有，原文出处。若您的权利被侵害，请联系删除。

本文标题：(htmlpython做网站)(python的网页制作)
本文链接：https://www.51qsb.cn/article/m8mub.html

(htmlpython做网站)(python的网页制作)

你可能还想知道

发表回复