0%

Windows+anaconda+4050 6G+chatglm本地部署五

本文就如何使用爬虫工具实现文本抓取进行验证介绍

数据收集

开始我下载的是小说,但是发现小说和我所知道的电视剧改动比较大,而且小说的语法用词不太白话文,后面又想着字幕是不是好一点,但是忽略了字幕会跳过最重要的环节:人物关系,也就是丢给大模型,它并不明白这话都是谁说的,无法联系起来。所以最终决定从分集剧情作为数据,更简单干练。

因为我要从网站上把每集的分集剧情内容爬下来,所以不得不提爬虫工具。

网页不可复制代码块内容提取

1
2
3
4
5
6
7
8
9
10
11
# -*- coding: UTF-8 -*-
import requests
from bs4 import BeautifulSoup

url = "https://www.zhihu.com/question/467685925" # 替换为你要爬取的网页URL
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
class_element = soup.find('div', class_='highlight') ## 这里的class_就对应根据关键字在网页源码找到具体块
class_text = class_element.text
print(class_text)

可以看下效果

带有下一页格式网页内容提取

观察网页格式可以得到网页命名格式

1
2
3
4
5
6
7
8
9
10
11
'''
https://www.tvmao.com/drama/YicsIy8=/episode
https://www.tvmao.com/drama/YicsIy8=/episode/0-2
https://www.tvmao.com/drama/YicsIy8=/episode/0-3
https://www.tvmao.com/drama/YicsIy8=/episode/1-4
https://www.tvmao.com/drama/YicsIy8=/episode/1-5
https://www.tvmao.com/drama/YicsIy8=/episode/1-6
https://www.tvmao.com/drama/YicsIy8=/episode/20-63
https://www.tvmao.com/drama/YicsIy8=/episode/21-64
https://www.tvmao.com/drama/YicsIy8=/episode/25-76
'''

所以中心思想是遍历所有网页再根据关键信息提取对应的块内容

这里一定注意:1.查找时尽量找文本最后的字符为查找对象,才能找对,文档前面的字符可能会在网页源码中多次出现,不注意的话很容易找错地方。2. 文本提取出来看看有没有不相关的内容尽早清洗剔除掉,保证数据干净

所以整体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import csv
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

def get_html(url):
"""
获取网页源代码
:param url: URL
:return: str
"""
headers = {'User-Agent': UserAgent().random}
response = requests.get(url, headers=headers)
return response.text


def parse_html(html):
"""
解析网页源码并提取数据
:param html: Page_Source
:return: generator
"""
soup = BeautifulSoup(html, 'html.parser')
class_element = soup.find('article', class_='clear epi_c')
description = class_element.text
## 数据清洗
description = description.replace('(后宫·甄嬛传剧情系电视猫原创,未经许可,请勿转载!转载许可)', "").replace("上传剧照图片", '').strip()+ '\n'
print(description)
return description


def write2csv(filename, rows):
"""
保存数据到csv文件
:param filename:
:param rows:
:return:
"""
with open(filename, 'a+', encoding='utf-8', newline='') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)

def write2txt(filename, contents):
fh = open(filename, 'a+', encoding='utf-8')
fh.write(contents)
fh.close()


def main():
"""
入口函数
:return:
"""
filename = './甄嬛传剧情.txt'

'''
https://www.tvmao.com/drama/YicsIy8=/episode
https://www.tvmao.com/drama/YicsIy8=/episode/0-2
https://www.tvmao.com/drama/YicsIy8=/episode/0-3
https://www.tvmao.com/drama/YicsIy8=/episode/1-4
https://www.tvmao.com/drama/YicsIy8=/episode/1-5
https://www.tvmao.com/drama/YicsIy8=/episode/1-6
https://www.tvmao.com/drama/YicsIy8=/episode/20-63
https://www.tvmao.com/drama/YicsIy8=/episode/21-64
https://www.tvmao.com/drama/YicsIy8=/episode/25-76

'''
for index in range(1, 77):
shang = index//3
rest = index%3
if rest==0:
pageNo = shang -1
else:
pageNo = shang
if index ==1:
url = 'https://www.tvmao.com/drama/YicsIy8=/episode'
else:
url = 'https://www.tvmao.com/drama/YicsIy8=/episode/{}-{}'.format(pageNo, index)


page_source = get_html(url)
print(url)
content = parse_html(page_source)
write2txt(filename, content)

if __name__ == '__main__':
main()

最终的效果:

保存提取的内容