Scrapy入门

2024-11-07/ 21 次浏览/ 未命名

Scrapy 是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架。

安装scrapy

pip install scrapy==2.5.0

1.新建 Scrapy项目

scrapy startproject mySpider # 项目名为mySpider

2.进入到spiders目录

cd mySpider/mySpider/spiders

3.创建爬虫

scrapy genspider dgcuAI ai.dgcu.edu.cn # 爬虫名为dgcuAI，爬取域为ai.dgcu.edu.cn

4.制作爬虫

创建爬虫之后，打开dgcuAI.py文件。

引入Selector

from scrapy.selector import Selector

修改start_urls：

start_urls = ['http://ai.dgcu.edu.cn/front/category/2.html']

修改parse函数：

def parse(self, response):

print(response.url)

selector = Selector(response)

# # 使用XPath表达式提取信息：

# 标题： //div[@class="pageList"]/ul/li/a/div[@class="major-content1"]/text()

# 链接： //div[@class="pageList"]/ul/li/a/@href

# 日期： //div[@class="pageList"]/ul/li/a/div[@class="major-content2"]/text()

node_list = selector.xpath("//div[@class='pageList']/ul/li")

for node in node_list:

# 文章标题

title = node.xpath('http://www.coreui.cn/news/a[1]/div[@class="major-content1"]/text()').extract_first()

# 文章链接

url = node.xpath('http://www.coreui.cn/news/a[1]/@href').extract_first()

# 日期

date = node.xpath('http://www.coreui.cn/news/a[1]/div[@class="major-content2"]/text()').extract_first()

print("文章标题:", title)

print("文章链接:",url)

print("日期:",date)

5.运行爬虫

在mySpider/mySpider/文件夹下创建run.py文件，并运行：

from scrapy import cmdline

cmdline.execute("scrapy crawl dgcuAI -s LOG_ENABLED=False".split())

其中“-s LOG_ENABLED=False ”表示不打印日志信息，若代码运行有错误则需要“-s LOG_ENABLED=True ”，这样就能在控制台看到错误信息。

运行结果：

<< 上一篇

回撤、FUD、行业OG、VC币是什么意思？