NScrapy 是基于 .NET Core 异步编程框架和 Redis 内存存储的开源分布式爬虫框架。NScrapy 的整体思想源于知名的 Python 爬虫框架 Scrapy,整体写法也接近于 Scrapy,支持分布式部署、丰富的中间件扩展以及流式(Fluent)API。
NScrapy is a distributed spider framework based on .NET Core async programming and Redis. Inspired by the famous Python framework Scrapy, NScrapy offers a similar development experience with support for distributed deployment, middleware extensions, and a Fluent API.
git clone https://github.com/xboxeer/NScrapy.git
cd NScrapy
dotnet build ./NScrapy.Cli/NScrapy.Cli.csproj
dotnet tool install --global --add-path ./NScrapy.Cli/bin/Debug/net10.0/publish创建新的爬虫项目模板:
nscrapy new <SpiderName> [options]| 参数/选项 | 说明 |
|---|---|
<SpiderName> |
爬虫名称(PascalCase,必填) |
-t, --type <TYPE> |
模板类型:basic(默认)或 distributed |
-o, --output <PATH> |
输出目录(默认 .) |
--force |
覆盖已有文件 |
示例:
# 创建基础爬虫
nscrapy new MySpider
# 在指定目录创建
nscrapy new MySpider -o ./spiders
# 创建分布式爬虫
nscrapy new MyDistributedSpider -t distributed
# 覆盖已存在的项目
nscrapy new MySpider --force运行爬虫程序:
nscrapy run [spider-name] [options]| 选项 | 说明 |
|---|---|
--role <ROLE> |
运行角色:single(默认)、spider、downloader |
--distributed |
启用分布式模式 |
--redis <HOST:PORT> |
Redis 连接地址 |
--redis-password <PWD> |
Redis 密码 |
--redis-ssl |
使用 SSL 连接 Redis |
--receiver-queue <NAME> |
请求队列名称 |
--response-queue <NAME> |
响应队列名称 |
--concurrency <N> |
并发请求数 |
--delay <MS> |
请求间隔(毫秒) |
-c, --config <PATH> |
配置文件路径 |
示例:
# 本地单节点运行
nscrapy run MySpider
# 分布式模式运行
nscrapy run MySpider --role spider --distributed --redis localhost:6379
# 带 Redis 认证和 SSL
nscrapy run MySpider --redis localhost:6380 --redis-password secret --redis-ssl
# 调整并发和请求间隔
nscrapy run MySpider --concurrency 10 --delay 100
# 运行下载器节点
nscrapy run --role downloader --redis localhost:6379
# 使用自定义配置
nscrapy run MySpider -c ./config.json配置按以下优先级生效(高 → 低):
- CLI 参数 — 命令行直接指定
- 环境变量 —
NSCRAPY_*前缀(如NSCRAPY_REDIS_HOST) - 配置文件 — JSON 配置文件
nscrapy new 生成的结构:
MySpider/
├── Spiders/
│ └── MySpiderSpider.cs # 爬虫逻辑
├── Items/
│ └── MySpiderItem.cs # 数据模型
├── Pipelines/
│ └── MySpiderPipeline.cs # 数据处理管道
├── Program.cs # 入口
├── MySpider.csproj
└── appsettings.json # 配置
nscrapy new MySpider
cd MySpider
nscrapy run MySpiderdotnet add package NScrapy.Infra
dotnet add package NScrapy.Scheduler使用 Fluent API 编写爬虫:
NScrapy.CreateSpider("MySpider")
.StartUrl("https://example.com")
.OnResponse(r => {
Console.WriteLine(r.CssSelector("title::text").Extract());
})
.AddPipeline<MyPipeline>()
.Configure(o => o.Concurrency = 5)
.Run();💡 更多 API 用法参见下方「编程指南」章节。
NScrapy 支持通过 Docker Compose 进行容器化部署,提供三种模式:
NScrapy supports containerized deployment via Docker Compose with three modes:
| 模式 / Mode | 适用场景 / Use Case | Redis |
|---|---|---|
| local-only | 单节点,无需 Redis / Single-node, no Redis | 无 / None (InMemoryScheduler) |
| local-redis | 同主机多工作节点 / Multi-worker on same host | 本地 Redis 容器 / Local Redis container |
| managed-redis | 云端/生产部署 / Cloud / production deploy | 外部 Redis / External Redis |
- Docker & Docker Compose v2+
- .NET 10.0(用于本地开发构建 / for local development/builds)
1. 克隆项目 / Clone the project
git clone https://github.com/your-org/NScrapy.git
cd NScrapy2. 创建 .env 配置文件 / Create a .env config file
# 托管 Redis 模式示例(如 Azure Redis Cache)
# Example for managed-redis mode (Azure Redis Cache)
cat > .env << 'EOF'
REDIS_ENABLED=true
REDIS_HOST=your-redis.redis.cache.windows.net
REDIS_PORT=6380
REDIS_PASSWORD=your-access-key-here
REDIS_USESSL=true
EOF3. 选择模式并启动 / Choose your mode and start
# 模式 1:本地仅限(无 Redis)
# Mode 1: Local only (no Redis)
docker-compose up -d
# 模式 2:本地 + 本地 Redis 容器
# Mode 2: Local + local Redis container
docker-compose --profile local-redis up -d
# 模式 3:托管 Redis(需先取消 .env 中 Redis 主机注释)
# Mode 3: Managed Redis (uncomment Redis host in .env first)
docker-compose up -d4. 扩缩容 Spider 和 Downloader / Scale Spider and Downloader workers
# 运行 3 个 Spider 实例和 4 个 Downloader 工作进程
# Run 3 spider instances and 4 downloader workers
docker-compose up -d --scale spider=3 --scale downloader=4所有设置通过环境变量控制。JSON 配置文件(appsettings.spider.json、appsettings.downloader.json)作为默认值,运行时会由环境变量覆盖。
All settings are controlled via environment variables. The JSON config files serve as defaults; environment variables override them at runtime.
| 变量 / Variable | 说明 / Description | 默认值 / Default |
|---|---|---|
REDIS_ENABLED |
启用 Redis 调度器 / Enable Redis scheduler | false |
REDIS_HOST |
Redis 主机名 / Redis hostname | redis |
REDIS_PORT |
Redis 端口 / Redis port | 6379 |
REDIS_PASSWORD |
Redis 密码 / Redis password | (empty) |
REDIS_USESSL |
使用 TLS/SSL | false |
SCHEDULER_TYPE |
调度器类名 / Scheduler class name | InMemoryScheduler |
NScrapy/
├── NScrapy.Cli/ # CLI 工具(nscrapy new / nscrapy run)
├── NScrapy.Core/ # 核心抽象层
├── NScrapy.Engine/ # 爬虫引擎
├── NScrapy.Infra/ # 基础设施(HTTP、解析器等)
├── NScrapy.Scheduler/ # 调度器(InMemory / Redis)
├── NScrapy.Spider/ # Spider 抽象
├── NScrapy.Downloader/ # Downloader 抽象
├── docker-compose.yml # Docker 部署配置(3 种模式)
├── Dockerfile.Spider # Spider 镜像构建
├── Dockerfile.Downloader # Downloader 镜像构建
├── appsettings.spider.json # Spider 默认配置
└── appsettings.downloader.json # Downloader 默认配置
# 构建 Spider 镜像
docker build -f Dockerfile.Spider -t nscrapy-spider .
# 构建 Downloader 镜像
docker build -f Dockerfile.Downloader -t nscrapy-downloader .
# 运行 spider 节点
docker run nscrapy-spider run MySpider --role spider --distributed
# 运行 downloader 节点
docker run nscrapy-downloader run --role downloader ┌─────────────────┐
│ Spider(s) │ ← 调度请求 / Schedules requests
│ (Crawler logic) │
└────────┬────────┘
│ Redis queues
┌─────────▼────────┐
│ Downloader(s) │ ← HTTP 工作器 / HTTP workers
│ (Workers) │
└─────────────────┘
NScrapy 提供流式 API,让爬虫编写更加简洁直观。通过 NScrapy.CreateSpider() 可以链式调用配置爬虫行为,支持本地运行和分布式运行两种模式。
NScrapy provides a Fluent API for writing spiders in a concise and intuitive chain-call style. Use NScrapy.CreateSpider() to configure and run spiders, supporting both local and distributed modes.
// Local spider / 本地爬虫
NScrapy.CreateSpider("MySpider")
.StartUrl("https://example.com")
.OnResponse(r => {
Console.WriteLine(r.CssSelector("title::text").Extract());
})
.AddPipeline<MyPipeline>()
.Configure(o => o.Concurrency = 5)
.Run();// Distributed spider / 分布式爬虫
NScrapy.CreateSpider("MySpider")
.StartUrl("https://example.com")
.OnResponse(r => { /* parse / 解析 */ })
.Distributed(d => d
.UseRedis("localhost:6379")
.ReceiverQueue("nscrapy:requests")
.ResponseQueue("nscrapy:responses")
)
.Run();| 选项 / Option | 类型 / Type | 说明 / Description |
|---|---|---|
Concurrency |
int |
并发请求数(默认 1)/ Concurrent requests, default 1 |
DelayMs |
int |
请求间隔毫秒(默认 0)/ Delay between requests (ms), default 0 |
MaxRetries |
int |
最大重试次数(默认 3)/ Max retry count, default 3 |
TimeoutMs |
int |
请求超时毫秒(默认 30000)/ Request timeout (ms), default 30000 |
UserAgents |
List<string> |
User-Agent 列表,随机选用 / User-Agent list, randomly selected |
NScrapy.CreateSpider("MySpider")
.StartUrl("https://example.com")
.OnResponse(r => { /* parse / 解析 */ })
.Configure(o => {
o.Concurrency = 5;
o.DelayMs = 1000;
o.MaxRetries = 3;
o.TimeoutMs = 15000;
o.UserAgents = new List<string> {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
};
})
.Run();启用 JavaScript 渲染支持(用于抓取 SPA 页面的内容):
Enable JavaScript rendering support (for crawling SPA pages):
NScrapy.CreateSpider("MySpider")
.StartUrl("https://example.com")
.OnResponse(r => { /* parse / 解析 */ })
.AddDownloaderMiddleware<JsRenderMiddleware>()
.Run();
⚠️ 注意:JsRenderMiddleware目前为存根实现,功能开发中。⚠️ Note:JsRenderMiddlewareis currently a stub implementation; full functionality is under development.
修改项目中的 appsettings.json:
Modify appsettings.json in your project:
{
"Scheduler": {
"SchedulerType": "NScrapy.Scheduler.RedisExt.RedisScheduler"
},
"Scheduler.RedisExt": {
"RedisServer": "192.168.0.106",
"RedisPort": "6379",
"ReceiverQueue": "NScrapy.Downloader",
"ResponseQueue": "NScrapy.ResponseQueue"
}
}修改 NScrapy.DownloaderShell.dll 同层目录中的 appsettings.json:
Modify appsettings.json under NScrapy.DownloaderShell.dll directory:
{
"Scheduler": {
"SchedulerType": "NScrapy.Scheduler.RedisExt.RedisScheduler"
},
"Scheduler.RedisExt": {
"RedisServer": "192.168.0.106",
"RedisPort": "6379",
"ReceiverQueue": "NScrapy.Downloader",
"ResponseQueue": "NScrapy.ResponseQueue"
}
}单独运行 DownloaderShell: Run DownloaderShell individually:
dotnet /path/to/NScrapy.DownloaderShell.dll如果需要将 Downloader 状态更新到 Redis(用于监控),可以在 appsettings.json 中添加:
If you want to update Downloader status to Redis (for monitoring), add to appsettings.json:
"DownloaderMiddlewares": [
{ "Middleware": "NScrapy.DownloaderShell.StatusUpdaterMiddleware" }
]💡 NScrapyWebConsole 会从 Redis 中读取 Downloader 状态数据。 💡 NScrapyWebConsole reads Downloader status from Redis.
将爬取的数据存储到 MongoDB:
Save crawled data to MongoDB:
public class MongoItemPipeline : IPipeline<JobItem>
{
private MongoClient client = new MongoClient("mongodb://localhost:27017");
public async void ProcessItem(JobItem item, ISpider spider)
{
var db = client.GetDatabase("NScrapy");
var collection = db.GetCollection<JobItem>("JobItem");
await collection.InsertOneAsync(item);
}
}添加到 appsettings.json:
Add to appsettings.json:
"Pipelines": [
{ "Pipeline": "NScrapy.Project.MongoItemPipeline" }
]将数据保存为 CSV 文件:
Save data to CSV file:
public class CSVItemPipeline : IPipeline<JobItem>
{
private string startTime = DateTime.Now.ToString("yyyyMMddhhmm");
public void ProcessItem(JobItem item, ISpider spider)
{
var info = $"{item.Title},{item.Firm},{item.Salary},{item.Time},{System.Environment.NewLine}";
Console.WriteLine(info);
File.AppendAllText($"output-{startTime}.csv", info, Encoding.UTF8);
}
}添加到 appsettings.json:
Add to appsettings.json:
"Pipelines": [
{ "Pipeline": "NScrapy.Project.MongoItemPipeline" },
{ "Pipeline": "NScrapy.Project.CSVItemPipeline" }
]📄 旧版类继承 API 文档已移至 LEGACY.md。 For legacy class-inheritance API documentation, see LEGACY.md.