An LLM tool for crawling and analyzing AI-related papers in the field of fusion research. Currently supports paper crawling from the journal "Nuclear Fusion".
- Automatically crawl papers from specified years and months
- Use Qwen 2.5 model to intelligently identify AI-related papers
- Generate Chinese paper summaries automatically
- Extract publication dates and institutional information
- Export results to Excel format
- Support breakpoint resume and deduplication
- Python 3.8+ (tested on 3.12)
- Chrome browser (if using Selenium mode)
-
Clone the repository:
git clone https://github.com/ZhenghaoYang19/LiteratureResearch.git cd LiteratureResearch
-
Install dependencies:
pip install -r requirements.txt
-
Modify the time range in
main.py
:scraper = PaperScraper("Nuclear Fusion") scraper.scrape_papers(2020, 2021, 1, 12) # Crawl all issues from 2020-2021
-
Run
main.py
:python main.py
-
Hardware Requirements:
- Running Qwen 2.5 model requires at least 16GB GPU VRAM
- Consider using a smaller model if GPU memory is insufficient
-
Network Requirements:
- Stable network connection required
- Automatic retry on network errors
- Uses modelscope in China, consider switching to transformers outside China
-
Storage Space:
- PDFs will be downloaded to the
Papers
folder - Ensure sufficient storage space
- PDFs will be downloaded to the
-
Runtime:
- First run requires downloading the model (14GB), ensure stable network
- AI analysis takes time for each paper
- Tested on RTX 4090 with gigabit network: ~1s per paper analysis, 30-60s for PDF download
- Recommended to crawl in batches
-
Results:
- Results saved to
ai_fusion_papers.xlsx
- Supports resume from breakpoint
- Automatic deduplication
- Results saved to
Excel file contains the following fields:
- title: Paper title
- journal: Journal name
- volume_issue: Volume and issue number
- doi: Paper DOI
- first_institution: First author's institution
- second_institution: Second author's institution
- summary: Chinese content summary
- pub_date: Publication date
-
Model loading errors:
- Check GPU memory
- Verify CUDA installation
- Check modelscope installation
-
Network errors:
- Check network connection
- Consider using proxy
- Program will retry automatically
-
Selenium mode:
- Selenium mode is experimental
- Set
use_selenium=True
inPaperScraper
initialization - Ensure Chrome browser is installed
- Ensure matching ChromeDriver version
Copyright 2024 Harry Young
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.