Hacker News Scraper

Overview

This project is a web crawler developed in Python that extracts information from the Hacker News webpage using scraping techniques. It retrieves the titles, order numbers, number of comments, and points of the entries from the provided URL. Additionally, it implements some filtering operations of the extracted entries based on specific criteria.

Features

Fetches webpage content and extracts relevant information (titles, order numbers, number of comments, and points).
Handles up to the first 30 entries in the web page
Offers two filtering options:
- Entries with more than 5 words in the title, ordered by number of comments (fewer comments first)
- Entries with 5 or less words in the title, ordered by number of points (fewer points first)
Outputs the data in json format via stdout with the following properties:
- all_entries: A list of all extracted entries.
- more_than_five_words_in_title_by_comments: Entries with more than five words in the title, ordered by number of comments.
- less_than_five_words_in_title_by_points: Entries with less than or equal to five words in the title, ordered by points.
Each entry is displayed as an object with the following properties:
- order (int): A number representing the rank in the Hacker News list (as displayed in the web, could be different from the actual order in the list).
- title (string): The title of the entry.
- points (int): A number showing the number of points given to that entry. Takes the value null if the entry is job offer.
- comments_count (int): A number showing the count of comments in that entry. Takes the value null if the entry is job offer.

JSON Schema example

The JSON output returned by the script follows this schema:

{
  "all_entries": [
    {
      "order": "1",
      "title": "Sample Title",
      "points": 100,
      "comments_count": 50
    },
    {
      "order": "2",
      "title": "Another Title",
      "points": 80,
      "comments_count": 30
    },
    ...
  ],
  "more_than_five_words_in_title_by_comments": [
    {
      "order": "3",
      "title": "Title with more than five words",
      "points": 120,
      "comments_count": 70
    },
    ...
  ],
  "less_than_five_words_in_title_by_points": [
    {
      "order": "4",
      "title": "Short Title",
      "points": 150,
      "comments_count": 40
    },
    ...
  ]
}

Dependencies

Python 3.10 (it should work in different versions but has only been tested in that one)
requests library (for fetching webpage content)
beautifulsoup4 library (for HTML parsing and data extraction)
pytest library (for writing and running tests)

The specific versions and other indirect dependencies can be checked in the requirements.txt file.

Installation

Clone the repository:

git clone [email protected]:mariomantilla/hacker-news-scraper.git

Navigate to the project directory:

cd hacker-news-scraper

Create a virtual environment (optional but recommended):

python -m venv venv

Activate the virtual environment (if you created it):

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Run the main script to extract entries from the Hacker News webpage:
```
python main.py
```
This will fetch entries from the https://news.ycombinator.com/ and display them in the terminal. Alternatively, you can create a json file with the data using:
```
python main.py >> data.json
```

Testing

Run the test suite to ensure correct behaviour:

pytest -v

This command will run all test cases in verbose mode, providing detailed output.

Contributing

Contributions are welcome! If you encounter any bugs or have suggestions for improvements, please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data_processing		data_processing
tests		tests
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hacker News Scraper

Overview

Features

JSON Schema example

Dependencies

Installation

Usage

Testing

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mariomantilla/hacker-news-scraper

Folders and files

Latest commit

History

Repository files navigation

Hacker News Scraper

Overview

Features

JSON Schema example

Dependencies

Installation

Usage

Testing

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages