🗃️ Datasets 🎯

Where data dreams come true! A collection of datasets and data generation scripts for your AI adventures.

🌟 Overview

Welcome to the datasets repository! This is my personal treasure trove of datasets and data generation scripts that I am in the process of open sourcing, and that I've created, curated, and collected throughout my AI journey. Whether you're looking for financial data from FMP API, text corpora for NLP, or utilities to wrangle your own datasets into submission, you'll find it all neatly organized here.

📂 Repository Structure

datasets/
├── financial/       # Financial datasets and generation scripts
├── nlp/             # Natural language processing datasets
├── vision/          # Image and video datasets
├── multimodal/      # Datasets combining multiple modalities
├── utils/           # Utility scripts for data processing
└── templates/       # Templates for dataset documentation

🔍 Dataset Categories

💰 Financial

The financial/ directory contains datasets and scripts related to financial data, including:

FMP API: Scripts to pull and process data from Financial Modeling Prep API
Market Data: Historical stock prices, indices, and related financial instruments
Financial Reports: Structured data from quarterly and annual reports

🔤 NLP

The nlp/ directory houses text-based datasets and generation scripts, including:

Text Corpora: Collections of text for training language models
Embeddings: Pre-computed embeddings for various text datasets
Conversational Data: Dialogue datasets for training conversational AI

👁️ Vision

The vision/ directory contains image and video datasets, including:

Image Collections: Categorized images for computer vision tasks
Video Snippets: Short video clips for motion analysis
Annotations: Bounding boxes, segmentation masks, and other annotations

🔮 Multimodal

The multimodal/ directory contains datasets that span multiple modalities, perfect for training models that can understand both text and images, or other combinations.

🛠️ Utilities

The utils/ directory contains scripts to help you work with the datasets:

Data Cleaning: Scripts to sanitize and normalize data
Format Conversion: Tools to convert between different data formats
Metadata Generation: Utilities to generate and validate dataset metadata

📋 Using the Templates

The templates/ directory contains standardized templates for:

Dataset metadata in JSON format
Dataset README files
Script headers with standardized documentation

🚀 Getting Started

To use these datasets and scripts:

Clone this repository:

git clone https://github.com/Kris-Nale314/datasets.git
cd datasets

Explore the directories to find datasets or scripts you're interested in.
Each dataset comes with its own README providing specific usage instructions.

📊 Example: Using the FMP API Script

The FMP API script allows you to pull financial data and store it in a structured JSON format for use in LLMs or RAG systems:

from financial.fmp_api.scripts.fetcher import FMPDataFetcher

# Initialize the fetcher with your API key
fetcher = FMPDataFetcher(api_key="your_api_key_here")

# Fetch financial data for a specific company
apple_data = fetcher.get_company_profile("AAPL")

# Save the data to a JSON file
fetcher.save_to_json(apple_data, "apple_profile.json")

🤝 Contributing

While this is primarily a personal repository, contributions, suggestions, and issues are welcome! If you have ideas for improving the organization or want to add your own datasets/scripts, please:

Fork the repository
Create a new branch
Make your changes
Submit a pull request

📝 Dataset Documentation Standard

Each dataset in this repository follows a standard documentation format:

Name: What the dataset is called
Description: What the dataset contains
Source: Where the data came from
Size: How big the dataset is
Format: What format the data is in
License: How the dataset can be used
Last Updated: When the dataset was last modified
Usage Examples: Code snippets showing how to use the dataset

📜 License

This repository and its contents are available under the MIT License - see the LICENSE file for details.

Happy data wrangling! 🎉

"Data is the new oil, but unlike oil, it becomes more valuable the more it's refined." - Unknown

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
financial		financial
images		images
templates		templates
.gitignore		.gitignore
README.md		README.md
logo.svg		logo.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🗃️ Datasets 🎯

🌟 Overview

📂 Repository Structure

🔍 Dataset Categories

💰 Financial

🔤 NLP

👁️ Vision

🔮 Multimodal

🛠️ Utilities

📋 Using the Templates

🚀 Getting Started

📊 Example: Using the FMP API Script

🤝 Contributing

📝 Dataset Documentation Standard

📜 License

About

Uh oh!

Releases

Packages

Languages

Kris-Nale314/datasets

Folders and files

Latest commit

History

Repository files navigation

🗃️ Datasets 🎯

🌟 Overview

📂 Repository Structure

🔍 Dataset Categories

💰 Financial

🔤 NLP

👁️ Vision

🔮 Multimodal

🛠️ Utilities

📋 Using the Templates

🚀 Getting Started

📊 Example: Using the FMP API Script

🤝 Contributing

📝 Dataset Documentation Standard

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages