# Week 2: Publish a Hub Dataset Create and share high-quality datasets on the Hub. Good data is the foundation of good modelsβ€”help the community by contributing datasets others can train on. ## Why This Matters The best open source models are built on openly available datasets. By publishing well-documented, properly structured datasets, you're directly enabling the next generation of model development. Quality matters more than quantity. ## The Skill Use `hf_dataset_creator/` for this quest. Key capabilities: - Initialize dataset repos with proper structure - Multi-format support: chat, classification, QA, completion, tabular - Template-based validation for data quality - Streaming uploads without downloading entire datasets ```bash # Quick setup with a template python hf_dataset_creator/scripts/dataset_manager.py quick_setup \ --repo_id "your-username/dataset-name" --template chat ``` ## XP Tiers ### 🐒 Starter β€” 50 XP **Upload a small, clean dataset with a complete dataset card.** 1. Create a dataset with ≀1,000 rows 2. Write a dataset card covering: license, splits, and data provenance 3. Upload to the Hub under the hackathon organization (or your own account) **What counts:** Clean data, clear documentation, proper licensing. ```bash python hf_dataset_creator/scripts/dataset_manager.py init \ --repo_id "hf-skills/your-dataset-name" python hf_dataset_creator/scripts/dataset_manager.py add_rows \ --repo_id "hf-skills/your-dataset-name" \ --template classification \ --rows_json "$(cat your_data.json)" ``` ### πŸ• Standard β€” 100 XP **Publish a conversational dataset with a complete dataset card.** 1. Create a dataset with ≀1,000 rows 2. Write a dataset card covering: license and splits. 3. Upload to the Hub under the hackathon organization. **What counts:** Clean data, clear documentation, proper licensing. ### 🦁 Advanced β€” 200 XP **Translate a dataset into multiple languages and publish it on the Hub.** 1. Find a dataset on the Hub 2. Translate the dataset into multiple languages 3. Publish the translated datasets on the Hub under the hackathon organization **What counts:** Translated datasets and merged PRs. ## Resources - [SKILL.md](../hf_dataset_creator/SKILL.md) β€” Full skill documentation - [Templates](../hf_dataset_creator/templates/) β€” JSON templates for each format - [Examples](../hf_dataset_creator/examples/) β€” Sample data and system prompts --- **Next Quest:** [Supervised Fine-Tuning](04_sft-finetune-hub.md)