Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -9,3 +9,72 @@ license: apache-2.0
|
|
| 9 |
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
|
| 10 |
---
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Synthetic Data SDK by MOSTLY AI Demo
|
| 13 |
+
|
| 14 |
+
[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)
|
| 15 |
+
|
| 16 |
+
The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.
|
| 17 |
+
|
| 18 |
+
- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
|
| 19 |
+
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
|
| 20 |
+
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
The SDK allows you to programmatically create, browse and manage 3 key resources:
|
| 25 |
+
|
| 26 |
+
1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
|
| 27 |
+
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
|
| 28 |
+
3. **Connectors** - Connect to any data source within your organization, for reading and writing data
|
| 29 |
+
|
| 30 |
+
| Intent | Primitive | API Reference |
|
| 31 |
+
|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
|
| 32 |
+
| Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) |
|
| 33 |
+
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
|
| 34 |
+
| Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) |
|
| 35 |
+
| Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) |
|
| 36 |
+
|
| 37 |
+
https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f
|
| 38 |
+
|
| 39 |
+
## Key Features
|
| 40 |
+
|
| 41 |
+
- **Broad Data Support**
|
| 42 |
+
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
|
| 43 |
+
- Single-table, multi-table, and time-series
|
| 44 |
+
- **Multiple Model Types**
|
| 45 |
+
- State-of-the-art performance via TabularARGN
|
| 46 |
+
- Fine-tune Hugging Face hosted language models
|
| 47 |
+
- Efficient LSTM for text synthesis from scratch
|
| 48 |
+
- **Advanced Training Options**
|
| 49 |
+
- GPU/CPU support
|
| 50 |
+
- Differential Privacy
|
| 51 |
+
- Progress Monitoring
|
| 52 |
+
- **Automated Quality Assurance**
|
| 53 |
+
- Quality metrics for fidelity and privacy
|
| 54 |
+
- In-depth HTML reports for visual analysis
|
| 55 |
+
- **Flexible Sampling**
|
| 56 |
+
- Up-sample to any data volumes
|
| 57 |
+
- Conditional simulations based on any columns
|
| 58 |
+
- Re-balance underrepresented segments
|
| 59 |
+
- Context-aware data imputation
|
| 60 |
+
- Statistical fairness controls
|
| 61 |
+
- Rule-adherence via temperature
|
| 62 |
+
- **Seamless Integration**
|
| 63 |
+
- Connect to external data sources (DBs, cloud storages)
|
| 64 |
+
- Fully permissive open-source license
|
| 65 |
+
|
| 66 |
+
## Citation
|
| 67 |
+
|
| 68 |
+
Please consider citing our project if you find it useful:
|
| 69 |
+
|
| 70 |
+
```bibtex
|
| 71 |
+
@misc{mostlyai,
|
| 72 |
+
title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
|
| 73 |
+
author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
|
| 74 |
+
year={2025},
|
| 75 |
+
eprint={2508.00718},
|
| 76 |
+
archivePrefix={arXiv},
|
| 77 |
+
primaryClass={cs.LG},
|
| 78 |
+
url={https://arxiv.org/abs/2508.00718},
|
| 79 |
+
}
|
| 80 |
+
```
|