Spaces:
Running
Running
| title: README | |
| emoji: π | |
| colorFrom: yellow | |
| colorTo: green | |
| sdk: static | |
| pinned: false | |
|  | |
| ## π Simply detect, segment, label, and separate speakers in any language | |
| <div align="center"> | |
| <a href="https://github.com/pyannote/pyannote-audio"><img alt="Github" src="https://img.shields.io/badge/Open%20source%20toolkit-059669?style=flat&logo=github&logoColor=FFFFFF"></a> | |
| <a href="https://hf.co/pyannote"><img alt="Hugging Face" src="https://img.shields.io/badge/Open%20models-059669?style=flat&logo=huggingface&logoColor=FFFFFF"></a> | |
| <a href="https://discord.gg/vux8UH9QmV"><img alt="Discord" src="https://img.shields.io/badge/Discord-059669?style=flat&logo=discord&logoColor=white"></a> | |
| <a href="https://www.linkedin.com/company/pyannoteai/"><img alt="LinkedIn" src="https://img.shields.io/badge/LinkedIn-059669?style=flat&logo=linkedin&logoColor=white"></a> | |
| <a href="https://x.com/pyannoteAI"><img alt="X" src="https://img.shields.io/badge/X-059669?style=flat&logo=x&logoColor=white"></a><br/> | |
| <a href="https://dashboard.pyannote.ai/"><img alt="Playground" src="https://img.shields.io/badge/-Playground-059669?style=flat"></a> | |
| <a href="https://docs.pyannote.ai/"><img alt="Documentation" src="https://img.shields.io/badge/Documentation-059669?style=flat"></a> | |
| </div> | |
| [pyannoteAI](https://www.pyannote.ai/) facilitates the understanding of speakers and conversation context. We focus on identifying speakers and conversation metadata under conditions that reflect real conversations rather than controlled recordings. | |
| ### π€ What is speaker diarization? | |
|  | |
| **Speaker diarization** is the process of automatically partitioning the audio recording of a conversation into segments and labeling them by speaker, answering the question **"who spoke when?"**. As the **foundational layer of conversational AI**, speaker diarization provides high-level insights for human-human and human-machine conversations, and unlocks a wide range of downstream applications: meeting transcription, call center analytics, voice agents, video dubbing. | |
| ### βΆοΈ Getting started | |
| Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) latest release available from  with either `uv` (recommended) or `pip`: | |
| ```bash | |
| $ uv add pyannote.audio | |
| $ pip install pyannote.audio | |
| ``` | |
| Enjoy state-of-the-art speaker diarization: | |
| ```python | |
| # download pretrained pipeline from Huggingface | |
| from pyannote.audio import Pipeline | |
| pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization-community-1', token="HUGGINGFACE_TOKEN") | |
| # perform speaker diarization locally | |
| output = pipeline('/path/to/audio.wav') | |
| # enjoy state-of-the-art speaker diarization | |
| for turn, speaker in output.speaker_diarization: | |
| print(f"{speaker} speaks between t={turn.start}s and t={turn.end}s") | |
| ``` | |
| Read [`community-1` model card](https://hf.co/pyannote/speaker-diarization-community-1) to make the most of it. | |
| ### π State-of-the-art models | |
| [`pyannoteAI`](https://www.pyannote.ai/) research team trains cutting-edge speaker diarization models, thanks to [**Jean Zay**](http://www.idris.fr/eng/jean-zay/) π«π· supercomputer managed by [**GENCI**](https://www.genci.fr/) π. They come in two flavors: | |
| * [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) open models available on [Huggingface](https://hf.co/pyannote) and used by 140k+ developers over the world ; | |
| * premium models available on [`pyannoteAI` cloud](https://dashboard.pyannote.ai) (and on-premise for enterprise customers) that provide state-of-the-art speaker diarization as well as additional enterprise features. | |
| | Benchmark (last updated in 2025-09) | <a href="https://hf.co/pyannote/speaker-diarization-3.1">`legacy` (3.1)</a>| <a href="https://hf.co/pyannote/speaker-diarization-community-1">`community-1`</a> | <a href="https://docs.pyannote.ai">`precision-2`</a> | | |
| | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------ | | |
| | [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.4 π | | |
| | [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 15.2 π| | |
| | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 12.9 π| | |
| | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.6 π | | |
| | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 37.1 π | | |
| | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 16.6 π | | |
| | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 14.7 π | | |
| | [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 39.0 π | | |
| | [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.3 π | | |
| | [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.5 π | | |
| | [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.4 π | | |
| | [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 8.5 π | | |
| __[Diarization error rate](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %, the lower, the better)__ | |
| Our models achieve competitive performance across multiple public diarization datasets, explore pyannoteAI performance benchmark β‘οΈ [https://www.pyannote.ai/benchmark](https://www.pyannote.ai/benchmark) | |
| ### β©οΈ Going further, better, and faster | |
| [`precision-2`](https://www.pyannote.ai/blog/precision-2) premium model further improves accuracy, processing speed, as well as brings additional features. | |
| | Features | <a href="https://hf.co/pyannote/speaker-diarization-community-1">`community-1`</a> | <a href="https://docs.pyannote.ai">`precision-2`</a> | | |
| | -------------- | ----------- | ----------- | | |
| | Set exact/min/max number of speakers | β | β | | |
| | Exclusive speaker diarization (for transcription) | β | β | | |
| | Segmentation confidence scores | β | β | | |
| | Speaker confidence scores | β | β | | |
| | Voiceprinting | β | β | | |
| | Speaker identification | β | β | | |
| | STT Orchestration | β | β | | |
| | Time to process 1h of audio (on H100) | 37s | 14s | | |
| Create a [`pyannoteAI`](https://dashboard.pyannote.ai) account, change one line of code, and enjoy free cloud credits to try [`precision-2`](https://pyannote.ai/blog/precision-2) premium diarization: | |
| ```python | |
| # perform premium speaker diarization on pyannoteAI cloud | |
| pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization-precision-2', token="PYANNOTEAI_API_KEY") | |
| better_output = pipeline('/path/to/audio.wav') | |
| ``` | |
| ### π Get speaker-attributed transcripts | |
| We host open-source transcription models like [**Nvidia Parakeet-tdt-0.6b-v3**](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) and [**OpenAI whisper-large-v3-turbo**](https://huggingface.co/dropbox-dash/faster-whisper-large-v3-turbo) with specialized STT + diarization reconciliation logic for speaker-attributed transcripts. | |
| STT orchestration orchestrates pyannoteAI diarization `Precision-2` with transcription services. Instead of running diarization and transcription separately, then reconciling outputs manually, you make one API call and receive speaker-attributed transcripts. | |
|  | |
| To use this feature, make a request to the diarize API endpoint with the `transcription:true` flag. | |
| ```python | |
| # pip install pyannoteai-sdk | |
| from pyannoteai.sdk import Client | |
| client = Client("your-api-key") | |
| job_id = client.diarize( | |
| "[https://www.example/audio.wav](https://www.example/audio.wav)", | |
| transcription=True) | |
| job_output = client.retrieve(job_id) | |
| for word in job_output['output']['wordLevelTranscription']: | |
| print(word['start'], word['end'], word['speaker'], word['text']) | |
| for turn in job_output['output']['turnLevelTranscription']: | |
| print(turn['start'], turn['end'], turn['speaker'], turn['text']) | |