Nima Nooshiri's picture

Open to Collab

Nima Nooshiri

nimanzik

·

AI & ML interests

None yet

Recent Activity

upvoted an article 18 minutes ago

upvoted an article 1 day ago

LateOn-Code & ColGrep: LightOn unveils state-of-the-art code retrieval models and code search tooling

reacted to Kseniase's post with 👍 2 days ago

15 types of attention mechanisms Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention. Here is a list of 15 types of attention mechanisms used in AI models: 1. Soft attention (Deterministic attention) -> https://huggingface.co/papers/1409.0473 Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1. 2. Hard attention (Stochastic attention) -> https://huggingface.co/papers/1508.04025 Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything. 3. Self-attention -> https://huggingface.co/papers/1706.03762 Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation. 4. Cross-Attention (Encoder-Decoder attention) -> https://huggingface.co/papers/2104.08771 The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources. 5. Multi-Head Attention (MHA) -> https://huggingface.co/papers/1706.03762 Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values. 6. Multi-Head Latent Attention (MLA) -> https://huggingface.co/papers/2405.04434 Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations. 7. Memory-Based attention -> https://huggingface.co/papers/1503.08895 Involves an external memory and uses attention to read from and write to this memory. See other types in the comments 👇

View all activity

Organizations

liked a Space 11 days ago

Distilling 100B+ Models 40x Faster with TRL

TRL distillation for 100B+ teachers, 40x faster

liked 3 Spaces about 1 month ago

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

Explore synthetic data experiments on a virtual bookshelf

The Smol Training Playbook

The secrets to building world-class LLMs

Evaluation Guidebook

Explore LLM benchmark trends over time

liked a Space 2 months ago

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

Who needs 1T parameters? Olympiad proofs with a 4B model

liked a Space 8 months ago

The Ultra-Scale Playbook

The ultimate guide to training LLM on large GPU Clusters

liked 3 Spaces over 1 year ago

Can You Run It? LLM version

Calculate GPU needs for running LLMs on your hardware

Accelerate Presentation

Launch and train PyTorch models easily

Accelerate Examples

Generate code samples for model training and configuration