Mismatch between full attention layer mentioned in Dflash config for target context

#3
by RaghavvGoel - opened

Hi,

I was trying to run Dflash draft model with Qwen3.5-4B target model. However, I found that Dflash is using target hidden-states from linear attention layers. I was curious if this was the intended behavior?
have the authors ablated with using target hidden-states from full attention layers only?

Thanks!

Z Lab org

We didn’t distinguish linear attention layers or softmax attention layers, we just select the hidden states uniformly from target model layers. We did try to get target hidden all from softmax attention layers in Qwen3-Coder-Next-DFlash model, but the difference seems small.

Sign up or log in to comment