Does GPT use the encoder or decoder part of the Transformer?, GPT-style LLMs use a decoder-only Transformer. More precisely: stacked Transformer decoder blocks with causal/masked self-attention, so each token can only attend to previous tokens., Why is causal masking essential in GPT?, Because GPT is trained to predict the next token. If token t could see future tokens, the task would leak the answer and the model would not learn proper autoregressive generation., How does Mamba differ from a Transformer?, A Transformer uses attention to compare tokens with other tokens. Mamba uses a selective state-space mechanism: it updates a hidden state and learns what information to keep, forget, or pass forward.A Transformer uses attention to compare tokens with other tokens. Mamba uses a selective state-space mechanism: it updates a hidden state and learns what information to keep, forget, or pass forward., Why can Mamba scale better to long sequences than attention-based Transformers?, Standard attention compares many token pairs, so cost grows roughly with sequence length squared. Mamba processes sequences with state updates, so its sequence-length scaling is closer to linear., What does the KV cache do during LLM generation?, It stores previous keys and values so the model does not recompute attention for all past tokens. It speeds up decoding, but generation is still sequential: the model must produce token 1 before token 2.

Leaderboard

Visual style

Options

Switch template

Continue editing: ?