Further revisions to the query_rewrite model card. The usage example is now self-contained.

---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: peft
library_name: transformers
---
---

# Query Rewrite Intrinsic

## Model Summary



We are releasing a query rewrite intrinsic, implemented as a family of adapters that are fine-tuned specifically for the following task:

Given a multi-turn conversation between a user and an AI assistant, rewrite the last
user utterance (query) by transforming it (only if necessary) into an equivalent version that
is standalone and can be understood by itself (without the conversation).

While the intrinsic is general purpose, one of its main use cases is in RAG settings where the ability to rewrite a user query into a standalone version directly improves the retriever performance, which in turn improves the answer generation performance. Outside of RAG, there are other conversational use cases that require to rewrite a user query, for example, before accessing a database, or before routing to other APIs or tools. The intrinsic does not need any RAG documents (which may be present in the context, in a RAG setting) and uses only the dialog turns with what is being said between the user and assistant. We are providing experimental results in a RAG setting as well as on a specialized enterprise (non-RAG) setup, showing in both cases that the intrinsic performance is significantly higher than when prompting out-of-the-box models, including open-source models such as gpt-oss as well as frontier models such as gpt-4o.

The adapters released here work with both the IBM granite-3.3 family of models (2b and 8b) as well as gpt-oss-20b.

- **Developer:** IBM Research
- **Model type:** LoRA adapter for [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct) and [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Intended use

The intrinsic gives the ability to rewrite the last user query in a multi-turn conversation. Typically, the rewrite is a form of expansion that inlines into the query any implicit references that are made to entities, concepts, or even parts of the conversation that occur in the previous turns (either by the user or the AI assistant). Such expansion can include coreference resolution (i.e., replacement of pronouns with the actual entities), handling of ellipsis, which is the common linguistic phenomenon where parts of a sentence or phrase are omitted by the user, but can be understood from the context (i.e., for whom, of what, with respect to something discussed above, etc.).

As a result of the expansion, the query becomes a standalone query, still equivalent in meaning with what the user asked in the last turn. The rewritten query can be sent to downstream tasks (e.g., to a retriever in a RAG setting) as a better replacement for the original user query, and without the need for (a potentially very long) context.

> [!TIP]
> Note: While you can invoke the query rewrite intrinsic directly, it is strongly recommended to call it through [granite-common](https://github.com/ibm-granite/granite-common), which wraps the model with a tailored I/O processor, enabling a friendlier development interface. The I/O processor takes care of several data transformation and validation tasks that would be otherwise required (including sending the appropriate instruction for the intrinsic) as well as validating the intrinsic's output. We next describe the input/output of the query rewrite intrinsic when invoked through granite-common.

**Intrinsic input**: The input to the query rewrite intrinsic is an OpenAI-compatible chat completion request, containing a list of conversation turns that can alternate between the `user` and `assistant` role, and ending with a `user` turn. The last `user` turn in the list is assumed to be the query that needs to be rewritten (if not already standalone).

**Intrinsic output**: The output of the query rewrite intrinsic is the result of the original chat completion request formatted as a
JSON object as follows:
```json
{
"rewritten_question": <Rewritten last user question>
}
```

Please see the code snippets in the Quickstart Example section below for examples that illustrate the intrinsic's input/output.

## Quickstart Example Using Granite-Common

To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM, follow the steps below. We recommend using Python 3.11 or higher.

1. Install the granite-common library:
```
pip install granite-common[nltk]
```

2. Install the Hugging Face CLI:
```
pip install -U "huggingface_hub[cli]"
```

3. Install vLLM:
```
pip install vllm
```

4. Download the intrinsics library:
```
hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
cd rag-intrinsics-lib
```

5. Edit if needed the vLLM startup script found in `./rag-intrinsics-lib/run_vllm.sh` using your favorite editor:

Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the base model on which the desired adapter has been trained. Optionally, edit the constant `PORT` to change the port on which vLLM will run. Save the modified file and exit the editor.

6. [Optional] If you are on a cluster, allocate a node with GPU, which will be used as inference node. On a CPU, things will be slower. Edit the below LSF command as needed (e.g., among other things, you will need to change `-q gpu` to reflect the actual GPU queue that you have access to). Also, before that, you will need to make sure you activate the same python environment where you installed granite-common and vllm in the steps above.
```
bsub -q gpu -gpu "num=1" -hl -n 1 -W 12:0 -Is bash
hostname -f # get the node name
```

7. Start vLLM through the startup script. If you are on a cluster and have allocated the GPU node, do that from the shell of the GPU node (e.g., started with the `bsub` job above). The first time you run the script, you may have to change the permissions to allow execution:
```
chmod u+x ./run_vllm.sh
./run_vllm.sh &
```

8. Run the following Python code snippet. This does not have to be on the same node as the one you are running the vLLM server and, in particular, it can be on a regular CPU node.

```
import json
import openai
import granite_common
intrinsic_name = "query_rewrite"
# Change the following constant as needed to select a different base model
base_model_name = "granite-3.3-8b-instruct"
# Change the following constants as needed to reflect the location of the vLLM server.
# For example, if you are on a cluster and you have allocated the GPU node as above, use
# the nodename obtained above (with `hostname -f`) instead of localhost.
# The selected port should be identical to the one you specified in the vLLM startup script
openai_base_url = "http://localhost:55555/v1"
openai_api_key = "rag_intrinsics_1234"
# Fetch IO configuration file from Hugging Face Hub
io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
intrinsic_name, base_model_name
)
# Instantiate input/output processors
rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
# Sample request
request_json = {
"messages": [
{
"role": "assistant",
"content": "Welcome to pet questions!"
},
{
"role": "user",
"content": "I have two pets, a dog named Rex and a cat named Lucy."
},
{
"role": "assistant",
"content": "Great, what would you like to share about them?"
},
{
"role": "user",
"content": "Rex spends a lot of time in the backyard and outdoors, and Lucy is always inside."
},
{
"role": "assistant",
"content": "Sounds good! Rex must love exploring outside, while Lucy probably enjoys her cozy indoor life."
},
{
"role": "user",
"content": "But is he more likely to get fleas because of that?"
}
]
}
# Add other parameters
request_json["model"] = intrinsic_name
request_json["temperature"] = 0.0
# Apply input processor
rewritten_request = rewriter.transform(request_json)
# Run inference
client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
# Apply output processor
processed_chat_completion = result_processor.transform(
chat_completion, rewritten_request
)
# Verify that the contents of the completion is valid JSON and pretty-print the JSON.
parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
print("JSON output:")
print(json.dumps(parsed_contents, indent=2))
```

The post-processed JSON output for the above example should be something similar to this:
```json
{
"rewritten_question": "Is Rex, my outdoor dog, more likely to get fleas because of his time spent in the backyard and outdoors?"
}
```

## Training Details

The training data contains both: 1) standalone examples, which teach the intrinsic to refrain from rewriting user questions that are already standalone, and 2) non-standalone examples containing a diversity of patterns that are used to teach the intrinsic to expand the user turn so that it becomes standalone.

### Training Data

The training data uses the publicly available Cloud corpus of technical documentation pages from [MT-RAG](https://arxiv.org/abs/2501.03468). Based on this corpus of documents, we constructed a dataset consisting of high-quality, human-created conversations, where the last turn of the conversation comes into vers

Files changed (1) hide show

query_rewrite/README.md +0 -0

query_rewrite/README.md CHANGED Viewed

The diff for this file is too large to render. See raw diff