How to connect requirements and test cases into a dataset for fine-tuning or RAG?

VendyGo · October 24, 2025, 11:55am

Hi everyone,
I’m working on a small experimental project related to automotive software testing.

My goal is to combine these two sources (requirements + test cases) into a structured dataset that I can use for either:

Fine-tuning an LLM (e.g. LLaMA) to generate test cases based on requirements, or
Building a RAG (Retrieval-Augmented Generation) pipeline where the model retrieves relevant requirements before generating the test.

I’m not sure what’s the best approach for the data structure:

Should I join all related requirements into the prompt and the test case as completion?
Or should I keep them separate and build a retrieval index for RAG?
What would be a good JSON or text format for this small experiment to test fine-tuning or retrieval quality?

I’m doing this mainly to validate if the concept works before scaling up.
Any advice, example dataset structures, or references to similar projects would be very helpful!

Thanks in advance

John6666 · October 25, 2025, 2:32am

All the approaches you’re considering seem like good options to implement. Also, since the field is science and engineering, you might find useful information on the Hugging Science Discord.

VendyGo · October 25, 2025, 9:57am

I forgot to write that the data I work with as a requirements and test cases are internal to the company and is not searchable on the internet, I do not want to send this data to models outside the company. My goal is to launch a local model in our company that will be able to write tests from requirements.

VendyGo · October 25, 2025, 9:58am

Thank you. I’ll check it out.

John6666 · October 25, 2025, 1:17pm

If you want to handle highly confidential data and avoid cloud services, it’s convenient to set up and use a locally hosted OpenAI API-compatible endpoint.

For embedding models, many people use TEI; for LLMs, TGI, vLLM, Ollama, etc. are popular. All are fast.

You could simply use Python libraries.
However, setting up a local server makes it easier to replace models later if you need more powerful ones, and the framework handles load balancing. For larger scales, a local server is likely more convenient.

VendyGo · October 25, 2025, 1:58pm

I really appreciate response on your forum. I like it.

We already use web platform that enables traceability from requirements to tests.
On this web we can write test cases and another important notes, link them to each other and I can use WEB API to get test cases and requirements but requirements has terrible structure and sometimes are difficult understand even for humans.

What do you think, Should I come up with new structure of requirements ? or we can use new requirement that describe several requirements and we will use this new requirement as a prompt for the model, either for fine tuning or RAG.

The tests we use to verify the software are invented by us and we have created python libraries for them in the background, in other words, these tests are not available anywhere on the internet and must have the exact required specific form, because testers will write, for example, a string in step and we will convert this string into an executable test file via python, therefore the model must strictly follow the rules when generated. How would you solve this problem?

John6666 · October 26, 2025, 1:03am

I’m just a user, but I will be glad if this provides even a little clue.
If there’s an existing setup, the procedure can get quite complicated… It would be inconvenient if the new one is too incompatible…

Topic		Replies	Views
How do we insert our own datasets in DPR / RAG retrieval Q&A models? 🤗Transformers	1	1665	October 11, 2020
What is the text dataset format for fintune LLM? Beginners	2	2778	June 8, 2023
Is it mandatory to fine tune a RAG model on custom dataset to generate realted responses for queries, i working on RAG code from the examples , when i use a custom datast=et it doesnot produce intended results for queries Beginners	0	246	September 5, 2023
How to process tabular data for fine tuning LLMs 🤗Datasets	0	1116	November 24, 2023
Teaching an LLM about your domain only with machine readable data Beginners	3	3786	May 7, 2023

How to connect requirements and test cases into a dataset for fine-tuning or RAG?

Related topics