# Google Colab Version: [Open this notebook in Google Colab](https://colab.research.google.com/github/starfishdata/starfish/blob/main/examples/data_factory.ipynb)

#### Dependencies 

In [11]:
%pip install starfish-core

Note: you may need to restart the kernel to use updated packages.


In [1]:
## Fix for Jupyter Notebook only â€” do NOT use in production
## Enables async code execution in notebooks, but may cause issues with sync/async issues
## For production, please run in standard .py files without this workaround
## See: https://github.com/erdewit/nest_asyncio for more details
import nest_asyncio
nest_asyncio.apply()

from starfish import StructuredLLM, data_factory
from starfish.llm.utils import merge_structured_outputs

from starfish.common.env_loader import load_env_file ## Load environment variables from .env file
load_env_file()

In [2]:
# setup your openai api key if not already set
# import os
# os.environ["OPENAI_API_KEY"] = "your_key_here"

# If you dont have any API key, please navigate to local model section

In [3]:
## Helper function mock llm call
# When developing data pipelines with LLMs, making thousands of real API calls
# can be expensive. Using mock LLM calls lets you test your pipeline's reliability,
# failure handling, and recovery without spending money on API calls.
from starfish.data_factory.utils.mock import mock_llm_call

#### 1. Your First Data Factory: Simple Scaling

The @data_factory decorator transforms any async function into a scalable data processing pipeline.
It handles:
- Parallel execution 
- Automatic batching
- Error handling & retries
- Progress tracking

Let's start with a single LLM call and then show how easy it is to scale it.


In [4]:
# First, create a StructuredLLM instance for generating facts about cities
json_llm = StructuredLLM(
    model_name = "openai/gpt-4o-mini",
    prompt = "Funny facts about city {{city_name}}.",
    output_schema = [{'name': 'fact', 'type': 'str'}],
    model_kwargs = {"temperature": 0.7},
)

json_llm_response = await json_llm.run(city_name='New York')
json_llm_response.data

[{'fact': 'New Yorkers consume around 1,000,000 slices of pizza every day, which means if you laid them all in a line, they would stretch from the Statue of Liberty to the Eiffel Tower... and back!'}]

In [5]:
# Now, scale to multiple cities using data_factory
# Just add the @data_factory decorator to process many cities in parallel

from datetime import datetime
@data_factory(max_concurrency=10)
async def process_json_llm(city_name: str):
    ## Adding a print statement to indicate the start of the processing
    print(f"Processing {city_name} at {datetime.now()}")
    json_llm_response = await json_llm.run(city_name=city_name)
    return json_llm_response.data

# This is all it takes to scale from one city to many cities!
process_json_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"])

[32m2025-04-25 10:16:32[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 8c926411-63e7-4dc6-98c9-861c3489fb8b[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:32[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
Processing New York at 2025-04-25 10:16:32.524033
Processing London at 2025-04-25 10:16:32.524286
Processing Tokyo at 2025-04-25 10:16:32.524979
Processing Paris at 2025-04-25 10:16:32.525535
Processing Sydney at 2025-04-25 10:16:32.526729
[32m2025-04-25 10:16:34[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:16:34[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


[{'fact': "In Tokyo, there's a train station called 'Shinjuku' that handles more passengers each day than the entire population of the United States!"},
 {'fact': "London has a 'secret' underground city known as the 'London Stone', which is said to have magical powers, making it one of the city's most famous and quirky legends!"},
 {'fact': 'In Paris, you can legally marry a dead person! This quirky law allows for posthumous marriages, as long as you can prove that the deceased had intended to marry you before their untimely demise.'},
 {'fact': 'In New York City, there are more than 25,000 licensed taxis, but only about 1,200 of them are actually yellow. The rest are a rainbow of colors, including pink, blue, and even animal print!'},
 {'fact': 'Sydney has a beach where you can surf, swim, and even watch a film â€“ all in one day! Just donâ€™t forget your sunscreen and popcorn!'}]

#### 2. Works with any aysnc function

Data Factory works with any async function, not just LLM calls, you can build complex pipelines involving multiple LLMs, data processing, etc.

Here is example of two chained structured llm

In [6]:
# Example of a more complex function that chains multiple LLM calls
# This was grabbed from structured llm examples 

@data_factory(max_concurrency=10)
async def complex_process_cities(topic: str):
    ## topic â†’ generator_llm â†’ rating_llm â†’ merged results
    # First LLM to generate question/answer pairs
    generator_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt="Generate question/answer pairs about {{topic}}.",
        output_schema=[
            {"name": "question", "type": "str"},
            {"name": "answer", "type": "str"}
        ],
    )

    # Second LLM to rate the generated pairs
    rater_llm = StructuredLLM(
        model_name="openai/gpt-4o-mini",
        prompt='''Rate the following Q&A pairs based on accuracy and clarity (1-10).
        Pairs: {{generated_pairs}}''',
        output_schema=[
            {"name": "accuracy_rating", "type": "int"},
            {"name": "clarity_rating", "type": "int"}
        ],
        model_kwargs={"temperature": 0.5}
)

    generation_response = await generator_llm.run(topic=topic, num_records=5)
    rating_response = await rater_llm.run(generated_pairs=generation_response.data)
    
    # Merge the results
    return merge_structured_outputs(generation_response.data, rating_response.data)


### To save on token here we only use 3 topics as example
complex_process_cities_data = complex_process_cities.run(topic=['Science', 'History', 'Technology'])

[32m2025-04-25 10:16:34[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 466fca03-85a2-46de-b135-629cd76738f7[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:34[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:37[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:40[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/3[0m | [33mRunning: 3[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:43[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 2/3[0m | [33mRunning: 1[

In [7]:
### Each topic has 5 question/answer pairs so 3 topics has 15 pairs!
print(len(complex_process_cities_data))
print(complex_process_cities_data)

15
[{'question': 'What is the primary function of a CPU in a computer?', 'answer': 'The CPU, or Central Processing Unit, is responsible for executing instructions and processing data in a computer system.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What does IoT stand for and what is its significance?', 'answer': 'IoT stands for Internet of Things, which refers to the interconnection of everyday devices to the internet, allowing them to send and receive data, thereby enhancing efficiency and convenience.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the difference between RAM and ROM?', 'answer': 'RAM (Random Access Memory) is volatile memory that temporarily stores data and applications currently in use, while ROM (Read-Only Memory) is non-volatile memory that permanently stores firmware and system software.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is cloud computing?', 'answer': 'Cloud computing is the delivery of co

#### 3. Working with Different Input Formats


Data Factory is flexible with how you provide inputs. Let's demonstrate different ways to pass parameters to data_factory functions.

'data' is a reserved keyword expecting list(dict) or tuple(dict) - this design make it super easy to pass large data and support HuggingFace and Pandas dataframe very easily

In [8]:
## We will be using mock llm call for rest of example to save on token
## Mock LLM call is a function that simulates an LLM API call with random delays (controlled by sleep_time) and occasional failures (controlled by fail_rate)
await mock_llm_call(city_name="New York", num_records_per_city=3)

[{'answer': 'New York_5'}, {'answer': 'New York_2'}, {'answer': 'New York_3'}]

In [9]:
@data_factory(max_concurrency=100)
async def input_format_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.01)

In [10]:
# Format 1: Multiple lists that get zipped together
input_format_data1 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=[2, 1, 1, 1, 1])

[32m2025-04-25 10:16:49[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 05c84608-fec3-4010-8876-e59eed12bb6a[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:49[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:50[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:16:50[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


In [11]:
# Format 2: List + single value (single value gets broadcasted)
input_format_data2 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=1)

[32m2025-04-25 10:16:50[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: fedb98e5-c408-4bc8-9479-6087f4a298b7[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:50[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:51[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:16:51[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


In [12]:
# Format 3: Special 'data' parameter
# 'data' is a reserved keyword expecting list(dict) or tuple(dict)
# Makes integration with various data sources easier
input_format_data3 = input_format_mock_llm.run(data=[{"city_name": "New York", "num_records_per_city": 2}, {"city_name": "London", "num_records_per_city": 1}, {"city_name": "Tokyo", "num_records_per_city": 1}, {"city_name": "Paris", "num_records_per_city": 1}, {"city_name": "Sydney", "num_records_per_city": 1}])

[32m2025-04-25 10:16:51[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 2f5cb7cc-83c9-4b7e-9ebb-386cd66bdd42[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:51[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:52[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:16:52[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 4. Resilient error retry
Data Factory automatically handles errors and retries, making your pipelines robust.

Let's demonstrate with a high failure rate example.

In [13]:
@data_factory(max_concurrency=100)
async def high_error_rate_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3) # Hardcode to 30% chance of failure

# Process all cities - some will fail, but data_factory keeps going
cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 5  # 25 cities
high_error_rate_mock_lllm_data = high_error_rate_mock_llm.run(city_name=cities, num_records_per_city=1)

print(f"\nSuccessfully completed {len(high_error_rate_mock_lllm_data)} out of {len(cities)} tasks")
print("Data Factory automatically handled the failures and continued processing")
print("The results only include successful tasks")

[32m2025-04-25 10:16:56[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 38c50ab6-f24b-4cba-a2c5-070130ab420e[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:16:56[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/25[0m | [33mRunning: 25[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:59[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 16/25[0m | [33mRunning: 9[0m | [36mAttempted: 16[0m    ([32mCompleted: 16[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:16:59[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Tokyo[0m
[32m2025-04-25 10:16:59[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: London[0m
[32m2025-04-25 10:16:59[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mo

#### 5. Resume

This is essential for long-running jobs with thousands of tasks.

If a job is interrupted, you can pick up where you left off using one of two resume methods:


1. **Same Session Resume**: If you're still in the same session where the job was interrupted, simply call - Same instance with .resume()

2. **Cross-Session Resume**: If you've closed your notebook or lost your session, you can resume using the job ID:
   ```python
   from starfish import DataFactory
   # Resume using the master job ID from a previous run
   data_factory = DataFactory.resume_from_checkpoint(job_id="your_job_id")
   ```

The key difference:
- `resume()` uses the same DataFactory instance you defined
- `resume_from_checkpoint()` reconstructs your DataFactory from persistent storage where tasks and progress are saved

> **Note**: Google Colab users may experience issues with `resume_from_checkpoint()` due to how Colab works

We're simulating an interruption here. In a real scenario, this might happen if your notebook errors out, is manually interrupted with a keyboard command, encounters API rate limits, or experiences any other issues that halt execution.

In [14]:
@data_factory(max_concurrency=10)
async def re_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 20  # 100 cities
re_run_mock_llm_data_1 = re_run_mock_llm.run(city_name=cities, num_records_per_city=1)

[32m2025-04-25 10:17:12[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: b2a400b3-32e7-45ee-b8e8-c2bc7afe9f11[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:17:12[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/100[0m | [33mRunning: 10[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:17:15[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 17/100[0m | [33mRunning: 10[0m | [36mAttempted: 17[0m    ([32mCompleted: 17[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:17:15[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:17:15[0m | [31m[1mERROR   [0m | [31m[1mError occurred: KeyboardInterrupt[0m
[32m2025-04-25 10:17:15[0m | [1mINFO    [0m | [1m[RESUME INFO] ðŸš¨ Job stopped unexpectedly. You can resume the job by calli

In [15]:
print("When a job is interrupted, you'll see a message like:")
print("[RESUME INFO] ðŸš¨ Job stopped unexpectedly. You can resume the job by calling .resume()")

print("\nTo resume an interrupted job, simply call:")
print("interrupted_job_mock_llm.resume()")
print('')
print(f"For this example we have {len(re_run_mock_llm_data_1)}/{len(cities)} data generated and not finished yet!")

When a job is interrupted, you'll see a message like:
[RESUME INFO] ðŸš¨ Job stopped unexpectedly. You can resume the job by calling .resume()

To resume an interrupted job, simply call:
interrupted_job_mock_llm.resume()

For this example we have 20/100 data generated and not finished yet!


In [17]:
## Lets keep continue the rest of run by resume_from_checkpoint 
re_run_mock_llm_data_2 = re_run_mock_llm.resume()

[32m2025-04-25 10:18:00[0m | [1mINFO    [0m | [1m[1m[JOB RESUME START][0m [33mPICKING UP FROM WHERE THE JOB WAS LEFT OFF...[0m
[0m
[32m2025-04-25 10:18:00[0m | [1mINFO    [0m | [1m[1m[RESUME PROGRESS] STATUS AT THE TIME OF RESUME:[0m [32mCompleted: 20 / 100[0m | [31mFailed: 0[0m | [31mDuplicate: 0[0m | [33mFiltered: 0[0m[0m
[32m2025-04-25 10:18:00[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 20/100[0m | [33mRunning: 10[0m | [36mAttempted: 20[0m    ([32mCompleted: 20[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:18:03[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 32/100[0m | [33mRunning: 10[0m | [36mAttempted: 32[0m    ([32mCompleted: 32[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:18:03[0m | [31m[1mERROR   [0m | [31m[1mError running task: Mock LLM failed to process city: Paris[0m
[32m2025-04-25 10:18:03[0m | [31m[1

In [18]:
print(f"Now we still able to finished with what is left!! {len(re_run_mock_llm_data_2)} data generated!")

Now we still able to finished with what is left!! 100 data generated!


#### 6. Dry run
Before running a large job, you can do a "dry run" to test your pipeline. This only processes a single item and doesn't save state to the database.

In [19]:
@data_factory(max_concurrency=10)
async def dry_run_mock_llm(city_name: str, num_records_per_city: int):
    return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

dry_run_mock_llm_data = dry_run_mock_llm.dry_run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"]*20, num_records_per_city=1)

[32m2025-04-25 10:18:14[0m | [1mINFO    [0m | [1m[1m[JOB START][0m [36mMaster Job ID: None[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-04-25 10:18:14[0m | [1mINFO    [0m | [1m[JOB PROGRESS] [32mCompleted: 0/1[0m | [33mRunning: 1[0m | [36mAttempted: 0[0m    ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m)[0m
[32m2025-04-25 10:18:15[0m | [1mINFO    [0m | [1mSending telemetry event, TELEMETRY_ENABLED=true[0m
[32m2025-04-25 10:18:15[0m | [1mINFO    [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 1/0[0m | [33mAttempted: 1[0m (Failed: 0, Filtered: 0, Duplicate: 0)[0m


#### 8. Advanced Usage
Data Factory offers more advanced capabilities for complete pipeline customization, including hooks that execute at key stages and shareable state to coordinate between tasks. These powerful features enable complex workflows and fine-grained control. Our dedicated examples for advanced data_factory usage will be coming soon!