# Repository Guidelines This repository contains the LLM-based Cancer Risk Assessment Assistant. ## Core Technologies - **FastAPI** for the web framework - **LangChain** for LLM orchestration - **uv** for environment and dependency management - **hydra:** for configuration management ## Development Setup ### Environment Setup - Create the virtual environment (at '.venv') with `uv sync`. - As the repository uses uv, the uv should be used to run all commands, e.g., "uv run python ..." NOT "python ...". ### Running Commands - **Streamlit Interface**: `uv run streamlit run apps/streamlit_ui/main.py` - **CLI Demo**: `uv run python apps/cli/main.py` - **Tests**: `uv run pytest` ## Coding Standards ### Coding Philosophy - Write simple, explicit, modular code - Prioritize clarity over cleverness - Prefer small pure functions over large ones - Return early instead of nesting deeply - Favor functions over classes unless essential - Favor simple replication over heavy abstraction - Keep comments short and only where code isn't self-explanatory - Avoid premature optimization or over-engineering ### Variable Naming - **Avoid single-letter variable names** (x, y, i, j, e, t, f, m, c, ct) in favor of descriptive names. - **Avoid abbreviations** (fh, ct, w, h) in favor of full descriptive names. - Use context-specific names for loop indices based on what you're iterating over: - `item_index` for general enumeration - `line_index` for text line iteration - `column_index` for table/array column iteration - `row_index` for table/array row iteration - Use descriptive names for comprehensions and iterations: - `item` instead of `i` for general items - `element` instead of `e` for list elements - `key` instead of `k` for dictionary keys - `value` instead of `v` for dictionary values - Use descriptive names for coordinates and positions: - `x_position`, `y_position` instead of `x`, `y` - `width`, `height` instead of `w`, `h` - Use descriptive names for data structures: - `file_path` instead of `f` for file paths - `model` instead of `m` for model instances - `user` instead of `u` for user objects **Examples from recent refactoring:** - `for i, ref in enumerate(references)` → `for ref_index, ref in enumerate(references)` - `for e in examples` → `for example in examples` - `for m in models` → `for model in models` - `x = pdf.get_x()` → `x_position = pdf.get_x()` - `fh = family_history` → `family_history = family_history` (avoid abbreviations) - `ct for ct in cancer_types` → `cancer_type for cancer_type in cancer_types` - `f in MODELS_DIR.glob` → `file_path in MODELS_DIR.glob` - `t in field_type.__args__` → `type_arg in field_type.__args__` ### Path Handling - **Always use `pathlib.Path`** for all file I/O, joining, and globbing - Accept `Path | str` at function boundaries; normalize to `Path` internally - **Never use `os.path`** for path operations Example: ```python from pathlib import Path def read_text(file: Path | str) -> str: path = Path(file) return path.read_text(encoding="utf-8") ``` ### Type Hints and Modern Python - **Use modern type hints**: `list`, `dict`, `tuple`, `set` (not `List`, `Dict`, etc.) - **Use PEP 604 unions**: `A | B` (not `Union[A, B]` or `Optional[A]`) - Import from `typing` only when necessary (`TypedDict`, `Literal`, `Annotated`, etc.) - **Never use** `from __future__ import annotations` - Add type hints to all public functions and methods - Prefer precise types (`float`, `Path`, etc.) over generic ones - If `Any` is required, isolate and document why ### Import Management - **Place all imports at the top of the file**, never inside functions or classes - Group imports in three sections with blank lines between: 1. Standard library imports 2. Third-party library imports 3. Local/project imports - This improves performance (imports loaded once) and code readability ### Error Handling and Logging - **Use `try/except` only for I/O or external APIs** - Catch specific exceptions only (never broad `except:`) - Raise clear, actionable error messages - **Use `loguru`** for logging, never `print()` in production code Example: ```python from loguru import logger try: data = Path(file_path).read_text(encoding="utf-8") except FileNotFoundError as error: logger.error(f"Configuration file not found: {file_path}") raise ValueError(f"Missing required config: {file_path}") from error ``` ### Docstring Standards - **Use Google-style docstrings** for all public functions and classes - Do NOT include type hints in docstrings (they're in the signature) - Describe behavior, invariants, side effects, and edge cases - Include examples for complex functions - Avoid verbose docstrings for simple, self-explanatory functions ## Testing ### Testing Philosophy - Write meaningful tests that verify core functionality and prevent regressions - Use `pytest` as the testing framework - Tests go under `tests/` mirroring the source layout - Test both valid and invalid input scenarios ### Test Types - **Unit tests**: Small, deterministic, one concept per test - **Integration tests**: Real workflows or reference comparisons with external systems - Use `pytest.mark` to tag slow or manual tests ### Test Coverage Requirements - Ensure comprehensive test coverage for all risk models - **Ground Truth Validation**: Test against known reference values - **Input Validation**: Test that invalid inputs raise `ValueError` - **Edge Cases**: Test boundary conditions - **Inapplicable Cases**: Test when models should return "N/A" ### Running Tests ```bash uv run pytest # Run all tests uv run pytest -q # Quiet mode uv run pytest -v # Verbose mode uv run pytest tests/test_risk_models/ # Specific directory ``` ### Pre-Submission Checklist Before committing code, verify: 1. ✅ Run `uv run pytest -q` (all tests pass) 2. ✅ Run `pre-commit run --all-files` (all hooks pass) 3. ✅ No `print()` statements in production code 4. ✅ No broad `except:` blocks 5. ✅ All type hints present on public functions 6. ✅ File paths use `pathlib.Path` 7. ✅ Logging uses `loguru` ## Risk Models ### Implemented Models The assistant currently includes the following built-in risk calculators: - **Gail** - Breast cancer risk - **Claus** - Breast cancer risk based on family history - **Tyrer-Cuzick** - Breast cancer risk (IBIS model) - **BOADICEA** - Breast and ovarian cancer risk (via CanRisk API) - **PLCOm2012** - Lung cancer risk - **LLPi** - Liverpool Lung Project improved model for lung cancer risk (8.7-year prediction) - **CRC-PRO** - Colorectal cancer risk - **PCPT** - Prostate cancer risk - **Extended PBCG** - Prostate cancer risk (extended model) - **Prostate Mortality** - Prostate cancer-specific mortality prediction - **MRAT** - Melanoma risk (5-year prediction) - **aMAP** - Hepatocellular carcinoma (liver cancer) risk - **QCancer** - Multi-site cancer differential Additional models should follow the interfaces under `src/sentinel/risk_models`. ### Risk Model Implementation Guide #### Base Architecture All risk models must inherit from `RiskModel` in `src/sentinel/risk_models/base.py`: ```python from sentinel.risk_models.base import RiskModel class YourRiskModel(RiskModel): def __init__(self): super().__init__("your_model_name") ``` #### Required Methods Every risk model must implement these abstract methods: ```python def compute_score(self, user: UserInput) -> str: """Compute the risk score for a given user profile. Args: user: The user profile containing demographics, medical history, etc. Returns: str: Risk percentage as a string or an N/A message if inapplicable. Raises: ValueError: If required inputs are missing or invalid. """ def cancer_type(self) -> str: """Return the cancer type this model assesses.""" return "breast" # or "lung", "prostate", etc. def description(self) -> str: """Return a detailed description of the model.""" def interpretation(self) -> str: """Return guidance on how to interpret the results.""" def references(self) -> list[str]: """Return list of reference citations.""" ``` #### UserInput Structure **All risk models must use the centralized `UserInput` structure** - this is the single source of truth for all data types and enums. The `UserInput` class follows a hierarchical structure: ``` UserInput ├── demographics: Demographics │ ├── age_years: int │ ├── sex: Sex (enum) │ ├── ethnicity: Ethnicity | None │ └── anthropometrics: Anthropometrics │ ├── height_cm: float | None │ └── weight_kg: float | None ├── lifestyle: Lifestyle │ ├── smoking: SmokingHistory │ └── alcohol: AlcoholConsumption ├── personal_medical_history: PersonalMedicalHistory │ ├── chronic_conditions: list[ChronicCondition] │ ├── previous_cancers: list[CancerType] │ ├── genetic_mutations: list[GeneticMutation] │ └── tyrer_cuzick_polygenic_risk_score: float | None ├── female_specific: FemaleSpecific | None │ ├── menstrual: MenstrualHistory │ ├── parity: ParityHistory │ └── breast_health: BreastHealthHistory ├── symptoms: list[SymptomEntry] └── family_history: list[FamilyMemberCancer] ``` #### REQUIRED_INPUTS Specification Every risk model must define a `REQUIRED_INPUTS` class attribute using Pydantic's `Annotated` types with `Field` constraints: ```python REQUIRED_INPUTS: dict[str, tuple[type, bool]] = { "demographics.age_years": (Annotated[int, Field(ge=18, le=100)], True), "demographics.sex": (Sex, True), "demographics.ethnicity": (Ethnicity | None, False), "family_history": (list, False), # list[FamilyMemberCancer] "symptoms": (list, False), # list[SymptomEntry] } ``` #### Input Validation Every `compute_score` method must start with input validation: ```python def compute_score(self, user: UserInput) -> str: """Compute the risk score for a given user profile.""" # Validate inputs first is_valid, errors = self.validate_inputs(user) if not is_valid: raise ValueError(f"Invalid inputs for {self.name}: {'; '.join(errors)}") # Model-specific validation if user.demographics.sex != Sex.FEMALE: return "N/A: Model is only applicable to female patients." # Continue with model-specific logic... ``` #### Data Access Patterns ```python # Demographics age = user.demographics.age_years sex = user.demographics.sex ethnicity = user.demographics.ethnicity # Female-specific data if user.female_specific is not None: menarche_age = user.female_specific.menstrual.age_at_menarche num_births = user.female_specific.parity.num_live_births # Family history for member in user.family_history: if member.cancer_type == CancerType.BREAST: relation = member.relation age_at_diagnosis = member.age_at_diagnosis ``` #### Enum Usage **Always use enums from `sentinel.user_input`, never string literals or custom enums:** ```python # ✅ Correct - using UserInput enums if user.demographics.sex == Sex.FEMALE: if member.cancer_type == CancerType.BREAST: if member.relation == FamilyRelation.MOTHER: # ❌ Incorrect - string literals if user.demographics.sex == "female": if member.cancer_type == "breast": # ❌ Incorrect - custom enums if user.demographics.sex == MyCustomSex.FEMALE: ``` **Important**: All risk models must use the same centralized enums from `UserInput`. If a required enum doesn't exist in `UserInput`, you must: 1. **Extend UserInput** by adding the new enum to `src/sentinel/user_input.py` 2. **Never create model-specific enums** - this prevents divergence between models 3. **Update all models** to use the new centralized enum This ensures all risk models share the same data structure and prevents fragmentation. #### Extending UserInput When a risk model needs fields or enums that don't exist in `UserInput`: 1. **Add to UserInput**: Extend `src/sentinel/user_input.py` with new fields/enums 2. **Update all models**: Ensure all existing models can handle the new fields (use `| None` for optional fields) 3. **Never create model-specific structures**: This prevents divergence and fragmentation 4. **Test thoroughly**: Add tests for new fields in `tests/test_user_input.py` Example of extending UserInput: ```python # In src/sentinel/user_input.py class ChronicCondition(str, Enum): # ... existing values NEW_CONDITION = "new_condition" # Add new enum value class PersonalMedicalHistory(StrictBaseModel): # ... existing fields new_field: float | None = Field(None, description="New field description") ``` #### Testing Requirements Create comprehensive test files with: - **Ground Truth Validation**: Test against known reference values - **Input Validation**: Test that invalid inputs raise `ValueError` - **Edge Cases**: Test boundary conditions and edge cases - **Inapplicable Cases**: Test cases where model should return "N/A" Example test structure: ```python import pytest from sentinel.user_input import UserInput, Demographics, Sex from sentinel.risk_models import YourRiskModel GROUND_TRUTH_CASES = [ { "name": "test_case_name", "input": UserInput( demographics=Demographics( age_years=40, sex=Sex.FEMALE, # ... other fields ), # ... rest of input ), "expected": 1.5, # Expected risk percentage }, # ... more test cases ] class TestYourRiskModel: @pytest.mark.parametrize("case", GROUND_TRUTH_CASES, ids=lambda x: x["name"]) def test_ground_truth_validation(self, case): """Test against ground truth results.""" user_input = case["input"] expected_risk = case["expected"] actual_risk_str = self.model.compute_score(user_input) actual_risk = float(actual_risk_str) assert actual_risk == pytest.approx(expected_risk, abs=0.01) ``` #### Migration Checklist When adapting an existing risk model to the new structure: - [ ] Update imports to use new `user_input` module - [ ] Add `REQUIRED_INPUTS` with Pydantic validation - [ ] Refactor `compute_score` to use new `UserInput` structure - [ ] Replace string literals with enums - [ ] Update parameter extraction logic - [ ] Add input validation at start of `compute_score` - [ ] Update all test cases to use new `UserInput` structure - [ ] Run full test suite to ensure 100% pass rate - [ ] Run pre-commit hooks to ensure code quality ## LLM and Code Assistant Guidelines When generating or modifying code, AI assistants MUST: ### Mandatory Rules - Follow ALL guidelines in this document without exception - Never use forbidden constructs (`os.path`, `Optional[]`, `List[]`, `print()`, broad `except:`) - Never add decorative comment banners or unnecessary formatting - Always generate clean, modular, statically typed code ### Code Generation Standards - Prefer clarity and simplicity over cleverness - Use modern Python type hints exclusively - Include comprehensive docstrings for non-trivial functions - Ensure all examples compile, type-check, and pass linting ### Verification All generated code must: - Pass `ruff format` and `ruff check` - Include proper type hints - Use `pathlib.Path` for all file operations - Use `loguru` for logging - Follow the Variable Naming guidelines ## Important Note for Developers When making changes to the project, ensure that the following files are updated to reflect the changes: - `README.md` - `AGENTS.md` - `GEMINI.md` For additional implementation details, refer to the existing risk model implementations in `src/sentinel/risk_models/`.