yasserrmd commited on
Commit
4f2aac1
·
verified ·
1 Parent(s): 0d664d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -37
README.md CHANGED
@@ -93,65 +93,197 @@ The training demonstrated strong stability and smooth convergence towards sub-0.
93
 
94
  ---
95
 
96
- ## 🧰 Usage Example
97
 
98
  ```python
99
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
100
 
101
  model_id = "yasserrmd/LFM2-350M-Extract-TOON"
102
  tokenizer = AutoTokenizer.from_pretrained(model_id)
103
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
104
 
105
  schema = """
106
- $schema: http://json-schema.org/draft-07/schema#
107
  type: object
108
  properties:
109
- users:
110
- type: array
111
- items:
112
- type: object
113
- properties:
114
- id:
115
- type: integer
116
- name:
117
- type: string
118
- role:
119
- type: string
120
- enum: admin, user
121
- required: id, name, role
122
- required: users
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  """
124
  text = """
125
- [2025-11-11 10:34] New account created -> Alice | role: admin | id#1
126
- [2025-11-11 10:37] User joined -> Bob | regular user | id#2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  """
128
 
129
- system = (
130
- "You are an intelligent model specialized in converting natural language text "
131
- "into valid TOON (Token-Oriented Object Notation) format. "
132
- "Always follow the given schema strictly, emit the correct header "
133
- "in the form <label>[1]{fields}: followed by exactly one values row. "
134
- "Do not include explanations or additional commentary."
 
 
 
 
 
 
135
  )
136
- user = f'Use schema: {schema}\nText: "{text}"'
137
 
138
  messages = [
139
- {"role": "system", "content": system},
140
- {"role": "user", "content": user}
141
  ]
142
 
143
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
144
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
145
- outputs = model.generate(**inputs, max_new_tokens=80, temperature=0)
146
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ```
148
 
149
  **Expected Output:**
150
 
151
  ```
152
- users[2]{id,name,role}:
153
- 1,Alice,admin
154
- 2,Bob,user
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ```
156
 
157
  ---
@@ -167,7 +299,7 @@ users[2]{id,name,role}:
167
 
168
  ---
169
 
170
- ## 🚀 Intended Use
171
 
172
  * **Structured data extraction** from unstructured text.
173
  * **Compact schema-based representations** for LLM pipelines.
@@ -205,7 +337,7 @@ users[2]{id,name,role}:
205
 
206
  ---
207
 
208
- ## 🙏 Acknowledgements
209
 
210
  * **Base model:** LiquidAI team for LFM2-350M-Extract
211
  * **Fine-tuning framework:** Unsloth AI
@@ -214,7 +346,7 @@ users[2]{id,name,role}:
214
 
215
  ---
216
 
217
- ## 📜 Version History
218
 
219
  | Version | Date | Changes |
220
  | ------- | ---------- | ---------------------------------------- |
 
93
 
94
  ---
95
 
96
+ ## Usage Example
97
 
98
  ```python
99
  from transformers import AutoTokenizer, AutoModelForCausalLM
100
+ from transformers import TextStreamer
101
 
102
  model_id = "yasserrmd/LFM2-350M-Extract-TOON"
103
  tokenizer = AutoTokenizer.from_pretrained(model_id)
104
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
105
 
106
  schema = """
107
+ "$schema": "http://json-schema.org/draft-07/schema#"
108
  type: object
109
  properties:
110
+ id:
111
+ type: string
112
+ pattern: "^(\\d+\\.\\d+) disturbing$"
113
+ description: Dot-separated integers representing the unique ID of each element in the hierarchy
114
+ title:
115
+ type: string
116
+ description: Descriptive title of the section or element
117
+ level:
118
+ type: integer
119
+ minimum: 0
120
+ maximum: 9
121
+ description: "Hierarchical level (0 - ROOT, 1 - SECTION, 2 - SUBSECTION, 3+ - DETAIL_N)"
122
+ level_type:
123
+ type: string
124
+ enum[4]: ROOT,SECTION,SUBSECTION,DETAIL_N
125
+ description: Type of the hierarchical element
126
+ component:
127
+ type: array
128
+ items:
129
+ type: object
130
+ properties:
131
+ idc:
132
+ type: integer
133
+ description: Component ID
134
+ component_type:
135
+ type: string
136
+ enum[4]: PARAGRAPH,TABLE,CALCULATION,CHECKBOX
137
+ description: Type of component
138
+ metadata:
139
+ type: string
140
+ description: "Additional metadata (e.g., title, note, or overview)"
141
+ properties:
142
+ type: object
143
+ properties:
144
+ variables:
145
+ type: array
146
+ items:
147
+ type: object
148
+ properties:
149
+ idx:
150
+ type: string
151
+ description: Unique row-column identifier (X.Y format)
152
+ name:
153
+ type: string
154
+ description: Attribute name
155
+ value:
156
+ type: string
157
+ description: Attribute value
158
+ unit:
159
+ type[2]: string,"null"
160
+ description: Optional unit for the value
161
+ metrics:
162
+ type: boolean
163
+ description: Boolean flag indicating if the attribute is a metric
164
+ formula:
165
+ type: boolean
166
+ description: Boolean flag indicating if the attribute is a formula
167
+ content:
168
+ type: array
169
+ items:
170
+ type[2]: string,"null"
171
+ description: Text content
172
+ children:
173
+ type: array
174
+ items:
175
+ "$ref": #
176
+ required[6]: id,title,level,level_type,component,children
177
  """
178
  text = """
179
+ SUBSECTION component[1]: - idc: 1 component_type: PARAGRAPH metadata: "<note>Note: Specific to debtor risk.</note>" properties: variables[0]: content[1]: The risk of debtors failing to make payments on time. - id: "2.2" title: Liquidity Risk level: 2 level_type: SUBSECTION component[1]: - idc: 1 component_type: PARAGRAPH metadata: "<note>Note: Specific to liquidity risk.</note>" properties: variables[0]: content[1]: Liquidity risk is related to the difficulty in selling assets quickly without a significant loss.
180
+
181
+ The document begins with an inclusive overview, elucidating the purpose of the report and its objective to assess risks and propose mitigations for financial operations, such as compliance, fraud detection, and performance metrics. The overall framework is meticulously divided into several sections and subsections reflecting detailed and structured analysis.
182
+
183
+ This report is intended to provide a comprehensive understanding of risk exposure within financial operations. We will now delve into the first section of the report, which covers a vast array of compliance regulations critical for maintaining financial accountability.
184
+
185
+ Firstly, let’s examine the **Compliance Section**. The section’s primary aim is to highlight the key compliance regulations applicable to financial operations. Notably, this includes the **Anti-Money Laundering (AML) Regulation (RC.1)** and the **Data Privacy Act (RC.2)**. Highlighting the significance of these regulations, the Subsection on Anti-Money Laundering identifies several gaps within the current system. These gaps need to be addressed to ensure robust compliance. The analysis suggests the presence of several risk points where the current practices might fall short of regulatory standards.
186
+
187
+ Next, we have a **Detailed Risk Analysis** for the Anti-Money Laundering Regulation. This component outlines the specific risks and potential impacts on financial operations. In the document, a table detailing the risk assessment is provided outlining two primary risks, **Fraudulent Transactions (RA.1)**, and **Non-Compliance with AML (RA.2)**, each with a brief description of the risk and its possible consequences. Addressing these risks requires a systematic approach, ensuring all preventive measures are in place to mitigate financial risks effectively.
188
+
189
+ Moreover, a **Checklist** is included to assess the current status concerning the Anti-Money Laundering Regulation. The Checklist requires the selection of the best option that describes the current status as either **Option 1 (true)** or **Option 2 (false)**. This selection is pivotal in making informed decisions about regulatory compliance and operational adjustments.
190
+
191
+ In parallel, the **Data Privacy Act** (RC.2) Subsection identifies several issues in handling personal data. These issues need to be corrected to fully comply with the Data Privacy Act. The **Fraud Detection Section** and its **Subsections on Misrepresentation and Theft of Data** follow a similar structure, detailing the critical risks associated with these vulnerabilities and emphasizing the necessity for mitigation strategies.
192
+
193
+ In the **Fraud Detection Section**, we have a table outlining two major cases of fraud: **Misrepresentation (FC.1)** and **Theft of Data (FC.2)**. These cases are significant due to their impact on financial integrity and operational continuity. The analysis of these cases includes detailed descriptions of the nature and extent of the fraud, highlighting the importance of robust fraud detection mechanisms.
194
+
195
+ Each regulatory and fraud-related section is equipped with thorough analysis and checks, ensuring that every risk is identified and addressed. While the sections provide detailed tables and checklists, they also reflect the broader context of financial operations and the mitigation strategies required to ensure compliance and prevent fraud.
196
+
197
+ By providing these detailed sections and sub-sections, the report aims to equip stakeholders with the necessary information to assess and improve the risk management framework. This ensures that all financial operations are conducted in a compliant, transparent, and secure manner, thereby safeguarding the interests of all stakeholders involved.
198
+
199
  """
200
 
201
+ system_instruction = (
202
+ "You are an intelligent model specialized in converting natural language text"
203
+ "into valid TOON (Token-Oriented Object Notation) format. "
204
+ "Always follow the given schema strictly, emit the correct header "
205
+ "in the form <label>[1]{fields}: followed by exactly one values row. "
206
+ "Do not include explanations or additional commentary."
207
+ )
208
+
209
+
210
+ user_prompt = (
211
+ f'Generate TOON format using the schema {schema} '
212
+ f'for the below text "{text}".'
213
  )
214
+
215
 
216
  messages = [
217
+ {"role": "system", "content": system_instruction},
218
+ {"role": "user", "content": user_prompt}
219
  ]
220
 
221
+
222
+ inputs = tokenizer.apply_chat_template(
223
+ messages,
224
+ add_generation_prompt = True, # Must add for generation
225
+ return_tensors = "pt",
226
+ tokenize = True,
227
+ return_dict = True,
228
+ ).to("cuda")
229
+
230
+
231
+ _ = model.generate(
232
+ **inputs,
233
+ max_new_tokens = 2046, # Increase for longer outputs!
234
+ # Recommended Liquid settings!
235
+ temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
236
+ streamer = TextStreamer(tokenizer, skip_prompt = True),
237
+ )
238
  ```
239
 
240
  **Expected Output:**
241
 
242
  ```
243
+ id: "0.0"
244
+ title: Financial Risk Assessment Report
245
+ level: 0
246
+ description: Overview of financial risks and mitigation strategies.
247
+ level_type: ROOT
248
+ component[1]:
249
+ - idc: 1
250
+ component_type: PARAGRAPH
251
+ metadata: <note>Note: Specific to debtor risk.</note>"
252
+ properties:
253
+ variables[0]:
254
+ content[1]: The risk of debtors failing to make payments on time.
255
+ children[1]:
256
+ - id: "1.0"
257
+ title: Compliance Section
258
+ level: 1
259
+ level_type: SECTION
260
+ component[1]:
261
+ - idc: 1
262
+ component_type: PARAGRAPH
263
+ metadata: <note>Note: Specific to liquidity risk.</note>"
264
+ properties:
265
+ variables[0]:
266
+ content[1]: The risk of liquidity risk is related to the difficulty in selling assets quickly without a significant loss.
267
+ children[1]:
268
+ - id: "1.1"
269
+ title: Detailed Risk Analysis
270
+ level: 2
271
+ level_type: SUBSECTION
272
+ component[1]:
273
+ - idc: 1
274
+ component_type: TABLE
275
+ metadata: <note>Table of Risks</note>"
276
+ properties:
277
+ variables[2]{idx,name,value,unit,metrics}:
278
+ "0.0",Risk Assessment,false,null,false
279
+ "0.1",Risks,Fraudulent Transactions,null,false
280
+ content[1]: Fraudulent Transactions (RA.1), Non-Compliance with AML,null,false
281
+ - idc: 2
282
+ component_type: CHECKBOX
283
+ metadata: <note>Checklist for compliance</note>
284
+ properties:
285
+ variables[0]:
286
+ content[1]: Option 1 (true),Option 2 (false)<|im_end|>
287
  ```
288
 
289
  ---
 
299
 
300
  ---
301
 
302
+ ## Intended Use
303
 
304
  * **Structured data extraction** from unstructured text.
305
  * **Compact schema-based representations** for LLM pipelines.
 
337
 
338
  ---
339
 
340
+ ## Acknowledgements
341
 
342
  * **Base model:** LiquidAI team for LFM2-350M-Extract
343
  * **Fine-tuning framework:** Unsloth AI
 
346
 
347
  ---
348
 
349
+ ## Version History
350
 
351
  | Version | Date | Changes |
352
  | ------- | ---------- | ---------------------------------------- |