ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval
Paper • 2602.05550 • Published • 1
How to use hreyulog/embedinggemma_arkts with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("hreyulog/embedinggemma_arkts")
sentences = [
"组件即将出现时加载收藏商家数据",
"static async delete(key: string, preferenceName: string = defaultPreferenceName) {\n let preferences = await this.getPreferences(preferenceName)\n return await preferences.delete(key)\n }",
"async aboutToAppear(): Promise<void> {\n await this.loadFavoriteMerchants();\n }",
"Copyright (c) 2022 Huawei Device Co., Ltd.\nLicensed under the Apache License,Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from google/embeddinggemma-300m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("hreyulog/embedinggemma_arkts")
# Run inference
queries = [
"Transform an array of points with all matrices. VERY IMPORTANT: Keep\nmatrix order \"value-touch-offset\" when transforming.\n\n@param pts",
]
documents = [
"public pointValuesToPixel(pts: number[]) {\n this.mMatrixValueToPx.mapPoints(pts);\n this.mViewPortHandler.getMatrixTouch().mapPoints(pts);\n this.mMatrixOffset.mapPoints(pts);\n }",
'makeNode(uiContext: UIContext): FrameNode {\n this.rootNode = new FrameNode(uiContext);\n if (this.rootNode !== null) {\n this.rootRenderNode = this.rootNode.getRenderNode();\n }\n return this.rootNode;\n }',
'export interface OnlineLunarYear {\n year: number;\n zodiac: string;\n ganzhi: string;\n leapMonth: number;\n isLeapYear: boolean;\n leapMonthDays?: number;\n solarTerms: SolarTermInfo[];\n festivals: LunarFestival[];\n}',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.8923, 0.0264, -0.0212]])
On arkts-code-docstring dataset split test
| Model | Params | MRR | NDCG@5 | Recall@1 | Recall@5 |
|---|---|---|---|---|---|
| embedinggemma_arkts | 308M | 0.7788 | 0.8034 | 0.7142 | 0.8769 |
| QWEN3-Embedding-0.6B | 596M | 0.6776 | 0.7015 | 0.6141 | 0.7723 |
| embeddinggemma-300m | 308M | 0.6399 | 0.6654 | 0.5740 | 0.7416 |
| BGE-M3 | 567M | 0.5283 | 0.5603 | 0.4464 | 0.6558 |
| BGE-base-zh-v1.5 | 110M | 0.3598 | 0.3903 | 0.2841 | 0.4816 |
| BGE-base-en-v1.5 | 110M | 0.3439 | 0.3637 | 0.2935 | 0.4227 |
| E5-base-v2 | 110M | 0.3073 | 0.3261 | 0.2596 | 0.3823 |
| BM25 (jieba) | – | 0.2043 | 0.2204 | 0.1643 | 0.2690 |
Dataset: hreyulog/arkts-code-docstring
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
移除登录状态监听 |
public removeLoginStateListener(listener: (isLoggedIn: boolean) => void) {\n const index = this.loginStateListeners.indexOf(listener);\n if (index !== -1) {\n this.loginStateListeners.splice(index, 1);\n }\n } |
PUT请求 |
static put(url: string, data?: Object, config: RequestConfig = {}): Promise> { |
Transform an array of points with all matrices. VERY IMPORTANT: Keep\nmatrix order "value-touch-offset" when transforming.\n\n@param pts |
public pointValuesToPixel(pts: number[]) {\n this.mMatrixValueToPx.mapPoints(pts);\n this.mViewPortHandler.getMatrixTouch().mapPoints(pts);\n this.mMatrixOffset.mapPoints(pts);\n } |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32num_train_epochs: 2multi_dataset_batch_sampler: round_robindo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 32gradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 2max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_ratio: Nonewarmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Trueenable_jit_checkpoint: Falsesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseuse_cpu: Falseseed: 42data_seed: Nonebf16: Falsefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: -1ddp_backend: Nonedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonedisable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Nonegroup_by_length: Falselength_column_name: lengthproject: huggingfacetrackio_space_id: trackioddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Truepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_for_metrics: []eval_do_concat_batches: Trueauto_find_batch_size: Falsefull_determinism: Falseddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_num_input_tokens_seen: noneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Trueuse_cache: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.4088 | 500 | 0.3798 |
| 0.8177 | 1000 | 0.2489 |
| 1.2265 | 1500 | 0.1308 |
| 1.6353 | 2000 | 0.0877 |
@misc{he2026arktscodesearchopensourcearktsdataset,
title={ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval},
author={Yulong He and Artem Ermakov and Sergey Kovalchuk and Artem Aliev and Dmitry Shalymov},
year={2026},
eprint={2602.05550},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2602.05550},
}
Base model
google/embeddinggemma-300m