Skip to content

Inference cannot be performed on a real iOS device. #26

@yatoooon

Description

@yatoooon

FlutterView implements focusItemsInRect: - caching for linear focus movement is limited as long as this view is on screen.
flutter: The Dart VM service is listening on http://127.0.0.1:50570/Sif_eIL0TQU=/
Failed to associate thumbnails for picked URL file:///private/var/mobile/Containers/Data/Application/EA977386-3E11-46CD-AC80-5AB86E488188/Documents/qwen2.5-0.5b-instruct-q4_k_m.gguf with the Inbox copy file:///private/var/mobile/Containers/Data/Application/3A05364A-3C99-4DA2-8D2D-F7845D900EB8/tmp/com.example.lcpp-Inbox/qwen2.5-0.5b-instruct-q4_k_m.gguf: Error Domain=QLThumbnailErrorDomain Code=102 "(null)" UserInfo={NSUnderlyingError=0x303208c60 {Error Domain=GSLibraryErrorDomain Code=3 "Generation not found" UserInfo={NSDescription=Generation not found}}}
Can't find or decode reasons
Failed to get or decode unavailable reasons
Can't find or decode disabled use cases
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple A12 GPU)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
llama_model_load_from_file_impl: using device Metal (Apple A12 GPU) - 1967 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /private/var/mobile/Containers/Data/Application/3A05364A-3C99-4DA2-8D2D-F7845D900EB8/tmp/qwen2.5-0.5b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 5: general.size_label str = 630M
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 15
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 133 tensors
llama_model_loader: - type q8_0: 13 tensors
llama_model_loader: - type q4_K: 12 tensors
llama_model_loader: - type q6_K: 12 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 462.96 MiB (6.16 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 896
print_info: n_layer = 24
print_info: n_head = 14
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 128
print_info: n_embd_v_gqa = 128
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 4864
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1B
print_info: model params = 630.17 M
print_info: general.name = qwen2.5-0.5b-instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
load_tensors: layer 0 assigned to device Metal, is_swa = 0
load_tensors: layer 1 assigned to device Metal, is_swa = 0
load_tensors: layer 2 assigned to device Metal, is_swa = 0
load_tensors: layer 3 assigned to device Metal, is_swa = 0
load_tensors: layer 4 assigned to device Metal, is_swa = 0
load_tensors: layer 5 assigned to device Metal, is_swa = 0
load_tensors: layer 6 assigned to device Metal, is_swa = 0
load_tensors: layer 7 assigned to device Metal, is_swa = 0
load_tensors: layer 8 assigned to device Metal, is_swa = 0
load_tensors: layer 9 assigned to device Metal, is_swa = 0
load_tensors: layer 10 assigned to device Metal, is_swa = 0
load_tensors: layer 11 assigned to device Metal, is_swa = 0
load_tensors: layer 12 assigned to device Metal, is_swa = 0
load_tensors: layer 13 assigned to device Metal, is_swa = 0
load_tensors: layer 14 assigned to device Metal, is_swa = 0
load_tensors: layer 15 assigned to device Metal, is_swa = 0
load_tensors: layer 16 assigned to device Metal, is_swa = 0
load_tensors: layer 17 assigned to device Metal, is_swa = 0
load_tensors: layer 18 assigned to device Metal, is_swa = 0
load_tensors: layer 19 assigned to device Metal, is_swa = 0
load_tensors: layer 20 assigned to device Metal, is_swa = 0
load_tensors: layer 21 assigned to device Metal, is_swa = 0
load_tensors: layer 22 assigned to device Metal, is_swa = 0
load_tensors: layer 23 assigned to device Metal, is_swa = 0
load_tensors: layer 24 assigned to device Metal, is_swa = 0
load_tensors: tensor 'output.weight' (q8_0) (and 168 others) cannot be used with preferred buffer type Metal, using CPU instead
ggml_backend_metal_log_allocated_size: allocated buffer, size = 235.78 MiB, ( 316.64 / 2048.02)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors: CPU_Mapped model buffer size = 462.96 MiB
load_tensors: Metal_Mapped model buffer size = 235.78 MiB
.....................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A12 GPU
ggml_metal_load_library: loading '/var/containers/Bundle/Application/3DE56B4D-3076-45F6-A5F7-30068A5F3403/Runner.app/Frameworks/lcpp.framework/default.metallib'
ggml_metal_init: GPU name: Apple A12 GPU
ggml_metal_init: GPU family: MTLGPUFamilyApple5 (1005)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction = false
ggml_metal_init: simdgroup matrix mul. = false
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = false
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 2147.50 MB
ggml_metal_init: loaded kernel_add 0x3018844e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x301881b00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sub 0x301884a20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sub_row 0x3018850e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x301885aa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x3018819e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_div 0x30188f7e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_div_row 0x30188e4c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_repeat_f32 0x30188fea0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_repeat_f16 0x301898060 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_repeat_i32 0x3018986c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_repeat_i16 0x301886460 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x301886e20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale_4 0x301887480 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_clamp 0x3018874e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_tanh 0x301887540 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x301898780 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sigmoid 0x301898de0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x301898e40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu_4 0x301898ea0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu_quick 0x301898f00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu_quick_4 0x301898f60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x301898fc0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu_4 0x301899020 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_elu 0x3018875a0 | th_max = 1024 | th_width = 32
ggml_metal_init: skipping kernel_soft_max_f16 (not supported)
ggml_metal_init: skipping kernel_soft_max_f16_4 (not supported)
ggml_metal_init: skipping kernel_soft_max_f32 (not supported)
ggml_metal_init: skipping kernel_soft_max_f32_4 (not supported)
ggml_metal_init: loaded kernel_diag_mask_inf 0x301887600 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x301899080 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x3018990e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x301887660 | th_max = 1024 | th_width = 32
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: loaded kernel_get_rows_q4_0 0x3018877e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x30189c060 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_0 0x30189c3c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_1 0x30189c720 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x30189c7e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x30189cb40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x30189cea0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x30189d200 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x30189d560 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x30189d8c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq2_xxs 0x30189dc20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq2_xs 0x301899860 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq3_xxs 0x3018997a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq3_s 0x301899980 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq2_s 0x30189df80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq1_s 0x30189e2e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq1_m 0x30189e640 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq4_nl 0x30189e9a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_iq4_xs 0x301899ce0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_i32 0x30189a040 | th_max = 1024 | th_width = 32
ggml_metal_init: skipping kernel_rms_norm (not supported)
ggml_metal_init: skipping kernel_l2_norm (not supported)
ggml_metal_init: skipping kernel_group_norm (not supported)
Compiler failed to build request
ggml_metal_init: loaded kernel_norm 0x0 | th_max = 0 | th_width = 0
ggml_metal_init: error: load pipeline error: Error Domain=AGXMetalA12 Code=3 "Encountered unlowered function call to air.simd_sum.f32" UserInfo={NSLocalizedDescription=Encountered unlowered function call to air.simd_sum.f32}
ggml_backend_metal_device_init: error: failed to allocate context
llama_init_from_model: failed to initialize the context: failed to initialize Metal backend
Assertion failed: (ctx != nullptr), function llama_prompt, file llm.cpp, line 111.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions