42532afff4
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes #21919.
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
121 lines
2.6 KiB
Plaintext
121 lines
2.6 KiB
Plaintext
ied 4 ½ months
|
|
__ggml_vocab_test__
|
|
Äpfel
|
|
__ggml_vocab_test__
|
|
|
|
__ggml_vocab_test__
|
|
|
|
__ggml_vocab_test__
|
|
|
|
__ggml_vocab_test__
|
|
|
|
__ggml_vocab_test__
|
|
|
|
__ggml_vocab_test__
|
|
|
|
|
|
__ggml_vocab_test__
|
|
|
|
|
|
|
|
__ggml_vocab_test__
|
|
|
|
|
|
|
|
|
|
__ggml_vocab_test__
|
|
|
|
|
|
__ggml_vocab_test__
|
|
Hello world
|
|
__ggml_vocab_test__
|
|
Hello world
|
|
__ggml_vocab_test__
|
|
Hello World
|
|
__ggml_vocab_test__
|
|
Hello World
|
|
__ggml_vocab_test__
|
|
Hello World!
|
|
__ggml_vocab_test__
|
|
Hello, world!
|
|
__ggml_vocab_test__
|
|
Hello, world!
|
|
__ggml_vocab_test__
|
|
this is 🦙.cpp
|
|
__ggml_vocab_test__
|
|
w048 7tuijk dsdfhu
|
|
__ggml_vocab_test__
|
|
нещо на Български
|
|
__ggml_vocab_test__
|
|
កាន់តែពិសេសអាចខលចេញ
|
|
__ggml_vocab_test__
|
|
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)
|
|
__ggml_vocab_test__
|
|
Hello
|
|
__ggml_vocab_test__
|
|
Hello
|
|
__ggml_vocab_test__
|
|
Hello
|
|
__ggml_vocab_test__
|
|
Hello
|
|
__ggml_vocab_test__
|
|
Hello
|
|
__ggml_vocab_test__
|
|
Hello
|
|
Hello
|
|
__ggml_vocab_test__
|
|
(
|
|
__ggml_vocab_test__
|
|
|
|
=
|
|
__ggml_vocab_test__
|
|
' era
|
|
__ggml_vocab_test__
|
|
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~
|
|
__ggml_vocab_test__
|
|
!!!!!!
|
|
__ggml_vocab_test__
|
|
3
|
|
__ggml_vocab_test__
|
|
33
|
|
__ggml_vocab_test__
|
|
333
|
|
__ggml_vocab_test__
|
|
3333
|
|
__ggml_vocab_test__
|
|
33333
|
|
__ggml_vocab_test__
|
|
333333
|
|
__ggml_vocab_test__
|
|
3333333
|
|
__ggml_vocab_test__
|
|
33333333
|
|
__ggml_vocab_test__
|
|
333333333
|
|
__ggml_vocab_test__
|
|
Cửa Việt
|
|
__ggml_vocab_test__
|
|
discards
|
|
__ggml_vocab_test__
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
|
|
__ggml_vocab_test__
|
|
é
|
|
__ggml_vocab_test__
|
|
résumé
|
|
__ggml_vocab_test__
|
|
àààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààà
|
|
__ggml_vocab_test__
|
|
Vieết Nam
|
|
__ggml_vocab_test__
|