• 7 Posts
  • 22 Comments
Joined 1 year ago
cake
Cake day: June 14th, 2023

help-circle













  • This model doesn’t have the new llama.cpp k quantization, and there are better models that have come out since Pygmalion 13b, like guanaco (https://huggingface.co/TheBloke/guanaco-13B-GGML) and manticore (https://huggingface.co/TheBloke/manticore-13b-chat-pyg-GGML). For optimal ratio of performance to speed, I’d go with either q_3_K_L, quantization (for lower spec hardware) or any of the q_n_K_M quantization schemes, since those use higher bit quantization on the attention layers and perform better either the q_n_K_S models or the plain q_n models.

    GGML also suppers GPU acceleration via Cuda now, so if you have an Nvidia GPU, add the flag “–n-GPU-layers N”, where n is the number of layers sent to the GPU, you want this as high as possible without crashing.

    You should also add the flag “–threads N”, where N is the number of CPU threads you want to use, otherwise it’ll run single threaded and the performance will be Ass, N should be the number of physical CPU cores you have, maybe 1-2 higher, I’d play around with it a bit. Be careful though, if you’re able to fit all/most of the model in the GPU, - - threads should be 1 (for 100% in GPU) or close to it (80%+ in GPU).

    These numbers aren’t exact and will depend on your hardware config, so tinker a bit. Also, if you’re getting speeds faster than you need for the 13B parameter models, I’d try out a 33b parameter model (use at least a q_3 one though, q_2_K 33b quality is a bit worse than some of the higher-bit quantized 13b models imo, and the performance isn’t that much better than a q_3 33B parameter model so it’s just not worth it). The quality of the output is much better, if you have good enough hardware or can tolerate the speed penalty.

    Edit: Also manticore was trained on a cleaned version of the Pygmalion dataset, plus a bunch of other stuff, so it works like a drop-in replacement for Pygmalion if you have some application that depends on a model using Pygmalion prompt formats.