Windows端运行LLaMA语言模型 [draft]
Windows端运行LLaMA语言模型
写在前面
2023 年 2 月 24 日,Meta AI 发布了 650 亿参数的大语言模型LLaMA ,3 月 3 日,有人“泄露”了它的训练参数,当然也可以通过官方申请得到对应参数(一般申请都会给通过)。本次我们将运行 llama-int8 以及带有 WebUI 的 4 位量化模型 GPTQ-for-LLaMA
概要
-
模型有 7B,13B,30B,65B 的参数分类
-
Meta 宣称 LLaMA-13B 已经超越了 GPT-3 的性能
-
模型运行中内存及显存占用情况:
不同模型版本:
-
Meta AI 原始模型 https://github.com/facebookresearch/llama
- llama-int8 8 位量化模型 https://github.com/tloen/llama-int8
- GPTQ-for-LLaMA 4 位量化模型 https://github.com/qwopqwop200/GPTQ-for-LLaMa
环境配置
- Windows 10 Professional 64 Bit
- NVIDIA RTX 3090
- CUDA 11.6
- cuDNN 8.8.1
创建 conda 环境
conda create -n textgen python=3.10.9
conda activate textgen
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
简洁安装
这是我所创建的 conda 虚拟环境 requirements.txt 文件:
accelerate==0.18.0
aiofiles==23.1.0
aiohttp==3.8.4
aiosignal==1.3.1
altair==4.2.2
anyio==3.6.2
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
async-timeout==4.0.2
+ models
+ llama-7b
+ consolidated.00.pth
+ params.json
+ checklist.chk
+ tokenizer.model
attrs==22.2.0
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
bitsandbytes @ git+https://github.com/Keith-Hon/bitsandbytes-windows.git@85ff11a7f04af73bc83cbe6ed0eb1a77ade0697b
certifi @ file:///C:/b/abs_85o_6fm0se/croot/certifi_1671487778835/work/certifi
charset-normalizer==3.1.0
click==8.1.3
colorama @ file:///C:/b/abs_a9ozq0l032/croot/colorama_1672387194846/work
comm @ file:///C:/b/abs_1419earm7u/croot/comm_1671231131638/work
contourpy==1.0.7
cycler==0.11.0
datasets==2.10.1
debugpy @ file:///C:/ci_310/debugpy_1642079916595/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
dill==0.3.6
entrypoints==0.4
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
fairscale==0.4.13
fastapi==0.95.0
ffmpy==0.3.0
filelock==3.11.0
fire==0.5.0
fonttools==4.39.2
frozenlist==1.3.3
fsspec==2023.3.0
gradio==3.24.1
gradio_client==0.0.5
h11==0.14.0
httpcore==0.16.3
httpx==0.23.3
huggingface-hub==0.13.3
idna==3.4
ipykernel @ file:///C:/b/abs_b4f07tbsyd/croot/ipykernel_1672767104060/work
ipython @ file:///C:/b/abs_d1yx5tjhli/croot/ipython_1680701887259/work
jedi @ file:///C:/ci/jedi_1644315428305/work
Jinja2==3.1.2
jsonschema==4.17.3
jupyter_client @ file:///C:/b/abs_059idvdagk/croot/jupyter_client_1680171872444/work
jupyter_core @ file:///C:/b/abs_9d0ttho3bs/croot/jupyter_core_1679906581955/work
kiwisolver==1.4.4
linkify-it-py==2.0.0
Markdown==3.4.3
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline @ file:///C:/ci/matplotlib-inline_1661934094726/work
mdit-py-plugins==0.3.3
mdurl==0.1.2
multidict==6.0.4
multiprocess==0.70.14
nest-asyncio @ file:///C:/b/abs_3a_4jsjlqu/croot/nest-asyncio_1672387322800/work
ninja==1.11.1
numpy==1.24.2
orjson==3.8.9
packaging @ file:///C:/b/abs_ed_kb9w6g4/croot/packaging_1678965418855/work
pandas==1.5.3
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
peft==0.2.0
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.4.0
platformdirs @ file:///C:/b/abs_73cc5cz_1u/croots/recipe/platformdirs_1662711386458/work
prompt-toolkit @ file:///C:/b/abs_6coz5_9f2s/croot/prompt-toolkit_1672387908312/work
psutil==5.9.4
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pyarrow==11.0.0
pydantic==1.10.7
pydub==0.25.1
Pygments @ file:///opt/conda/conda-bld/pygments_1644249106324/work
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-multipart==0.0.6
pytz==2023.3
pywin32==305.1
PyYAML==6.0
pyzmq @ file:///C:/ci/pyzmq_1657616000714/work
quant-cuda @ file:///E:/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
regex==2023.3.23
requests==2.28.2
responses==0.18.0
rfc3986==1.5.0
semantic-version==2.10.0
sentencepiece==0.1.97
six @ file:///tmp/build/80754af9/six_1644875935023/work
sniffio==1.3.0
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
starlette==0.26.1
termcolor==2.2.0
tokenizers==0.13.2
toolz==0.12.0
torch==1.13.1+cu116
torchaudio==0.13.1+cu116
torchvision==0.14.1+cu116
tornado @ file:///C:/ci/tornado_1662476985533/work
tqdm==4.65.0
traitlets @ file:///C:/b/abs_e5m_xjjl94/croot/traitlets_1671143896266/work
transformers @ git+https://github.com/huggingface/transformers@4c01231e67f0d699e0236c11178c956fb9753a17
typing_extensions==4.5.0
uc-micro-py==1.0.1
urllib3==1.26.15
uvicorn==0.21.1
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
websockets==10.4
wincertstore==0.2
xxhash==3.2.0
yarl==1.8.2
带有 UI 界面模型运行
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
cd GPTQ-for-LLaMa
pip install ninja
conda install -c conda-forge cudatoolkit-dev
python setup_cuda.py install
- 如果
setup_cuda.py
安装失败,下载.whl 文件 ,并且运行pip install quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
安装 - 目前,
transformers
刚添加 LLaMA 模型,因此需要通过源码安装 main 分支,具体参考huggingface LLaMA - 大模型的加载通常需要占用大量显存,通过使用 huggingface 提供的 bitsandbytes 可以降低模型加载占用的内存,却对模型效果产生比较小的影响,具体可阅读A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes ,windows 平台用户需要通过源码安装[bitsandbytes-windows](https://github.com/fa0311/bitsandbytes-windows]
模型参数下载
建议首先在模型根目录下创建models
文件夹:
-
Meta AI 原始参数文件 models
├── llama-7b
│ ├── consolidated.00.pth
│ ├── params.json
│ └── checklist.chk
└── tokenizer.model
参数 Torrent 文件: Safe-LLaMA-HF (3-26-23).zip
参数 Magnet 链接: magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA
- Huggingface 转化的参数文件
在 text-generation-webui 的根目录下创建 models 文件夹,例如这里我们想导入 13b 的参数,使用 git clone 到decapoda-research/llama-13b-hf
复制对应参数,更名文件为
llama-13b
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/decapoda-research/llama-13b-hf
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1>)
我已经下载好了两类参数文件至百度网盘 ,提取码:1234
运行
# run GPTQ-for-LLaMA web client
python server.py --cai-chat --model llama-7b --no-stream
# run llama-int8
python example.py --ckpt_dir [TARGET_DIR]/7b --tokenizer_path [TARGET_DIR]/tokenizer.model --max_batch_size=1
带有 webui 的模型运行如下:

内网端口映射
由于我个人使用的是校园网,需要映射到公网才可公开访问,体验了几个工具暂时使用ngrok 作为临时替代品。
参考文献:
[1] aituts llama
💬 评论