項目地址:https://github.com/VikParuchuri/marker
官方介紹說,Marker 將 PDF、EPUB 和 MOBI 轉換為 markdown 文檔,且比 nougat 快 10 倍。
官方只提供 linux 和 mac 的安裝方式,參考在 windows 上安裝成功。
下面是安裝步驟:
1、安裝 Visual Studio 2022
2、安裝 NVIDIA CUDA
3、安裝 PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
4、安裝 wheel
pip install wheel
5、安裝 detectron2,需要本地安裝,報錯參考:#issuecomment-651560907
步驟如下:
git clone https://github.com/facebookresearch/detectron2.git
cd detectron2/
#管理員權限運行cmd
python setup.py install
修改detectron2\layers\csrc\nms_rotated\nms_rotated_cuda.cu
文件的內容為下述內容
// Copyright (c) Facebook, Inc. and its affiliates.
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <ATen/cuda/CUDAApplyUtils.cuh>
/*#ifdef WITH_CUDA
#include "../box_iou_rotated/box_iou_rotated_utils.h"
#endif
// TODO avoid this when pytorch supports "same directory" hipification
#ifdef WITH_HIP
#include "box_iou_rotated/box_iou_rotated_utils.h"
#endif*/
#include "box_iou_rotated/box_iou_rotated_utils.h"
修改後運行如下命令即可
python setup.py install
6、繼續安裝 Windows 版本的 Tesseract 和 Ghostscript
Tesseract:
tesseract-ocr-w64-setup-5.3.3.20231005.exe
Ghostscript:
7、安裝 VikParuchuri/marker
git clone https://github.com/VikParuchuri/marker.git
從 VikParuchuri/marker/requirements.txt 中刪除 detectron2,並使用上述步驟手動安裝它(也就是第五步,如果安裝了這裡可以跳過,所以就移除 detectron2 依賴)
安裝其他沒裝的依賴
pip install -r requirements.txt
pip install ftfy
pip install spellchecker
pip install pyspellchecker
pip install ocrmypdf
pip install nltk
pip install thefuzz
pip uninstall python-magic
pip install python-magic-bin
pip install ray==2.7.1
8、安裝 nougat
#下述方法會報錯
pip install nougat-ocr
#通過這種方式安裝
pip install git+https://github.com/facebookresearch/nougat
運行python convert_single.py "Vim 101 Hacks.pdf" vim.md --parallel_factor 5
報錯:
1、報錯如下:
python convert_single.py "Vim 101 Hacks.pdf" vim.md --parallel_factor 5
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
C:\Users\lca\AppData\Roaming\Python\Python311\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3527.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
處理此報錯信息:
找到這個文件C:\Users\lca\AppData\Roaming\Python\Python311\site-packages\torch\functional.py
,修改return _VF.meshgrid(tensors, **kwargs)
為return _VF.meshgrid(tensors, **kwargs, indexing = 'ij')
。即可。