欢迎光临散文网 会员登陆 & 注册

下载文档教程

2022-12-26 18:43 作者:SciTechSports  | 我要投稿

多种文档下载器

https://github.com/rty813/doc_downloader

简单的方法

  1. 下载docDownloader.zip(https://github.com/rty813/doc_downloader/releases/),解压缩。

  2. 运行docDownloader.exe。

  3. 输入文档的网址,即可开始下载。下载后的文档在output子文件夹下。

复杂的方法

  1. 下载doc_downloader-master所有文件(GitZip for github Chrome插件),解压缩

  2. 安装好python或者Anaconda。以Anaconda为例,打开开始菜单,找到Anaconda3 (64-bit),以管理员身份运行Anaconda Powershell Prompt (anaconda3),即可打开终端。输入下列内容,定位到解压缩后的文件夹,这里是下载解压缩到D:\Download\doc_downloader-master,终端内输入:

    D:(回车)

    cd D:\Download\doc_downloader-master\doc_downloader-master(回车)

  3. 终端内输入pip install -r requirements.txt(回车),安装所需要的包。注意若使用报错,应先检查chromedriver版本与chrome版本是否兼容。若不兼容,则只需将文件夹中的chromedriver.exe替换为兼容的版本即可。附[chromedriver下载地址](https://chromedriver.chromium.org/downloads)

  4. 终端内输入python docDownloader.py(回车),输入文档的网址,即可开始下载。下载后的文档在output子文件夹下。

上述方法下载的PDF中存储的是一张张图片,为了可以复制文字,需要对PDF进行OCR(光学字符识别)。

Windows下安装OCRmyPDF

https://ocrmypdf.readthedocs.io/en/latest/installation.html#native-windows

You must install the following for Windows:

  • Python 3.8 (64-bit) or later

  • Tesseract 4.1.1 or later

  • Ghostscript 9.50 or later

Using the Chocolatey (https://chocolatey.org/) package manager, install the following when running in an Administrator command prompt:

  • choco install python3

  • choco install --pre tesseract

  • choco install ghostscript

  • choco install pngquant (optional)

The commands above will install Python 3.x (latest version), Tesseract, Ghostscript and pngquant. Chocolatey may also need to install the Windows Visual C++ Runtime DLLs or other Windows patches, and may require a reboot.

You may then use pip to install ocrmypdf. (This can performed by a user or Administrator.):

  • pip install ocrmypdf

Chocolatey automatically selects appropriate versions of these applications. If you are installing them manually, please install 64-bit versions of all applications for 64-bit Windows, or 32-bit versions of all applications for 32-bit Windows. Mixing the “bitness” of these programs will lead to errors.

OCRmyPDF will check the Windows Registry and standard locations in your Program Files for third party software it needs (specifically, Tesseract and Ghostscript). To override the versions OCRmyPDF selects, you can modify the PATH environment variable. Follow these directions to change the PATH.

打开Anaconda终端,输入

cd D:\Download\docDownloader\docDownloader\output(回车)

待OCR文档命名为pic.pdf,待输出文件命名为 text.pdf,对于中文文档,输入

ocrmypdf --force-ocr -l chi_sim  pic.pdf text.pdf

即可开始OCR,输出的text.pdf也在同一文件夹。






下载文档教程的评论 (共 条)

分享到微博请遵守国家法律