Python系列之爬虫软件安装

  • Post author:
  • Post category:python




pycurl



简介



安装

#安装pycurl
pip install pycurl

# 安装phantomjs
1).下载phantomjs(http://phantomjs.org/download.html)官网下载mac版本
2).下载后直接解压,将解压后的phantomjs-2.1.1-macosx文件夹放到你想放的目录下(随意、开心就好) 


# 配置环境变量

phantomjs --version 


# 安装pyspider


# 验证安装pyspider
pyspider all

#查看pyspider启动情况
lsof -i:25555

#杀死进程
kill -9 14211

补充说明:

安装参考文章:https://www.jianshu.com/p/e37603bc70c7



常见问题



问题1:SyntaxError: invalid syntax

Traceback (most recent call last): File “/usr/local/bin/pyspider”,

line 5, in

from pyspider.run import main File “/usr/local/lib/python3.7/site-packages/pyspider/run.py”, line 231

async=True, get_object=False, no_input=False):

^ SyntaxError: invalid syntax

问题分析

源码里面使用了async作为变量名,但是python3.7以后async已经是关键字了,所以会报错。 参数明冲突:

https://blog.csdn.net/qq_26261381/article/details/86514138

https://www.jianshu.com/p/a0042a636229

解决方案

待修改文件 /usr/local/lib/python3.7/site-packages/pyspider/run.py /usr/local/lib/python3.7/site-packages/pyspider/webui/app.py

/usr/local/lib/python3.7/site-packages/pyspider/fetcher/tornado_fetcher.py



问题2: libcurl link-time version (7.64.1) is older than compile-time version

ImportError: pycurl: libcurl link-time version (7.64.1) is older than compile-time version (7.65.3)

问题分析

https://www.cjjjs.com/article/201841813540391

查看curl版本,仅提取有用信息

curl -V curl 7.65.3 (x86_64-apple-darwin13.4.0) libcurl/7.65.3

OpenSSL/1.1.1d zlib/1.2.11 libssh2/1.8.2

查找当前系统libcurl.*文件

/usr/lib/libcurl.dylib /usr/lib/libcurl.4.dylib

/usr/lib/libcurl.3.dylib

/Users/apple/opt/anaconda3/lib/libcurl.dylib

/Users/apple/opt/anaconda3/pkgs/libcurl-7.65.3-h051b688_0/lib/libcurl.dylib

/System/Volumes/Data/Users/apple/opt/anaconda3/lib/libcurl.dylib

/System/Volumes/Data/Users/apple/opt/anaconda3/pkgs/libcurl-7.65.3-h051b688_0/lib/libcurl.dylib



解决方案一:卸载并升级pycurl
#首先确认当前执行脚本的Python版本,其次用该版本下的pip进行卸载、升级操作。
/usr/bin/python -m pip list
/usr/bin/python -m pip uninstall pycurl
/usr/bin/python -m pip install pycurl
or 
pip uninstall pycurl /  pip install pycurl

再次启动pyspider查看效果

#启动pyspider
pyspider all

仍旧报错,因此解决方案一验证失败

再次启动pyspider报错:

ImportError: pycurl: libcurl link-time version (7.64.1) is older than

compile-time version (7.65.3)



解决方案二:卸载并升级pycurl(推荐)
#重新编译安装
pip3 install pycurl --compile --no-cache-dir 



验证python导入的库文件目录

删除/usr/lib目录下面的libcurl.4.dylib库以后,报错:

import pycurl # type: ignore ImportError:

dlopen(/usr/local/lib/python3.7/site-packages/pycurl.cpython-37m-darwin.so,

2): Library not loaded: @rpath/libcurl.4.dylib Referenced from:

/usr/local/lib/python3.7/site-packages/pycurl.cpython-37m-darwin.so

Reason: image not found

#python运行环境下导入pycurl

>>> import pycurl  
Traceback (most recent call last):   File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'pycurl'

分析:

错误提示是全局的,因为导入这个模块的文件是公共库文件,所以一出错,很多地方受影响。然后是在导入pycurl时报错,错误提示是链接时间和编译时间不一致。而我是用C++新编译安装了一个高版本的curl,python的curl是低版本的,是以前安装的。那么提示的链接时间和编译时间不一致,那么可以确定是新编译安装的curl和python安装的curl的冲突。

问题是为什么会有这样的提示?python安装的curl是编译好的,直接安装的。而我新装的这个是编译安装的,所以我们不难理解错误提示链接的时间和编译安装时间不一致了。这个好确定产生问题的场景,可以大致确定范围在这个库。

解决思路

  1. 重装pycurl failed!
  2. pycurl.cpython-37m-darwin.so 重新编译

结论:

python导入的库文件确实是site-packages目录下的。



tesserocr



简介



安装

#安装imagemagick

brew install imagemagick

成功安装结果

==> Caveats(警告)

==> libffi libffi is keg-only, which means it was not symlinked into /usr/local, because macOS already provides this software and

installing another version in parallel can cause all kinds of trouble.

For compilers to find libffi you may need to set: export

LDFLAGS=”-L/usr/local/opt/libffi/lib” export

CPPFLAGS=”-I/usr/local/opt/libffi/include”

==> python@3.8 Python has been installed as /usr/local/opt/python@3.8/bin/python3

You can install Python packages with

/usr/local/opt/python@3.8/bin/pip3 install They will install

into the site-package directory

/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages

See: https://docs.brew.sh/Homebrew-and-Python

python@3.8 is keg-only, which means it was not symlinked into

/usr/local, because this is an alternate version of another formula.

If you need to have python@3.8 first in your PATH run: echo ‘export

PATH=”/usr/local/opt/python@3.8/bin:$PATH”’ >> ~/.zshrc

For compilers to find python@3.8 you may need to set: export

LDFLAGS=”-L/usr/local/opt/python@3.8/lib”

==> glib Bash completion has been installed to: /usr/local/etc/bash_completion.d

==> docbook To use the DocBook package in your XML toolchain, you need to add the following to your ~/.bashrc:

export XML_CATALOG_FILES=”/usr/local/etc/xml/catalog”

==> gnu-getopt gnu-getopt is keg-only, which means it was not symlinked into /usr/local, because macOS already provides this

software and installing another version in parallel can cause all

kinds of trouble.

If you need to have gnu-getopt first in your PATH run: echo ‘export

PATH=”/usr/local/opt/gnu-getopt/bin:$PATH”’ >> ~/.zshrc

Bash completion has been installed to:

/usr/local/opt/gnu-getopt/etc/bash_completion.d

==> libtool In order to prevent conflicts with Apple’s own libtool we have prepended a “g” so, you have instead: glibtool and glibtoolize.

# 安装tesseract
brew install tesseract-lang

# 安装
pip install tesserocr pillow

# 验证安装

## 测试图片地址
https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png

tesseract image.png result -l eng && cat result.txt

# 主要查看具体的信息及依赖关系当前版本注意事项等
brew info tesseract

运行结果

Tesseract Open Source OCR Engine v4.1.1 with Leptonica

Python3WebSpider

参数说明:

第一个参数:图片名称

第二个参数:结果保存的目标文件名称

第三个参数-l: 指定使用的语言包,此处eng表示英文

cat:用于输出结果

常见问题

Mac使用brew安装tesseract提示invalid: –all-languages

https://blog.csdn.net/weixin_40368256/article/details/100624099

brew install tesseract –all-languages (failed)



RedisDump



简介



安装

#安装命令
准备:首先安装Ruby
sudo gem install redis-dump

#验证安装
redis-dump
redis-load



Flask



简介



Web库-Flask的安装

#安装命令
pip install flask

验证安装

from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello():
	return 'Hello World!'

if __name__ == '__main__':
	app.run()



Tornado



简介



Web库-Tornado的安装

#安装命令
pip install tornado

验证安装

import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
	def get(self):
		self.write("Hello, world")

def make_app():
	return tornado.web.Application([
		(r"/", MainHandler)
	])

if __name__ == "__main__":
	app = make_app()
	app.listen(8888)
	tornado.ioloop.IOLoop.current().start()



mitmproxy



简介



App爬取-mitmproxy的安装

#安装命令
pip install mitmproxy

验证安装



node



简介



Node的安装

#安装命令
brew install node

成功安装结果

==> Caveats

==> icu4c icu4c is keg-only, which means it was not symlinked into /usr/local, because macOS provides libicucore.dylib (but nothing

else).

If you need to have icu4c first in your PATH run: echo ‘export

PATH=”/usr/local/opt/icu4c/bin:



P

A

T

H

>

>

 

/

.

z

s

h

r

c

e

c

h

o

e

x

p

o

r

t

P

A

T

H

=

/

u

s

r

/

l

o

c

a

l

/

o

p

t

/

i

c

u

4

c

/

s

b

i

n

:

PATH”‘ >> ~/.zshrc echo ‘export PATH=”/usr/local/opt/icu4c/sbin:






P


A


T


H




























>






>






/


.


z


s


h


r


c


e


c


h



o






















e


x


p


o


r


t


P


A


T


H




=











/


u


s


r


/


l


o


c


a


l


/


o


p


t


/


i


c


u


4


c


/


s


b


i


n




:





PATH”’ >> ~/.zshrc

For compilers to find icu4c you may need to set: export

LDFLAGS=”-L/usr/local/opt/icu4c/lib” export

CPPFLAGS=”-I/usr/local/opt/icu4c/include”

==> node Bash completion has been installed to: /usr/local/etc/bash_completion.d

验证安装

node -v 

npm -v



appium



简介



App爬取-Appium的安装

npm install -g appium



安装命令

pip install mitmproxy



验证安装



版权声明:本文为qq_23306647原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。