1 Star 0 Fork 7

Neoman / spiderkit

forked from 立冬 / spiderkit 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
BSD-3-Clause

spiderkit

编译步骤:

安装依赖

CentOS:
sudo yum -y install gcc gcc-c++ make flex bison gperf ruby \
      openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel \
        libpng-devel libjpeg-devel
sudo yum install git
sudo yum install automake autoconf autogen libtool
自行下载安装maven

Ubuntu:
sudo apt-get install build-essential g++ flex bison gperf ruby perl \
      libsqlite3-dev libfontconfig1-dev libicu-dev libfreetype6 libssl-dev \
        libpng-dev libjpeg-dev
sudo apt-get install git
sudo apt-get install automake autoconf autogen libtool
sudo apt-get install maven

下载spiderkit

git clone https://git.oschina.net/wangsihong/spiderkit.git

编译

进入 spiderkit/script 目录
cd spiderkit/script
./compile.sh

启动spiderkit

启动spiderkit集群首先要启动一个zookeeper集群

编辑spiderkit.conf, script目录下有一个模板,拷到项目根目录

配置文件:
    "StaticCoreCount"  : 启动的静态内核数(静态内核指关闭js解析的内核)
    "DynamicCoreCount" : 启动的动态内核数
    "GroupName"        : 分组名称(客户端的请求要求制定一个分组,分组可以用来表示网络环境,机房或者用来分配请求等)
    "IsDebug"          : 是否以debug启动,log打印debug信息
    "ZookeeperHost"    : zookeeper 集群host
    "ExitWorkCount"    : 表示内核加载渲染多少页面之后申请重启.解决qwebkit内存泄露问题
    "DefaultSocketPort": 服务端口,默认为 21225
    "enableProxy"      : 是否使用代理
    "proxyType"        : 代理的类型
    "proxyAuthUser"    : 用户名
    "proxyAuthPass"    : 密码
    "proxyHost"        : 代理IP
    "proxyPort"        : 代理端口
    "enableProxyPool"  : false/true, 集群可通过zookeeper加载代理池代理
    "enableProxyPath"  : zookeeper中保存代理信息的代理池所在节点


执行启动脚本:
    ./spiderkit-start.sh

停止脚本:
    ./spiderkit-stop.sh

python 客户端

CentOS:
    sudo yum install python-devel
Ubuntu:
    sudo apt-get install python-dev

安装 easy_install:
    wget --no-check-certificate https://bootstrap.pypa.io/ez_setup.py -O - | sudo python
安装 pyzmq
    sudo easy_install pyzmq
安装 protobuf python
    进入 src/thrid/
    解压并安装 python-gflags-2.0.tar.gz 和 google-apputils-0.4.0.tar.gz
    然后进入 script/source/protobuf/python/ 安装 protobuf python
    sudo python setup.py install
安装 zookeeper python
    下载zookeeper python 包
    下载地址 https://pypi.python.org/packages/source/z/zkpython/zkpython-0.4.2.tar.gz

安装 spiderkit python
    进入 src/python-client
    sudo python setup.py install


python客户端渲染百度的例子:

    gconfig = caller.GlobalConfig()
    gconfig.init("10.58.222.103:2181")

    wk = webkit.WebKit("test")
    page = wk.getWebPage("http://www.baidu.com/", 30000, 40000)

    if page is None: # get page failed
        wk.release()
        _exit(0)

    print page.getTitle()

    page.destroy()
    wk.release()

java 客户端

java客户端代码在 spiderkit/src/java/src/skit-client 下
java客户端maven引入 :
    <dependency>
        <groupId>com.gome</groupId>
        <artifactId>skit-client</artifactId>
        <version>0.0.1</version>
</dependency>

java 客户端渲染百度的例子:

    String zkhost = "10.58.222.103:2181";
    GlobalConfig config = GlobalConfig.getInstance();
    config.connect(zkhost);

    WebKit webkit = new WebKit("test");

    WebPage page = webkit.get("http://www.baidu.com/");

    if (page == null) # get page failed
        wk.release()
        return

    System.out.println(page.getTitle());

    page.destory();
    webkit.release();

spiderkit-schedule 爬虫框架

spiderkit schedule 是一个插件式的爬虫框架,插件通过继承spider-plugin中的类,完成一个爬虫的链接提取和数据提取存储的功能,通过框架的调度启动或停止爬虫任务。

相关源码目录: spiderkit/src/java/src/spider-schedule --- spiderkit-schedule
              spiderkit/src/java/src/spider-plugin --- spider-plugin

spider-plugin maven引入 :
<dependency>
         <groupId>com.gome</groupId>
         <artifactId>spider-plugin</artifactId>
         <version>0.0.1</version>
</dependency>

启动:
    ./app-start.sh

控制界面:
    http://your_ip:8089/spiderkit/

flowcrawl 爬虫插件

flowcrawl是spiderkit schedule内置的一个爬虫插件,通过spiderkit schedule的web页面,可以配置按步骤的爬取某网站的垂直爬虫,保存渲染后的页面代码以及页面间的关系。
通过配制提取xpath信息或者编写javascript脚本, 可以提取出想要的信息。
Copyright (c) 2015, 北漂立冬 All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the {organization} nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

简介

基于phantomjs修改的webkit解析集群,实现远程调用,提供java/c++/python接口。 展开 收起
BSD-3-Clause
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/neoman/spiderkit.git
git@gitee.com:neoman/spiderkit.git
neoman
spiderkit
spiderkit
master

搜索帮助