3 Star 0 Fork 0

Gitee 极速下载 / jedi-crawler

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/spacenick/jedi-crawler
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

JEDI CRAWLER

Da fuq?

JEDI CRAWLER is a Node/PhantomJS crawler made to scrape pretty much anything from Node, with a really simple syntax. Work in progress ladies

npm install jedi-crawler

How does it work

Register padawans to the jedi crawler, that have a pattern to match a URL, and jQuery-style selectors. You can also post-process the data if you need to do some treatment (number conversion, etc)

wikipedia.js:

module.exports = function(jedi) {

  jedi.registerPadawan({
    // Pattern to match URL
    pattern: /en.wikipedia.org\/wiki\//,
    // Selectors to be executed
    selectors:{
      title:{
        sel: "#firstHeading span",
        type: "text"
      },
      firstParagraph:{
        sel: "#toc ~ p:first",
        type: "text"
      }
    },
    // You can choose to process the data AFTER being crawled.
    postProcessing: function(data) {
      /// Do your custom processing on the data processed
      data.title = data.title.toUpperCase();
      return data;
    }
  });

};

For now only two types of selectors are supported : "text" and "src"

I find having one file per padawan (crawler) pretty cool for code clarity and also padawans need to learn by themselve and be alone

npm install jedi-crawlers

You can then give your padawans to the Jedi by doing

var jedi = require('jedi-crawler');
require('./padawans/wikipedia')(jedi);

And then you can do

jedi.crawl('http://en.wikipedia.org/whatever', function(err, result){
  console.log(err);
  console.log(result);
});

As the jedi will figure out what padawan to use given on the URL and of the pattern you set

Special features

Crawlers only start to scrape the page as soon as $(document).ready is fired. Our own version of jQuery is injected into the page, but then we also give back the $ to its owner in case they're executing 3rd party libraries to modify the DOM or w/e

If your selectors matches severals DOM elements, then an array of every value is returned

Right now, PhantomJS is instantiated with "--load-images=no" option so the page loads faster

Test it now

Pull that bad boy Make sure you have PhantomJS installed Run node main.js

空文件

简介

暂无描述 展开 收起
JavaScript
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/mirrors/jedi-crawler.git
git@gitee.com:mirrors/jedi-crawler.git
mirrors
jedi-crawler
jedi-crawler
master

搜索帮助