Watch Star Fork

黄亿华 / webmagicJavaApache-2.0GVP

webmagic 是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。
Clone or download
yihua.huang authored 2017-12-02 10:57 update travis ci to openjdk
Notice: Creating folder will generate an empty file .keep, because not support in Git
2014-05-13 19:28
Loading... 4.23 KB


Readme in Chinese

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.


  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.


Add dependencies to your pom.xml:


WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.


Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site =;

    public void process(Page page) {
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));

    public Site getSite() {
        return site;

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("").thread(5).run();
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    private String author;

    private String readme;

    public static void main(String[] args) {
                , new ConsolePageModelPipeline(), GithubRepo.class)

Docs and samples:


The architecture of webmagic (refered to Scrapy)


There are more examples in webmagic-samples package.


Lisenced under Apache 2.0 lisence


To write webmagic, I refered to the projects below :


QQ Group: 373225642 542327088

Related Project

  • Gather Platform

    A web console based on WebMagic for Spider configuration and management.

Comments ( 18 )

You need to Sign in for post a comment

7_float_left_people 7_float_left_close