1 Star 0 Fork 0

风的旋轮 / craw

Create your Gitee Account
Explore and code with more than 6 million developers,Free private repositories !:)
Sign up
This repository doesn't specify license. Without author's permission, this code is only for learning and cannot be used for other purposes.
Clone or Download
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
readme.md

##documents

php爬虫
爬网站的文章、标题、图片

只针对特定网站(微信搜狐)自己学习技术,不做商用

http://weixin.sogou.com/weixin?query=%s&_sug_type_=&s_from=input&_sug_=n&type=2)

链接中第一个%是关键字、第二个是分页页数 用printf替换

  • 首先,需要自建一个关键字的词表(id,name ,is_done)
  • 注意属性中设置的一些绝对路径
  • 注意用了php pdo连接mysql,一些表的命名等,都是写死的
  • 如果出现null , 说明得防爬虫的安全验证,需要每小时不能爬太多
  • 发现30min爬100个页面,没问题

CREATE TABLE `star_cate` (
  `id` tinyint(3) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `wx_code` varchar(255) NOT NULL,
  `url` varchar(255) NOT NULL,
  `is_done` tinyint(255) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=82 DEFAULT CHARSET=utf8mb4;


--------
CREATE TABLE `star_article` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `cate_id` int(11) NOT NULL DEFAULT '0',
  `title` varchar(255) NOT NULL DEFAULT '',
  `body` text NOT NULL,
  `img_md5` text NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

最新版本是craw_client.php代码

。。后期继续优化中。。。。

Repository Comments ( 0 )

Sign in to post a comment

About

php 爬虫 搜狗微信搜索 expand collapse
HTML
Cancel

Releases

No release

Contributors

All

Activities

Load More
can not load any more
HTML
1
https://git.oschina.net/fengdexuanlun/craw.git
git@git.oschina.net:fengdexuanlun/craw.git
fengdexuanlun
craw
craw
master

Search

161121 f78d6d6f 1850385 154831 86f8c370 1850385