Creating your first task: wget

Sample code

About wget

wget reads HTTP/HTTPS URLs from stdin, crawls the webpages and then print the content to stdout. It also outputs the HTTP headers of the request and the response to stderr.
For convenience, wget exits with Ctrl-C, but it will ensure that all resources are completely released first.

Creating and starting an HTTP task

WFHttpTask *task = WFTaskFactory::create_http_task(url, REDIRECT_MAX, RETRY_MAX, wget_callback);
protocol::HttpRequest *req = task->get_req();
req->add_header_pair("Accept", "*/*");
req->add_header_pair("User-Agent", "Wget/1.14 (gnu-linux)");
req->add_header_pair("Connection", "close");
task->start();
pause();

WFTaskFactory::create_http_task() generates an HTTP task. In WFTaskFactory.h, the prototype is defined as follows:

WFHttpTask *create_http_task(const std::string& url,
                             int redirect_max, int retry_max,
                             http_callback_t callback);

The first few parameters are self-explanatory. http_callback_t is the callback of an HTTP task, which is defined below:

using http_callback_t = std::function<void (WFHttpTask *)>;

To put it simply, it’s the funtion that has Task as one parameter and does not return any value. You can pass NULL to this callback, indicating that there is no callback. The callback in all tasks follows the same rule.
Please note that all factory functions do not return failure, so even if the URL is illegal, don't worry that the task is a null pointer. All errors are handled in the callback.
You can use task->get_req() to get the request of the task. The default method is GET via HTTP/1.1 on long connections. The framework automatically adds request_uri, Host and other parameters. The framework will add other HTTP header fields automatically according to the actual requirements, including Content-Length or Connection before sending the request. You may also use add_header_pair() to add your own header. For more interfaces on HTTP messages, please see HttpMessage.h.
task->start() starts the task. It’s non-blocking and will not fail. Then the callback of the task will be called. As it’s an asynchronous task, obviously you cannot use the task pointer after start().
To make the example as simple as possible, call pause() after start() to prevent the program from exiting. You can press Ctrl-C to exit the program.

Handling crawled HTTP results

This example demonstrates how to handle the results with a general function. Of course, std::function supports more features.

void wget_callback(WFHttpTask *task)
{
    protocol::HttpRequest *req = task->get_req();
    protocol::HttpResponse *resp = task->get_resp();
    int state = task->get_state();
    int error = task->get_error();

    // handle error states
    ...

    std::string name;
    std::string value;
    // print request to stderr
    fprintf(stderr, "%s %s %s\r\n", req->get_method(), req->get_http_version(), req->get_request_uri());
    protocol::HttpHeaderCursor req_cursor(req);
    while (req_cursor.next(name, value))
        fprintf(stderr, "%s: %s\r\n", name.c_str(), value.c_str());
    fprintf(stderr, "\r\n");
    
    // print response header to stderr
    ...

    // print response body to stdin
    void *body;
    size_t body_len;
    resp->get_parsed_body(&body, &body_len); // always success.
    fwrite(body, 1, body_len, stdout);
    fflush(stdout);
}

In this callback, the task is generated by the factory.
You can use task->get_state() and task->get_error() to obtain the running status and the error code of the task respectively. Let's skip the error handling first.
Use task->get_resp() to get the response of the task, which is slightly different from the request, as they are both derived from HttpMessage.
Then, use the HttpHeaderCursor to scan the headers of the request and the response. HttpUtil.h contains the definition of the Cursor.

class HttpHeaderCursor
{
public:
    HttpHeaderCursor(const HttpMessage *message);
    ...
    void rewind();
    ...
    bool next(std::string& name, std::string& value);
    bool find(const std::string& name, std::string& value);
    ...
};

There should be no doubt about the use of this cursor.
The next line resp->get_parsed_body() obtains the HTTP body of the response. This call always returns true when the task is successful, and the body points to the data area.
The call gets the raw HTTP body, and does not decode the chunk. If you want to decode the chunk, you can use the HttpChunkCursor in HttpUtil.h. In addition, find() will change the pointer inside the cursor. If you want to iterate over the header after you use find(), please use rewind() to return to the cursor header.

GVP 搜狗开源 / workflow

Creating your first task: wget

Sample code

About wget

Creating and starting an HTTP task

Handling crawled HTTP results

简介

发行版 (6)

贡献者

近期动态

GVP搜狗开源 / workflow

Creating your first task: wget

Sample code

About wget

Creating and starting an HTTP task

Handling crawled HTTP results

简介

发行版 (6)

开源评估指数源自 OSS-Compass 评估体系，评估体系围绕以下三个维度对项目展开评估：

贡献者

近期动态

搜索帮助

GVP 搜狗开源 / workflow