Trial Experience of Archer Crawler

I saw the advertisement on several websites about the magic archer crawler. It’s easier to get started, and the expansion of the function is not as comfortable as Python writes.

Step

  1. Register
  2. Select or develop a spider in the spider market
  3. Setting related parameters
  4. Test and start the spider
  5. Derived data

The crawler of the archer is written in JavaScript, and the content is parsed with XPath, regular and JsonPath. Integrating IP proxy, JS loading and other functions, the official documents are more detailed. The flow chart of the crawler operation is as follows:

—update—
The picture has expired。
—update—

The process of spider running is clear.

The structure of reptiles is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
var configs = {
domains: ["xxx"],
scanUrls: ["xxxx"],
contentUrlRegexes: ["xxx"],
helperUrlRegexes: ["xxx"],
fields: [
{
name: "title",
selector: 'xxx',
required: true
},
{
name: "content",
selector: 'xxx',
repeated: true
},
{
name: "article_author",
selector: 'xxx'
}
]
};

var crawler = new Crawler(configs);
crawler.start();

The parameters of the crawler are defined in configs, and then a new Crawler object is started.

Domains are used to filter unrelated URLs, scanUrls is the crawler’s entry page, contentUrlRegexes is the regular filtering rule for content pages, helperUrlRegexes is the regular filtering rule for list pages, fields contain the data that content pages parse out to save, selector defaults to XPath rule, can also be specified as regularity, required means that the item can not be empty, repeated means that the item can not be empty. There is more than one item. If you use XPath Helper to view elements filtered by XPath, multiple elements will be displayed, but when the crawler writes here, it will only crawl the first one that meets the criteria, so you need to specify repeated = true.

Here is a reptile from the Acfun article area written this morning.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
var configs = {
domains: ["www.acfun.tv"],
scanUrls: ["http://www.acfun.tv/v/list110/index_1.htm"],
contentUrlRegexes: ["http://www\\.acfun\\.tv/a/ac\\d{7}"],
helperUrlRegexes: ["http://www\\.acfun\\.tv/v/list110/index_\\d+\\.htm"],
enableJS: true,
fields: [
{
name: "title",
selector: '//*[@id="title_1"]/span[2]',
required: true
},
{

name: "content",
selector: '//*[contains(@id,"area-player")]//p//text()',
required: true,
repeated: true
},
{
name: "article_author",
selector: '//*[@id="block-info-bottom"]/div[2]/div/span[1]/a/nobr',
required: false
},
{
name: "article_view_count",
selector: '//*[@id="txt-info-title_1"]/span[1]',
required: false
},
{
name: "article_agree_count",
selector: '//*[@id="txt-info-title_1"]/span[5]',
required: false
}
]
};

var crawler = new Crawler(configs);
crawler.start();

EnableJS is loaded by javascript because of the amount of reading and approval. The content part uses //p// text(), because part of the page has a layer of div, and part of the content of the page is mixed in span and p.

More usage can be found in official documents.

Experience

Spider are more comfortable with Python.