Trial Experience of Archer Crawler
I saw the advertisement on several websites about the magic archer crawler. It’s easier to get started, and the expansion of the function is not as comfortable as Python writes.
Step
- Register
- Select or develop a spider in the spider market
- Setting related parameters
- Test and start the spider
- Derived data
The crawler of the archer is written in JavaScript, and the content is parsed with XPath, regular and JsonPath. Integrating IP proxy, JS loading and other functions, the official documents are more detailed. The flow chart of the crawler operation is as follows:
—update—
The picture has expired。
—update—
The process of spider running is clear.
The structure of reptiles is as follows:
1 | var configs = { |
The parameters of the crawler are defined in configs, and then a new Crawler object is started.
Domains
are used to filter unrelated URLs, scanUrls
is the crawler’s entry page, contentUrlRegexes
is the regular filtering rule for content pages, helperUrlRegexes
is the regular filtering rule for list pages, fields
contain the data that content pages parse out to save, selector
defaults to XPath
rule, can also be specified as regularity, required
means that the item can not be empty, repeated
means that the item can not be empty. There is more than one item. If you use XPath Helper
to view elements filtered by XPath
, multiple elements will be displayed, but when the crawler writes here, it will only crawl the first one that meets the criteria, so you need to specify repeated = true
.
Here is a reptile from the Acfun article area written this morning.:
1 | var configs = { |
EnableJS
is loaded by javascript
because of the amount of reading and approval. The content
part uses //p// text()
, because part of the page has a layer of div
, and part of the content of the page is mixed in span
and p
.
More usage can be found in official documents.
Experience
Spider are more comfortable with Python.