Spider Trip of HeiBanKe

There is a traditional gateway link Python Challenge, which was played a long time ago and happened to see a blackboard crawler breaking through to record the process of gateway breaking.

Checkout Record

Mission One

Website links are as follows:

http://www.heibanke.com/lesson/crawler_ex00/

The contents are as follows:

这里是黑板客爬虫闯关的第一关(This is the first hurdle for blackboard reptiles to break through.)

你需要在网址后输入数字49163(You need to enter the number 49163 after the address.)

If the input number is still similar to the content, then the method is ready to come out, access the link and then extract the number to continue to visit the new link. By analyzing the source code of the page, we can determine the location of the number is a body - > h3, which is accessed by requests, parsed by Beautiful Soup, and then matched with regular numbers. The specific code is as follows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup as bs
import re
url_number=''

while(1):
r = requests.get('http://www.heibanke.com/lesson/crawler_ex00/'+url_number)
soup = bs(r.text,"lxml")
url_line = soup.body.h3.text
print(url_line)
pat = re.compile(r'\d')
url_number = ''.join(re.findall(pat,url_line))
if(url_number==""):
break

Part of the running log is as follows (omitted intermediate repeats):

你需要在网址后输入数字49163(You need to enter the number 49163 after the address.)

下一个你需要输入的数字是26470.(The next number you need to input is 26470.)

…………

下一个你需要输入的数字是83105. 还有一大波数字马上就要到来…(The next number you need to input is 83105., and there’s a big wave coming.)

…………

下一个你需要输入的数字是43396. 还有一大波数字马上就要到来…(The next number you need to input is 43396., and there’s a big wave coming.)

下一个你需要输入的数字是39642. 老实告诉你吧, 这样的数字还有上百个(The next number you need to input is 39642.. To tell you the truth, there are hundreds of such figures.)

…………

下一个你需要输入的数字是72996. 老实告诉你吧, 这样的数字还有上百个(The next number you need to input is 72996.. To tell you the truth, there are hundreds of such figures.)

恭喜你,你找到了答案.继续你的爬虫之旅吧(Congratulations, you found the answer. Go on with your crawler trip.)

输入72996后即可看到进入第二关的链接。(After entering 72996, you can see the link) to the second level.)

Mission Two

The second level requires the input of any character name and number less than 30, click the button to submit, obviously, the first level Get, the second level Post.
Analyze the link submitted and form content through F2’s Network, and then construct the post request.

The code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup as bs
import re
password_number=0

while(password_number<=30):
data = {
'username':'a',
'password':password_number
}

r = requests.post('http://www.heibanke.com/lesson/crawler_ex01/',data = data)
soup = bs(r.text,"lxml")
url_line = soup.body.div.div.h3.text
print("password is "+ str(password_number) +" and " +url_line)
password_number+=1

Some of the running logs are as follows:
Password incorrect log:

password is 16 and 您输入的密码错误, 请重新输入(The password you entered is wrong. Please retype it.)

password is 17 and 您输入的密码错误, 请重新输入(The password you entered is wrong. Please retype it.)

Password correct log:

password is 20 and 恭喜! 用户a成功闯关, 继续你的爬虫之旅吧(Congratulations! User a successfully broke through and continued your reptile trip.)

So enter any character as user name, enter 20 as password, and then enter the Mission Three.
Note that you need to log in first, otherwise you will jump to the login page - > bookkeeping page.
In addition, registration does not validate the mailbox, so it can be registered without real information.