Spider for Floor Reply Lottery on V2EX

Foreword

Last night, I saw Gift Book Activity on the V2EX, and the first three comrades with the same number of responses and the last number of the Shanghai Composite Index can get the gifts.

跟帖回复任意一个两位数(例如 37 )。取 2016 年 10 月 20 日当日收盘时的上证指数的十位和个位数字(比如,如果是 3789 ,那就是“ 89 ”),最接近的前三位同学,将获得《 Python Web 开发实战》一本。

which is tanslate like

Reply to any one of the two digits (for example, 37). Take the ten and one’s digits of the Shanghai Composite Index at the close of trading on October 20, 2016 (for example, if it is 3789, that is “89”), the closest top three students will get “Python Web Development” One.

But at this time there is already a reply to the 900+ floor. If the number of responses has exceeded three, it makes no sense. So I came up with the idea of writing a reptile to judge.

Effect

Below are the results sorted by a number of occurrences and by a number of numbers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
Counter({66: 35,
23: 32,
77: 22,
42: 20,
18: 19,
24: 19,
21: 18,
33: 18,
56: 18,
13: 17,
27: 16,
28: 15,
55: 15,
69: 15,
12: 14,
25: 14,
32: 14,
52: 14,
78: 14,
17: 13,
51: 13,
63: 13,
88: 13,
89: 13,
62: 12,
64: 12,
11: 11,
16: 11,
19: 11,
22: 11,
47: 11,
57: 11,
99: 11,
15: 10,
29: 10,
31: 10,
37: 10,
39: 10,
44: 10,
45: 10,
53: 10,
54: 10,
65: 10,
74: 10,
10: 9,
26: 9,
35: 9,
50: 9,
67: 9,
86: 9,
34: 8,
36: 8,
41: 8,
43: 8,
48: 8,
73: 8,
76: 8,
87: 8,
0: 7,
7: 7,
38: 7,
46: 7,
71: 7,
75: 7,
81: 7,
20: 6,
49: 6,
68: 6,
85: 6,
98: 6,
1: 5,
58: 5,
61: 5,
80: 5,
83: 5,
93: 5,
3: 4,
30: 4,
40: 4,
60: 4,
79: 4,
5: 3,
14: 3,
59: 3,
70: 3,
72: 3,
90: 3,
92: 3,
95: 3,
2: 2,
8: 2,
9: 2,
82: 2,
84: 2,
94: 2,
97: 2,
4: 1,
6: 1,
91: 1,
96: 1})
{0: 7,
1: 5,
2: 2,
3: 4,
4: 1,
5: 3,
6: 1,
7: 7,
8: 2,
9: 2,
10: 9,
11: 11,
12: 14,
13: 17,
14: 3,
15: 10,
16: 11,
17: 13,
18: 19,
19: 11,
20: 6,
21: 18,
22: 11,
23: 32,
24: 19,
25: 14,
26: 9,
27: 16,
28: 15,
29: 10,
30: 4,
31: 10,
32: 14,
33: 18,
34: 8,
35: 9,
36: 8,
37: 10,
38: 7,
39: 10,
40: 4,
41: 8,
42: 20,
43: 8,
44: 10,
45: 10,
46: 7,
47: 11,
48: 8,
49: 6,
50: 9,
51: 13,
52: 14,
53: 10,
54: 10,
55: 15,
56: 18,
57: 11,
58: 5,
59: 3,
60: 4,
61: 5,
62: 12,
63: 13,
64: 12,
65: 10,
66: 35,
67: 9,
68: 6,
69: 15,
70: 3,
71: 7,
72: 3,
73: 8,
74: 10,
75: 7,
76: 8,
77: 22,
78: 14,
79: 4,
80: 5,
81: 7,
82: 2,
83: 5,
84: 2,
85: 6,
86: 9,
87: 8,
88: 13,
89: 13,
90: 3,
91: 1,
92: 3,
93: 5,
94: 2,
95: 3,
96: 1,
97: 2,
98: 6,
99: 11}
[Finished in 0.3s]

Result

It can be seen that only the numbers 2, 4, 6, 8, 9, 82, 84, 91, 94, 96, 97 are mentioned less than 3 times, that is, when I go to the draw at this time, choose other The numbers are meaningless.

Principle

Sweepstakes page You don’t need to log in to view it. Each comment page is in the form of get url?=number.

v2ex_pythonbook_1to7.png

Because each person’s first digital response is counted, you need to save the username and reply content of each reply to deduplicate.

Except for individual floors, the first number that appears in the response is the number the user guessed. So use the regular findall to get the first matching number.

So the process is to first crawl all the pages with the crawler, then use XPath to filter out the username and reply content, then use the regular to match the number, and finally calculate.

Code

First, download the page to the local that easy to debug.

1
2
3
4
5
6
7
8
9
#Download the page
import requests
s = requests.session()
for num in range(1,11):
url = 'https://www.v2ex.com/t/313225?p='+str(num)
r = s.get(url)
with open('page'+str(num)+'.html', 'w') as f:
f.write(r.text)
print('Download page',num,'successful!')

Then, extract the username and comments from the page, and do the deduplication and regular matching.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import re
from lxml import etree
for num in range(1,11):
with open('page'+str(num)+'.html','rb') as f:
r = f.read()
page = etree.HTML(r)
#The xpath here can't be copied out of the xpath with chrome, you have to adjust it manually.
#The tbody in the path does not need to be written, but it needs to be added when debugging with xpath helper.
user_name = page.xpath(u'//*[@id="Main"]/div[4]//table/tr/td[3]/strong/a')
user_comment = page.xpath(u'//*[@id="Main"]/div[4]//table/tr/td[3]/div[4]')


for (name,comment) in zip(user_name,user_comment):
#Use [0-9]+ instead of [0-9]{2} because some people will omit 0 from a number less than 10
num_re = re.compile('[0-9]+')
comment_num = num_re.findall(comment.text)
#Here to use the characteristics of python to judge non-empty
if comment_num:
comment_num = comment_num[0]
else:
#999 is just a mark, and 100 is used to exclude responses like "666666".
comment_num = 999
#Username deduplication
if name.text in name_list or int(comment_num) > 100 :
pass
else:
name_list.append(name.text)
# print(name.text,'+',int(comment_num))
num_list.append(int(comment_num))

Then do sorting, here are two sorts

1
2
3
4
5
6
7
8
9
10
11
12
13
import pprint

#sort by number of times
from collections import Counter
result = Counter(num_list)
pprint.pprint(result)


#sort by number
result = [0 for x in range(0,100)]
for i in num_list:
result[i]+=1
pprint.pprint(dict(zip(range(0,100),result)))

See here for the complete code

Conclusion

After reading the Shanghai Stock Exchange Index for the past week, I found that the stability was at 50±15, but the closest one among the remaining numbers was only 82, so I chose 82.

———————————update———————————

The Shanghai Composite Index 84, passed by…