decoding weibo captcha in python
The Problem
ALSO THIS
THeY seem to be hard
Not REALLY... LET's BREAK IT
Wait, BEFORE THAT
LEt's REVIEW SOME COMMON TECHniCS
Steps
-
Removing Noises
- Separating Characters
-
Extracting Features
-
Classifying Features
an example
After CLEANING
After Splitting
AFter CLASSIFYING
8452
IT LOOKS SIMPLE
BUT it isN't...
... UNTIL I EXPLAIN DETAILS
CLEANING NOISES
[(30, 0), (21, 14), (7, 15), (1, 17), (1, 19), (33, 22), (3, 23), (1, 25),
(2, 26), (2, 28), (1, 29), (5, 30), (4, 32), (1, 34), (3, 36), (15, 38), (10, 39), ...,
, (1, 245), (2, 246), (5, 247), (4, 249), (4, 251), (5, 253), (2190, 255)]
>>> # 255-> White, 0-> Black
>>> # If we remove all the "whiter" colors
>>> im = im.point(lambda x: 255 if x>128 else x)
>>> # see how this policy works
>>> im.show()
↓
>>> # the new color distribution?
>>> im.getcolors()
...
>>> # new attempts
...
...
AFTER A FEw GUESS AND TRY
...
We GOT THIS
def clean(im): im = im.convert('L') im = im.point(lambda x:255 if x>128 or x==0 else x) im = im.point(lambda x:0 if x<255 else 255) return im
It's surprisingly simple, isn't it?
→
EXTRACTING CHARS
Extracting Features
-
does it contains the word "sex"?
- what about "buy"?
-
does it have links in it?
- how many words in it?
- ....
- how many pixels in the image are black?
- how many white areas in the image?
- Is there curves in the image, about where?
-
....
def im2array(im): return [ int(x!='\xff') for x in im.tobytes() ]
Classifying
- we have features:
- a list of feature vector
- (vec1, vec2, vec3, ...)
- we have target:
- a list of target values
- (tar1, tar2, tar3, ...)
-
We want to predict:
- given a new feature vector
- which target would it be like?
- predict(vector)
>>> # let's train a XOR operator
>>> import sklearn.svm
>>> clf = sklearn.svm.SVC()
>>> data = [(1, 1),
... (1, 0),
... (0, 1),
... (0, 0)]
>>> targets = [0, 1, 1, 0]
>>> clf.fit(data, targets)
>>> clf.predict((0, 0))[0]
1
-
traing:
- data: integer arrays
- target: arrays of 0-35(represents [0-9A-Z])
- clf.fit(data, target)
-
predicting:
- array = preprocess char image into array,
- code = clf.predict(array)
- char = lookup code in [0-9A-Z]
- Bayes
- Decision Tree
-
SVM
-
kNN
-
MLP(NN)
Q&A?
Decoding Weibo CAPTCHA in Python
By jingchaohu
Decoding Weibo CAPTCHA in Python
Explain how to decode CAPTCHAs using python
- 18,194