Abstract

We explore the robustness and usability of moving-image object recognition (video) captchas, designing and implementing automated attacks based on computer vision techniques including object tracking and identification. The attack approach is suitable for broad classes of moving-image captchas involving rigid objects. We first present an attack that defeats instances of such a captcha (NuCaptcha) representing the commercial stateof-the-art, involving dynamic text strings. We then consider simple captcha design modifications to reduce attack efficacy (e.g., longer text strings, characters more closely overlapped, semi-transparent characters). We implement the modified captchas and tune design parameters to ranges allowing us to test if designs modified for greater robustness maintain usability. We thoroughly test the modified captcha designs for usability by lab-based user study allowing direct observation of participants. We find the modified captchas fail to offer viable usability, even when captcha strength is reduced below acceptable targets, signaling that the modified designs are not viable. We further implement and test another variant of moving text strings using the known emergent images idea. In contrast, we find it is resilient to our attacks and of similar usability to commercial NuCaptcha captchas on performance measures. We explain why fundamental elements of the emergent images design resist our current attack approach. Our work suggests the emergent image variation as a promising direction for future exploration, being based on an underlying AI problem considered hard by the computer vision community.

Paper

Yi Xu, Gerardo Reynaga, Sonia Chiasson, Jan-Michael Frahm, Fabian Monrose and Paul Van Oorschot, Security and Usability Challenges of Moving-Object CAPTCHAs: Decoding Codewords in Motion,  the 21st USENIX Security Symposium, 2012 [pdf]

Studied Countermeasures

To highlight some of the tensions that exists between the security and usability of moving-image object recognition(MIOR) captchas, we explore a series of possible mitigations to our attacks. In order to do so, we generate video captchas that closely mimic those from NuCaptcha. In particular, we built a framework for generating videos with characters that move across a background scene with constant velocity in the horizontal direction, and move up and down harmonically. Similar to NuCaptcha, the characters of the codeword also rotate. Our framework is tunable, and all the parameters are set to the defaults calculated from the original videos from NuCaptcha.

Standard Captcha:

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

Given this framework, we explore the following defenses:

1. Extended: the codeword consists of m > 3 random characters moving across a dynamic scene.

Static:

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

Moving:

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

2. Overlapping: identical to the Standard case (i.e., m = 3), except that the characters are more closely overlapped.

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

3. Semi-Transparent: identical to the Standard case, except that the characters are semi-transparent.

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

   

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

4. ;Emerging objects: A different MIOR captcha where the codewords are 3 characters but created using an "Emerging Images" concept.

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

   

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

 

Summary

We analyzed our attack on a random sample of 500 captchas. To determine an appropriate training set size, we varied the number of
videos as well as the number of extracted frames and examined the recognition rate. The results (not shown) showed that while accuracy steadily increased with more training videos (e.g., 100 versus 50), recognition rates only marginally increased beyond 1500 samples. Therefore, we used 300 video sequences for training (i.e., 900 codeword characters) and for each detected character, we select 2 frames containing that character (yielding 1800 training patches in total). We use dense SIFT descriptors as the features for each patch (i.e., a SIFT descriptor is extracted for each pixel in the patch, and concatenated to form a feature vector). The feature vectors are used to train the neural network. For testing, we choose a different set of 200 captchas, almost evenly distributed among the 19 backgrounds. The accuracy of the attacks are given below.

  Single Letter Accuracy 3-Letter Accuracy
Accuracy 90.3% (542/600) 77.0% (154/200)

The result indicate that the robustness of these MIOR captchas are far weaker than one would hope. In particular, our automated attacks can successfully decode the captchas more than three quarters of the time.

In the course of our pilot studies, it became clear that if the parameters for the Extended, Overlapping, and Transparent countermeasures are set too stringently (i.e., to defeat automated attacks 99% of the time), then the resulting MIOR captchas would be exceedingly difficult for humans to solve. Therefore, to better measure the tension between usability and security, we set the parameters for the videos (in the subsequent user study) to values where our attacks have a 5% success rate, despite that being intolerably high for practical security. Any captcha at this parametrization, which is found to be unusable, is thus entirely unviable. For the countermeasures we studied above, our algorithm can break the first three of them with successrate at 5% while we cannot break the last method (emerging objects) at this point.

Supplemental Information

Not presented in the USENIX Security paper

1. Inverse Background Captcha

We also tested a defense that is similar in spirit to the semi-tranparent codeword letters. It aims to prevent foreground segmentation by choosing for each pixel on the codeword the mean value of the background mode (using our two mode Gaussian mixture background model) that is not present. Using the second mode of the learned background model leads to the fact that our method assigns a high likelihood for each pixel to be background and hence it will not detect these pixels to be foreground. Using the second mode of the background also ensures a minimal separation from the first mode and hence provides at least a minimal contrast to the user (the contrast is higher the more diverse the background's appearance is). We tested two methods to learn the background modes. The first mode learns a global background model using all frames of the sequence. An example is shown below.

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

This lead in all tested cases to video captchas that were not usuable as the letters appearance continuously varies. This prevents the user from perceiving the letters well (user study pending). To overcome this difficulty we evaluated a second mode of background learning, which uses frames close in time to learn the background model and is shown below :

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

 

This mode leads to more usable captchas when the background video excercises significant color change as in the above video. In those case the resulting video captcha can be attacked by standard methods as can be seen in the image below

 

Otherwise the obtained video captchas are comparably unusable to the ones using the globally learned background model.

 

2. Random Fonts

Recently, NUCaptcha proposed to use a variety of fonts for the codeword in order to prevent an attack. Accordingly, we evaluated a scheme using a variety of different fonts (10 different fonts). This scheme has not been evaluated in a user study so far but we generally believe it will be on par with the current NuCaptcha scheme with regard to usability. In order to create the test data for the experimental evaluation we selected ten different fonts as possible font for each of the different codeword letters (please see the Figure below for an example letter in the different fonts).

 

Then we assemble the test captchas using the above described synthesizing. For each sequence we randomly select the font for each of the letters of the codeword. We modify our proposed system by training a classifier for each font providing us 10 classifiers for the codeword letter recognition. All of the classifiers are embedded into our attack algorithm by executing all classifiers on each letter and selecting the highest response across letters (the 20 possible symbols) and fonts (the ten different fonts possible per letter) as the codeword letter. We tested the modified attack on 100 testing video sequences (different from the training videos), each of which contains 3 letter codewords with random fonts chosen from the ten possible fonts. An example frame of the resulting captcha is shown below.

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

The results of our test are summarized in the Table below. It can be seen that the use of different fonts only slightly impacts the percentage of successful attacks but clearly does not prevent our proposed attack scheme from succeeding.

  Single Letter Accuracy 3-Letter Accuracy
Accuracy 85%(255/300) 70%(70/100)

 

3. Multiple colors

In order to analyze the behavior of our attack for captchas with different colors of the codeword letters or textured codeword letters, we evaluated our attacks performance when the codeword letters have random color or a texure consisting of multiple colores. Two of these test videos are shown here:

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

此页面上的内容需要较新版本的 Adobe Flash Player�?/h4>

获取 Adobe Flash Player

In the left captcha, each letter is colored with a different color. In the right one, each captcha is textured with rainbow like colors. For both cases our attack remains successful without any modification. The reason is that the optical flow method based forground detection provides a valid segmentation of the letter. We tested both of the two color variations on 100 samples obtaining the following results:

Type of Defend Single Letter Accuracy 3-Letter Accuracy
Left( different color on different letter) 87.67%(263/300)  74%(74/100) 
Right( Rainbowlike pattern) 87.67%(263/300)  69%(69/100)

 

In summary, we find that none of the additional strategies for curtailing our attacks (i.e, the inverted-background, textured colored, or varied fonts extensions) offer a viable path forward.