Author(s): Daksh Trehan
Machine Learning, Cybersecurity
And how to crack CAPTCHA using Machine Learning!
I am kind of amazed by the technology, sometimes, it hooks me to weird-yet-interesting short videos, other times, it asks me to prove, ‘I’m a human!’
You book Flight Tickets, you face CAPTCHA. You create accounts, you face CAPTCHA. You check for plagiarism for your article, CAPTCHA again!
Sometimes, I want to yell, YES! I am a Robot. (well obviously I am a human)
Other times, I wonder who gets all mountains/bikes/fire hydrants/cycles in the first pass?
What is Captcha? And Why do we use it? Are they getting harder?
CAPTCHA stands for Completely Automated Public Turing Test to Tell Computers & Humans Apart.
In the early 21st century, when Yahoo! was blooming, they were afraid that there will be a day when users will write code to create millions of fake accounts to spam. And to stop spammers, a mechanism is needed to differentiate human users from automated scripts.
The required mechanism should be something that can’t be cracked by our computers, but still, they must be able to grade that test. I told you technology is weird-yet-interesting.
At that time, due to the weaker configuration of machines, less exposure to Machine Learning and Python, computers were weak at recognizing texts. But on the other hand, we humans had expertise in text recognition, as, all we do is read texts all day long.
Luis Von Ahn developed CAPTCHA, where Computers were given a random image of text with its answer, and the text would be warped, thus, making it computers difficult to understand it.
Photo by Marija Zaric on Unsplash
The test helped to differentiate between humans and users. But it wasn’t for the long run, soon computers started to learn that warped text and got better at it.
The same problems arose, the computers were too smart to bypass the test, and now with the increase in traffic, a more robust mechanism was required.
It was something very similar to CAPTCHA, but now, instead of providing one piece of text, there are two words in the CAPTCHA.
For the first word, Computers know the answer but the second word was pulled randomly from any article/book. It was assumed, that if humans answered the first word right, there is a high possibility another word would be right too!
For the second word, usually, Computers are used to send the same CAPTCHA to many users and check for the majority. But soon, this method got exhausted and computers were yet able to crack Re-CAPTCHA.
They brought this method down so very well that, according to a test conducted by Google, only 33% of times humans conquered Re-CAPTCHA, but AI did it with an accuracy of 99.8%
This time, the approach was different, this time, humans were expected to teach machines about real-world entities.
Photo by dedy kurniawan on Unsplash
We all remember Fire Hydrants, Buses, Cycle, Bikes test, right?
When we try to choose the correct image, we are trying to teach the machine what a real-world entity looks like. The input given by us is recorded and is used for self-learning cars to better understand these entities.
But, guess what? AI is getting better at it too!
By this time, humans have lost all hopes and temper to create a robust test.
Now, we are starting to verify the user’s identity based on her behavior. This is a kind of invisible test, of which users are unaware. It is secretly running behind your web pages to determine whether you’re human or a bot.
Privacy is a myth, for sure!
The test can track your clicks, your typing speed, your workflow. And based on that it tries to judge. If you show some unusual behavior, that is writing 100s of words of texts in a second, clicking very frequently. It will prompt Re-CAPTCHA(v2) and will ask you to verify.
How Machine Learning cracked CAPTCHA?
Till this time, you must have understood cracking CAPTCHA with Machine Learning isn’t a biggie. All you need to do is built a simple OCR model with the required data.
The training data can be found at Github
The dataset consists of 1040 images.
Visualizing the data
Training our model
The code can be found at: Solving CAPTCHA using ML
If you like this article, please consider subscribing to my newsletter: Daksh Trehan’s Weekly Newsletter.
Hopefully, this article has given you an insight into the CAPTCHAs.
The work was created as an academic/fun project and doesn’t intend to be used for harmful/malicious purposes.
 OCR Model for reading CAPTCHA.
Find me on Web: www.dakshtrehan.com
Follow me at LinkedIn: www.linkedin.com/in/dakshtrehan
Read my Tech blogs: www.dakshtrehan.medium.com
Connect with me at Instagram: www.instagram.com/_daksh_trehan_
Want to learn more?
How is YouTube using AI to recommend videos?
Detecting COVID-19 Using Deep Learning
The Inescapable AI Algorithm: TikTok
GPT-3 Explained to a 5-year old.
Tinder+AI: A perfect Matchmaking?
An insider’s guide to Cartoonization using Machine Learning
How Google made “Hum to Search?”
One-line Magical code to perform EDA!
Give me 5-minutes, I’ll give you a DeepFake!
CAPTCHAs vs. MACHINES: A Bitter Rivalry? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI