ThePhish: an automated phishing email analysis tool

As of today, phishing emails are the most widely used infection vector. This means that the number of alerts related to emails to analyze is growing faster and faster. The problem is that analyzing an email is a complex and tedious process that can make an analyst waste the majority of its time on repetitive tasks. The solution is, obviously, the automation of this process. To do that, we developed an open-source automated phishing email analysis tool, which is ThePhish. It is based on three open-source platforms, namely TheHive, Cortex and MISP.

ThePhish logo

Why automation?

Everyone knows what an email is, but not everyone knows what an email really contains. We are used to see the body of the email, which is the content that is displayed by the mail client. Actually, if we stopped there, we would be missing maybe the most important part of the email when it comes to the analysis: the header. It is composed of dozens of header fields, and being able to understand where to look at among them is crucial. We are usually looking for observables like IP addresses, email addresses, domains, URLs and file attachments that can be found in the header and/or in the body of the email. It is important to identify the relevant ones and analyze them with many different tools and services. However, doing that can literally take hours! That’s why we developed ThePhish.

The underlying tools: TheHive, Cortex and MISP

TheHive

TheHive is a scalable, open-source and free Security Incident Response Platform (SIRP) created by TheHive Project. It allows managing alerts related to security events coming from a multitude of sources. In addition, specialized programs called alert feeders can be built to consume such a security event, parse it and create an alert in TheHive. Alerts can be ignored, marked as read, previewed and imported. When an alert is imported, it becomes a case that needs to be investigated. The core construct of TheHive is infact the case, which is also the core construct of most security investigations. It can contain observables, such as IP addresses, email addresses, URLs, domains, files and many more. They can be tagged, analyzed and even flagged as Indicators of Compromise (IoCs). It is possible to access to all the functionalities provided by TheHive both through a web interface and a REST API. Moreover, a Python API client is available that is called TheHive4py.

Cortex

Cortex is a powerful open-source observable analysis and active response engine created by TheHive Project that allows analyzing observables at scale by querying a single tool instead of several. It provides a web interface from which it is possible to analyze observables one by one or in bulk mode, but it can also be used to automate these operations and submit large sets of observables from TheHive or through a REST API. Moreover, a Python API client is available that is called Cortex4py. The usage of Cortex is based on neurons, which are autonomous applications managed by and run through the Cortex core engine. They can be of one of two types:

Analyzers: They allow analyzing different types of observables by automating the interaction with a service or a tool so as to speed up the analysis and make it possible to contain threats before it is too late. Cortex comes with more than a hundred analyzers for popular services such as VirusTotal, emailrep.io, urlscan.io, AbuseIPDB and PhishTank.
Responders: They are installed along with the analyzers. However, they are only useful when Cortex is used in conjunction with TheHive, in fact they perform different actions and apply to various elements of TheHive.

MISP

Malware Information Sharing Platform (MISP) is a free and open-source software helping information sharing of threat intelligence including cyber security indicators. Its aim is to help improve the countermeasures used against targeted attacks. It makes it possible to store IoCs in a structured manner, and thus enjoy the correlation, automated exports and synchronize to other MISP servers. The basic building block of MISP is the event. Each event is made up of a list of attributes, which are atomic pieces of data that could be IoCs. It also makes it easier to share with, but also to receive from trusted partners and trust groups so as to enable fast and effective detection of attacks. MISP provides an intuitive web interface, but also a REST API that can be used for automation and feeding devices.

Putting it all together

The real advantage of using TheHive, Cortex and MISP is obtained when they are used together. The following picture summarizes the possible interactions, which are described below:

TheHive, Cortex and MISP

Events coming from SIEMs, emails, IDSes and other platforms are consumed by alert feeders, which can use the TheHive4py library to create alerts in TheHive. Alerts can also be generated thanks to TheHive polling events from MISP.
An analyst can create a case from that alert (or from scratch) and decide to analyze the observables that it contains through Cortex. It is also possible to interact with MISP through Cortex by using a special analyzer (the MISP Search analyzer). This analyzer makes it possible to search observables within a MISP instance. The observbles can also be analyzed with the invocation of MISP expansion modules. Moreover, it is possible to launch responders to automate certain actions, like sending a notification email.
The analyst can then decide whether to export the case along with its observables marked as IoC from TheHive to MISP.
It is also possible to enrich MISP events by calling Cortex analyzers from within MISP.

How ThePhish can help

ThePhish is a web application that automates the entire analysis process. It extracts the observables from the header and the body of the email and elaborates a verdict, which is final in most cases. In addition, the analyst can intervene in the analysis process and obtain further details on the email being analyzed if necessary. Here is shown how ThePhish works at high-level:

ThePhish overview

An attacker starts a phishing campaign and sends a phishing email to a user.
A user who receives such an email can send it as an attachment in EML format to the mailbox used by ThePhish.
The analyst interacts with ThePhish and selects the email to analyze.
ThePhish extracts all the observables from the email and creates a case on TheHive. The observables are analyzed thanks to Cortex and its analyzers.
ThePhish calculates a verdict based on the verdicts of the analyzers.
If the verdict is final, ThePhish closes the case and notifies the user. In addition, if it is a malicious email, it exports the case to MISP.
If the verdict is not final, ThePhish requires the analyst’s intervention. He must review the case on TheHive along with the results given by the various analyzers to formulate a verdict. Then, he can send the notification to the user, optionally export the case to MISP, and close the case.

ThePhish relieves the analyst from manually extracting all the observables from the email and adding them one by one in a case on TheHive. Moreover, he doesn’t need to start the various analyzers on each observable, send notifications to users, nor interacting with MISP. Even in the case in which his intervention is required, the majority of the work will have already been performed. This means that he can focus only on things that matter to elaborate a final verdict.

ThePhish example usage

This example aims to demonstrate two aspects:

How a user can send an email to ThePhish
How an analyst can actually analyze that email using ThePhish

A user sends an email to ThePhish

A user can send an email to the email address used by ThePhish to fetch the emails to analyze. He must forward the email as an attachment in EML format. This is to prevent the contamination of the email header. In this case, the used mail client is Mozilla Thunderbird and the used email address is a Gmail address.

ThePhish forward email

The analyst analyzes the email

The analyst clicks on the “List emails” button to obtain the list of emails to analyze.

ThePhish list emails

When the analyst clicks on one of the “Analyze” buttons, the analysis starts. Its progress is shown on the web interface.

ThePhish start analysis

In the meantime, ThePhish extracts the observables from the email and interacts with TheHive to create the case.

ThePhish case outside

ThePhish creates three tasks inside the case.

ThePhish case tasks

Then, ThePhish starts adding the extracted observables to the case.

ThePhish observables

At this point ThePhish notifies the user via email that the analysis has started thanks to the Mailer responder.

ThePhish notification

The description of the first task allows the Mailer responder to send the notification via email.

ThePhish notification task

ThePhish starts the analyzers on the observables. The analysis progress is shown on the web interface while the analyzers are started.

ThePhish analysis

The analysis progress can also be viewed on TheHive, thanks to its live stream.

ThePhish live feed

Once all the analyzers have terminated their execution, ThePhish calculates the verdict. Since the verdict is “malicious”, ThePhish marks all the observables that it finds to be malicious as IoC.

ThePhish ioc

The case is then exported to MISP as an event, with a single attribute represented by the observable mentioned above.

ThePhish event MISP

ThePhish MISP attribute

Then, ThePhish sends the verdict via email to the user thanks to the Mailer responder.

ThePhish result malicious

Finally, ThePhish closes the case. The description of the third task allows the Mailer responder to send the verdict via email. Moreover, the case has been closed after five minutes and resolved as “True Positive” with “No Impact”. This means that ThePhish detected the attack before it could do any damage.

ThePhish task result

Once the case is closed, the analyst can view the verdict on the web interface. He can also view the entire log of the analysis progress.

ThePhish malicious verdict

At this point, the analyst can go back and analyze another email. The above-depicted case was related to a phishing email. However, a similar workflow can be observed when the analyzed email is classified as “safe”. Indeed, ThePhish closes the case and sends the verdict to the user.

ThePhish result safe

Then, the analyst can view the verdict on the web interface.

ThePhish safe verdict

On the other hand, when an email is classified as “suspicious”, the verdict is only displayed to the analyst.

ThePhish suspicious verdict

Now the analyst needs to use the buttons on the left to use TheHive, Cortex and MISP for further analysis. This is because the analysis has not been completed yet. Indeed, the last task and the case have been left open. They infact need to be closed by the analyst himself once he elaborates a final verdict.

ThePhish task left open

The analyst can view the reports of all the analyzers on TheHive and Cortex. In case this was not enough, he could also download the EML file of the email and analyze it manually.

ThePhish EML file

When the analyst terminates the analysis, he populates the description of the last task. It will infact contain the body of the email to send to the user. Then he starts the Mailer responder, exports the case to MISP if necessary and then closes the case.

Implementation

The following picture is worth a thousand words:

ThePhish implementation

ThePhish is a web application written in Python 3. The web server is implemented using Flask, while the front-end part of the application is implemented using Bootstrap. Apart from the web server module, the back-end logic of the application is constituted by three Python modules that encapsulate the logic of the application itself and a Python class used to support the logging facility through the WebSocket protocol. Moreover, ThePhish uses several files for the configuration of the following aspects:

The logging functionality used to show the analysis progress
The connection with the underlying instances of TheHive, Cortex and MISP
The fine tuning of the results given by the Cortex analyzers
The observables to whitelist

When the analyst navigates to the base URL of the application, the browser establishes a bi-directional connection with the server. This is done by using the Socket.IO JavaScript library. For this to work, the server application uses the Flask-SocketIO Python library, which provides a Socket.IO integration for Flask applications. ThePhish uses this connection to display the progress of the analysis on the web interface.

Every time the analyst performs an action on the web interface, an asynchronous AJAX request is sent to the server. This allows the analyst both to visualize the list of emails to analyze and to make the analysis start.

ThePhish interacts with TheHive and Cortex thanks to TheHive4py and Cortex4py. Moreover, it interacts with an IMAP server to retrieve the emails to analyze.

Installation

You can install ThePhish in one of two possible ways:

From scratch, if you have up-and-running instances of TheHive, Cortex and MISP
By using Docker and Docker Compose

It’s also on GitHub!

ThePhish has been made available on GitHub as an open-source project under the AGPL license at this repository. It is possible to refer to that repository for a complete installation and configuration guide.

Conclusions

In this post we have presented a tool that allows automating the entire analysis process and obtain a verdict. It requires the analyst’s intervention when necessary, but with the most tedious and mechanical tasks already performed. The tool is available on GitHub so that anyone can contribute to improve it over time. The development of ThePhish will in fact not stop here, as there is great room for improvements. We will make further changes in the future in order to add new functionalities to ThePhish. Also, we will support any new feature introduced by TheHive and Cortex and support new analyzers. Furthermore, we will fix bugs that might be present in the current release or in any future release of ThePhish.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.