Using Data Science to disrupt human trafficking

Tuesday the 11th of January is Human Trafficking Awareness Day. In honour of this, we wanted to share Le Wagon Graduate Lena's story; describing how her journey into Data Science came about from a desire to simplify the data collection and analysis process for research into human trafficking activity.

Stay on top of the latest tech trends & AI news with Le Wagon’s newsletter

Human Trafficking is a multi-billion dollar industry with an estimated 40 million victims worldwide (ILO). It is a hugely underreported crime and, where case data does exist, it is rarely granular enough to be useful to targeted disruption efforts.

There are, however, thousands of survivor testimonies published online - collected by human rights organisations or found in news articles - which are rich in the kinds of information that could be used to disrupt human trafficking activity. This might include the industry in which a person was exploited, sub-country location information, grooming, control and transport methods used by traffickers and transaction amounts involved in the buying and selling of a person. Essentially, any information which could be reported to law enforcement or financial institutions that could lead to the identification of a trafficker, or any information which could be communicated to communities and individuals to help them better understand potential trafficking risks.

Put yourselves in the shoes of an analyst faced with thousands of written survivor testimonies to process and make sense of. It would take years to go through each case, manually label the information and put it into a dataframe before any form of quantitative analysis could begin. Quite apart from the time it would take, the emotional toll from reading survivor testimonies can be high.*

Could data science tools accelerate and simplify the data collection and analysis process for research into human trafficking?

In 2019, as an analyst at STOP THE TRAFFIK (STT), I was lucky enough to take part in one of DataKind UK’s DataDives - a hackathon aimed to help NGOs and charities make the best possible use of their data. The DataDive showed the STT team how much could be achieved using the preprocessed data already available, and concluded with some exploratory natural language processing (NLP) work on a collection of survivor testimonies from the Nottingham Rights Lab.

Inspired by the DataDive and DataKind’s incredible data scientist volunteers, I took the plunge two years later and completed a Data Science bootcamp at Le Wagon. Together with a hugely talented team of fellow Le Wagon students on the project - Paloma Aragón Trujillo, Sofia Giordano and Andrea Lampugnani - ideas for NLP exploration of free text survivor testimonies that were born during the DataDive were brought to life. We were able to automate the extraction of key intelligence from the texts and demonstrate the power of each and every survivor testimony through the use of NLP and machine learning.

To give an example of one of the pieces of intelligence we were able to extract from the survivor testimonies using data science tools, our team found that the most common price paid for a human being was $10,000. If we were working with a financial institution, we could share this intelligence to aid the detection of trafficker bank accounts.

The image shows a wordcloud generated by applying a regex to the survivor testimonies. The regex was designed to extract phrases relating to the buying and selling of a human being.

To identify relevant transaction amounts, our team created a regex which identified phrases relating to the buying and selling of a human being, combined with a monetary amount - signified by a series of digits followed or preceded by a currency name or symbol.

The image above highlights two clusters generated by the DistilBERT multilingual cased model when applied to the survivor testimonies. Cluster 1 relates to a conflict theme and cluster 4 to debt.

One of the project’s aims was to show how data science can aid data interpretation, for example, aiding the identification of patterns or themes in the texts without an analyst needing to read each text individually. The DistilBERT multilingual cased model was able to identify distinct themes, including the conflict and debt-related topics shown in the cluster chart - both of which are known vulnerabilities to trafficking - without any human guidance. Moreover, when the team looked into testimonies involving Sudan, we found a group of testimonies from Eritrean trafficking survivors who had been abducted in either Sudan or Egypt and forced into slavery while making their way to Israel where they had hoped to seek asylum. It was amazing to see an unsupervised machine learning model bring this group of testimonies to light.

This Sankey diagram shows some of the most common routes from a recruitment location to an exploitation location in the survivor testimonies.

After extracting location information from the survivor testimonies we initially plotted the main routes in a Sankey diagram. Contrary to what some might expect when talking about human trafficking, many people had been trafficked within one country, as opposed to transnationally.

Further insights and interactive charts produced as a result of the project can be found on the project website, in which we combined data extracted from the survivor testimonies and data from the Counter Trafficking Data Collaborative.

Within DataKind UK, thanks to the talented team and volunteers who helped STT during the DataDive, and thank you to all the amazing teachers and alumni who helped us at Le Wagon.

*For anyone exposed to distressing content online or in any other day to day setting, this resource from the British Medical Association may be helpful.