Headshot of Tim Jackson wearing a pink jackets, open necked white shirt and glasses. He is smiling and looking directly at the camera.

Tim Jackson MSc.

A solution-focused Data Scientist and experienced leader.

Connect

The Challenges of Unstructured Data

A Report written during my studies for an MSc in Data Science.


Abstract:

Unstructured data is now accounting for up to 80% of all data held by organisations. Currently, processing this data is time-consuming and complex, as traditional analysis methods are designed to be deployed against structured data. However, unstructured data hold the potential to answer many questions. In this paper the use of Machine Learning, including Deep Learning, and NLP, focusing on word embeddings and the use of word2vec and BERT with unstructured datasets is reviewed, with focus on medical and financial unstructured data. The paper summarises the state of the art and outlines the open research questions of privacy, quality and availability of datasets and generalisability of study outcomes that face the field, highlighting where they exist, some solutions to these questions that have been proposed in the literature.

Document Downloads: