Content search in large text corpuses using natural language processing

FFI-Report 2021
This abstract and publication is only available in Norwegian

About the publication

ISBN

9788246433769

Size

602.7 KB

Language

English

Download publication
Bernt Ivar Utstøl Nødland Hallvar Gisnås Henrik Gråtrud Vidar Benjamin Skretting
Analysts and researchers are facing an ever increasing amount of information. Finding ways to identify relevant information on fuzzy topics and concepts can thus accelerate the analyst. We investigate the method of using deep learning for semantic content search in a large text corpus. We test several state of the art models, such as ULMFiT and transformer based models. Deep learning models leverage large public corpuses to achieve a comprehensive understanding of language, such as next word prediction, to aid it’s prediction of relevance. We compare them to a baseline of keyword search on a test case of approximately 50 000 articles from Jordan Times, where we try to identify articles about jihadist terror plots. We find that the best deep learning models outperform keyword search, indicating that these techniques could provide a useful tool for the analyst. However, they require effort to set up properly, and are much more complex compared to the baseline. We recommend to do further testing of these methods, both in English and in other languages.

Newly published