Mining Associations Between Two Categories Using Unstructured Text Data in Cloud

Yanqing Jia(Author)
,
Yun Tianb(Author)
,
Fangyang Shend(Author)
,
John Tranc(Author)

aElectrical & Computer Engineering
,
bEastern Washington University
,
cFrontier Behavioral Health
,
dNew York City College of Technology

Research Output: Chapter in Book/Report/Conference proceeding Conference contribution

Abstract

Finding associations between itemsets within two categories (e.g., drugs and adverse effects, genes and diseases) are very important in many domains. However, these association mining tasks often involve computation-intensive algorithms and a large amount of data. This paper investigates how to leverage MapReduce to effectively mine the associations between itemsets within two categories using a large set of unstructured data. While existing MapReduce-based association mining algorithms focus on frequent itemset mining (i.e., finding itemsets whose frequencies are higher than a threshold), we proposed a MapReduce algorithm that could be used to compute all the interestingness measures defined on the basis of a 2 × 2 contingency table. The algorithm was applied to mine the associations between drugs and diseases using 33,959 full-text biomedical articles on the Amazon Elastic MapReduce (EMR) platform. Experiment results indicate that the proposed algorithm exhibits linear scalability.

Access to documents

10.1007/978-3-319-77028-4_70

Mining Associations Between Two Categories Using Unstructured Text Data in Cloud

Links