Parallelization of backward deleted distance calculation in graph based features using Hadoop

dc.contributor.authorPillamari, Jayachandran
dc.date.accessioned2013-04-01T14:45:23Z
dc.date.available2013-04-01T14:45:23Z
dc.date.graduationmonthMayen_US
dc.date.issued2013-04-01
dc.date.published2013en_US
dc.description.abstractThe current project presents an approach to parallelize the calculation of Backward Deleted Distance (BDD) in Graph Based Features (GBF) computation using Hadoop. In this project the issues concerned with the calculation of BDD are identified and parallel computing technologies like Hadoop are applied to solve them. The project introduces a new algorithm to parallelize the APSP problem in BDD calculation using Hadoop Map Reduce feature. The project is implemented in Java and Hadoop technologies. The aim of this project is to parallelize the calculation of BDD thereby reducing GBF computation time. The process of BDD calculation is examined to identify the key places where it could be parallelized. Since the BDD calculation involves calculating the shortest paths between all pairs of given users, it can viewed as All Pairs Shortest Path (APSP) problem. The internal structure and implementation of Hadoop Map-Reduce framework is studied and applied to the process of APSP problem. The GBF features are one of the features set used in the Ontology classifiers. In the current project, GBF features are used to predict the friendship relationship between the users whose direct link is deleted. The computation involves calculating BDD between all pairs of users. The BDD for a user pair represents the shortest path between them when their direct link is deleted. In real terms, it is the shortest distance between them other than the direct path. The project uses train and test data sets consisting of positive instances and negative instances. The positive instances consist of user pairs having a friendship link between them whereas the negative instances do not have any direct link between them. Apache Hadoop is a latest emerging technology in the market introduced for scalable, distributed computing across clusters of computers. It has a Map Reduce framework used for developing applications which process large amounts of data in parallel on large clusters. The project is developed and implemented successfully and has the best time complexity. The project is tested for its reliability and performance. Different data sets are used in this testing by considering various factors and typical graph representations. The test results were analyzed to predict the behavior of the system. The test results show that the system has best speedup and considerably decreased the processing time from 10 hours to 20 minutes which is rewarding.en_US
dc.description.advisorDaniel A. Andresenen_US
dc.description.degreeMaster of Scienceen_US
dc.description.departmentDepartment of Computing & Information Sciencesen_US
dc.description.levelMastersen_US
dc.identifier.urihttp://hdl.handle.net/2097/15433
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectParallelization of APSPen_US
dc.subjectHadoopen_US
dc.subjectMap Reduceen_US
dc.subject.umiComputer Science (0984)en_US
dc.subject.umiInformation Technology (0489)en_US
dc.titleParallelization of backward deleted distance calculation in graph based features using Hadoopen_US
dc.typeReporten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JayachandranPillamari2013.pdf
Size:
1.44 MB
Format:
Adobe Portable Document Format
Description:
Masters Report
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: