Abstract |
: |
The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Since the data sources are incoherent and autonomous, they may adopt their own conventions and often, integrating data from different sources may lead to erroneous redundancy of data. To ensure high quality data, the database must validate and filter the incoming data from the external sources. In this regard, data normalization has become a necessity to ensure the high quality of the data stored in these databases. The process of identifying the record pairs that represent the same entity is commonly known as duplicate record detection making it one of the most important tasks in the process of data cleansing. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. The background of implementation trials for these concepts was chosen as Scholarship Portal data developed for various organizations where finding and identifying of
such records to the most possible extents as well as enabling the genuine students not to be debarred from getting scholarships as it has various kind of reservation/quota mechanism was a dire need. |