Abstract |
: |
This Machine translation from one natural language to the other is a challenging task. One of the methods of doing machine translation is using Interlingua based approach. In that approach the source language can be represented in an intermediate form, and that can be translated to the target language. Generation of Natural language sentence combines knowledge about language and the application domain to produce correct translation. And thus, it is important to prepare domain-specific corpus. Also it is equally important that the semantic hierarchy among the sets of domain words for machine translation of a document, since the hierarchy will provide semantic links and ontological information for words. Ontologies define concepts and interrelationships in order to provide a shared vision of a given application domain. One of the main problems is the difficulty in identifying and defining relevant concepts in the domain. This paper aimed the extraction of knowledge from Tamil Nadu university websites, in order to identify the domain specific words for educational sites. This paper proposes a method to identify domain specific words by utilizing the hierarchical structure of web directories node-by-node. This method will produce a list of domain dependent words with high frequency words. |