Monday, January 27, 2020

VDEC Based Data Extraction and Clustering Approach

VDEC Based Data Extraction and Clustering Approach This chapter describes in details the proposed VDEC Approach. It discusses the two phases of the VDEC process for Data Extraction and clustering. Experimental performance evaluation results are shown in the last section in comparing the GDS and SDS datasets. INTRODUCTION Extracting data records on the response pages returned from web databases or search engines is a challenge posed in information retrieval. Traditional web crawlers focus only on the surface web while the deep web keeps expanding behind the scene. Vision based data extraction provides a solution to extract information from dynamic web pages through page segmentation for creating a data region and data record and item extraction. A vision based web information extraction systems become more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. This chapter discusses an approach to vision-based deep web data extraction and web document clustering. The proposed approach comprises of two phases, (1) Vision-based web data extraction, and (2) web document clustering. In phase 1, the web page information is segmented into various chunks. From which, surplus noise and duplicate chunks are removed using three parameters, such as hyperlink percentage, noise score and cosine similarity. Finally, the extracted keywords are subjected to web document clustering using Fuzzy c-means clustering (FCM). VDEC APPROACH VDEC approach is designed to extract visual data automatically from deep web pages as shown in the block diagram in figure 5.1. Figure 5.1 – VDEC Approach Block diagram In most of web pages, there will be more than one data object tied together in data region, makes difficult to search attributes for each page. Unprocessed source of web page for representing the objects is non-contiguous one, the problem becomes more complicated. In existent applications, the users necessitate from complex web pages is the description of individual data object derived from the partitioning of the data region. VDEC achieve the data capturing from the deep web pages using two phases as discussed in the following sections. Phase-1 Vision Based Web Data Extraction In Phase-1 VDEC approach performs data extraction and a measure is introduced to evaluate the importance of each leaf chunk in the tree, which in turn helps us to eliminate noise in a deep web page. In this measure, remove the surplus noise and duplicate chunk using three parameters such as hyperlink percentage, Noise score and cosine similarity. Finally, obtain the main chunk extraction process using three parameters such as Title word Relevancy, Keyword frequency based chunk selection, Position features and a set of keywords are extracted from those main chunks. Phase-2 Web Document Clustering In Phase-2 VDEC perform web document clustering using Fuzzy c-means clustering (FCM), the set of keywords were clustered for all deep web pages. Both the phases of the VDEC helps to extract the visual features of the web pages and supports on web page clustering for improvising information retrieval. The process activities are briefly described in the following section. DEFINITIONS OF TERMS USED IN VDEC APPROACH Definition (chunk): Consider a deep web page is segmented by blocks. These each block are known as chunk. For example the web page is represented as, , where the main chunk, . Definition (Hyperlink): A hyperlink has an anchor, which is the location within a document from which the hyperlink can be followed; the document having a hyperlink is called as its source document to web pages. Hyperlink percentage Where, à ¯Ã†â€™Ã‚   Number of Keywords in a chunk à ¯Ã†â€™Ã‚   Number of Link Keywords in a chunk Definition (Noise score): Noise score is defined as the ratio of the number of images in total number of chunks. Noise score, Where, à ¯Ã†â€™Ã‚   Number of images in a chunk à ¯Ã†â€™Ã‚   Total number of images Definition (Cosine similarity): Cosine similarity means calculating the similarity of two chunks. The inner product of the two vectors, i.e., the sum of the pairwise multiplied elements, is divided by the product of their vector lengths. Cosine Similarity, Where, , à ¯Ã†â€™Ã‚  Weight of keywords in, Definition (Position feature): Position features (PFs) that indicate the location of the data region on a deep web page. To compute the position feature score, the ratio is computed and then, the following equation is used to find the score for the chunk. (4) Where, à ¯Ã†â€™Ã‚   à ¯Ã†â€™Ã‚   Position features Definition (Title word relevancy): A web page title is the name or heading of a Web site or a Web page. If there is more number of title words in a certain block, then it means that the corresponding block is of more importance. Title word relevancy, Where, à ¯Ã†â€™Ã‚   Number of Title Keywords à ¯Ã†â€™Ã‚   Frequency of the title keyword in a chunk Definition (Keyword frequency): Keyword frequency is the number of times the keyword phrase appears on a deep Web page chunk relative to the total number of words on the deep web page. Keyword frequency based chunk selection, Where, à ¯Ã†â€™Ã‚   Frequency of top ten keywords à ¯Ã†â€™Ã‚   Number of keywords à ¯Ã†â€™Ã‚   Number of Top-K Keywords PHASE-1 – VISION BASED DEEP WEB DATA EXTRACTION In a web page, there are numerous immaterial components related to the descriptions of data objects. These items comprise an advertisement bar, product category, search panel, navigator bar, and copyright statement, etc. Generally, a web page is specified by a triple. is a finite set of objects or sub-web pages. All these objects are not overlapped. Each web page can be recursively viewed as a sub-web-page and has a subsidiary content structure. is a finite set of visual separators, such as horizontal separators and vertical separators. Every separator has a weight representing its visibility, and all the separators in the same have the same weight. is the relationship of every two blocks in , which is represented as:. In several web pages, there are normally more than one data object entwined together in a data region, which makes it complex to find the attributes for each page. Deep Web Page Extraction The Deep web is usually defined as the content on the Web not accessible through a search on general search engines. This content is sometimes also referred to as the hidden or invisible web. The Web is a complex entity that contains information from a variety of source types and includes an evolving mix of different file types and media. It is much more than static, self-contained Web pages. In our work, the deep web pages are collected from Complete Planet (www.completeplanet.com), which is currently the largest deep web repository with more than 70,000 entries of web databases. Chunk Segmentation Web pages are constructed not only main contents information like product information in shopping domain, job information in a job domain, but also advertisements bar, static content like navigation panels, copyright sections, etc. In many web pages, the main content information exists in the middle chunk and the rest of the page contains advertisements, navigation links, and privacy statements as noisy data. Removing these noises will help in improving the mining of the web and it’s called Chunk Segmenting Operation as shown in figure.5.2. Figure 5.2 Chunk Segmenting Operation To assign importance to a region in a web page (), we first need to segment a web page into a set of chunks. It extracts main content information and deep web clustering that is both fast and accurate. The two stages and its sub-steps are given as follows. Stage 1: Vision-based deep web data identification Deep web page extraction Chunk segmentation Noisy chunk Removal Extraction of main chunk using chunk weightage Stage 2: Web document clustering Clustering process using FCM Normally, a tag separated by many sub tags based on the content of the deep web page. If there is no tag in the sub tag, the last tag is consider as leaf node. The Chunk Splitting Process aims at cleaning the local noises by considering only the main content of a web page enclosed in div tag. The main contents are segmented into various chunks. The result of this process can be represented as follows: , Where, à ¯Ã†â€™Ã‚   A set of chunks in the deep web page à ¯Ã†â€™Ã‚   Number of chunks in a deep web page In Figure 5.1, we have taken an example of a tree sample which consists of main chunks and sub chunks. The main chunks are segmented into chunks C1, C2 and C3 using Chunk Splitting Operation and sub-chunks are segmented into . Noisy Chunk Removal A deep web page usually contains main content chunks and noise chunks. Only the main content chunks represent the informative part that most users are interested in. Although other chunks are helpful in enriching functionality and guiding browsing, they negatively affect such web mining tasks as web page clustering and classification by reducing the accuracy of mined results as well as speed of processing. Thus, these chunks are called noise chunks. Removing these chunks in our research work, we have concentrated on two parameters; they are Hyperlink Percentage and Noise score which is very significant. The main objective of removing noise from a Web Page is to improve the performance of the search engine. The representation of each parameter is as follows: Hyperlink Keyword – A hyperlink has an anchor, which is the location within a document from which the hyperlink can be followed; the document containing a hyperlink is known as its source document to web pages. Hyperlink Keywords are the keywords which are present in a chunk such that it directs to another page. If there are more links in a particular chunk then it means the corresponding chunk has less importance. The parameter Hyperlink Keyword Retrieval calculates the percentage of all the hyperlink keywords present in a chunk and is computed using the following equation. Hyperlink word Percentage, Where, à ¯Ã†â€™Ã‚   Number of Keywords in a chunk à ¯Ã†â€™Ã‚   Number of Link Keywords in a chunk Noise score – The information on Web page consists of both text and images (static pictures, flash, video, etc.). Many Internet sites draw income from third-party advertisements, usually in the form of images sprinkled throughout the site’s pages. In our work, the parameter Noise score calculates the percentage of all the images present in a chunk and is computed using the following equation. Noise score, Where, à ¯Ã†â€™Ã‚   Number of images in a chunk à ¯Ã†â€™Ã‚   Total number of images Duplicate Chunk Removal Using Cosine Similarity: Cosine Similarity Cosine similarity is one of the most popular similarity measure applied to text documents, such as in numerous information retrieval applications [7] and clustering too [8]. Here, duplication detection among the chunk is done with the help of cosine similarity. Given two chunks and, their cosine similarity is Cosine Similarity Where, , à ¯Ã†â€™Ã‚  Weight of keywords in, Extraction of Main Block Chunk Weightage for Sub-Chunk: In the previous step, we obtained a set of chunks after removing the noise chunks, and duplicate chunks present in a deep web page. Web page designers tend to organize their content in a reasonable way: giving prominence to important things and deemphasizing the unimportant parts with proper features such as position, size, color, word, image, link, etc. A chunk importance model is a function to map from features to importance for each chunk, and can be formalized as : . The preprocessing for computation is to extract essential keywords for the calculation of Chunk Importance. Many researchers have given importance to different information inside a webpage for instance location, position, occupied area, content, etc. In this research work, we have concentrated on the three parameters Title word relevancy, keyword frequency based chunk selection, and position features which are very significant. Each parameter has its own significance for calculating sub-chunk weightage. The following equation computes the sub-chunk weightage of all noiseless chunks. (1) Where à ¯Ã†â€™Ã‚   Constants For each noiseless chunk, we have to calculate these unknown parameters, and. The representation of each parameter is as follows: Title Keyword – Primarily, a web page title is the name or title of a Web site or a Web page. If there is more number of title words in a particular block then it means the corresponding block is of more importance. This parameter Title Keyword calculates the percentage of all the title keywords present in a block. It is computed using the following equation. Title word Relevancy; (2) Where, à ¯Ã†â€™Ã‚   Number of Title Keywords à ¯Ã†â€™Ã‚   Title word relevancy, à ¯Ã†â€™Ã‚   Frequency of the title keyword in a chunk. Keyword Frequency based chunk selection: Basically, Keyword frequency is the number of times the keyword phrase appears on a deep Web page chunk relative to the total number of words on the deep web page. In our work, the top-K keywords of each and every chunk were selected and then their frequencies were calculated. The parameter keyword frequency based chunk selection calculates for all sub-chunks and is computed using the following equation. Keyword Frequency based chunk selection (3) Where, à ¯Ã†â€™Ã‚   Frequency of top ten keywords à ¯Ã†â€™Ã‚   Keyword Frequency based chunk selection à ¯Ã†â€™Ã‚   Number of Top-K Keywords Position features (PFs): Generally, these data regions are always centered horizontally and for calculating, we need the ratio of the size of the data region to the size of the whole deep Web page instead of the actual size. In our experiments, the threshold of the ratio is set at 0.7, that is, if the ratio of the horizontally centered region is greater than or equal to 0.7, then the region is recognized as the data region. The parameter position features calculate the important sub chunk from all sub chunks and is computed using the following equation. (4) Where, à ¯Ã†â€™Ã‚   à ¯Ã†â€™Ã‚   Position features Thus, we have obtained the values of, and by substituting the above mentioned equation. By substituting the values of , and in eq.1, we obtain the sub-chunk weightage. Chunk Weightage for Main Chunk: We have obtained sub-chunk weightage of all noiseless chunks from the above process. Then, the main chunks weightage are selected from the following equation (5) Where,à ¯Ã†â€™Ã‚   Sub-chunk weightage of Main-chunk. à ¯Ã†â€™Ã‚   Constant, à ¯Ã†â€™Ã‚   Main chunk weightage. Thus, finally we obtain a set of important chunks and we extract the keywords from the above obtained important chunks for effective web document clustering mining. Algorithm-1 : Clustering Approach PHASE-2 – DEEP WEB DOCUMENT CLUSTERING USING FCM Let DB be a dataset of web documents, where the set of keywords is denoted by . Let X={x1, x2, †¦Ã¢â‚¬ ¦, xN} is the set of N web documents, where, xi={ xi1,xi2,†¦.,xin}. Each xij(i=1,†¦.,N;j=1,†¦.,n) corresponds to the frequency of keyword xi on web document. Fuzzy c-means [29] partitions set of web documents indimensional space into fuzzy clusters with cluster centers or centroids. The fuzzy clustering of keywords is described by a fuzzy matrix with n rows and c columns in which n is the number of keywords and c is the number of clusters. , the element in the row and column in, indicates the degree of association or membership function of the object with the cluster. The characters of are as follows: (6) (7) (8) The objective function of FCM algorithm is to minimize the Eq. (9): (9) Where (10) in which, m(m >1) is a scalar termed the weighting exponent and controls the fuzziness of the resulting clusters and dij is the Euclidian distance from key to the cluster center zip. The zj, centroid of the jth cluster, is obtained using Eq. (11) (11) The FCM algorithm is iterative and can be stated as in Algorithm-2. Algorithm-2 : Fuzzy c-means Approach Experimental Setup The experimental results of the proposed method for vision-based deep web data extraction for web document clustering are presented in this section. The proposed approach has been implemented in Java (jdk 1.6) and the experimentation is performed on a 3.0 GHz Pentium PC machine with 2 GB main memory. For experimentation, we have taken many deep web pages which contained all the noises such as Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and Other Uninteresting Data. These pages are then applied to the proposed method for removing the different noises. The removal of noise blocks and extracting of useful content chunks are explained in this sub-section. Finally, extracting the useful con

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.