Monday, January 27, 2020

VDEC Based Data Extraction and Clustering Approach

VDEC Based Data Extraction and Clustering Approach This chapter describes in details the proposed VDEC Approach. It discusses the two phases of the VDEC process for Data Extraction and clustering. Experimental performance evaluation results are shown in the last section in comparing the GDS and SDS datasets. INTRODUCTION Extracting data records on the response pages returned from web databases or search engines is a challenge posed in information retrieval. Traditional web crawlers focus only on the surface web while the deep web keeps expanding behind the scene. Vision based data extraction provides a solution to extract information from dynamic web pages through page segmentation for creating a data region and data record and item extraction. A vision based web information extraction systems become more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. This chapter discusses an approach to vision-based deep web data extraction and web document clustering. The proposed approach comprises of two phases, (1) Vision-based web data extraction, and (2) web document clustering. In phase 1, the web page information is segmented into various chunks. From which, surplus noise and duplicate chunks are removed using three parameters, such as hyperlink percentage, noise score and cosine similarity. Finally, the extracted keywords are subjected to web document clustering using Fuzzy c-means clustering (FCM). VDEC APPROACH VDEC approach is designed to extract visual data automatically from deep web pages as shown in the block diagram in figure 5.1. Figure 5.1 – VDEC Approach Block diagram In most of web pages, there will be more than one data object tied together in data region, makes difficult to search attributes for each page. Unprocessed source of web page for representing the objects is non-contiguous one, the problem becomes more complicated. In existent applications, the users necessitate from complex web pages is the description of individual data object derived from the partitioning of the data region. VDEC achieve the data capturing from the deep web pages using two phases as discussed in the following sections. Phase-1 Vision Based Web Data Extraction In Phase-1 VDEC approach performs data extraction and a measure is introduced to evaluate the importance of each leaf chunk in the tree, which in turn helps us to eliminate noise in a deep web page. In this measure, remove the surplus noise and duplicate chunk using three parameters such as hyperlink percentage, Noise score and cosine similarity. Finally, obtain the main chunk extraction process using three parameters such as Title word Relevancy, Keyword frequency based chunk selection, Position features and a set of keywords are extracted from those main chunks. Phase-2 Web Document Clustering In Phase-2 VDEC perform web document clustering using Fuzzy c-means clustering (FCM), the set of keywords were clustered for all deep web pages. Both the phases of the VDEC helps to extract the visual features of the web pages and supports on web page clustering for improvising information retrieval. The process activities are briefly described in the following section. DEFINITIONS OF TERMS USED IN VDEC APPROACH Definition (chunk): Consider a deep web page is segmented by blocks. These each block are known as chunk. For example the web page is represented as, , where the main chunk, . Definition (Hyperlink): A hyperlink has an anchor, which is the location within a document from which the hyperlink can be followed; the document having a hyperlink is called as its source document to web pages. Hyperlink percentage Where, à ¯Ã†â€™Ã‚   Number of Keywords in a chunk à ¯Ã†â€™Ã‚   Number of Link Keywords in a chunk Definition (Noise score): Noise score is defined as the ratio of the number of images in total number of chunks. Noise score, Where, à ¯Ã†â€™Ã‚   Number of images in a chunk à ¯Ã†â€™Ã‚   Total number of images Definition (Cosine similarity): Cosine similarity means calculating the similarity of two chunks. The inner product of the two vectors, i.e., the sum of the pairwise multiplied elements, is divided by the product of their vector lengths. Cosine Similarity, Where, , à ¯Ã†â€™Ã‚  Weight of keywords in, Definition (Position feature): Position features (PFs) that indicate the location of the data region on a deep web page. To compute the position feature score, the ratio is computed and then, the following equation is used to find the score for the chunk. (4) Where, à ¯Ã†â€™Ã‚   à ¯Ã†â€™Ã‚   Position features Definition (Title word relevancy): A web page title is the name or heading of a Web site or a Web page. If there is more number of title words in a certain block, then it means that the corresponding block is of more importance. Title word relevancy, Where, à ¯Ã†â€™Ã‚   Number of Title Keywords à ¯Ã†â€™Ã‚   Frequency of the title keyword in a chunk Definition (Keyword frequency): Keyword frequency is the number of times the keyword phrase appears on a deep Web page chunk relative to the total number of words on the deep web page. Keyword frequency based chunk selection, Where, à ¯Ã†â€™Ã‚   Frequency of top ten keywords à ¯Ã†â€™Ã‚   Number of keywords à ¯Ã†â€™Ã‚   Number of Top-K Keywords PHASE-1 – VISION BASED DEEP WEB DATA EXTRACTION In a web page, there are numerous immaterial components related to the descriptions of data objects. These items comprise an advertisement bar, product category, search panel, navigator bar, and copyright statement, etc. Generally, a web page is specified by a triple. is a finite set of objects or sub-web pages. All these objects are not overlapped. Each web page can be recursively viewed as a sub-web-page and has a subsidiary content structure. is a finite set of visual separators, such as horizontal separators and vertical separators. Every separator has a weight representing its visibility, and all the separators in the same have the same weight. is the relationship of every two blocks in , which is represented as:. In several web pages, there are normally more than one data object entwined together in a data region, which makes it complex to find the attributes for each page. Deep Web Page Extraction The Deep web is usually defined as the content on the Web not accessible through a search on general search engines. This content is sometimes also referred to as the hidden or invisible web. The Web is a complex entity that contains information from a variety of source types and includes an evolving mix of different file types and media. It is much more than static, self-contained Web pages. In our work, the deep web pages are collected from Complete Planet (www.completeplanet.com), which is currently the largest deep web repository with more than 70,000 entries of web databases. Chunk Segmentation Web pages are constructed not only main contents information like product information in shopping domain, job information in a job domain, but also advertisements bar, static content like navigation panels, copyright sections, etc. In many web pages, the main content information exists in the middle chunk and the rest of the page contains advertisements, navigation links, and privacy statements as noisy data. Removing these noises will help in improving the mining of the web and it’s called Chunk Segmenting Operation as shown in figure.5.2. Figure 5.2 Chunk Segmenting Operation To assign importance to a region in a web page (), we first need to segment a web page into a set of chunks. It extracts main content information and deep web clustering that is both fast and accurate. The two stages and its sub-steps are given as follows. Stage 1: Vision-based deep web data identification Deep web page extraction Chunk segmentation Noisy chunk Removal Extraction of main chunk using chunk weightage Stage 2: Web document clustering Clustering process using FCM Normally, a tag separated by many sub tags based on the content of the deep web page. If there is no tag in the sub tag, the last tag is consider as leaf node. The Chunk Splitting Process aims at cleaning the local noises by considering only the main content of a web page enclosed in div tag. The main contents are segmented into various chunks. The result of this process can be represented as follows: , Where, à ¯Ã†â€™Ã‚   A set of chunks in the deep web page à ¯Ã†â€™Ã‚   Number of chunks in a deep web page In Figure 5.1, we have taken an example of a tree sample which consists of main chunks and sub chunks. The main chunks are segmented into chunks C1, C2 and C3 using Chunk Splitting Operation and sub-chunks are segmented into . Noisy Chunk Removal A deep web page usually contains main content chunks and noise chunks. Only the main content chunks represent the informative part that most users are interested in. Although other chunks are helpful in enriching functionality and guiding browsing, they negatively affect such web mining tasks as web page clustering and classification by reducing the accuracy of mined results as well as speed of processing. Thus, these chunks are called noise chunks. Removing these chunks in our research work, we have concentrated on two parameters; they are Hyperlink Percentage and Noise score which is very significant. The main objective of removing noise from a Web Page is to improve the performance of the search engine. The representation of each parameter is as follows: Hyperlink Keyword – A hyperlink has an anchor, which is the location within a document from which the hyperlink can be followed; the document containing a hyperlink is known as its source document to web pages. Hyperlink Keywords are the keywords which are present in a chunk such that it directs to another page. If there are more links in a particular chunk then it means the corresponding chunk has less importance. The parameter Hyperlink Keyword Retrieval calculates the percentage of all the hyperlink keywords present in a chunk and is computed using the following equation. Hyperlink word Percentage, Where, à ¯Ã†â€™Ã‚   Number of Keywords in a chunk à ¯Ã†â€™Ã‚   Number of Link Keywords in a chunk Noise score – The information on Web page consists of both text and images (static pictures, flash, video, etc.). Many Internet sites draw income from third-party advertisements, usually in the form of images sprinkled throughout the site’s pages. In our work, the parameter Noise score calculates the percentage of all the images present in a chunk and is computed using the following equation. Noise score, Where, à ¯Ã†â€™Ã‚   Number of images in a chunk à ¯Ã†â€™Ã‚   Total number of images Duplicate Chunk Removal Using Cosine Similarity: Cosine Similarity Cosine similarity is one of the most popular similarity measure applied to text documents, such as in numerous information retrieval applications [7] and clustering too [8]. Here, duplication detection among the chunk is done with the help of cosine similarity. Given two chunks and, their cosine similarity is Cosine Similarity Where, , à ¯Ã†â€™Ã‚  Weight of keywords in, Extraction of Main Block Chunk Weightage for Sub-Chunk: In the previous step, we obtained a set of chunks after removing the noise chunks, and duplicate chunks present in a deep web page. Web page designers tend to organize their content in a reasonable way: giving prominence to important things and deemphasizing the unimportant parts with proper features such as position, size, color, word, image, link, etc. A chunk importance model is a function to map from features to importance for each chunk, and can be formalized as : . The preprocessing for computation is to extract essential keywords for the calculation of Chunk Importance. Many researchers have given importance to different information inside a webpage for instance location, position, occupied area, content, etc. In this research work, we have concentrated on the three parameters Title word relevancy, keyword frequency based chunk selection, and position features which are very significant. Each parameter has its own significance for calculating sub-chunk weightage. The following equation computes the sub-chunk weightage of all noiseless chunks. (1) Where à ¯Ã†â€™Ã‚   Constants For each noiseless chunk, we have to calculate these unknown parameters, and. The representation of each parameter is as follows: Title Keyword – Primarily, a web page title is the name or title of a Web site or a Web page. If there is more number of title words in a particular block then it means the corresponding block is of more importance. This parameter Title Keyword calculates the percentage of all the title keywords present in a block. It is computed using the following equation. Title word Relevancy; (2) Where, à ¯Ã†â€™Ã‚   Number of Title Keywords à ¯Ã†â€™Ã‚   Title word relevancy, à ¯Ã†â€™Ã‚   Frequency of the title keyword in a chunk. Keyword Frequency based chunk selection: Basically, Keyword frequency is the number of times the keyword phrase appears on a deep Web page chunk relative to the total number of words on the deep web page. In our work, the top-K keywords of each and every chunk were selected and then their frequencies were calculated. The parameter keyword frequency based chunk selection calculates for all sub-chunks and is computed using the following equation. Keyword Frequency based chunk selection (3) Where, à ¯Ã†â€™Ã‚   Frequency of top ten keywords à ¯Ã†â€™Ã‚   Keyword Frequency based chunk selection à ¯Ã†â€™Ã‚   Number of Top-K Keywords Position features (PFs): Generally, these data regions are always centered horizontally and for calculating, we need the ratio of the size of the data region to the size of the whole deep Web page instead of the actual size. In our experiments, the threshold of the ratio is set at 0.7, that is, if the ratio of the horizontally centered region is greater than or equal to 0.7, then the region is recognized as the data region. The parameter position features calculate the important sub chunk from all sub chunks and is computed using the following equation. (4) Where, à ¯Ã†â€™Ã‚   à ¯Ã†â€™Ã‚   Position features Thus, we have obtained the values of, and by substituting the above mentioned equation. By substituting the values of , and in eq.1, we obtain the sub-chunk weightage. Chunk Weightage for Main Chunk: We have obtained sub-chunk weightage of all noiseless chunks from the above process. Then, the main chunks weightage are selected from the following equation (5) Where,à ¯Ã†â€™Ã‚   Sub-chunk weightage of Main-chunk. à ¯Ã†â€™Ã‚   Constant, à ¯Ã†â€™Ã‚   Main chunk weightage. Thus, finally we obtain a set of important chunks and we extract the keywords from the above obtained important chunks for effective web document clustering mining. Algorithm-1 : Clustering Approach PHASE-2 – DEEP WEB DOCUMENT CLUSTERING USING FCM Let DB be a dataset of web documents, where the set of keywords is denoted by . Let X={x1, x2, †¦Ã¢â‚¬ ¦, xN} is the set of N web documents, where, xi={ xi1,xi2,†¦.,xin}. Each xij(i=1,†¦.,N;j=1,†¦.,n) corresponds to the frequency of keyword xi on web document. Fuzzy c-means [29] partitions set of web documents indimensional space into fuzzy clusters with cluster centers or centroids. The fuzzy clustering of keywords is described by a fuzzy matrix with n rows and c columns in which n is the number of keywords and c is the number of clusters. , the element in the row and column in, indicates the degree of association or membership function of the object with the cluster. The characters of are as follows: (6) (7) (8) The objective function of FCM algorithm is to minimize the Eq. (9): (9) Where (10) in which, m(m >1) is a scalar termed the weighting exponent and controls the fuzziness of the resulting clusters and dij is the Euclidian distance from key to the cluster center zip. The zj, centroid of the jth cluster, is obtained using Eq. (11) (11) The FCM algorithm is iterative and can be stated as in Algorithm-2. Algorithm-2 : Fuzzy c-means Approach Experimental Setup The experimental results of the proposed method for vision-based deep web data extraction for web document clustering are presented in this section. The proposed approach has been implemented in Java (jdk 1.6) and the experimentation is performed on a 3.0 GHz Pentium PC machine with 2 GB main memory. For experimentation, we have taken many deep web pages which contained all the noises such as Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and Other Uninteresting Data. These pages are then applied to the proposed method for removing the different noises. The removal of noise blocks and extracting of useful content chunks are explained in this sub-section. Finally, extracting the useful con

Sunday, January 19, 2020

Hobsons choice-How did hobson lose control? Essay

Henry Horatio Hobson is one of the principal characters of the play and his conflict with his daughters, particularly Maggie, provides the basis of the story line. Hobson is a 55-year-old middle-class man very old fashioned values.This causes the reader to instantly dislike Hobson thanks to the language Brighouse uses when exposing Hobson’s mannerisms to the audience for the first time. He is a ‘single parent’ since his wife’s death and although in a different situation this could have been seen as quite heroic, instead he is shown to be quite the opposite, in the way that he constantly reminds his daughters that he considers them to be uppish, and that they have,†grown bumptious at a time when they lack a mother’s hand.† Throughout the play Hobson is portrayed as a character who wants to be dominant, from as early as act one Hobson can be seen addressing his daughters so called â€Å"uppishness†. â€Å"I’m talking now, and your listening†¦.Girls grow bumptious, and must have someone to rule, but I tell you this, you’ll none rule me.† This shows that Hobson thinks he understands his daughters actions, and thinks that their actions are normal, but the reality is that his daughters are tired of Hobson’s way’s , and want Hobson to allow them some independence. Hobson is portrayed as his daughters oppressor in the way that he describes the way that Alice and Vickey dress (who are avid followers of fashion).†It’s immodest†. â€Å"To hell with the fashion†. Hobson shows a lack of understanding or care for his daughters feelings and is clearly not worried about offending them. Hobson’s lack of warmth and inability to empathize contributes towards his downfall. Despite Hobson’s many imperfections, he still remains in control of his daughters, that is until, Maggie sets her mind of marrying Hobson’s most skilled worker, the working class, uneducated, son of a â€Å"workhouse brat† ; Willie Mosses. Hobson initially Laugh’s at the idea of marriage claiming that he will choose who his daughters marry. â€Å"Didn’t you hear me say that I’m doing the choosing when it comes to husbands?† The fact that Maggie goes on to Marry Will demonstrates the eventual shift of power in the play. Brighouse is very clever when choosing Hobson’s words, rather than having Hobson disagree with the idea in an ordinary way; he demonstrates Hobson’s arrogance by having Hobson’s question Maggie’s ability to listen. Hobson’s actions in act three cause the reader to feel a strong feeling of irony when Hobson is diagnosed with alcoholism towards the end of act four. In the middle of act three Hobson can be found warning his daughters never to come home.† Don’t you imagine thereby be room for you when you come home crying and tired of your fine husbands. I’m Rid of ye and it’s a lasting riddance.† In conclusion, the main cause for Hobson’s loss of control was That Hobson underestimated his Daughters, Particularly Maggie. Throughout the play Brighouse uses Hobson a representation of a middle class and proud stereotype. Hobson’s loss o control is underlined at the end of the play when he is forced to give will half of his shop and agrees to have no say in the shop’s affairs. Brighouse uses irony in the form of the â€Å"Son of a workhouse brat† Will mossop.   

Saturday, January 11, 2020

International Trade and Trade Restrictions

International Trade and Trade Restrictions International Trade and Trade Restrictions International trade increases the number of goods that domestic consumers can choose from, decreases the cost of those goods through increased competition, and allows domestic industries to ship their products abroad. While all of these seem beneficial, free trade is not widely accepted as completely beneficial to all parties and trade restrictions are applied.Trade restrictions can be in the form of tariffs, which are taxes on imports; quotas, which are limits on the quantity of a particular good that can be imported or exported; or other trade restrictions. International trade efficiencies, trade restrictions, and the consequences of these restrictions will be discussed further.World trade offers many advantages to the trading countries: access to markets around the world, lower cost through economies of scale, the opportunity to utilize abundant resources, better access to information about marke ts and technology, improved quality honed by competitive pressure, and lower prices for consumers (McEachern, 2012, p. 733). Comparative advantage, specialization, and trade allow people to use their scarce resources most efficiently to satisfy their unlimited wants.Comparative Advantage is the ability to make something at a lower opportunity cost than other producers face (McEachern, 2012, p. 32). The ability to make a good at a lower opportunity cost gives that individual, firm, region, or country a comparative advantage. Even if a country has absolute advantage in all goods, they should specialize in producing the goods in which it has a comparative advantage. If each country specializes and trades according to the law of comparative advantage, everyone would benefit from greater consumption possibilities.McEachern provides three reasons for international specialization: countries having different resource endowments, greater economies of scale can be achieved when firms particip ate in international trade, and tastes differ from country to country (McEachern, 2013, p. 719-720). Every country has a comparative advantage in the production of some products. This means that the labor and capital resources available in the reason are more productive when focused towards a particular industry and thus are able to be produce that product better as a result.In the case of the textile industry, Pakistan enjoys a comparative advantage as it has many cotton fields, providing it direct access to the raw material for the industry. It further has been operating in that industry for a long time that has spawned a lot of trained workforce relating to that industry in the country. Therefore, law of comparative advantage dictates that it should produce textile materials. The World Trade Organization (WTO) is the only global international organization dealing with the rules of trade between nations (WTO, 2012).Their goal is to help producers of goods and services, exporters, and importers conduct their business. The WTO and agreements such as NAFTA open up free trade, allowing goods to move freely and thereby aiding consumers in various countries in terms of prices and quality. It also spawns healthy competition in the local industries. Trade restrictions can lead to a breakdown in competition and can lead to adverse effects in the local and international market. Restrictions can benefit certain domestic producers that lobby their government for benefits.Congress tends to support the group that fights back, so trade restrictions often persist, despite the clear and widespread gains from freer trade. For example, U. S. growers of sugar cane have been protected from imports, which results in an increase in U. S. sugar prices. Higher prices hurt domestic consumers, but they are usually unaware. As McEachern states, â€Å"Consumers remain largely oblivious. † Who is responsible for trade ethics? Government might be the initial answer but all constitu ents involved should be aware and be transparent. An example is Nike, Inc.They provid a statement in understanding how to change the way an industry views its labor force. It does not happy by monitoring factories alone. Monitoring reveals the issues, issues that in turn are locked into a complex web of root causes. The ability to address these root causes should be shared by many, owned by no single constituent (Nike, Inc. , 2013, p. 1). One of their strategies is to transform working relationships with their contracted factories to incentivize change that will benefit workers. Are trade restrictions effective? Trade protection can foster inefficiencies.The immediate cost of such restrictions includes not only the welfare loss from higher domestic prices but also the cost of resources used by domestic producer groups to secure the favored protection (McEachern, 2012, p. 732). These costs may become permanent if the industry never realizes the economies of scale and never becomes co mpetitive. Protecting one stage of production usually requires protecting downstream stages of production as well. The biggest problem with imposing trade restrictions are that other countries usually retaliate which shrinks the gains from trade.Some experts believe the costs of protecting the jobs of workers in vulnerable industries, which are ultimately borne by taxpayers or consumers, far exceed the potential cost of retraining and finding new jobs for those workers (Globalization 101, 202, para. 1). In addition, that it may not promote firms and industries to make necessary changes to challenge foreign competition and find efficiencies to which then would make them become even more dependent on government protection. As international trade has increased, conflicts over trade have also increased.Trade restrictions may continue to be very political in nature. The more companies like Nike and consumers start being more aware of ethical behavior around international trade, the more everyone will benefit. The U. S. government does take responsibility for workers who lose their jobs by international trade and have programs established to assist in training and support to re-employ those workers. As countries specialize and trade according to the law of comparative advantage, consumers should also benefit from efficient production and cheaper prices.The increase of technology may have an impact that will increase the speed at which international trade and efficiencies happen. References McEachern, W. A. (2012). Economics, 9e (9th ed). Mason, OH: South-Western. Globalization 101 (2013). The Levin Insitute. Consequences of trade restrictions. thttp://www. globalization101. org/consequences-of-trade-restrictions/ Nike, Inc. (2013). Responsibility. Targets and performance. http://www. nikeresponsibility. com/report/content/chapter/targets-and-performance#Labor World Trade Organization (2013). http://www. wto. org/english/thewto_e/whatis_e/whatis_e. htm

Friday, January 3, 2020

A Basic Guide to the NCAA for Your Children

If youre the parent of a student-athlete, youve probably heard the term NCAA. The NCAA, or National Collegiate Athletic Association, is the governing body that oversees 23 different sports and athletic championships at 1,200 colleges and universities in the United States. It stresses a well-rounded student, who excels at sports, as well as academics and campus life. Recruitment for the NCAA The point at which parents and the NCAA usually intersect is during college recruitment. High school athletes who want to play college ball (or track, swimming, etc.) at Division I, II or III school must register with the NCAA through its online eligibility center. If your child is interested in playing sports at the college level, his counselor and coach can help him navigate that path.   Divisions I, II, and III Schools that are part of the NCAA are divided into Division I, II and III schools. Each of these divisions reflects the relative priority of sports and academics. Division, I schools generally have the largest student bodies, as well as the largest budgets and scholarships for sports. 350 schools are classified as Division I and 6,000 teams belong to those schools. Division II schools strive to provide student-athletes with a high level of athletic competition, while also maintaining high grades and a well-rounded campus experience. Division III schools also provide opportunities for student-athletes to compete and participate athletically, but the primary focus is on academic achievement. This is the largest division in both total participants and number of schools. NCAA Sports By Season Fall Sports The NCAA offers six different sports for the fall season. Arguably, the most popular overall collegiate sport is  football, which takes place during the fall season. Overall, though, the fall season offers the least amount of sports out of the three seasons, as more sports take place during both the winter and spring seasons. The six sports offered by the National College Athletic Association for the fall season are: Mens and womens cross-countryField hockeyFootballMens and womens soccer teamsWomens volleyballMens water polo Winter Sports Winter is the busiest season in college sports. The NCAA offers ten different sports during the winter season: Mens and womens basketball teamsBowlingFencingMens and womens gymnasticsMens and womens ice hockeyMens, womens and mixed rifleMens, womens and mixed skiingMens and womens swimming and divingMens and womens indoor track and fieldWrestling Spring Sports Eight separate sports are offered during the spring season. Out of those eight sports, seven of them are available to both men and women. The spring season offers baseball for men, as well as softball for women. The eight sports offered by the National College Athletic Association for the spring season are: Baseball and softballMens and womens golfMens and womens lacrosseRowingMens and womens tennisMens and womens outdoor track and fieldMens volleyball Womens water polo