Name: Asdm Assignment Data Mining Using Sas And R Assignment Sample
Brand: Assignment Help UK
Rating: 4.9/5 (17549 reviews)

Pages: 17 Words: 4350

Table of Contents

1. Introduction - Asdm Assignment Data Mining Using Sas And R
Background:
Research Questions
Justification
2. Aim and Objective of the task
3. Literature Review
4. Explanation and preparation of the dataset
Dataset description
Dependent and independent variables
Data pre-processing steps
5. Task: Classification/ Association Rules/ Clustering Task/ Text Mining
R_Code
Sas Results
6. Results Analysis and Discussion
Result Comparison between R and SAS and Critical Findings

1. Introduction - Asdm Assignment Data Mining Using Sas And R

Enterprise Minor helps in analysing the data that would be created into projects and as well as diagrams. Every project will be created for processing the application that will be used for processing the flow diagrams, and every diagram would contain different analysis based on the process of application. The diagram will capture the analysis process of one data set only. There are different steps that could be used by using SAAS. The first step is to go to FILE, then NEW then project. Then the name of the project is to be set and the location also needs to be fixed for saving the application and in this way, the file is created.

Don’t stress! Our Affordable Assignment Help UK Services are designed for students facing academic pressure on a limited budget. We provide top-quality, professionally written assignments at prices that won’t break the bank. With our quick turnaround times and commitment to excellence, you’ll never have to compromise on quality—even with a tight schedule.

Background:

Data mining is extracting useful data or analysis reports from large amounts of data. With the rapidly increasing data in the modern world, data analysis and management of data have become very crucial. Data mining techniques are used for discovering unresolved patterns, and use these analysis reports for businesses, the government needs, and in researches. Data analysis using different algorithms is a part of the prediction system nowadays. Data mining techniques such as the K nearest neighbour algorithm, clustering, text mining, applying association rules, and sentiment analysis are some algorithms that are used for data mining (Ccsc, 2021). In addition, applications for data mining have been changed over the years. The finance industry, retail industry, and telecommunication are key areas of implementation of data mining techniques. In addition, data mining techniques have uncertainty in integration (Ccsc, 2021). The uncertainty in data is represented by calculating the difference between real data and recorded data. Hence, uncertainty in data needs to be calculated for better data analysis. All these aspects are very crucial for addressing data analysis. Hence, this research on this selected topic has been chosen.

Research Questions

Research questions for the report can be described as:

What is the need for implementing data analysis models?
What is the significance of data visualization?
Why is the visualization of analysis results important?
What are the significant outcomes of critical analysis of results derived from R programming and SAS Enterprise Miner?

Justification

The topic for implementation of analysis and choosing the dataset has been done based on the need for data analysis. The report includes implementation of classification models such as K nearest neighbours, clusters, application of association rules, and finally implementation of sentiment analysis. Hence, data for tourist accommodation review and dataset for London cases have been chosen. The tourist data has been chosen for the sentiment analysis as the sentiment for the reviews needs to be identified. The other one has been chosen for the implantation of other data mining models such as KNN, clustering, and association rules. For meeting these criteria these datasets became the best suite for the implementation.

2. Aim and Objective of the task

Aim

The aim of this report is to gain a deep understanding of the R programming language and SAS Enterprise Miner while implementing classification models, association rules, clustering, and sentiment analysis.

Objective

Objectives of this report are:

To gain a deep understanding of different data analysis models
To visualize data
To visualize analysis results
To critically analyse the results

3. Literature Review

The application is implemented using the four algorithms-

K-Nearest Neighbour- It is one of the simplest algorithms that have been used in R and SaaS as it is used in supervised language. It also helps in assuming the similarity among the data available in the resources. According to the paper by Patro (2020), the recommendation system has been broadly utilized in various areas of commerce. Tremendous growth has been seen after implementation of recommendation systems. Recommendation systems are based on the successful utilization of various machine learning algorithms and data mining techniques. Based on the analysis recommendations are done. Sentiment analysis is a crucial part of recommendation systems as this helps in the identification of positive and negative areas of products and helps the system to decide if the particular product needs to be recommended further or not (Patro et al., 2020). The paper used the K nearest neighbour algorithm and hybrid filtering for building a recommendation system with higher accuracy. The proposed recommender system based on the experimental result evaluates different measures such as mean absolute error (MAE), mean square error (MSE), and Root Mean Squared Error (RMSE). According to the paper, based on the present data, the access of the user behaviour matrix depends on user behaviour. Hence, the identification of customer preferences largely matters for the accuracy of the recommender system (Patro et al., 2020). Hence, the paper suggests multi-dimensionality in Recommendation systems.

Association rules mining- Association rules Mining is used for finding the associations and relations between the different sets of data. This could also help in showing the item set that occurs in the transaction of the application. In the paper by Rekik (2018), society is deeply affected by social content. The paper uses text mining and association rules for identifying website quality. Multiple criteria for decision-making issues and reduction phase are needed for the website quality implementation (Rekik et al., 2018). The paper uses the apriori algorithm for text mining as an association rule. The paper generates a network graph based on the output of the apriori algorithm. The system highlights main findings, systematic literature review, and assessment-based website quality analysis. The paper focuses on the key objective of exploratory studies in the domain while filtering the essential areas in the data. Hence, the research paper mainly extracts data that are needed for their research and uses these data for the application of association rules. Moreover, the paper is able to provide some crucial fundamental questions regarding data interpretation and analysis such as the essentiality of data analysis, what are the aspects that can be accessed, and ways of accessing any website. Hence, this paper is very crucial for analysis purposes (Rekik et al., 2018).

K-Means Clustering- K-Means Clustering means the improvised learning methods that would be done in a group of objects in different clusters. The algorithm K-means is a clustering algorithm that follows simple iterations. This algorithm uses the metric of distance and the given datasets in the K class to calculate the distance mean (Yuan and Yang, 2019). The advantage of the K-means cluster is that it can be left unsupervised. The unsupervised procedure is used to find data groups within the datasets.

Among many other clustering algorithms, K means clustering is the widely used clustering algorithm. Hence, to directly affect the convergence result, k means clustering is a highly recommended clustering algorithm and this provides results with higher accuracy. In this paper, researchers target the convergence issues that have been faced while the use of this algorithm. Hence, the procedures that have been used are the Elbow method, Gap Statics, Canopy, and coefficient method. Finally, the research paper verifies the results that are evaluated and the advantages and disadvantages of clustering algorithms in the k value selection. Then the paper shows the clustering method prints as a result. The research paper provides results as the table of execution time and accuracy rate after using this algorithm. The paper aims to use real-world multidimensional data for further research.

Sentimental Analysis- Sentimental Analysis means the mining is done in a contextual way for identifying the objects as extra information that are needed in the business for growing the industry. According to the paper by Zvarevashe and Olugbara (2018), sentiment analysis is a very crucial data analysis implementation nowadays. In every sector especially in b businesses sentiment analysis is done massively to recommend products to the customers and increase sales. The paper includes the methodology of sentiment analysis such as building the intuition model and building a sentiment priority-based model. The intuition-based model is based on feedback collection, responses, and feedback from customers and guests that have been staying in hotels (Zvarevashe and Olugbara, 2018). The research paper has used data transformation techniques and used filtering. Hence, according to the research, the implementation of research data using classification models will be much easier. Hence, training and testing data will be much more implementable to the final model of research. The research paper also implements sentiment polarity based models and begins with the elicitation of opinion. According to the paper, the research showed that IBM relies heavily on human intervention and it is consistent in nature. Hence, the research paper represents a system for sentiment analysis, where the system learns automatically. Hence, the proposed system tries to make the sentences that are correctly labelled such as filtering for the false information will be done. Hence, the experimental results of the study have been done using the Naive Bayas algorithm in this paper. In this paper, the Naive Bayes algorithm has provided better results than any other classification models used for the implementation (Zvarevashe and Olugbara, 2018). Hence, according to this paper, naive Bayes can be a good implementation of the analysis of data.

4. Explanation and preparation of the dataset

Dataset description

The dataset used in the application is basically known as the collection of the data, the values of the variables like the height as well as the weight of the object that is used in the application. The dataset could consist of different collections of files for execution. Dataset for tourist accommodation review and London Cases have been chosen for the implementation of the data mining techniques. The tourist review dataset includes columns for the location of the hotel, name of the hotel, date of the review, and ID. Hence the sentiment analysis has been done based on the review text for the tourist dataset. Another dataset that is phe_cases_age_london.csv includes columns for the area name, code of the area, date, number of cases, rolling_rate, age_band, age_lower, age_higher, and population. Among all these values, cases and population have been the key focus columns for all the implementations. Data mining techniques such as KNN, clustering, and association rules have been implemented.

Dependent and independent variables

The independent value is independent of the other variables used in the application; it is not connected with other factors. The dependent variable is the most effective one; its values change according to the need of the application. Dependent variables in the tourist dataset are review data dependent on the hotel name and their services followed by the location. In addition, in the other dataset, data for the population is independently based on location but the case data is dependent on the area as well on the population. Hence, these are the dependency in both datasets.

Data pre-processing steps

There are different steps used for pre-processing the data

It needs to handle the data properly before importing it into the dataset.
It is required to handle the missing data.
It would help in encoding the data for getting the output.
The training dataset will be split into training as well as test sets.
I will feature the feature scaling dataset.

5. Task: Classification/ Association Rules/ Clustering Task/ Text Mining

R_Code

Explanation about the methods and their results

Task 3: K Means Clustering

Figure 7: K means Clustering with 4 clusters

In this, the clustering is shown with the help of 4 clusters. From the database, cases and population columns have been used for this clustering.

Figure 8: Aggregate Values

After the clustering process is done, it will show the aggregate values based on the clusters. The mean value has been calculated and the aggregate value is shown by using the function aggregate. Using cbind function in R programming, specific vendors, data, or matrices are combined. Hence, in this above figure, the results of cbind function have been stored in the dd variable and the result has been shown in forms of the cluster.

Figure 9: Clustering Output

The above figure represents the plotting based on cluster output. The population and cases have been plotted on the y and x-axis respectively.

Task 4: Sentiment Analysis

Figure 10: Frequency of most frequent words

The score of sentiment analysis has been represented by the above figure. The top five most frequently used words have been printed with their respective frequencies.

Figure 11: Plotting Most Frequent Words

The above figure shows the plotting for the top 5 most used words based on their frequencies. This plot is done for the visualization of data.

Figure 12: Word Cloud using analysis

The above figure shows the word cloud based on the frequent words. This is also another representation of data visualization. Hence, this word cloud has been created to provide a better idea about words that are not based on the top five words only.

Figure 13: Sentiment Analysis Result

The above figure shows the sentiment analysis result from the dataset. It is a bar plot that has been created based on the sentiments such as anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Each emotion is represented using different colour codes. According to the analysis, the dataset that has been used for the analysis has three types of the emotion of words such as joy, sadness, and trust. Among these three emotions, trust and joy are used the most, and sadness has very less count on graphs. Hence, it can be analysed from their results that the test is more positively emphasized and there is very little negativity.

Sas Results

Task 1: K-Nearest Neighbours

Figure 14: Score Distribution Chart Plot

The above figure shows the score distribution chart plot that has been created as a result of the implementation of K-Nearest Neighbours in SAS enterprise Miner.

Figure 15: Train and Validate Statistics

The above figure represents the train and validates statistics that have been created by the use of K-Nearest Neighbours in SAS enterprise Miner. The graph shows the flow of depth of both these variables depending on their availability in the dataset.

Figure 16: Statistics for the Training, Validation, and Test Dataset

The above figure shows the full statistics of the training, validation, and test dataset that is also a part of the output of K-Nearest Neighbours in SAS enterprise Miner. the figure shows the value for train, validation, and test records in the form of a table.

Figure 17: Mean Values

The above figure shows the mean values of the K-Nearest Neighbours algorithm application. Values for depth, number of observations, target, and predicted values have been calculated and shown through the above table.

Figure 18: Assessment Score Distribution

The above figure for assessment score distribution shows the values for the range of prediction, target mean values, predicted mean values, number of observations, and model score. For the analysis, data scores have been set to train values, and variables have been set to cases.

The above figure shows the no9del score distribution values in K-Nearest Neighbours in SAS enterprise Miner. Values for a range of prediction, target mean values, predicted mean values, the number of observations, and model score have been shown in this figure. For the analysis data roles have been set to validate values, and variables have been set to cases and the target level is set to Nil.

Task 2: Association Rules Mining

Figure 20: Statics Plot

The above figure represents the statistics plot for using association rules mining techniques. Values, based on the confidence and support calculation this plot has been created.

Figure 21: Statistics Line Plot

The above figure shows the statics line plot based on confidence and support frequency calculation.

Figure 22: Rule Matrix

The above figure shows the rule matrix that has been created as a result of association rule implementation on the dataset. The x-axis of this matrix represents the right hand rule, and the y axis shows left hand rule.

Figure 23: Rule Statics

B the above figure shows the rule statistics based on mean values. The statistics show values for variables, level, minimum, maximum, and mean.

Figure 24: Sequence Report

The above figure shows the sequence report using association rules. The Report represents values for frequency, percent, cumulative frequency, and cumulative percent.

Task 3: K Means Clustering

Figure 25: Segment Plot For Each Variable

The above figure shows the segment plot for variables that have been used for K means clustering or simply clustering.

Shiny dashboard

Figure 26: Shiny dashboard

The above figure shows the shiny dashboard in R-studio, where the population attribute has implemented graphically. The change in observation changes the representation graphically.

Figure 27: Segment Size

The above figure shows the pie chart for segment size plotting. This plot has been created based on the size of each segment that has been identified and used in the clustering.

Figure 28: Mean Statics

Figure 29: Mean Statics Contt…

Figure 30: Mean Statics Contt…

Figure 31: Mean Statics Contt…

Figure 32: Mean Statics Contt…

Figure 33: Mean Statics Contt…

The above figures range from figure 27 to figure 32 represent mean statics for each variable for clustering.

Figure 34: Optimum Number of Clusters

The above figure represents the optimum number of clusters. The matrix includes values for the number of clusters and values for the clustering cubic criterion.

The SAS output for k means clustering includes segment plot, segment size, mean statics numbers, an optimum number. The optimum number has been come out as 27.42 and 31.32. In addition, variable importance has come to 1.0 as the maximum number. Hence, there is a difference between R programming output and SAS implementation.

Figure 35: Variable Importance

Figure 34 represents the variable importance for each variable that has been shown in the column names variable names. Respecting that number of splitting rules, the number of surrogate rules and the importance of variables have been shown based on the values.

Task 4: Sentiment Analysis

Figure 36: Cluster Frequencies

The above figure shows the plotting for frequencies of clusters or words based on the application of sentiment analysis.

Figure 37: Distance Between Clusters

The above figure represents the plots for the distances between clusters.

Figure 38: Cluster Frequency By RMS

The above figure shows the cluster frequency based on the results of the implementation of RMS.

Figure 39: Text Cluster Diagram: Sentiment Analysis by Unsupervised Learning

The above figure shows the text cluster diagram that has been based on sentiment analysis. An unsupervised learning algorithm has been used for this analysis. Based on the analysis result, cluster, frequency, percent, cumulative frequency, and cumulative percent have been shown in the above figure.

6. Results Analysis and Discussion

Result Comparison between R and SAS and Critical Findings

Results from both analyses include charts and numbers. In the case of SAS output, results are represented as charts, graphs, and numbers. In the case of R programming output, KNN has been used and the accuracy score came as approx 26 %. The second algorithm has provided results as plots in R with all the pointed points. In SAS the output has shown the cumulative percentage and count. Sentiment analysis on both cases has come out as graphs and charts.

Results from R programming while implementing the K-Nearest Neighbour algorithm, the result has been represented as an accuracy score and the result comes out as 26.15 which is not a very impressive score for accuracy. In addition, In SAS the results have been represented as charts for distributions, train data, and many other aspects. In addition, a chart that represents different measuring factors such as mean values, prediction range, has been represented through the model score. Hence, the model scores the final output and there is no single output in the case of the SAS application.

In the case of applying association rules, the result of R programming has provided result has plot diagram. In addition, in R programming, the association rules have been implemented by using an apriori algorithm. That is not the case in the SAS. there is a tool for implementing association rules and using that tool results have been represented as rule matrix, statics plot, statics line plot. In the rule matrix, the values for mean, minimum, and maximum have been represented. In the sequence report that is the final result of SAS, frequency, percent, cumulative frequency, and cumulative percent have been shown. The maximum frequency is 198, the percent value is 99.00, and cumulative frequency and percent are 200 and 100 respectively.

Implementation of clusters both in R programming and SAS is different. In R programming, the result has been represented as a cluster plot. The values for clusters for cases and population value have come to numeric. Cases cluster values have three negatives and one positive float value. The population index includes clusters for two negatives and two positive values. The SAS output for k means clustering includes segment plot, segment size, mean statics numbers, an optimum number. The optimum number has been come out as 27.42 and 31.32. In addition, variable importance has come to 1.0 as the maximum number. Hence, there is a difference between R programming output and SAS implementation.

The last implementation based on sentiment analysis results different in both the implementation too. In R programming the sentiment analysis result has been come as plotting most frequent words used, and sentiment of the sentences. Sentiment or emotions such as anger, joy, sadness, and trust has been plotted using R programming. According to the result, joy and trust have most of the values, and sadness has a very little score. Hence, most of the data is positive according to the analysis.

These are results and critical analyses of the output of both R programming and SAS. Hence, there is a difference in both the outputs and the model. Results differ from numeric values to plots and charts. Both the implementation provides a different representation of values. Hence, critical analysis has been done based on the outputs.

7. Conclusion

The report solely focuses on the implementation of different algorithms into two different datasets using R programming and SAS Enterprise Miner. A deep understanding of R programs and SAS enterprise miner has been developed while implementing and working on these platforms. Algorithms such as KNN, K means clustering, sentiment analysis, and association rules have been implemented on both the platforms and the results have been analysed. Results of R programming include most of the value results, and results from SAS mostly include graphs and charts. Hence, results differ from results and visualization as well. The report helped to understand different methods in detail. The use of clustering, KNN, association rules, and sentiment analysis on databases has been researched deeply. As the report includes a requirement for deep research in previous works and gains knowledge while implementation, research papers from the last five years have been used for the development of this report. Hence, the report is a result of deep research and implementation of different algorithms using R programming and SAS enterprise miner. Hence, a deep understanding of the use of R programming using the R studio platform and SAS enterprise Miner have been developed.

8. References

Ccsc. 2021. Data mining. https://www.ccsc.org/southcentral/E-Journal/2010/Papers/Yihao%20final%20paper%20CCSC%20for%20submission.pdf
Patro, S.G.K., Mishra, B.K., Panda, S.K., Kumar, R., Long, H.V., Taniar, D. and Priyadarshini, I., 2020. A hybrid action-related K-nearest neighbour (HAR-KNN) approach for recommendation systems. IEEE Access, 8, pp.90978-90991.
Rekik, R., Kallel, I., Casillas, J. and Alimi, A.M., 2018. Assessing web sites quality: A systematic literature review by text and association rules mining. International journal of information management, 38(1), pp.201-216.
Yuan, C. and Yang, H., 2019. Research on K-value selection method of K-means clustering algorithm. J, 2(2), pp.226-235.
Zvarevashe, K. and Olugbara, O.O., 2018, March. A framework for sentiment analysis with opinion mining of hotel reviews. In 2018 Conference on information communications technology and society (ICTAS) (pp. 1-4). IEEE.