Report On Applying The Data Science Lifecycle Assignment

Table of Contents

Report On Applying The Data Science Lifecycle
Introduction - Report On Applying The Data Science Lifecycle
Scrub
Explore
Model
Interpret

Pages: 15 Words: 3677

Report On Applying The Data Science Lifecycle

Introduction - Report On Applying The Data Science Lifecycle

Get free samples written by our Top-Notch subject experts for taking assignment help UK

A data science cycle of life is indeed an ongoing sequence of data science procedures used to finish a task or investigation. Many data science initiatives, though, follow the very same fundamental life cycles of data science activities. Two counties such as Norfolk as well as Suffolk from the different counties of the UK have been selected to find places for residence. The main objective of the report is to find out the statistical analysis which will help in determination of the recommendation of the districts of local authority within the above mentioned counties. The recommendations will be primarily based on following criteria: housing prices, insecurity in the region, as well as the excellence of nearby schools. The geographical areas that have been considered are Norfolk and Suffolk. The listings of the districts are given below -

The comparisons as well as the above-mentioned attribute tradeoffs of the districts will be structured in detail throughout the report through the life cycle of data mining consisting of five stages such as “obtain, scrub, explore, model, and interpret”.

Obtain

The datasets have been obtained from the sites of the central or public body of the UK government. The data of the two counties such as Norfolk and Suffolk has been taken for the analysis of the comparison of different districts of the two counties. Norfolk dataset consists of 12 columns defining different types of variables for understanding the situations of various districts of Norfolk County and 5926 rows defining inputs of 12 variables (Engin, Z., & Treleaven, P. (2019). The Suffolk dataset consists of 12 columns defining different types of variables for understanding the situations of various districts of Suffolk County and 4637 rows defining inputs of 12 variables. The main variables in both the datasets are the LSOA Code and the Crime ID for making the analysis for finding the best district of the counties Norfolk and Suffolk. Other variables are the month of occurrence of the crime, the body which reported the crime, under the body which the crime falls within, latitude and longitude of the districts, LSOA code defining the geographical location code and the corresponding names of the districts of the two counties, followed by the type of crime, category of last outcome and finally the context which is a empty field within each dataset.

Each and every dataset has been collected from the site “https://data.police.uk/data/”. The site provides different datasets in the format of csv files with the provision of the crime information, outcomes and search and stop information which has been provided by the forces of police as well as the “Lower-layer Super Output Area (LSOA)”. The crime data of Norfolk and Suffolk has been taken for December 2021. The crime rate data of the district has helped to understand the reliability of those particular districts mentioned and to realize how much each district is secured for residence.

Scrub

Data scrubbing is indeed an error control approach which employs a background process to audit main memory or storing for faults on just a regular basis, then tries to correct found flaws with redundancy information in the form of various hash functions or duplicates of information (Kaab, A. et al. (2019). Data cleaning is indeed the practice of going through each of the information in the database and then either removing or updating the database which really is missing, wrong, incorrectly structured, redundant, or unnecessary. Data cleansing often entails scrubbing up information that has accumulated inside a single location. Although data cleansing may include removing records, it's indeed mainly concerned with modernizing, amending, as well as combining information to make sure that systems are as efficient as feasible (Farley, S. S., Dawson, A., Goring, S. J., & Williams, J. W. (2018). The data cleansing procedure is generally completed at once and might take a long time if data has indeed been accumulating over years. Data cleaning is crucial since it enhances information quality and, as a result, total effectiveness. When data is cleaned, each and every obsolete or erroneous data is removed, leaving with just the finest quality statistics.

The dataset that has been collected from the police forces site of the UK, are firstly imported in the RStudio and then while continuing with the other tasks, it has been noticed that the data are not suited for statistical analysis (Matheus, R., Janssen, M., & Maheshwari, D. (2020). Each of the two datasets are having some issues in it for which the RStudio outcome has shown some errors. There are numerous empty cells, duplicate cells which are cleaned manually with the help of software such as Microsoft Excel. Empty cells show no records while the statistical analysis is done. At the initial stage of scrubbing of the data within the dataset, the empty rows have been eliminated from the Excel files.

The relevant variables that are required to be extracted for the statistical analysis from each of the dataset are specifically two variables such as LSOA Code and Crime ID (Nabavi-Pelesaraei, A. et al. (2018). While fitting the model, there were various errors in surfing the outputs for not recognizing the string value from the variables such as the LSOA code and the Crime ID from each of the two datasets for the counties Norfolk as well as Suffolk.

For diagnosing of the errors in the output, the technique of scrubbing of the data has been taken (Crawford, R. H. et al. (2018). The final stage of data pre-processing has been done at this point. It is impossible for the analysis tool to extract the string value of length 70. Hence, the Crime ID variable of the dataset has been given with consecutive integer numbers instead of string values of length 70. Additionally, the value for LSOA code has also been changed and replaced with random numbers but providing the same integer values for each of the similar district names.

Explore

The datasets are now ready for the use in analysis of the statistics of various districts of the two counties Norfolk and Suffolk with the help of relevant variables from each of the two scrubbed datasets (Zou, P. X., Xu, X., Sanjayan, J., & Wang, J. (2018). The statistical analysis of any of the data could be done with the help of two programming languages namely Python and R or could be done combining both of the programming languages. Here, in this report the statistical analysis has been done using the R programming language. R is indeed a coding language as well as software framework that may be used in statistical data analysis, graphical demonstration, as well as documentation (Attwood, T. K. et al. (2019). R is free to access there under "GNU General Public License", therefore pre-compiled executable editions for several OS like linux, Windows, as well as Macos are accessible. R is indeed a quantitative computation as well as graphical coding language which may be used to cleanse, analyse, as well as visualize the statistics. It is frequently used among scientists from several fields to evaluate as well as deliver results, as well as by statistical as well as research techniques lecturers.

For statistical analysis with the help of the R programming language, linear regression has been used. A linear regression is indeed a statistical framework which examines the connection between such dependent variables (y variable) and one or even more factors, and related interconnections or the independent variable (x variable). The same method cannot solve all problems (Smetana, S., Schmitt, E., & Mathys, A. (2019). Linear regression presupposes a linear connection between both the dependent factors as well as the explanatory factors in just this example. This implies that even a line may well be drawn here between 2 or more factors. In this particular scenario, there is a relationship between the two variables of the datasets such as the Crime ID and the LSOA code. Hence, the Crime ID of any district could be calculated if the LSOA code is known for a particular district.

Crime ID = a + LSOA code * b

Here, ‘a’ denotes the intercept and the ‘b’ denotes the slope. The intercept or ‘a’ is the value from which the measurement is started.

To find the statistical analysis, RStudio has been used. The objective of RStudio is to develop an open system for data analysis, scientific research, including technological communication (Davis, M. T. (2020). This is done to improve the creation as well as utilization of information by almost everybody, regardless of financial resources, as well as to enable collaboration through repeatable investigation, all of which are vital towards the authenticity as well as efficiency of applications of science, schooling, governance, as well as business. Firstly, the library named readxl has been imported. This helps in reading the excel file. Then the file has been attached using the attach function. Now, with the help of the command ‘lm’ the linear regression has been calculated by defining the model. [Referred to Appendix 1]

Model = lm (target_variable ~ predictor_variable)

The detailed information of the performance and the coefficients of the model is displayed with the summary function command.

Model 1: Norfolk

Model 2: Suffolk

According to the summary statistics, the value of Residuals, coefficients, residual standard error, r-square values, f-statistics and p-value has been found. Since, linear regression could be done using the relevant variables from the dataset, only the relation between the crime id and LSOA code has been done (Tambe, P., Cappelli, P., & Yakubovich, V. (2019). Taking other variables would also give similar results but if it was done then there was a requirement of screening of the data to a greater extent as all the other variables make use of characters to define their inputs (Sahin, U., & Türeci, Ö. (2018). Plotting for the different models couldn't be created as this would require further scrubbing of the data and the plot for the variables and the model would be the same because the model is fitted using a single variable. [Referred to Appendix 2]

Model

Linear regression has been done for the statistical analysis of data collected for the two counties Norfolk and Suffolk (Jablonka, K. M., Ongari, D., Moosavi, S. M., & Smit, B. (2020). The application of a linear regression analysis is critical for such aforementioned purposes: Description - It aids in determining the significance of the association here between resulting factors as well as the predictive factors (Qian, Y. et al. (2018). Predictor variables - It aids in evaluating the significant risk variables that influence performance. This part of the report describes the statistical models which defines the fitting of the model, getting the coefficients, level of confidence as well as the anova for the model created. [Referred to Appendix 3]

Model 1

Recommendation System

A recommendation system has been created for a smaller range of data of 0-10 and this shows a bad fit for the model than the major model. [Referred to Appendix 5-8]

Model 3: Recommendation System for Norfolk

The comparisons of the different models could be done efficiently. The r square value of the 1st model and the 2nd model is approximately near to each other such as 0.99 (Nwodo, M. N., & Anumba, C. J. (2019). That means both are equally fitted. But from the display of the graphical plot it could be understood that the model 1 or the model for the Norfolk is best fitted as the absolute line is more inclined towards the scatter plot points than the model 2 for the Suffolk. Therefore, the districts of Norfolk are much more preferable to reside in. The r square values of the recommended system are less than the major models as they are having a smaller range of data. The best district to reside in Norfolk is Allerdale, since the scatter plot points are touched at the point (1,1).

Interpret

Legal issue is often known as just a problem with the law, is indeed a legal matter that serves as that of the cornerstone of a lawsuit. It needs a court judgement. This can allude to a moment where the facts are unquestioned and the decision is determined by the court's application of the rules (Vassakis, K., Petrakis, E., & Kopanakis, I. (2018). An ethical quandary is a disagreement among two ethically right actions. There is indeed a clash of ideals or beliefs. The conundrum is that correct and incorrect is done at the very same moment, and adopting one correct choice would nullify some other correct option. The data has been collected from the portal of the police forces of the UK. The data has some legal issues as the data has not been collected from the site by taking any permissions from the police forces or other government bodies (Cucurachi, S., Scherer, L., Guinée, J., & Tukker, A. (2019). Since, the data mentioned within the datasets are confidential, as the officials of the districts would no way want to disclose their crime rates over time. This would rather create negative impressions on the peoples who would plan to reside in those districts. Hence, taking legal decisions by the districts could be done harming the career of the researcher.

There are also various ethical issues of the data collected from the police forces site of the UK. The two ethically correct actions are the crime rates hence shown in the dataset are all correct and significant as they are operated and updated by the government bodies i.e., police forces (Campos-Guzmán, V., García-Cáscales, M. S., Espinosa, N., & Urbina, A. (2019). Secondly, there is a need to analyse the place where a new resident should reside. They need to investigate each and every corner of the districts, the pros and cons that would be faced by the new residents after residing in the particular district and how they are going to get benefitted by the facilities available in that district. They also require to see whether their confidential things should get stolen or if they would face any other criminal problems if they would reside there and so on. For carrying on with the investigation of the reliability of the residence district, the confidential data of the districts should be investigated thoroughly which could further lead to any ethical issues.

To conclude from the whole report, a total stepwise life cycle of the data mining has been done effectively showing the comparisons of the districts which would help the new residents to reside in the districts of the counties of Norfolk and Suffolk using various steps of the data mining. Each and every step is described in detail from taking the datasets, creating the model, fitting the model, diagnostic plotting, recommendation system creation and plotting and lastly getting results by comparing the different models hence created, within the whole report. Recommendation system displays a better graph plotting but the r square value is lesser than the main models of the Norfolk and the Suffolk. So, the best fit model are the main models created since, the recommendation model consists of a very few ranges of data than the main model data. Hence, it is concluded that the Allerdale district of Norfolk is the best option to reside in. In the future, more models could be fitted using other variables of the dataset so that the analysis of the statistics could be much better. Both the programming languages like Python and R could be used jointly to get better results.

Reference list

Journals

Engin, Z., & Treleaven, P. (2019). Algorithmic government: Automating public services and supporting civil servants in using data science technologies. The Computer Journal, 62(3), 448-460. Retrieved from: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8852885 [Retrieved on:05.02.22]

Kaab, A., Sharifi, M., Mobli, H., Nabavi-Pelesaraei, A., & Chau, K. W. (2019). Combined life cycle assessment and artificial intelligence for prediction of output energy and environmental impacts of sugarcane production. Science of the Total Environment, 664, 1005-1019. Retrieved from: https://d1wqtxts1xzle7.cloudfront.net/58399669/Sugarcane-ANN_ANFIS-with-cover-page-v2.pdf [Retrieved on:05.02.22]

Farley, S. S., Dawson, A., Goring, S. J., & Williams, J. W. (2018). Situating ecology as a big-data science: Current advances, challenges, and solutions. BioScience, 68(8), 563-576. Retrieved from: https://watermark.silverchair.com/biy068.pdf [Retrieved on:05.02.22]

Matheus, R., Janssen, M., & Maheshwari, D. (2020). Data science empowering the public: Data-driven dashboards for transparent and accountable decision-making in smart cities. Government Information Quarterly, 37(3), 101284. Retrieved from: https://d1wqtxts1xzle7.cloudfront.net/56076684/Paddy-ANN_ANFIS-with-cover-page-v2.pdf [Retrieved on:05.02.22]

Nabavi-Pelesaraei, A., Rafiee, S., Mohtasebi, S. S., Hosseinzadeh-Bandbafha, H., & Chau, K. W. (2018). Integration of artificial intelligence methods and life cycle assessment to predict energy output and environmental impacts of paddy production. Science of the total environment, 631, 1279-1294. Retrieved from: https://minerva-access.unimelb.edu.au/bitstream/handle/11343/194165/Revised%20manuscript%20JCLEPRO-D-17-05860.pdf [Retrieved on:05.02.22]

Crawford, R. H., Bontinck, P. A., Stephan, A., Wiedmann, T., & Yu, M. (2018). Hybrid life cycle inventory methods–A review. Journal of Cleaner Production, 172, 1273-1288. Retrieved from: https://www.researchga.te.net/profile/Peter-Cappelli/publication/328798021.pdf [Retrieved on:05.02.22]

Zou, P. X., Xu, X., Sanjayan, J., & Wang, J. (2018). Review of 10 years research on building energy performance gap: Life-cycle and stakeholder perspectives. Energy and Buildings, 178, 165-181. Retrieved from: https://www.research.manchester.ac.uk/portal/files/158247167/Gallego_and_Tarpani_2019_.pdf [Retrieved on:05.02.22]

Attwood, T. K., Blackford, S., Brazas, M. D., Davies, A., & Schneider, M. V. (2019). A global perspective on evolving bioinformatics and data science training needs. Briefings in Bioinformatics, 20(2), 398-404.Retrived from: https://pdf.sciencedirectassets.com/271439/1-s2.0-S0166361519X00068.pdf Retrieved from: https://ec.europa.eu/jrc/communities/sites/default/files/publ046_tkde_2020_paper_earlyaccess.pdf [Retrieved on:05.02.22]

Smetana, S., Schmitt, E., & Mathys, A. (2019). Sustainable use of Hermetia illucens insect biomass for feed and food: Attributional and consequential life cycle assessment. Resources, Conservation and Recycling, 144, 285-296. Retrieved from: https://d1wqtxts1xzle7.cloudfront.net/57651140/BDA_chapter_publ-with-cover-page-v2.pdf [Retrieved on:05.02.22]

Davis, M. T. (2020). Applying technical communication theory to the design of online education. In Online Education (pp. 15-29). Routledge. Retrieved from: https://escholarship.org/content/qt0b9635gm/qt0b9635gm.pdf [Retrieved on:05.02.22]

Tambe, P., Cappelli, P., & Yakubovich, V. (2019). Artificial intelligence in human resources management: Challenges and a path forward. California Management Review, 61(4), 15-42. Retrieved from: https://d1wqtxts1xzle7.cloudfront.net/60230503/Energy220190807-55124-7wgpoo-with-cover-page-v2.pdf [Retrieved on:05.02.22]

Jablonka, K. M., Ongari, D., Moosavi, S. M., & Smit, B. (2020). Big-data science in porous materials: materials genomics and machine learning. Chemical reviews, 120(16), 8066-8129. Retrieved from: https://arxiv.org/pdf/1709.07493.pdf [Retrieved on:05.02.22]

Gallego-Schmid, A., & Tarpani, R. R. Z. (2019). Life cycle assessment of wastewater treatment in developing countries: a review. Water Research, 153, 63-79. Retrieved from: https://idus.us.es/bitstream/handle/11441/93904/A%20Method%20to%20Improve%20the%20Early%20Stages.pdf?sequence=1&isAllowed=y [Retrieved on:05.02.22]

Nwodo, M. N., & Anumba, C. J. (2019). A review of life cycle assessment of buildings using a systematic approach. Building and Environment, 162, 106290. Retrieved from: https://fardapaper.ir/mohavaha/uploads/2020/06/Fardapaper-Industrial-blockchain-based-framework-for-product-lifecycle-management-in-industry-4.0.pdf [Retrieved on:05.02.22]

Vassakis, K., Petrakis, E., & Kopanakis, I. (2018). Big data analytics: applications, prospects and challenges. In Mobile big data (pp. 3-20). Springer, Cham. Retrieved from: https://iopscience.iop.org/article/10.1088/1748-9326/ab89d7/pdf [Retrieved on:05.02.22]

Cucurachi, S., Scherer, L., Guinée, J., & Tukker, A. (2019). Life cycle assessment of food systems. One Earth, 1(3), 292-297. Retrieved from: https://www.osti.gov/servlets/purl/1798578.pdf [Retrieved on:05.02.22]

Campos-Guzmán, V., García-Cáscales, M. S., Espinosa, N., & Urbina, A. (2019). Life Cycle Analysis with Multi-Criteria Decision Making: A review of approaches for the sustainability evaluation of renewable energy technologies. Renewable and Sustainable Energy Reviews, 104, 343-366. Retrieved from: https://www.sciencedirect.com/science/article/abs/pii/S1364032119300413 [Retrieved on:05.02.22]

Sahin, U., & Türeci, Ö. (2018). Personalized vaccines for cancer immunotherapy. Science, 359(6382), 1355-1360. Retrieved from: https://digital.csic.es/bitstream/10261/192407/1/Graphene_%20in_%20supercapacitors_Cossutta.pdf [Retrieved on:05.02.22]

Qian, Y., Jiang, Y., Chen, J., Zhang, Y., Song, J., Zhou, M., & Pustišek, M. (2018). Towards decentralized IoT security enhancement: A blockchain approach. Computers & Electrical Engineering, 72, 266-273. Retrieved from: https://www.researchgate.net/profile/Pai-Zheng-5/publication/337363063_A_state-of-the art_survey_of_Digital_Twin_techniques_engineering_product_lifecycle_management_and_business_innovation_perspectives/links/5f09c0e9299bf18816129dd2/A-state-of-the-art-survey-of-Digital-Twin-techniques-engineering-product-lifecycle-management-and-business-innovation-perspectives.pdf [Retrieved on:05.02.22]