Skip to content

An empirical approach to comparing two prominent languages used in data science and analytics.

Notifications You must be signed in to change notification settings

LinhHoang8997/Python-vs-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PROJECT: Python vs R

INTRODUCTION

This project aims to compare the leading languages for data science and data analytics, in order to help new scientists, analysts, and students make the decision to invest in either language. The project comprises secondary research and empirical methods; besides exploring the industry landscape surrounding the two languages, the project seeks to examine both languages via complete examples of common tasks within the Data Analytics process. The following stages are in focus:

  • Extraction: web scraping
  • Data preparation: cleaning and wrangling
  • Model development: prototyping with naive bayes, clustering, and random forest models.

Each task will be conducted in both R and Python. At the end of the project, a presentation will be produced to compare relevant snippets and outputs line-by-line, with highlight on differences in syntax and paradigm.

WHY THIS PROJECT IS DIFFERENT FROM OTHER COMPARISONS

As the projects aims to support new learners, the team assumes the "student persona" in writing and evaluating code samples in Python and R. This means we avoid the topic of performance and paradigms to focus on learning curve and intuitiveness.

Each example will be "standalone". This means that any future reader will gain an adequate understanding of the datasets used and the code needed to replicate the task/activity in full. The project team in the Information Systems Integration class (MIS4596) could not demonstrate our examples in detailed, but we will include scorecards and notes to summarize our findings.

PRELIMINARY COMPARISON

This visualization was made in Tableau Desktop. Here is link to the visualization.

Infograph-1-cr

I used the Kaggle survey as a means to control for the broader use cases of Python, a popular general-purpose language, as the survey concerns only data science and machine learning professionals.

Between Python and R:

  • Nearly a half of professionals uses both languages in their day-to-day work.
  • Python enjoys a greater percentage in terms of exclusive users (those who primarily uses either only Python or R).
  • Python users are most concentrated in computer science/technology fields (43% of Python users). At least 19% of Python users are software engineers.
  • In comparision to Python, R users are more evenly distributed across disciplines and fields. It also sees a greater concentration of Data Scientists (30%) and Data Analysts (16%).
  • R generally sees much fewer commits and contributions on Github repos than Python. This may explained by the difference in size between Python and R communities. ggplot2 is noteworthy due to having as large a presence as a major Python package on Stackoverflow

EMPIRICAL COMPARISON

I was responsible for most of the comparative samples for this project except the Web-scraping module and a section within Data Wrangling.

Below are my observations on the difference between the two languages, based on work on my modules:

  • Exploring features/fields in R is more intuitive thanks to the str() function in base R. Python has pandas's info(), but it provides less information.
  • readr in R transform columns from csv files into strict categories, creating greater ease for new users later in the process. Python, with pandas, will code the data type of non-numeric variables as object, obfuscating columns with mixed data and errors.
  • Python runs statistical and machine learning models much faster than R does. For instance, RandomForest runs for 6 times longer in R.
  • Python's Object-oriented structured allows intuitive method chaining and greater readability.
  • R packages for statistical models* provide greater model transparency by providing built-in methods for explanation and exploring features importance. Getting the same information in Python, even with the package eli5, requires more skills from beginners

About

An empirical approach to comparing two prominent languages used in data science and analytics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published