Beyond the Blink: Decoding the Data Deluge with Skewness and Kurtosis ( Part – 1 )

Imagine trying to download an entire movie library in the blink of an eye. This once-unthinkable feat is now a reality thanks to the groundbreaking work of Asbjørn Arvad Jørgensen and his team at the Technical University of Denmark.  They’ve harnessed the power of light to create a revolutionary photonic chip capable of transmitting information at a mind-boggling 1.84 petabits per second.

Artificial IntelligenceBut what exactly does that mean? A petabit is a mind-numbing one quadrillion (one thousand trillion) bits. To put that in perspective, this chip could download the entire Library of Congress a staggering 2,700 times in a single second! This technological marvel shatters the limitations of even the most advanced internet connections available today, which typically operate in the realm of gigabits (billions of bits) per second.  Just a few decades ago, common internet speeds measured in kilobits (thousands of bits) per second were a sign of progress. Today, thanks to the ingenuity behind this photonic chip, we stand at the threshold of a data transfer revolution.

Jørgensen’s team achieved this remarkable feat by splitting a single stream of data into thousands of separate channels, each carried by a unique sliver of light. This technique allows them to transmit this massive amount of information simultaneously over a standard fiber-optic cable, stretching an impressive 7.9 kilometers (4.9 miles).  In essence, the chip accomplishes the unthinkable – it sends more data in a single second than currently travels through the entire internet’s backbone network!

industry4o.com

While this engineering marvel is awe-inspiring, I’d like to shift our focus to the world within our data sets. Imagine a giant warehouse filled with boxes, each containing information. Some boxes might hold details about online purchases, while others contain social media posts or weather recordings. Just like any good warehouse, understanding how these boxes are arranged is crucial to finding what you need quickly and efficiently.

In this realm, statisticians grapple with the nuances of data distributions, and two crucial concepts emerge: skewness and kurtosis. Skewness reveals how lopsided the arrangement of boxes in our warehouse is – whether they lean more to one side or the other. Understanding this bias is critical for sorting through the boxes, identifying outliers (boxes that are much bigger or smaller than the others), and potentially rearranging them for better analysis. On the other hand, kurtosis quantifies how many boxes are stacked extremely high or low in the warehouse. Does the data have a lot of extreme values, or are most of the boxes piled around a similar height? These statistical measures allow us to create a more organized warehouse, ultimately unlocking hidden patterns and generating valuable insights from the information they contain.

industry4o.com

As we embark on this journey into the heart of data analysis, let’s dive deeper into the peculiarities of skewness and the eccentricities of kurtosis. Within these concepts lie the keys to understanding our data’s true nature and shaping its future.

Data analysis, much like any scientific endeavor, thrives on order and structure. But what happens when the boxes in our data warehouse become jumbled, or the heights of the stacks vary erratically? This is where skewness and kurtosis enter the scene, acting as vital guides, influencing the interpretation and analysis of our data.

Part One: Interpreting Skewness in Data Analysis: Understanding Distribution Biases

Let’s return to our data warehouse analogy. Imagine we’re tasked with analyzing boxes containing information on daily login durations for a social media platform with millions of users. Ideally, the boxes would be stacked neatly, with a similar number of boxes at each height representing similar login durations. In this scenario, the data distribution would be symmetrical, with a balanced number of users logging in for short, medium, and long periods.

However, the world of data rarely aligns perfectly with such a neat arrangement. This is where skewness comes in. Skewness is a measure of how asymmetrical the data distribution is, revealing whether the boxes in our warehouse are leaning to the left or right.

There are two main types of skewness:

Positive Skewness: In this scenario, the tail of the distribution with the higher values extends further, like a stack of boxes tilting to the right. This indicates the presence of outliers or a larger number of boxes with high values compared to the left side of the distribution. Going back to our social media example, positive skewness might suggest that while most users log in for short bursts, a small group of highly engaged users significantly skew the average towards longer durations.

Negative Skewness: Here, the opposite occurs. The tail of the distribution with lower values extends further, resembling a stack of boxes leaning to the left. This signifies that there’s a larger concentration of boxes with lower values compared to the right side. For instance, imagine a dataset containing house prices in a particular city. The distribution is typically positively skewed due to a larger number of modestly priced homes compared to a smaller pool of luxury mansions. However, it’s also possible to encounter a negatively skewed distribution in housing prices, perhaps in an area with a high concentration of low-income housing.

Recognizing skewness is key to identifying potential issues within your data warehouse. By understanding the lean of the boxes, you can make informed decisions about how to organize and analyze the information.

Let’s dive deeper into the practical implications of skewness in data analysis.

Example: Making Sense of Social Media Engagement

Imagine a social media platform boasting millions of users. If you analyze their daily login durations and encounter a positive skew, it could mean that while most users log in for short bursts (think scrolling through their feed during a coffee break), a small group of highly engaged users (think power users or social media influencers) significantly skew the average towards longer durations. This insight can be invaluable for product development teams, as it directs attention to features that might keep users engaged for extended periods. For instance, the platform might consider implementing features that encourage content creation or foster a stronger sense of community, potentially attracting and retaining more high-engagement users.

Another Example: Understanding Real Estate Markets

Consider a dataset of house prices in a particular city. The distribution is typically positively skewed due to a larger number of modestly priced homes compared to a smaller pool of luxury mansions. A high positive skewness indicates that a few exceptionally expensive houses significantly impact the average price, highlighting the importance of considering this bias when making informed decisions about real estate investments. Simply averaging the house prices in this scenario could mislead you into thinking the typical house is much more expensive than it really is. Real estate agents or potential buyers who rely solely on the average price might miss out on good deals or overpay for properties.

Addressing Skewness for Better Analysis

One way to address the issue of skewness and make the data distribution more manageable is by log-transforming the data. Imagine reorganizing the boxes in our warehouse based on their size (length, width, and height) rather than just piling them on top of each other. Log transformation accomplishes something similar. It essentially condenses the spread of the larger values (the high-end houses or the long login durations) and stretches the range of the smaller values, effectively bringing the distribution closer to a symmetrical form, which is often ideal for many statistical analyses.

This transformation not only stabilizes the variance of the data (reducing the influence of outliers) but also prepares the data for further analysis, such as regression modeling. Normalizing the data (making the distribution more symmetrical) is particularly useful when developing predictive models to understand house price determinants, as it helps mitigate the impact of outliers and improves the model’s performance in predicting future prices.

 When Skewness Matters Most

When dealing with significant skewness, it is crucial to consider the type of analysis you intend to conduct. For certain statistical tests that assume a normal distribution of data (like the average house price being representative of most houses), it is essential to address skewness. Various data transformation methods, such as log, square root, or Box-Cox transformations, can help mitigate skewness and promote a more normal distribution of the data in your warehouse. This adjustment ensures valid results from statistical tests and enhances the reliability of the models built on the data.

Additionally, it is always advisable to visually examine your data both before and after transformation to ensure the appropriateness and effectiveness of the transformation. Tools like histograms (think of a bar chart that shows the number of boxes at each height) or Q-Q plots (a visual comparison of how closely your data resembles a normal distribution) can provide insights into how well the transformation has normalized the distribution of boxes in your warehouse (or the data itself).

“Bringing Data to Life and Life to Data”

data visualizationsAbout the Author:

data visualizations

Dr. Joe Perez,
Team Lead / Senior Systems Analyst,
NC Department of Health and Human Services

ncdhhs

Dr. Joe Perez ( Dr.Joe ) is also the Chief Technology Officer – CogniMind

To book Dr. Joe Perez for your speaking engagement please click here

Dr. Joe Perez was selected as the 2023 Gartner Peer Community Ambassador of the Year.

Dr. Joe Perez is a truly exceptional professional who has left an indelible mark on the IT, health and human services, and higher education sectors. His journey began in the field of education, where he laid the foundation for his career. With advanced degrees in education and a doctorate that included a double minor in computers and theology, Joe embarked on a path that ultimately led him to the dynamic world of data-driven Information Technology.

In the early 1990s, he transitioned into IT, starting as a Computer Consultant at NC State University. Over the years, his dedication and expertise led to a series of well-deserved promotions, culminating in his role as Business Intelligence Specialist that capped his 25 successful years at NC State. Not one to rest on his laurels, Dr. Perez embarked on a new challenge in the fall of 2017, when he was recruited to take on the role of Senior Business Analyst at the NC Department of Health & Human Services (DHHS). His impressive journey continued with promotions to Senior Systems Analyst and Team Leader, showcasing his versatility and leadership capabilities.

In addition to his full-time responsibilities at DHHS, Joe assumed the role of fractional Chief Technology Officer at a North Carolina corporation in October 2020. A top-ranked published author with over 17,000 followers on LinkedIn and numerous professional certifications, he is a highly sought-after international keynote speaker, a recognized expert in data analytics and visualization, and a specialist in efficiency and process improvement.

Dr. Perez’s contributions have not gone unnoticed. He is a recipient of the IOT Industry Insights 2021 Thought Leader of the Year award and has been acknowledged as a LinkedIn Top Voice in multiple topics. He holds memberships in prestigious Thought Leader communities at Gartner, Coruzant Technologies, DataManagementU, Engatica, the Global AI Hub, and Thinkers360 (where he achieved overall Top 20 Thought Leader 2023 ranking in both Analytics and Big Data). His reach extends to more than twenty countries worldwide, where he impacts thousands through his speaking engagements.

Beyond his professional achievements, Joe’s passion for teaching remains undiminished. Whether as a speaker, workshop facilitator, podcast guest, conference emcee, or team leader, he continually inspires individuals to strive for excellence. He treasures his time with his family and is a gifted musician, singer, pianist, and composer. Joe also dedicates his skills as a speaker, interpreter, and music director to his church’s Hispanic ministry. He manages the publication of a widely recognized monthly military newsletter, The Patriot News, and is deeply committed to his community.

To maintain a balanced life, Perez is a regular at the gym, and he finds relaxation in watching Star Trek reruns. He lives by the philosophy that innovation is the key to progress, and he approaches each day with boundless energy and an unwavering commitment to excellence. His journey is a testament to the remarkable achievements of a truly exceptional individual.

Dr. Joe Perez is Accorded with the following Honors & Awards :

https://www.linkedin.com/in/jwperez/details/honors/

Dr. Joe Perez is Bestowed with the following Licences,Certifications & Badge:

https://www.linkedin.com/in/jwperez/details/certifications/

https://www.thinkers360.com/tl/badge/19985/2764

Dr.Joe Perez is Voluentering in the following International Industry Associations & Institutions :

https://www.linkedin.com/in/jwperez/details/volunteering-experiences/

Dr.Joe Perez can be contacted at :

E-mail | LinkedIn | Web | Sessionize | FaceBook  | Twitter | YouTube

Also read Dr.Joe Perez‘s earlier article:

data visualizations