Graduation Project -- Duolingo Analysis¶
Business Problem Overview¶
Enhancing User Retention and Learning Effectiveness on Duolingo¶
Duolingo, a leading language-learning platform, empowers millions of users worldwide to learn new languages through engaging and interactive lessons. Offering diverse languages and skill levels, the platform provides a gamified experience designed to make learning fun and accessible to everyone.
However, the effectiveness and success of Duolingo rely heavily on user retention and learning engagement. With a vast user base and daily learning activities, identifying factors that drive users to stay active and progress through lessons—or conversely, to abandon the platform—is critical. High drop-off rates can hinder individual learning goals, reduce user satisfaction, and ultimately impact Duolingo’s long-term growth and revenue potential.
Key questions arise: Why do some users achieve consistent learning milestones while others disengage? What influences a user's ability to retain and correctly apply learned content? By analyzing user behavior, such as session frequency, accuracy rates, and engagement trends, Duolingo can gain actionable insights to optimize its learning strategies, personalize user experiences, and enhance overall retention and satisfaction.
Enhancing User Retention and Learning Effectiveness on Duolingo¶
Duolingo, a leading language-learning platform, empowers millions of users worldwide to acquire new languages through engaging and interactive lessons. Its gamified structure, encompassing daily streaks, leaderboards, and rewards, makes learning enjoyable and accessible. Catering to learners of diverse backgrounds, Duolingo plays a crucial role in breaking down language barriers and fostering global communication.
However, the platform's long-term success hinges on user retention, engagement, and learning outcomes. Despite its innovative approach, some users struggle to maintain consistent learning habits, achieve proficiency, or stay active on the platform. High drop-off rates and reduced session activity can diminish individual learning journeys and impact the platform’s ability to meet its mission of fostering education globally.
This raises several critical questions:
- Why do some users excel while others disengage early in their learning journey?
- What patterns or factors contribute to higher retention, lesson accuracy, and session frequency?
- How do user behaviors differ across languages, lesson difficulty levels, or demographics?
By analyzing key metrics such as session accuracy, engagement trends, and historical learning data, Duolingo can uncover actionable insights to address these challenges. These insights can enable the platform to:
- Personalize learning pathways for diverse user needs.
- Introduce adaptive features to re-engage at-risk users.
- Optimize lesson structures for improved comprehension and retention.
With a deeper understanding of user behavior, Duolingo can create a more tailored and effective learning experience, ensuring users remain motivated, achieve their language goals, and continue to thrive on the platform.
Objective¶
In this project, we aim to analyze Duolingo user activity data to:
- Understand user behavior: Identify trends in session frequency, lesson accuracy, and engagement across different languages and user demographics.
- Recognize disengagement signals: Detect key indicators that suggest users are losing interest or struggling with their learning journey.
- Optimize learning strategies: Provide actionable insights to enhance user retention, personalize learning experiences, and improve overall platform effectiveness.
Business Impact¶
This analysis will empower Duolingo to:
- Enhance user retention by identifying and addressing factors that contribute to user disengagement.
- Improve learning outcomes by personalizing lessons to align with individual user needs and preferences.
- Drive platform growth by fostering consistent user engagement and satisfaction, leading to positive word-of-mouth and higher subscription rates.
By addressing user disengagement and optimizing learning strategies, this project aligns with Duolingo's mission to make education accessible, engaging, and effective for learners worldwide. Let’s dive into the data to unlock these insights!
Dataset Overview¶
Dataset Overview¶
- Dataset Name: Duolingo Analytics Dataset
- Number of Rows: 3,795,780
- Number of Columns: 12
- Description: The dataset captures language learning session details, including user behavior, engagement metrics, and learning progress. It records recall probability, session performance, and lexeme interactions, enabling analysis of learning outcomes and activity patterns over days.
Column Definitions¶
- p_recall (Proportion of Recall Accuracy): The proportion of exercises in this lesson where the word (or lexeme) was correctly recalled by the student.
- timestamp (Time of the Lesson): The timestamp indicating when the current lesson or practice took place.
- delta (Time Gap): The time (in seconds) since the last lesson or practice where this specific word (lexeme) was encountered.
- user_id (Student ID): An anonymized ID representing the student who completed the lesson or practice.
- learning_language (Language Being Learned): The target language that the student is learning.
- ui_language (User Interface Language): The language of the app’s user interface, which is usually the student's native language.
- lexeme_id (Lexeme Tag ID): A system-generated unique ID for the word or lexeme being practiced.
- lexeme_string (Lexeme Tag): A detailed grammar tag describing the lexeme (word), including its properties like tense, gender, and plurality.
- history_seen (Times Seen Before): The total number of times the student has encountered this word (lexeme) in lessons or practice sessions before this one.
- history_correct (Times Correct Before): The total number of times the student correctly recalled this word (lexeme) in previous lessons or practice sessions.
- session_seen (Times the Word/Lexeme Was Seen in the Current Session): This column indicates how many times the student encountered the specific word or lexeme during the current lesson or practice session.
- session_correct (Times the Word/Lexeme Was Correctly Recalled in the Current Session): This column indicates how many times the student correctly recalled or answered the specific word or lexeme during the current lesson or practice session.
Analysis & Visualization¶
1. Importing and Cleaning Data¶
Importing Necessary Libraries¶
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting and visualization
import seaborn as sns # For advanced visualizations
Loading the Dataset from google drive¶
dl = pd.read_csv("reduced_data_400mb (1).csv") # Loading the Data
dl
p_recall | timestamp | delta | user_id | learning_language | ui_language | lexeme_id | lexeme_string | history_seen | history_correct | session_seen | session_correct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2013-03-03 17:13:47 | 1825254 | 5C7 | fr | en | 3712581f1a9fbc0894e22664992663e9 | sur/sur<pr> | 2 | 1 | 2 | 2 |
1 | 1.0 | 2013-03-04 18:30:50 | 367 | fWSx | en | es | 0371d118c042c6b44ababe667bed2760 | police/police<n><pl> | 6 | 5 | 2 | 2 |
2 | 0.0 | 2013-03-03 18:35:44 | 1329 | hL-s | de | en | 5fa1f0fcc3b5d93b8617169e59884367 | hat/haben<vbhaver><pri><p3><sg> | 10 | 10 | 1 | 0 |
3 | 1.0 | 2013-03-07 17:56:03 | 156 | h2_R | es | en | 4d77de913dc3d65f1c9fac9d1c349684 | en/en<pr> | 111 | 99 | 4 | 4 |
4 | 1.0 | 2013-03-05 21:41:22 | 257 | eON | es | en | 35f14d06d95a34607d6abb0e52fc6d2b | caballo/caballo<n><m><sg> | 3 | 3 | 3 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3795775 | 1.0 | 2013-03-06 23:06:48 | 4792 | iZ7d | es | en | 84e18e86c58e8e61d687dfa06b3aaa36 | soy/ser<vbser><pri><p1><sg> | 6 | 5 | 2 | 2 |
3795776 | 1.0 | 2013-03-07 22:49:23 | 1369 | hxJr | fr | en | f5b66d188d15ccb5d7777a59756e33ad | chiens/chien<n><m><pl> | 3 | 3 | 2 | 2 |
3795777 | 1.0 | 2013-03-06 21:20:18 | 615997 | fZeR | it | en | 91a6ab09aa0d2b944525a387cc509090 | voi/voi<prn><tn><p2><mf><pl> | 25 | 22 | 2 | 2 |
3795778 | 1.0 | 2013-03-07 07:54:24 | 289 | g_D3 | en | es | a617ed646a251e339738ce62b84e61ce | are/be<vbser><pres> | 32 | 30 | 2 | 2 |
3795779 | 1.0 | 2013-03-06 21:12:07 | 191 | iiN7 | pt | en | 4a93acdbafaa061fd69226cf686d7a2b | café/café<n><m><sg> | 3 | 3 | 1 | 1 |
3795780 rows × 12 columns
Displaying Dataset Information¶
print("Dataset Information:")
dl.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3795780 entries, 0 to 3795779 Data columns (total 12 columns): # Column Dtype --- ------ ----- 0 p_recall float64 1 timestamp object 2 delta int64 3 user_id object 4 learning_language object 5 ui_language object 6 lexeme_id object 7 lexeme_string object 8 history_seen int64 9 history_correct int64 10 session_seen int64 11 session_correct int64 dtypes: float64(1), int64(5), object(6) memory usage: 347.5+ MB
dl.info()
gives concise information about the columns and its data types with their count.
Displaying Column Names¶
dl.columns
Index(['p_recall', 'timestamp', 'delta', 'user_id', 'learning_language', 'ui_language', 'lexeme_id', 'lexeme_string', 'history_seen', 'history_correct', 'session_seen', 'session_correct'], dtype='object')
- Displays the Columns in the DataFrame
Describing Dataset Information¶
dl.describe()
p_recall | delta | history_seen | history_correct | session_seen | session_correct | |
---|---|---|---|---|---|---|
count | 3.795780e+06 | 3.795780e+06 | 3.795780e+06 | 3.795780e+06 | 3.795780e+06 | 3.795780e+06 |
mean | 8.964675e-01 | 7.055116e+05 | 2.197719e+01 | 1.949662e+01 | 1.808655e+00 | 1.636209e+00 |
std | 2.711188e-01 | 2.211979e+06 | 1.283616e+02 | 1.136178e+02 | 1.350644e+00 | 1.309628e+00 |
min | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 |
25% | 1.000000e+00 | 5.180000e+02 | 3.000000e+00 | 3.000000e+00 | 1.000000e+00 | 1.000000e+00 |
50% | 1.000000e+00 | 7.609500e+04 | 6.000000e+00 | 6.000000e+00 | 1.000000e+00 | 1.000000e+00 |
75% | 1.000000e+00 | 4.346412e+05 | 1.500000e+01 | 1.300000e+01 | 2.000000e+00 | 2.000000e+00 |
max | 1.000000e+00 | 3.964973e+07 | 1.344200e+04 | 1.281600e+04 | 2.000000e+01 | 2.000000e+01 |
dl.describe()
function describes the numerical columns from the DataFrame.- It Displays the count, mean, std, min, 25%, 50%, 75%, max which identifies potential outliers or data issues.
- It Helps to understand the Numerical Data Columns.
Displaying Column Data Types¶
print("The Data Types of all Columns:")
dl.dtypes
The Data Types of all Columns:
p_recall float64 timestamp object delta int64 user_id object learning_language object ui_language object lexeme_id object lexeme_string object history_seen int64 history_correct int64 session_seen int64 session_correct int64 dtype: object
- Using
dl.dtypes
, the data types for each column has been Displayed.
Checking the Shape of the Dataset¶
rows, columns = dl.shape
print(f"The dataset contains {rows} rows and {columns} columns.")
The dataset contains 3795780 rows and 12 columns.
dl.shape
has been used for checking the size of the dataset
Checking the unique values in the Dataset¶
for column in dl.columns.tolist():
print(f"No. of unique values in {column}:")
print(dl[column].nunique())
No. of unique values in p_recall: 66 No. of unique values in timestamp: 334848 No. of unique values in delta: 808491 No. of unique values in user_id: 79694 No. of unique values in learning_language: 6 No. of unique values in ui_language: 4 No. of unique values in lexeme_id: 16244 No. of unique values in lexeme_string: 15864 No. of unique values in history_seen: 3784 No. of unique values in history_correct: 3308 No. of unique values in session_seen: 20 No. of unique values in session_correct: 21
dl.columns.tolist()
converts that list of column names into a Python list which loops and iterates over all the columns in the DataFrame dl.dl[column].nunique()
returns the number of unique (distinct) values in the column.
Checking for the Value Counts in the Dataset¶
value_counts_dict = {}
for column in dl.columns:
value_counts_dict[column] = dl[column].value_counts()
print(value_counts_dict[column])
1.000000 3187760 0.000000 266909 0.500000 132930 0.666667 84204 0.750000 49116 ... 0.111111 1 0.545455 1 0.272727 1 0.684211 1 0.375000 1 Name: p_recall, Length: 66, dtype: int64 2013-03-05 21:09:36 105 2013-03-07 22:36:13 97 2013-03-01 20:07:14 94 2013-03-06 14:35:04 88 2013-03-06 19:41:47 87 ... 2013-03-07 14:20:21 1 2013-03-01 11:54:28 1 2013-03-02 03:39:04 1 2013-03-05 02:57:48 1 2013-03-05 19:53:28 1 Name: timestamp, Length: 334848, dtype: int64 165 3900 153 3854 164 3837 169 3771 160 3751 ... 4843625 1 579467 1 3626644 1 15992794 1 615997 1 Name: delta, Length: 808491, dtype: int64 bcH_ 4202 h8n2 2527 cpBu 2417 ht1n 2396 g2Ev 2383 ... iy7m 1 gGtD 1 g9rg 1 h6Lr 1 iTuP 1 Name: user_id, Length: 79694, dtype: int64 en 1479930 es 1007687 fr 552707 de 425433 it 237967 pt 92056 Name: learning_language, dtype: int64 en 2315850 es 1073889 pt 282884 it 123157 Name: ui_language, dtype: int64 827a8ecb89f9b59ac5c29b620a5d3ed6 36115 97e922f780d628eac638bea7a02bf496 28848 928787744a962cd4ec55c1b22cedc913 27224 b968b069e4e2c04848e9f8924e34c031 21842 a617ed646a251e339738ce62b84e61ce 20331 ... 4a8be4af945dbf28ff775b9ff933a5df 1 baae0bd8fddd341c208cb6aa04c237e8 1 f1f34a08001125f2337ed0a73f3a9f22 1 00e6f18dedcf8e59b7bf570beca9e80e 1 e370294ee020bc9db24865cac83a840e 1 Name: lexeme_id, Length: 16244, dtype: int64 a/a<det><ind><sg> 36115 is/be<vbser><pri><p3><sg> 28848 eats/eat<vblex><pri><p3><sg> 27224 we/prpers<prn><subj><p1><mf><pl> 21842 are/be<vbser><pres> 20331 ... telefonnummer/telefon<n>+nummer<n><f><sg><acc> 1 <*sf>/rumo<n><m><*numb> 1 kanada/kanada<np><nt><sg><dat> 1 ausstattung/ausstattung<n><f><sg><nom> 1 interior/interior<n><m><sg> 1 Name: lexeme_string, Length: 15864, dtype: int64 3 498066 4 373188 2 339224 5 287130 6 240742 ... 3816 1 1399 1 3897 1 2550 1 3517 1 Name: history_seen, Length: 3784, dtype: int64 3 511967 2 416078 4 366865 5 278302 1 249961 ... 3602 1 3730 1 3675 1 1587 1 1855 1 Name: history_correct, Length: 3308, dtype: int64 1 2245672 2 781463 3 397726 4 191528 5 93618 6 41790 7 19199 8 9297 9 5441 10 3653 11 1895 12 1106 13 926 16 859 14 796 15 482 17 118 19 115 18 57 20 39 Name: session_seen, dtype: int64 1 2120333 2 737118 3 363098 0 266909 4 166656 5 75562 6 32588 7 14441 8 7370 9 4239 10 2604 11 1484 12 1029 13 786 14 602 15 434 16 332 17 90 18 51 19 40 20 14 Name: session_correct, dtype: int64
value_counts_dict = {}
initializes an empty dictionary.dl[column].value_counts()
returns the counts of unique values in the column.
2. Data Preparation¶
Checking for Missing/Null Values¶
missing_value_count = dl.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
p_recall 0 timestamp 0 delta 0 user_id 0 learning_language 0 ui_language 0 lexeme_id 0 lexeme_string 0 history_seen 0 history_correct 0 session_seen 0 session_correct 0 dtype: int64
dl.isnull()
represents whether the data contains null or not.dl.isnull().sum()
Provides the count of missing values for each column in the DataFrame dl.
Checking for Duplicate Values in the Dataset¶
duplicates = dl[dl.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 22
- The
dl.duplicated()
function is used to identify duplicate rows in a DataFrame dl.
Dropping the Duplicate Values in the Dataset¶
dl= dl.drop_duplicates()
dl
p_recall | timestamp | delta | user_id | learning_language | ui_language | lexeme_id | lexeme_string | history_seen | history_correct | session_seen | session_correct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2013-03-03 17:13:47 | 1825254 | 5C7 | fr | en | 3712581f1a9fbc0894e22664992663e9 | sur/sur<pr> | 2 | 1 | 2 | 2 |
1 | 1.0 | 2013-03-04 18:30:50 | 367 | fWSx | en | es | 0371d118c042c6b44ababe667bed2760 | police/police<n><pl> | 6 | 5 | 2 | 2 |
2 | 0.0 | 2013-03-03 18:35:44 | 1329 | hL-s | de | en | 5fa1f0fcc3b5d93b8617169e59884367 | hat/haben<vbhaver><pri><p3><sg> | 10 | 10 | 1 | 0 |
3 | 1.0 | 2013-03-07 17:56:03 | 156 | h2_R | es | en | 4d77de913dc3d65f1c9fac9d1c349684 | en/en<pr> | 111 | 99 | 4 | 4 |
4 | 1.0 | 2013-03-05 21:41:22 | 257 | eON | es | en | 35f14d06d95a34607d6abb0e52fc6d2b | caballo/caballo<n><m><sg> | 3 | 3 | 3 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3795775 | 1.0 | 2013-03-06 23:06:48 | 4792 | iZ7d | es | en | 84e18e86c58e8e61d687dfa06b3aaa36 | soy/ser<vbser><pri><p1><sg> | 6 | 5 | 2 | 2 |
3795776 | 1.0 | 2013-03-07 22:49:23 | 1369 | hxJr | fr | en | f5b66d188d15ccb5d7777a59756e33ad | chiens/chien<n><m><pl> | 3 | 3 | 2 | 2 |
3795777 | 1.0 | 2013-03-06 21:20:18 | 615997 | fZeR | it | en | 91a6ab09aa0d2b944525a387cc509090 | voi/voi<prn><tn><p2><mf><pl> | 25 | 22 | 2 | 2 |
3795778 | 1.0 | 2013-03-07 07:54:24 | 289 | g_D3 | en | es | a617ed646a251e339738ce62b84e61ce | are/be<vbser><pres> | 32 | 30 | 2 | 2 |
3795779 | 1.0 | 2013-03-06 21:12:07 | 191 | iiN7 | pt | en | 4a93acdbafaa061fd69226cf686d7a2b | café/café<n><m><sg> | 3 | 3 | 1 | 1 |
3795758 rows × 12 columns
- Duplicates rows can inflate statistics, mislead analysis.
- So Duplicates have been dropped using
dl.drop_duplicates()
function
Checking the shape to ensure the drop of duplicates.¶
dl.shape
(3795758, 12)
- By
dl.shape
function, it ensures that duplicates hav been dropped.
Describing Non-Duplicated data.¶
dl.describe()
p_recall | delta | history_seen | history_correct | session_seen | session_correct | |
---|---|---|---|---|---|---|
count | 3.795758e+06 | 3.795758e+06 | 3.795758e+06 | 3.795758e+06 | 3.795758e+06 | 3.795758e+06 |
mean | 8.964686e-01 | 7.055136e+05 | 2.197721e+01 | 1.949666e+01 | 1.808658e+00 | 1.636214e+00 |
std | 2.711172e-01 | 2.211985e+06 | 1.283619e+02 | 1.136181e+02 | 1.350646e+00 | 1.309630e+00 |
min | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 |
25% | 1.000000e+00 | 5.180000e+02 | 3.000000e+00 | 3.000000e+00 | 1.000000e+00 | 1.000000e+00 |
50% | 1.000000e+00 | 7.609500e+04 | 6.000000e+00 | 6.000000e+00 | 1.000000e+00 | 1.000000e+00 |
75% | 1.000000e+00 | 4.346410e+05 | 1.500000e+01 | 1.300000e+01 | 2.000000e+00 | 2.000000e+00 |
max | 1.000000e+00 | 3.964973e+07 | 1.344200e+04 | 1.281600e+04 | 2.000000e+01 | 2.000000e+01 |
dl.describe()
function employed again for an non inflated statistics.
Checking Data Column types for Well Structured Analysis.¶
dl.dtypes
p_recall float64 timestamp object delta int64 user_id object learning_language object ui_language object lexeme_id object lexeme_string object history_seen int64 history_correct int64 session_seen int64 session_correct int64 dtype: object
- For the further analysis, the data types for all the column should be appropriate for a better directed analysis.
- Here its noticed that data type of timestamp is not appropriate as its states 'object' but it has to be in 'datetime'.
Changing Datatype of Inappropriate Columns.¶
pd.options.mode.chained_assignment = None
dl["timestamp"]= pd.to_datetime(dl["timestamp"], format = '%Y-%m-%d %H:%M:%S')
- Pandas might issue a warning. Setting
pd.options.mode.chained_assignment = None
stops that warning from appearing. pd.to_datetime()
function that converts columndl["timestamp"]
to datetime format.
Segregating Numerical Columns for Devicing Outliers.¶
# Select columns with numeric types (either float64 or int64)
numerical_columns = dl.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(numerical_columns)
['p_recall', 'delta', 'history_seen', 'history_correct', 'session_seen', 'session_correct']
dl.select_dtypes(include=['float64', 'int64'])
filters the DataFrame dl to select only the columns that are of type float64 or int64 which is numerical column..columns.tolist()
converts the filtered columns into a list.
Devicing outliers from the DataFrame¶
outliers = {}
- Initializing a Dictionary using
outliers = {}
.
for column in numerical_columns:
Q1 = dl[column].quantile(0.10)
Q3 = dl[column].quantile(0.90)
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
outliers[column] = dl[column][(dl[column]<lower_bound) | (dl[column]>upper_bound)]
print(outliers[column])
Series([], Name: p_recall, dtype: float64) 21 7688640 32 6915874 74 5092318 82 4855994 112 4664462 ... 3795641 9571218 3795644 4390588 3795693 4674316 3795718 4538280 3795763 13364296 Name: delta, Length: 149508, dtype: int64 3 111 20 535 73 2756 95 138 99 111 ... 3795548 2125 3795557 105 3795655 2019 3795686 95 3795701 359 Name: history_seen, Length: 132129, dtype: int64 3 99 20 510 73 2571 95 130 99 102 ... 3795548 1303 3795557 100 3795655 1855 3795686 86 3795701 333 Name: history_correct, Length: 130485, dtype: int64 113 7 157 9 196 9 232 9 263 14 .. 3795563 7 3795597 7 3795655 13 3795731 8 3795766 20 Name: session_seen, Length: 43983, dtype: int64 113 7 157 9 196 9 232 8 263 12 .. 3795391 8 3795597 7 3795655 13 3795731 7 3795766 20 Name: session_correct, Length: 33516, dtype: int64
- The code loops through all numeric columns (
numerical_columns
) in the DataFrame dl. - For each column, it calculates the 10th and 90th percentiles to determine the Inter Quartile Range.
lower_bound = Q1 - 1.5*IQR
,upper_bound = Q3 + 1.5*IQR
calculates the lower and upper bounds.Outliers[column] = dl[column][(dl[column]<lower_bound) | (dl[column]>upper_bound)]
identifies the outliers which is any data points outside this range.
Handling the Outliers.¶
for col, data in outliers.items():
print(f"{col}: {len(data)}")
p_recall: 0 delta: 149508 history_seen: 132129 history_correct: 130485 session_seen: 43983 session_correct: 33516
- Using the looping through conditions through
outliers.items()
. len(data)
employs the length of the outliers.
Data Distribution in Outliers.¶
for columns in numerical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=dl[columns])
plt.title('Boxplot for Numerical Columns')
plt.xlabel(columns)
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.show()
- The above code is generating separate boxplots for each of the numerical columns in the numerical_columns list as it allows visually inspect the distribution and potential outliers in each column.
Insights on Keeping Outliers For Our Analysis.¶
- Real-World Phenomena: Outliers often capture extreme but genuine behaviors or events, which may be valuable for business or learning insights.
- Completeness: Removing rows reduces the size of the dataset and may introduce bias, especially if outliers represent unique user segments.
- Statistical Relevance: Outliers can help test robustness in models and identify areas for further improvement or intervention.
- Strategic Decisions: Insights from outliers may influence critical decisions, such as identifying highly engaged users, optimizing content, or addressing user retention issues.
Creating New columns for better Analysis¶
pd.options.mode.chained_assignment = None
language_mapping = {
'en': 'English',
'fr': 'French',
'es': 'Spanish',
'it': 'Italian',
'de': 'German',
'pt': 'Portuguese',
}
- Pandas might issue a warning. Setting
pd.options.mode.chained_assignment = None
stops that warning from appearing. - This mapping allows you to convert codes to full names easily for more readable for analysis and visualization.
Create Learning_Language_Abb
column¶
# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'learning_language_Abb'] = dl['learning_language'].map(language_mapping)
# Display the updated DataFrame
print(dl[['learning_language', 'learning_language_Abb']])
learning_language learning_language_Abb 0 fr French 1 en English 2 de German 3 es Spanish 4 es Spanish ... ... ... 3795775 es Spanish 3795776 fr French 3795777 it Italian 3795778 en English 3795779 pt Portuguese [3795758 rows x 2 columns]
dl.loc[:, 'learning_language_Abb']
method ensures that the column is created explicitly.- The
.map()
function applies thelanguage_mapping
dictionary to thelearning_language
column.
Create ui_Language_Abb
column¶
# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'ui_language_Abb'] = dl['ui_language'].map(language_mapping)
# Display the updated DataFrame
print(dl[['ui_language', 'ui_language_Abb']])
ui_language ui_language_Abb 0 en English 1 es Spanish 2 en English 3 en English 4 en English ... ... ... 3795775 en English 3795776 en English 3795777 en English 3795778 es Spanish 3795779 en English [3795758 rows x 2 columns]
dl.loc[:, 'ui_language_Abb']
method ensures that the column is created explicitly.- The
.map()
function applies thelanguage_mapping
dictionary to theui_language
column.
Extracting Column lexeme_base
from lexeme_string
¶
pd.options.mode.chained_assignment = None
dl['lexeme_base'] = dl['lexeme_string'].str.split('<', expand=True)[0]
dl['lexeme_base']
0 sur/sur 1 police/police 2 hat/haben 3 en/en 4 caballo/caballo ... 3795775 soy/ser 3795776 chiens/chien 3795777 voi/voi 3795778 are/be 3795779 café/café Name: lexeme_base, Length: 3795758, dtype: object
lexeme_base
is the column extracted fromlexeme_string
to know the exact word learned.- Setting
pd.options.mode.chained_assignment = None
disables the warning. .str.split('<', expand=True)[0]
splits each string in thelexeme_string
column wherever the<
character appears on the first occurence.
Extracting New column grammar_tag
from lexeme_string
¶
import re
# Updated grammar_tags for more precise matching patterns
grammar_categories = {
'Pronouns and Related': ['pr', 'prn', 'preadv', 'predet', 'np', 'rel'],
'Verbs': ['vbhaver', 'vbdo', 'vbser', 'vblex', 'vbmod', 'vaux', 'ord'],
'Nouns': ['n', 'gen', 'apos'],
'Determiners and Adjectives': ['det', 'adj', 'predet'],
'Adverbs': ['adv'],
'Interjections': ['ij', '@ij:'],
'Conjunctions': ['cnjcoo', 'cnjadv', 'cnjsub', '@cnj:'],
'Numbers and Quantifiers': ['num', 'ord'],
'Negation': ['neg', '@neg:', '@common_phrases:', '@itg:'],
'Other': ['pprep', '@adv:', '@pr:', 'apos']
}
import re
module is essential for working with pattern-based string operations. By importing re, you unlock powerful tools to handle text data efficiently.- Initializing
grammar_categories
dictionary for more precise matching patterns.
# Reverse the dictionary to map tags to their respective grammar categories
tag_to_category_map = {tag: category for category, tags in grammar_categories.items() for tag in tags}
- The above code reverse the dictionary
grammar_categories
to create a new dictionary,tag_to_category_map
where the keys are individual tags with the corresponding values.
# Function to determine the grammar category based on prefixes
def grammar_tags(lexeme_string):
# Extract tags inside '<>' and check if they start with any of the provided prefixes
tags_in_string = re.findall(r'<(.*?)>', lexeme_string) # Extract tags inside '<>'
for tag in tags_in_string:
for prefix in tag_to_category_map:
if tag.startswith(prefix): # Match based on prefix
return tag_to_category_map[prefix]
return 'Nan' # Default if no match is found
# Applying the function to the DataFrame
dl['grammar_tag'] = dl['lexeme_string'].apply(lambda x: grammar_tags(x))
print(dl.head())
p_recall timestamp delta user_id learning_language \ 0 1.0 2013-03-03 17:13:47 1825254 5C7 fr 1 1.0 2013-03-04 18:30:50 367 fWSx en 2 0.0 2013-03-03 18:35:44 1329 hL-s de 3 1.0 2013-03-07 17:56:03 156 h2_R es 4 1.0 2013-03-05 21:41:22 257 eON es ui_language lexeme_id \ 0 en 3712581f1a9fbc0894e22664992663e9 1 es 0371d118c042c6b44ababe667bed2760 2 en 5fa1f0fcc3b5d93b8617169e59884367 3 en 4d77de913dc3d65f1c9fac9d1c349684 4 en 35f14d06d95a34607d6abb0e52fc6d2b lexeme_string history_seen history_correct \ 0 sur/sur<pr> 2 1 1 police/police<n><pl> 6 5 2 hat/haben<vbhaver><pri><p3><sg> 10 10 3 en/en<pr> 111 99 4 caballo/caballo<n><m><sg> 3 3 session_seen session_correct learning_language_Abb ui_language_Abb \ 0 2 2 French English 1 2 2 English Spanish 2 1 0 German English 3 4 4 Spanish English 4 3 3 Spanish English lexeme_base grammar_tag 0 sur/sur Pronouns and Related 1 police/police Nouns 2 hat/haben Verbs 3 en/en Pronouns and Related 4 caballo/caballo Nouns
re.findall(r'<(.*?)>', lexeme_string)
extracts all substrings inside angle brackets (<>) fromlexeme_string
.- Each extracted tag is checked to see if it starts with any prefix in
tag_to_category_map
. If a match is found, return the corresponding category fromtag_to_category_map
and if no match is found, return 'Nan'. - Applying the function
grammar_tags
to thelexeme_string
column to extract a new column namedgrammar_tag
. .head()
validate the output by printing the first few rows of the DataFramedl
to confirm the new column has been added correctly.
dl['grammar_tag'].value_counts()
Nouns 1637545 Verbs 856416 Determiners and Adjectives 606273 Pronouns and Related 435685 Adverbs 133575 Conjunctions 66891 Interjections 45956 Other 6821 Negation 6585 Nan 9 Numbers and Quantifiers 2 Name: grammar_tag, dtype: int64
value_counts()
calculates the frequency distribution of unique values in the columngrammar_tag
from the DataFrame dl.
Extracting New column gender_tag
from lexeme_string
¶
gender_tags = {
"Masculine": "m",
"Feminine": "f",
"Neuter": "nt",
"Masculine or Feminine (common gender)": "mf"
}
- Initializing
gender_tags
dictionary for more precise matching patterns.
def get_gender_tag_key(lexeme_string):
for key, tags in gender_tags.items():
if any(f"<{tag}>" in lexeme_string for tag in tags):
return key
return np.nan # Return NaN if no tags are found
# Apply the function to create the new column
dl['gender_tag'] = dl['lexeme_string'].apply(get_gender_tag_key)
get_gender_tag_key
function takeslexeme_string
as input and returns the key which is a matching tag in the string.- The function checks if f"<{tag}>" ie., any tag in the tags exists in lexeme_string. The any() function ensures that even if one tag matches, it returns key.
- Then applying the function to the
lexeme_string
column while returning in the new column namedgender_tag
.
dl['gender_tag'].value_counts()
Neuter 783971 Masculine 696207 Feminine 546620 Name: gender_tag, dtype: int64
value_counts()
calculates the frequency distribution of unique values in the columngender_tag
from the DataFrame dl.
Extracting New column plurality_tag
from lexeme_string
¶
plurality_tags = {
"Singular":{"sg": "Singular"},
"Plural":{"pl": "Plural"}
}
- Initializing
gender_tags
dictionary for more precise matching patterns.
def get_plurality_tags_key(lexeme_string):
for key, tags in plurality_tags.items():
if any(f"<{tag}>" in lexeme_string for tag in tags):
return key
return np.nan # Return NaN if no tags are found
# Apply the function to create the new column
dl['plurality_tag'] = dl['lexeme_string'].apply(get_plurality_tags_key)
get_plurality_tag_key
function takeslexeme_string
as input and returns the key which is a matching tag in the string.- The function checks if f"<{tag}>" ie., any tag in the tags exists in lexeme_string. The any() function ensures that even if one tag matches, it returns key.
- Then applying the function to the
lexeme_string
column while returning in the new column namedplurality_tag
.
dl['plurality_tag'].value_counts()
Singular 2233949 Plural 667294 Name: plurality_tag, dtype: int64
value_counts()
calculates the frequency distribution of unique values in the columngender_tag
from the DataFrame dl.
Initializing New Column delta_days
from delta
column¶
pd.options.mode.chained_assignment = None
dl['delta_days'] = dl['delta'] / 86400
dl['delta_days'] = dl['delta_days'].round(3)
dl['delta_days']
0 21.126 1 0.004 2 0.015 3 0.002 4 0.003 ... 3795775 0.055 3795776 0.016 3795777 7.130 3795778 0.003 3795779 0.002 Name: delta_days, Length: 3795758, dtype: float64
- Pandas might issue a warning. Setting
pd.options.mode.chained_assignment = None
stops that warning from appearing. - As it already stated that
delta
is time (in seconds) since the last lesson or practice and it is converted to days since the last session as its looks clean for the analysis.
Extracting Time
from Timestamp
column¶
dl['time'] = dl['timestamp'].dt.time
dl['time']
0 17:13:47 1 18:30:50 2 18:35:44 3 17:56:03 4 21:41:22 ... 3795775 23:06:48 3795776 22:49:23 3795777 21:20:18 3795778 07:54:24 3795779 21:12:07 Name: time, Length: 3795758, dtype: object
dl['timestamp'].dt.time
extracts the time portion (hour, minute, second) from the timestamp column.
Categorizing the delta_ways
into a column delta_days_category
¶
#defing a function to apply for the column
def categorize_delta_days(value):
if value == 0:
return 'Zero'
elif value > 0 and value <= 1:
return 'Less than a day'
elif value > 1 and value <= 7:
return 'Within a week'
elif value > 7 and value <= 30:
return 'Within a month'
else:
return 'Over a month'
- The
categorize_delta_days
function is used to categorizedelta_days
into predefined categories based on the time difference.
dl['delta_days_category'] = [categorize_delta_days(x) for x in dl['delta_days']]
- This is a list comprehension that loops through each value x in the delta_days column of the dl DataFrame.
Deriving success_rate_history
from two columns¶
dl['success_rate_history'] = dl['history_correct'] / dl['history_seen']
dl['success_rate_history']
0 0.500000 1 0.833333 2 1.000000 3 0.891892 4 1.000000 ... 3795775 0.833333 3795776 1.000000 3795777 0.880000 3795778 0.937500 3795779 1.000000 Name: success_rate_history, Length: 3795758, dtype: float64
success_rate_history
calculates the success rate for each record by dividinghistory_correct
byhistory_seen
.
Extracting column hour
from time
column¶
dl['time'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S').dt.time
dl['time']
0 17:13:47 1 18:30:50 2 18:35:44 3 17:56:03 4 21:41:22 ... 3795775 23:06:48 3795776 22:49:23 3795777 21:20:18 3795778 07:54:24 3795779 21:12:07 Name: time, Length: 3795758, dtype: object
- The above code converts the time column in the DataFrame dl into a proper datetime format, handling errors by coercing. It expects the time values in the format HH:MM:SS.
dl['time_d'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S')
dl['hour'] = dl['time_d'].dt.hour
dl['hour']
0 17 1 18 2 18 3 17 4 21 .. 3795775 23 3795776 22 3795777 21 3795778 7 3795779 21 Name: hour, Length: 3795758, dtype: int64
dl['time'].dt.hour
creates a new columnhour
that stores the hour (from 0 to 23) of the datetime values in thetime
column.
DataFrame dl
¶
dl
p_recall | timestamp | delta | user_id | learning_language | ui_language | lexeme_id | lexeme_string | history_seen | history_correct | ... | lexeme_base | grammar_tag | gender_tag | plurality_tag | delta_days | time | delta_days_category | success_rate_history | time_d | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2013-03-03 17:13:47 | 1825254 | 5C7 | fr | en | 3712581f1a9fbc0894e22664992663e9 | sur/sur<pr> | 2 | 1 | ... | sur/sur | Pronouns and Related | NaN | NaN | 21.126 | 17:13:47 | Within a month | 0.500000 | 1900-01-01 17:13:47 | 17 |
1 | 1.0 | 2013-03-04 18:30:50 | 367 | fWSx | en | es | 0371d118c042c6b44ababe667bed2760 | police/police<n><pl> | 6 | 5 | ... | police/police | Nouns | Neuter | Plural | 0.004 | 18:30:50 | Less than a day | 0.833333 | 1900-01-01 18:30:50 | 18 |
2 | 0.0 | 2013-03-03 18:35:44 | 1329 | hL-s | de | en | 5fa1f0fcc3b5d93b8617169e59884367 | hat/haben<vbhaver><pri><p3><sg> | 10 | 10 | ... | hat/haben | Verbs | NaN | Singular | 0.015 | 18:35:44 | Less than a day | 1.000000 | 1900-01-01 18:35:44 | 18 |
3 | 1.0 | 2013-03-07 17:56:03 | 156 | h2_R | es | en | 4d77de913dc3d65f1c9fac9d1c349684 | en/en<pr> | 111 | 99 | ... | en/en | Pronouns and Related | NaN | NaN | 0.002 | 17:56:03 | Less than a day | 0.891892 | 1900-01-01 17:56:03 | 17 |
4 | 1.0 | 2013-03-05 21:41:22 | 257 | eON | es | en | 35f14d06d95a34607d6abb0e52fc6d2b | caballo/caballo<n><m><sg> | 3 | 3 | ... | caballo/caballo | Nouns | Masculine | Singular | 0.003 | 21:41:22 | Less than a day | 1.000000 | 1900-01-01 21:41:22 | 21 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3795775 | 1.0 | 2013-03-06 23:06:48 | 4792 | iZ7d | es | en | 84e18e86c58e8e61d687dfa06b3aaa36 | soy/ser<vbser><pri><p1><sg> | 6 | 5 | ... | soy/ser | Verbs | NaN | Singular | 0.055 | 23:06:48 | Less than a day | 0.833333 | 1900-01-01 23:06:48 | 23 |
3795776 | 1.0 | 2013-03-07 22:49:23 | 1369 | hxJr | fr | en | f5b66d188d15ccb5d7777a59756e33ad | chiens/chien<n><m><pl> | 3 | 3 | ... | chiens/chien | Nouns | Masculine | Plural | 0.016 | 22:49:23 | Less than a day | 1.000000 | 1900-01-01 22:49:23 | 22 |
3795777 | 1.0 | 2013-03-06 21:20:18 | 615997 | fZeR | it | en | 91a6ab09aa0d2b944525a387cc509090 | voi/voi<prn><tn><p2><mf><pl> | 25 | 22 | ... | voi/voi | Pronouns and Related | NaN | Plural | 7.130 | 21:20:18 | Within a month | 0.880000 | 1900-01-01 21:20:18 | 21 |
3795778 | 1.0 | 2013-03-07 07:54:24 | 289 | g_D3 | en | es | a617ed646a251e339738ce62b84e61ce | are/be<vbser><pres> | 32 | 30 | ... | are/be | Verbs | NaN | NaN | 0.003 | 07:54:24 | Less than a day | 0.937500 | 1900-01-01 07:54:24 | 7 |
3795779 | 1.0 | 2013-03-06 21:12:07 | 191 | iiN7 | pt | en | 4a93acdbafaa061fd69226cf686d7a2b | café/café<n><m><sg> | 3 | 3 | ... | café/café | Nouns | Masculine | Singular | 0.002 | 21:12:07 | Less than a day | 1.000000 | 1900-01-01 21:12:07 | 21 |
3795758 rows × 24 columns
- Above DataFrame
dl
is the well organized and ready to Analyse Data for EDA
3. Exploratory Data Analysis (EDA)¶
Performing Statistical Analysis for some of the Derived columns¶
Statistical Analysis of success_rate_history
¶
mean_success_rate_history = dl['success_rate_history'].mean()
median_success_rate_history = dl['success_rate_history'].median()
max_success_rate_history = dl['success_rate_history'].max()
min_success_rate_history = dl['success_rate_history'].min()
print(f"Mean Success Rate for History: {mean_success_rate_history:.3f}")
print(f"Median Success Rate for History: {median_success_rate_history:.3f}")
print(f"Max Success Rate for History: {max_success_rate_history}")
print(f"Min Success Rate for History: {min_success_rate_history}")
Mean Success Rate for History: 0.901 Median Success Rate for History: 0.963 Max Success Rate for History: 1.0 Min Success Rate for History: 0.05
- The above code aims to analyze the performance trends by summarizing key statistics of the
success_rate_history
column.
Statistical Analysis of p_recall
¶
mean_recall_rate_session = dl['p_recall'].mean()
median_recall_rate_session = dl['p_recall'].median()
max_recall_rate_session = dl['p_recall'].max()
min_recall_rate_session = dl['p_recall'].min()
print(f"Mean Recall Rate for Session: {mean_recall_rate_session:.3f}")
print(f"Median Recall Rate for Session: {median_recall_rate_session:.3f}")
print(f"Max Recall Rate for Session: {max_recall_rate_session}")
print(f"Min Recall Rate for Session: {min_recall_rate_session}")
Mean Recall Rate for Session: 0.896 Median Recall Rate for Session: 1.000 Max Recall Rate for Session: 1.0 Min Recall Rate for Session: 0.0
- The above code aims to analyze the performance trends by summarizing key statistics of the
p_recall
column.
Range of Columns¶
p_recall_range = dl["p_recall"].max()-dl["p_recall"].min()
delta_range = dl["delta"].max()-dl["delta"].min()
history_seen_range = dl["history_seen"].max()-dl["history_seen"].min()
history_correct_range = dl["history_correct"].max()-dl["history_correct"].min()
session_seen_range = dl["session_seen"].max()-dl["session_seen"].min()
session_correct_range = dl["session_correct"].max()-dl["session_correct"].min()
print(f"The Range of p_recall: {p_recall_range:.3f}")
print(f"The Range of delta: {delta_range:.3f}")
print(f"The Range of history_seen: {history_seen_range}")
print(f"The Range of history_correct: {history_correct_range}")
print(f"The Range of session_seen: {session_seen_range}")
print(f"The Range of session_correct: {session_correct_range}")
The Range of p_recall: 1.000 The Range of delta: 39649729.000 The Range of history_seen: 13441 The Range of history_correct: 12815 The Range of session_seen: 19 The Range of session_correct: 20
Obtain the unique lexeme ids
of all learning_language
¶
# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_lexemes_per_lang = dl.groupby('learning_language_Abb')['lexeme_id'].nunique().reset_index()
# Rename the column for better readability
unique_lexemes_per_lang.rename(columns={'lexeme_id': 'Unique Lexeme IDs'}, inplace=True)
# Display the result
print(unique_lexemes_per_lang)
learning_language_Abb Unique Lexeme IDs 0 English 2740 1 French 3429 2 German 3218 3 Italian 1750 4 Portuguese 2055 5 Spanish 3052
dl.groupby('learning_language_Abb')['lexeme_id'].nunique().reset_index()
groups the dataset (dl) by the values in thelearning_language_Abb
column where each unique value in this column forms a group. Moreover, thenunique()
method calculates the number ofunique lexeme_id
values. This gives the count of distinct lexemes associated with each language.unique_lexemes_per_lang
column that stores the count of unique lexeme IDs is renamed from 'lexeme_id' to 'Unique Lexeme IDs' to make the DataFrame easier to understand.
Obtain the unique lexeme strings
of all learning_language
¶
# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_words_per_lang = dl.groupby('learning_language_Abb')['lexeme_string'].nunique().reset_index()
# Rename the column for better readability
unique_words_per_lang.rename(columns={'lexeme_base': 'Unique words'}, inplace=True)
# Display the result
print(unique_words_per_lang)
learning_language_Abb lexeme_string 0 English 2740 1 French 3429 2 German 3218 3 Italian 1750 4 Portuguese 2055 5 Spanish 3052
dl.groupby('learning_language_Abb')['lexeme_string'].nunique().reset_index()
groups the dataset (dl) by the values in thelearning_language_Abb
column where each unique value in this column forms a group. Moreover, thenunique()
method calculates the number ofunique lexeme_strings
values. This gives the count of distinct lexemes associated with each language.unique_words_per_lang
column that stores the count of unique lexeme IDs is renamed from 'lexeme_base' to 'Unique words' to make the DataFrame easier to understand.
Hypothesis 1:- Analyzing user engagement for each user aims to identify patterns, expecting higher engagement levels among a subset of users reflecting varying learning behaviors.¶
Computing User engagement¶
user_engagement = dl.groupby("user_id")["history_seen"].sum().reset_index()
user_engagement = user_engagement.sort_values(by=["history_seen"], ascending=[ False])
user_engagement
user_id | history_seen | |
---|---|---|
3328 | bcH_ | 3787808 |
27016 | goA | 3479726 |
5998 | cpBu | 3263073 |
7448 | dOig | 1266491 |
1223 | NPs | 903094 |
... | ... | ... |
17912 | fkZR | 1 |
318 | 6XJ | 1 |
40813 | hj3 | 1 |
25380 | g_k- | 1 |
29632 | h2Xz | 1 |
79694 rows × 2 columns
dl.groupby("user_id")["history_seen"].sum().reset_index()
function aggregates thehistory_seen
values for eachuser_id
which sums up all thehistory_seen
values for each unique user.- The resultant is sorted by Descending values.
Computing User Engagement on Recall¶
# Correlation between user engagement (history_seen) and learning success (p_recall)
user_engagement_recall = dl.groupby('user_id')['p_recall'].mean().reset_index()
user_engagement_recall = user_engagement_recall.sort_values(by=['p_recall'], ascending=[ False])
user_engagement_recall
user_id | p_recall | |
---|---|---|
12841 | exUs | 1.0 |
17986 | flbK | 1.0 |
37145 | hUni | 1.0 |
17985 | flax | 1.0 |
71594 | ijPr | 1.0 |
... | ... | ... |
60390 | iRTg | 0.0 |
14364 | fAzQ | 0.0 |
48288 | iB2r | 0.0 |
48361 | iBFg | 0.0 |
78581 | k0S | 0.0 |
79694 rows × 2 columns
dl.groupby('user_id')['p_recall'].mean().reset_index()
function aggregates thep_recall
values for eachuser_id
which performs the mean ofp_recall
values for each unique user.- The resultant is sorted by Descending values.
Correlation between user engagement and learning success¶
engagement_success_corr = user_engagement.merge(user_engagement_recall, on='user_id')
engagement_success_corr_corr = engagement_success_corr['history_seen'].corr(engagement_success_corr['p_recall'])
print(f"Correlation between user engagement and learning success: {engagement_success_corr_corr:.2f}")
Correlation between user engagement and learning success: -0.01
- The method .corr() is used to compute the Pearson correlation coefficient between two columns in the engagement_success_corr DataFrame.
- The Pearson correlation measures the linear relationship between two variables:
- A value close to 1 indicates a strong positive correlation (as one increases, the other also increases).
- A value close to -1 indicates a strong negative correlation (as one increases, the other decreases).
- A value close to 0 indicates no linear correlation.
Most Engaged User in the recorded time¶
# Find the user with the highest engagement
highest_engagement_user = user_engagement.loc[user_engagement["history_seen"].idxmax()]
highest_engagement_user
user_id bcH_ history_seen 3787808 Name: 3328, dtype: object
# Filter the dataset for the highest engaged user
user_data = dl[dl["user_id"] == highest_engagement_user["user_id"]]
# Calculate and display the mean and median of p_recall
mean_p_recall = user_data["p_recall"].mean()
median_p_recall = user_data["p_recall"].median()
print(f"Mean p_recall for highest engaged user: {mean_p_recall}, Median p_recall for highest engaged user: {median_p_recall}")
Mean p_recall for highest engaged user: 0.46417843355016153, Median p_recall for highest engaged user: 0.5
Analysis:¶
Correlation Insight:
- The correlation between user engagement (
history_seen
) and learning success (p_recall
) is -0.01, indicating virtually no linear relationship between the two. This suggests that higher engagement doesn't necessarily translate to better recall rates, and vice versa.
- The correlation between user engagement (
User with Highest Engagement:
- The user with the highest engagement is
bcH_
, with 3,787,808 history_seen. However, the recall performance of this user is not explicitly shown, which could provide further insights into the quality of their engagement.
- The user with the highest engagement is
Engagement and Success Variability:
- While some users exhibit high engagement, their recall rates vary significantly. Similarly, users with perfect recall (
p_recall = 1.0
) may not necessarily have high engagement, highlighting potential inconsistencies in the relationship between the quantity of interaction and learning outcomes.
- While some users exhibit high engagement, their recall rates vary significantly. Similarly, users with perfect recall (
Key Insights:¶
Engagement Alone is Not Enough: High engagement does not guarantee learning success, as seen from the weak correlation.
Targeted Improvement Required: Users with moderate or low recall but high engagement could benefit from tailored learning content to improve their effectiveness.
Outlier Behavior: Certain users with high recall rates but low engagement indicate the potential for efficient learning strategies, which could be analyzed further to design better learning paths.
Recommendations:¶
Personalized Feedback: Focus on providing feedback to highly engaged users with low recall rates to enhance their learning efficiency.
Optimized Content Delivery: Investigate the methods of users with high recall but low engagement to replicate and design efficient learning strategies for others.
Behavioral Analysis: Conduct a deeper analysis of the user with the highest engagement (
bcH_
) to understand whether their engagement is productive or repetitive and optimize their learning experience.Incentivize Quality Engagement: Create programs or gamified experiences encouraging not just time spent but effective learning practices to balance engagement and success.
Hypothesis 2:- The analysis aims to identify patterns in user engagement and recall, expecting higher engagement with easier lexemes. This reflects varying learning behaviors and the impact of different materials on user success.¶
Extract lexeme_performance
¶
# Group by lexeme_id and calculate mean recall rate
lexeme_performance = dl.groupby('lexeme_id')['p_recall'].agg(['mean', 'median']).round(2)
lexeme_performance
mean | median | |
---|---|---|
lexeme_id | ||
00022efc4121667defd065c88569e748 | 0.90 | 1.0 |
00064bf8c1c3cefa80b193acf7b9fe1d | 0.90 | 1.0 |
000d2eb6a5658fa17f828c5fb0c66c11 | 0.95 | 1.0 |
000f3063358c188d171d903ec5a7855c | 0.85 | 1.0 |
001635bcb24496cf2b27731c3708dbfa | 0.94 | 1.0 |
... | ... | ... |
ffeacb268a19c068cd8171938e5280a8 | 1.00 | 1.0 |
ffedebe922588f522094dd8eac320071 | 1.00 | 1.0 |
ffee4a0570d4eacc9c08f339c2bb11a7 | 1.00 | 1.0 |
fff70e9352d896105563156d7023d878 | 0.81 | 1.0 |
fff799d4d95e416db9dc07ca717b4ef9 | 0.91 | 1.0 |
16244 rows × 2 columns
dl.groupby('lexeme_id')['p_recall'].agg(['mean', 'median']).round(2)
represents the mean and median forp_recall
which groups the data in the DataFrame dl by thelexeme_id
column.- Each group corresponds to all rows in the dataset associated with a specific lexeme.
Extract lexeme_success
¶
# Group by lexeme_id and calculate success rate
lexeme_success = dl.groupby('lexeme_id')['success_rate_history'].agg(['mean', 'median']).round(2)
lexeme_success
mean | median | |
---|---|---|
lexeme_id | ||
00022efc4121667defd065c88569e748 | 1.00 | 1.00 |
00064bf8c1c3cefa80b193acf7b9fe1d | 0.86 | 0.94 |
000d2eb6a5658fa17f828c5fb0c66c11 | 0.98 | 1.00 |
000f3063358c188d171d903ec5a7855c | 0.85 | 0.88 |
001635bcb24496cf2b27731c3708dbfa | 0.93 | 1.00 |
... | ... | ... |
ffeacb268a19c068cd8171938e5280a8 | 1.00 | 1.00 |
ffedebe922588f522094dd8eac320071 | 1.00 | 1.00 |
ffee4a0570d4eacc9c08f339c2bb11a7 | 1.00 | 1.00 |
fff70e9352d896105563156d7023d878 | 0.92 | 1.00 |
fff799d4d95e416db9dc07ca717b4ef9 | 0.94 | 1.00 |
16244 rows × 2 columns
dl.groupby('lexeme_id')['success_rate_history'].agg(['mean', 'median']).round(2)
represents the mean and median forsuccess_rate_history
which groups the data in the DataFrame dl by thelexeme_id
column.- Each group corresponds to all rows in the dataset associated with a specific lexeme.
Lexeme Difficulty (Based on Recall)¶
# Calculate correlation between lexeme_id and recall rate
lexeme_difficulty_corr = dl.groupby('lexeme_id').agg({'p_recall': 'mean'}).reset_index()
lexeme_difficulty_corr
lexeme_id | p_recall | |
---|---|---|
0 | 00022efc4121667defd065c88569e748 | 0.904762 |
1 | 00064bf8c1c3cefa80b193acf7b9fe1d | 0.900000 |
2 | 000d2eb6a5658fa17f828c5fb0c66c11 | 0.952830 |
3 | 000f3063358c188d171d903ec5a7855c | 0.849481 |
4 | 001635bcb24496cf2b27731c3708dbfa | 0.940810 |
... | ... | ... |
16239 | ffeacb268a19c068cd8171938e5280a8 | 1.000000 |
16240 | ffedebe922588f522094dd8eac320071 | 1.000000 |
16241 | ffee4a0570d4eacc9c08f339c2bb11a7 | 1.000000 |
16242 | fff70e9352d896105563156d7023d878 | 0.808757 |
16243 | fff799d4d95e416db9dc07ca717b4ef9 | 0.910078 |
16244 rows × 2 columns
dl.groupby('lexeme_id').agg({'p_recall': 'mean'}).reset_index()
aggregates the mean ofp_recall
which grouped by 'lexeme_id' in the DataFrame dl.- Each group corresponds to all rows in the dataset associated with a specific lexeme.
Analysis¶
Recall Rates Across Lexemes:
- The average recall rates for lexemes vary significantly, with some lexemes achieving perfect recall (1.0), while others have lower averages.
- Median recall rates for most lexemes are high, often 1.0, suggesting that learners tend to perform well on individual lexemes when averaged over all users.
Success Rates Across Lexemes:
- Similar to recall, success rates for lexemes also show variability. Some lexemes have perfect success rates (1.0 mean and median), while others are lower, indicating that certain lexemes are harder for users to master.
Correlation Insights:
- The
lexeme_difficulty_corr
calculation indicates variability in recall performance across lexemes, reflecting differences in difficulty or contextual use.
- The
Key Insights¶
Lexeme Difficulty Varies:
- Some lexemes are inherently harder for users to recall and master, as evidenced by lower mean recall and success rates.
High Success Rates with Variability:
- While the median success rate for many lexemes is 1.0, the variability in the mean suggests some users face challenges with specific lexemes.
Consistent Performance in Easy Lexemes:
- Lexemes with high recall and success rates likely represent concepts or words that are easier or more frequently encountered in practice.
Recommendations¶
Target Difficult Lexemes:
- Identify lexemes with lower mean recall and success rates and introduce additional practice or review sessions specifically focused on these lexemes.
Contextual Learning:
- Provide contextual examples or mnemonics for harder lexemes to enhance retention and recall rates.
Adaptive Learning:
- Use the data on lexeme difficulty to adapt learning paths, offering more repetition and practice for harder lexemes while reducing redundancy for easier ones.
Track Lexeme Mastery Over Time:
- Continuously monitor lexeme performance to understand trends and adjust teaching methods dynamically, ensuring improvement in both recall and success rates.
Hypothesis 3:- The analysis explores how UI languages influence both recall rates and success rates, hypothesizing that certain languages may lead to better or worse performance in learning. It expects differences in recall and success rates based on the user's preferred UI language.¶
Aggregating Mean and Median on success_rate_history for ui_language_Abb¶
ui_language_stats_history_success_rate = dl.groupby('ui_language_Abb')['success_rate_history'].agg(['mean', 'median'])
ui_language_stats_history_success_rate = ui_language_stats_history_success_rate.reset_index()
ui_language_stats_history_success_rate
ui_language_Abb | mean | median | |
---|---|---|---|
0 | English | 0.898451 | 0.988764 |
1 | Italian | 0.910783 | 0.974359 |
2 | Portuguese | 0.905714 | 0.954545 |
3 | Spanish | 0.903697 | 0.948718 |
- The code computes the mean and median success rates for the
success_rate_history
column, grouped by each language in theui_language_Abb
column. - This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.
Aggregating Mean and Median on p_recall for ui_language_Abb¶
ui_language_stats_recall_rate = dl.groupby('ui_language_Abb')['p_recall'].agg(['mean', 'median'])
ui_language_stats_recall_rate = ui_language_stats_recall_rate.reset_index()
ui_language_stats_recall_rate
ui_language_Abb | mean | median | |
---|---|---|---|
0 | English | 0.894915 | 1.0 |
1 | Italian | 0.908159 | 1.0 |
2 | Portuguese | 0.897696 | 1.0 |
3 | Spanish | 0.898156 | 1.0 |
- The code computes the mean and median success rates for the
p_recall
column, grouped by each language in theui_language_Abb
column. - This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.
Analysis¶
Success Rate by UI Language:
- The mean success rate across UI languages is consistently high, ranging between 0.898 (English) and 0.911 (Italian).
- Median success rates are even higher, with most UI languages achieving close to or above 0.95, indicating that users tend to achieve high success rates during their learning sessions regardless of the UI language.
Recall Rate by UI Language:
- The mean recall rates also show a similar trend, ranging from 0.895 (English) to 0.908 (Italian).
- The median recall rate is perfect (1.0) across all UI languages, suggesting that users often recall individual lexemes correctly when the data is aggregated.
Comparison Across Languages:
- Italian outperforms other UI languages slightly in both success and recall rates, with the highest mean values in both metrics.
- Other languages (English, Portuguese, and Spanish) follow closely, with minimal variation, suggesting uniform learning effectiveness.
Key Insights¶
Uniform Performance Across Languages:
- Users perform consistently across different UI languages, with only slight differences in mean success and recall rates.
Italian Leads in Engagement:
- Italian users achieve the highest mean recall and success rates, indicating a potentially more engaging or user-friendly interface, or a more motivated user base.
Perfect Median Recall Rates:
- The perfect median recall rate (1.0) across all UI languages highlights effective learning mechanics, ensuring most users can recall lexemes correctly.
Recommendations¶
Deep Dive into Italian Performance:
- Investigate why Italian users are achieving slightly higher performance. This could inform UI/UX improvements for other languages.
Standardize UI Features:
- Apply any positive features or feedback from high-performing languages (like Italian) to other UI languages to further enhance overall performance.
Leverage Median Performance:
- Since most users achieve a perfect recall rate (median), focus on uplifting the mean by targeting specific user segments or lexemes with lower success rates.
Localized Support and Resources:
- Offer more tailored support or resources for users of specific UI languages, especially if certain groups exhibit higher variability in their learning outcomes.
Hypothesis 4:- Gender and grammatical features, such as plurality, may influence learning performance. It suggests that gender and grammatical differences (e.g., singular vs plural) will show varying impacts on recall and success rates across different groups.¶
Deriving mean and median on p_recall
for gender_tag
¶
# Group by gender_tag and calculate mean recall rate for each gender
gender_performance = dl.groupby('gender_tag')['p_recall'].agg(['mean', 'median'])
gender_performance
mean | median | |
---|---|---|
gender_tag | ||
Feminine | 0.896025 | 1.0 |
Masculine | 0.899669 | 1.0 |
Neuter | 0.907303 | 1.0 |
dl.groupby('gender_tag')['p_recall'].agg(['mean', 'median'])
computes the mean and median success rates for thep_recall
column, grouped by each language in thegender_tag
column.
Deriving mean and median on success_rate_history
for gender_tag
¶
# Group by gender_tag and calculate success rate
gender_success = dl.groupby('gender_tag')['success_rate_history'].agg(['mean', 'median'])
gender_success
mean | median | |
---|---|---|
gender_tag | ||
Feminine | 0.897845 | 1.0 |
Masculine | 0.899576 | 1.0 |
Neuter | 0.911196 | 1.0 |
dl.groupby('gender_tag')['success_rate_history'].agg(['mean', 'median'])
computes the mean and median success rates for thesuccess_rate_history
column, grouped by each language in thegender_tag
column.
Aggregation on p_recall
revolves through 'gender_tag' & 'plurality_tag'¶
# Compare performance across grammatical features (plurality_tag) and gender
gender_plurality_comparison = dl.groupby(['gender_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median'])
gender_plurality_comparison
mean | median | ||
---|---|---|---|
gender_tag | plurality_tag | ||
Feminine | Plural | 0.893904 | 1.0 |
Singular | 0.896369 | 1.0 | |
Masculine | Plural | 0.902440 | 1.0 |
Singular | 0.899193 | 1.0 | |
Neuter | Plural | 0.901168 | 1.0 |
Singular | 0.908342 | 1.0 |
- The code computes the mean and median success rates for the
p_recall
column, grouped by each language in thegender_tag
&plurality_tag
column.
Analysis¶
1. Recall Rate by Gender:¶
- Mean recall rates are very close across genders: Feminine (0.896), Masculine (0.900), and Neuter (0.907).
- The median recall rate is 1.0 for all genders, indicating that most users recall items accurately regardless of gender-tagged lexemes.
2. Success Rate by Gender:¶
- Mean success rates follow a similar pattern: Feminine (0.898), Masculine (0.900), and Neuter (0.911).
- The median success rate is also 1.0 across all genders, highlighting consistent high performance for all gender categories.
3. Gender and Plurality Performance:¶
- Neuter Singular lexemes show the highest mean recall rate (0.908) among all gender and plurality combinations.
- Feminine Plural lexemes have the lowest mean recall rate (0.894) but still maintain a perfect median recall rate.
- Plural lexemes generally have slightly lower mean recall rates compared to their singular counterparts across all genders.
Key Insights¶
Consistency Across Genders:
- Performance in terms of recall and success rates is uniform across gender tags, with only minor differences.
Neuter Gender Leads:
- Lexemes tagged as Neuter show slightly higher performance in both recall and success rates, especially in singular form.
Plurality Impact:
- Plural lexemes tend to have marginally lower mean recall rates than singular lexemes, suggesting users may find plural forms slightly more challenging.
High Median Performance:
- The median recall and success rates of 1.0 across all gender and plurality combinations highlight effective learning for most users.
Recommendations¶
Address Plural Lexemes:
- Develop targeted exercises or materials to improve the recall of plural lexemes, especially those tagged as Feminine Plural.
Leverage Neuter Lexeme Strengths:
- Analyze why Neuter Singular lexemes perform better to replicate this success in other categories.
Monitor Low Variance Segments:
- While overall performance is high, identify specific lexemes or users with lower scores in the Feminine and Plural categories for tailored support.
Balanced Lexeme Practice:
- Ensure learning paths include an equal mix of gender and plurality-tagged lexemes to avoid bias or under-representation of any category.
Advanced Insights:
- Investigate if cultural or linguistic biases in learning gendered lexemes influence performance, and adapt the content accordingly.
Hypothesis 5 :- Longer median delta days indicate lower consistency, suggesting sporadic learning behavior. Shorter median delta days reflect higher consistency, implying sustained engagement and better performance.¶
Deriving Median Delta Days
for each user¶
user_id_consistency = dl.groupby('user_id')['delta_days'].median().reset_index()
user_id_consistency = user_id_consistency.sort_values(by = 'delta_days', ascending = False)
user_id_consistency.rename(columns={'user_id': 'User ID', 'delta_days': 'Median Delta Days'}, inplace=True)
print(user_id_consistency)
User ID Median Delta Days 792 GB 458.9090 1576 TX 444.9910 1842 _2 423.2820 2439 bEO 411.3305 4197 bzm 409.2480 ... ... ... 24529 gVkB 0.0000 59174 iQCu 0.0000 45411 i1sv 0.0000 17451 fefQ 0.0000 17571 fg7z 0.0000 [79694 rows x 2 columns]
-dl.groupby('user_id')['delta_days'].median().reset_index()
delves the median of delta_days
which groups over user_id
.
Keeping Threshold for consistency¶
user_id_consistency['Median Delta Days'].describe()
count 79694.000000 mean 12.856487 std 34.171674 min 0.000000 25% 0.039000 50% 2.026000 75% 8.973000 max 458.909000 Name: Median Delta Days, dtype: float64
quantile_90 = user_id_consistency['Median Delta Days'].quantile(0.90)
# Display the 90th quantile
print(f"The 90th percentile (quantile) for Median Delta Days is: {quantile_90:.2f}")
The 90th percentile (quantile) for Median Delta Days is: 30.13
- By using
.describe()
function the viewer can understand the how the values ofMedian Delta Days
are distributed through out the data. - Since the data is highly deviated from 75% to max. It can be better to use the the 90% quantile to determine the level for Less consistent user.
- So it is derived that if
Median Delta Days
greater than 30 i.e., more than a month or odd the user is considered as the less consistent. - The consistency is considered as moderate if
Median Delta Days
is between 8 to 30 based on 75%. - The most consistent user if
Median Delta Days
is lesser than 8.
Less_consistent_user¶
# Filter rows where 'Median Delta Days' is greater than 100
Less_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] > 30]
# Display the filtered DataFrame
print(Less_consistent_user)
User ID Median Delta Days 792 GB 458.9090 1576 TX 444.9910 1842 _2 423.2820 2439 bEO 411.3305 4197 bzm 409.2480 ... ... ... 39727 hdRb 30.0040 13056 f-hA 30.0040 11184 eT8k 30.0030 18695 ftXa 30.0020 41123 hkY5 30.0010 [8025 rows x 2 columns]
user_id_consistency[user_id_consistency['Median Delta Days'] > 30]
filters theuser_id_consistency
DataFrame to display only rows where the value in the 'Median Delta Days' column is greater than 30.
Most_consistent_user¶
# Filter rows where 'Median Delta Days' is greater than 100
most_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] < 8]
# Display the filtered DataFrame
print(most_consistent_user)
User ID Median Delta Days 47936 iAJZ 7.9995 52441 iI4W 7.9990 38545 hZT8 7.9990 51604 iGoN 7.9990 6511 d2OP 7.9990 ... ... ... 24529 gVkB 0.0000 59174 iQCu 0.0000 45411 i1sv 0.0000 17451 fefQ 0.0000 17571 fg7z 0.0000 [58361 rows x 2 columns]
user_id_consistency[user_id_consistency['Median Delta Days'] < 8]
filters theuser_id_consistency
DataFrame to display only rows where the value in the 'Median Delta Days' column is lesser than 8.
Less Consistent User's data¶
# Step 1: Get the user_ids from Less_consistent_user
less_consistent_user_ids = Less_consistent_user['User ID']
# Step 2: Filter the original dataset (dl) for those user_ids
filtered_dl_l = dl[dl['user_id'].isin(less_consistent_user_ids)]
# Step 3: Group by 'user_id' and calculate required statistics
less_consistent_user_language_stats = (
filtered_dl_l.groupby(['user_id', 'learning_language_Abb',])[['delta_days','p_recall', 'success_rate_history']]
.median()
.reset_index()
.rename(columns={'p_recall': 'Median p_recall', 'delta_days': 'Median_delta_days','success_rate_history': 'Median Success Rate'})
)
less_consistent_user_language_stats = less_consistent_user_language_stats.sort_values(by = 'Median_delta_days', ascending = False)
# Display the resulting DataFrame
print(less_consistent_user_language_stats)
user_id learning_language_Abb Median_delta_days Median p_recall \ 355 GB German 458.9090 1.000000 699 TX Spanish 444.9910 0.666667 805 _2 Spanish 423.2820 1.000000 1033 bEO Spanish 411.3305 1.000000 1753 bzm Spanish 409.2480 1.000000 ... ... ... ... ... 3541 e1x_ French 0.0020 1.000000 3673 eApr Spanish 0.0020 1.000000 2829 dK31 English 0.0020 1.000000 4751 f8Bu Portuguese 0.0020 0.750000 2851 dLoW Portuguese 0.0020 0.900000 Median Success Rate 355 0.833333 699 0.666667 805 0.833333 1033 0.857143 1753 0.894444 ... ... 3541 0.800000 3673 1.000000 2829 1.000000 4751 0.857143 2851 1.000000 [8193 rows x 5 columns]
dl[dl['user_id'].isin(less_consistent_user_ids)]
After extracting theUser ID
column from theLess_consistent_user
DataFrame and stores it in the variable less_consistent_user_ids. These IDs represent users with less consistency in their behavior (e.g., high median delta days).The code filter the original dl dataset to only include rows where theuser_id
is in the listless_consistent_user_ids
. The filtered data is stored infiltered_dl_l
, which contains the records of users with less consistency.filtered_dl_l.groupby()
groups the filtered data (filtered_dl_l) byuser_id
andlearning_language_Abb
, then calculates the median for the columns delta_days, p_recall, and success_rate_history for each group..rename()
The column names are renamed to provide clearer labels..sort_values()
This sorts the DataFrame by Median_delta_days in descending order, so that users with the highest median delta days (less consistent) appear at the top.
Most Consistent User's data¶
# Step 1: Filter most consistent user_ids
most_consistent_user_ids = most_consistent_user['User ID']
# Step 2: Filter the original dataset (dl) for these user_ids
filtered_most_consistent_dl = dl[dl['user_id'].isin(most_consistent_user_ids)]
# Step 3: Group by 'user_id' and calculate required statistics
most_consistent_user_stats = (
filtered_most_consistent_dl.groupby(['user_id'])
.agg({
'delta_days': 'median',
'p_recall': 'median',
'success_rate_history': 'median'
})
.reset_index()
.rename(columns={
'delta_days': 'Median Delta Days',
'p_recall': 'Median p_recall',
'success_rate_history': 'Median Success Rate'
})
)
most_consistent_user_stats = most_consistent_user_stats.sort_values(by = 'Median Delta Days', ascending = False)
# Step 4: Merge with the language information
most_consistent_user_stats = most_consistent_user_stats.merge(
dl[['user_id', 'learning_language_Abb']].drop_duplicates(),
on='user_id',
how='left'
)
# Display the resulting DataFrame
print(most_consistent_user_stats)
user_id Median Delta Days Median p_recall Median Success Rate \ 0 iAJZ 7.9995 1.000000 0.875000 1 iGoN 7.9990 1.000000 1.000000 2 iI4W 7.9990 1.000000 0.750000 3 d2OP 7.9990 1.000000 1.000000 4 hZT8 7.9990 1.000000 1.000000 ... ... ... ... ... 59968 d52D 0.0000 0.833333 0.833333 59969 i5iF 0.0000 0.708333 0.708333 59970 iDPq 0.0000 1.000000 1.000000 59971 i5BG 0.0000 1.000000 0.878788 59972 iQCu 0.0000 1.000000 1.000000 learning_language_Abb 0 English 1 English 2 German 3 Spanish 4 Portuguese ... ... 59968 English 59969 Spanish 59970 Spanish 59971 Spanish 59972 English [59973 rows x 5 columns]
dl[dl['user_id'].isin(most_consistent_user_ids)]
After extracting theUser ID
column from themost_consistent_user
DataFrame and stores it in the variable less_consistent_user_ids. These IDs represent users with less consistency in their behavior (e.g., high median delta days).The code filter the original dl dataset to only include rows where theuser_id
is in the listmost_consistent_user_ids
. The filtered data is stored inffiltered_most_consistent_dl
, which contains the records of users with less consistency.filtered_most_consistent_dl.groupby()
groups the filtered data (filtered_most_consistent_dl) byuser_id
andlearning_language_Abb
, then calculates the median for the columns delta_days, p_recall, and success_rate_history for each group..rename()
The column names are renamed to provide clearer labels..sort_values()
This sorts the DataFrame by Median_delta_days in descending order, so that users with the highest median delta days (most consistent) appear at the top.
Analysis¶
1. Overview of User Consistency¶
The Median Delta Days metric reveals how consistently users engage with the platform.
- Overall Distribution:
- Mean: 12.86 days
- Median: 2.03 days
- 90th Percentile: 30.13 days
- Max: 458.91 days
- A significant proportion of users (~75%) have a Median Delta Days of less than 9 days, indicating regular engagement.
- Overall Distribution:
Users were split into:
- Less Consistent Users: Median Delta Days > 30 (e.g., User ID GB with 458.91 days).
- Most Consistent Users: Median Delta Days < 8 (e.g., User ID iAJZ with 7.99 days).
2. Less Consistent Users¶
- Language Engagement:
- Top inconsistent users primarily engage with German and Spanish.
- Median Recall Rate: Ranges from 0.67 to 1.00.
- Median Success Rate: Some users show low success rates (~0.66), indicating challenges in retention or understanding.
- Notable Outliers:
- Users like GB (German) and TX (Spanish) show very high delta days (>400 days), yet they maintain high recall rates (~1.0). This suggests sporadic but focused engagement.
3. Most Consistent Users¶
- Language Engagement:
- These users are consistent in practicing languages like English, German, and Spanish.
- Median Recall Rate: Consistently high (mostly 1.00).
- Median Success Rate: Majority maintain excellent success rates (~0.75–1.00).
- Insights:
- Consistent users display better retention and recall compared to less consistent users.
- The consistency in engagement likely correlates with high success and recall rates, reflecting effective learning habits.
Key Insights¶
Consistency Drives Performance:
- Users with regular engagement (lower delta days) consistently outperform less consistent users in recall and success rates.
Language-Specific Challenges:
- Inconsistent users engaging in languages like Spanish and German often exhibit lower success rates, indicating possible difficulties with these languages.
Outliers Among Inconsistent Users:
- Some inconsistent users maintain high recall rates despite long gaps between sessions, suggesting a preference for intensive, focused learning.
Median Delta Days Threshold:
- A Median Delta Days of 30 serves as a threshold for identifying inconsistent users, who may need targeted interventions.
Recommendations¶
Encourage Regular Engagement:
- Design reminders or streak incentives to prompt users with high delta days to engage more frequently.
- Offer micro-lessons or bite-sized activities for users who find it hard to commit regularly.
Target Support for Inconsistent Users:
- Focus on improving success rates for inconsistent users of languages like Spanish and German through:
- Personalized exercises.
- Gamified learning strategies to increase motivation.
- Focus on improving success rates for inconsistent users of languages like Spanish and German through:
Reward Consistency:
- Recognize and reward consistent users to reinforce positive habits. Introduce badges, leaderboards, or personalized achievements.
Analyze High Recall Outliers:
- Study inconsistent users with high recall rates to identify factors contributing to their performance. Insights can guide strategies for other inconsistent learners.
Data-Driven Personalization:
- Use the consistency data to dynamically adjust lesson difficulty and suggest tailored practice schedules for both consistent and inconsistent users.
Hypotheses 6 :- The more frequent learning sessions (shorter Delta_Days) lead to better recall and higher success rates, while long breaks between sessions can negatively affect learning outcomes¶
Counting number of members in each language in less consistent users¶
less_consistent_user_language_stats['learning_language_Abb'].value_counts()
English 2566 Spanish 2278 French 1573 German 1388 Italian 219 Portuguese 169 Name: learning_language_Abb, dtype: int64
.value_counts()
The code returns the count of how many times each language abbreviation appears in thelearning_language_Abb
column for the users identified as "less consistent".
Counting number of members in each language for more consistent users¶
most_consistent_user_stats['learning_language_Abb'].value_counts()
English 23055 Spanish 15740 French 9218 German 6905 Italian 3631 Portuguese 1424 Name: learning_language_Abb, dtype: int64
.value_counts()
The code returns the count of how many times each language abbreviation appears in thelearning_language_Abb
column for the users identified as "more consistent".
Getting Correlation on Delta_Days of Each user Vs P_recall and Success Rate¶
# Calculate the correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall = less_consistent_user_language_stats['Median_delta_days'].corr(
less_consistent_user_language_stats['Median p_recall']
)
# Calculate the correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate = less_consistent_user_language_stats['Median_delta_days'].corr(
less_consistent_user_language_stats['Median Success Rate']
)
# Print the results
print(f"Correlation between Median_delta_days and Median p_recall: {correlation_p_recall:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate: {correlation_success_rate:.2f}")
Correlation between Median_delta_days and Median p_recall: -0.05 Correlation between Median_delta_days and Median Success Rate: 0.04
- The
.corr()
method is used to calculate the Pearson correlation coefficient between two columns: Median_delta_days (representing user engagement or consistency) and Median p_recall (the median recall rate of users).
Devicing Correlation on Delta_Days over P_recall and Success_rate¶
# Correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall_most = most_consistent_user_stats['Median Delta Days'].corr(
most_consistent_user_stats['Median p_recall']
)
# Correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate_most = most_consistent_user_stats['Median Delta Days'].corr(
most_consistent_user_stats['Median Success Rate']
)
# Print the results
print(f"Correlation between Median_delta_days and Median p_recall (Most Consistent): {correlation_p_recall_most:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate (Most Consistent): {correlation_success_rate_most:.2f}")
Correlation between Median_delta_days and Median p_recall (Most Consistent): -0.03 Correlation between Median_delta_days and Median Success Rate (Most Consistent): 0.02
.corr()
devices correlation between Median_delta_days over Median p_recall and Median Success Rate
Analysis of Language Engagement and Correlations¶
1. Learning Language Distribution¶
Less Consistent Users:
- English: 2566 learners (~31.5% of less consistent group).
- Spanish: 2278 learners (~28%).
- French: 1573 learners (~19%).
- German: 1388 learners (~17%).
- Italian: 219 learners (~2.7%).
- Portuguese: 169 learners (~2%).
Most Consistent Users:
- English: 23055 learners (~33% of total).
- Spanish: 15740 learners (~22.5%).
- French: 9218 learners (~13%).
- German: 6905 learners (~10%).
- Italian: 3631 learners (~5%).
- Portuguese: 1424 learners (~2%).
2. Correlations¶
For Less Consistent Users:
- Median Delta Days and Median p_recall: -0.05 (weak negative correlation).
- Median Delta Days and Median Success Rate: 0.04 (weak positive correlation).
For Most Consistent Users:
- Median Delta Days and Median p_recall: -0.03 (very weak negative correlation).
- Median Delta Days and Median Success Rate: 0.02 (very weak positive correlation).
Key Insights¶
Distribution Imbalance:
- English and Spanish dominate across both consistent and less consistent user groups, but less consistent users have a slightly higher proportion studying German and Spanish.
- Italian and Portuguese have a smaller share of learners, with Italian learners being more consistent overall.
Weak Correlations:
- Minimal relationship between how frequently users interact (Median Delta Days) and their performance metrics (Median p_recall or Median Success Rate). This suggests that other factors, such as study habits or course difficulty, might play a larger role in determining success.
Language-Specific Trends:
- French and German learners are slightly overrepresented among less consistent users. These languages may have content or structure that makes it harder to maintain engagement.
Recommendations¶
Focus on English and Spanish:
- Invest in personalized reminders and motivational tools for less consistent English and Spanish learners to encourage consistent engagement.
- Introduce goal-setting features tailored to these languages, such as streak rewards.
Enhance French and German Content:
- Investigate potential challenges with French and German courses, such as lesson complexity or perceived difficulty.
- Provide step-by-step learning paths or extra practice materials for commonly challenging topics.
Leverage Consistency in Italian and Portuguese:
- Promote the high success and consistency rates for Italian and Portuguese to attract new learners.
- Offer cultural immersion content (e.g., travel-related lessons, conversational phrases).
Analyze External Factors:
- Conduct user surveys or focus groups to identify external factors influencing engagement and performance (e.g., motivation, user interface, content quality).
By addressing these areas, the platform can improve learner engagement, especially for less consistent users, while maintaining high retention rates for the most consistent groups.
Hypotheses 7 :-Users with a higher number of unique lexemes learned are likely to show greater engagement and proficiency in the language, indicating a stronger commitment to the learning process.¶
Unique Lexemes Learned for each Users¶
# Step 1: Group by 'user_id' and count unique 'lexeme_id' learned
user_lexeme_counts = (
dl.groupby(['user_id','learning_language_Abb'])['lexeme_id']
.nunique()
.reset_index()
.rename(columns={'lexeme_id': 'Unique Lexemes Learned'})
)
# Step 2: Sort by the count of unique lexemes in descending order
user_lexeme_counts = user_lexeme_counts.sort_values(by='Unique Lexemes Learned', ascending=False)
# Display the resulting DataFrame
print(user_lexeme_counts)
user_id learning_language_Abb Unique Lexemes Learned 81262 tJs Spanish 941 8867 dlpG German 899 13039 erXf English 867 25901 gZJc English 862 20467 g2Ev English 848 ... ... ... ... 40815 hdFN German 1 61977 iRVR French 1 46195 i17y French 1 75951 ipO1 English 1 48151 i4P7 French 1 [81711 rows x 3 columns]
dl.groupby(['user_id','learning_language_Abb'])['lexeme_id'].nunique().reset_index().rename(columns={'lexeme_id': 'Unique Lexemes Learned'})
this groups the count on unique lexeme id which groups overuser_id
andlearning_language_Abb
and rename the column names for more convinienceuser_lexeme_counts.sort_values(by='Unique Lexemes Learned', ascending=False)
sorts the dataframe by unique lexeme id in descending order.
Analysis of Lexeme Learning Across Users¶
1. Key Findings¶
Top Performers:
- The user
tJs
studying Spanish learned the most unique lexemes (941), followed bydlpG
in German with 899 anderXf
in English with 867. - The top learners predominantly study Spanish, German, and English.
- The user
Lower Engagement:
- A significant portion of users have learned only one unique lexeme across various languages, indicating very low engagement or limited initial interaction.
Language Insights:
- Users learning English dominate the top rankings, followed by Spanish and German, reflecting their popularity or user preference on the platform.
- Users in less popular languages (e.g., French) generally have fewer high performers compared to English and Spanish learners.
Key Insights¶
Engagement Gap:
- The wide range in unique lexemes learned (from 941 to 1) highlights a gap between highly engaged users and those who disengage early.
Language Popularity and Complexity:
- Spanish and English lead in lexeme acquisition, which might be due to user familiarity, the structure of the language courses, or motivation.
- German learners are relatively high-performing but could face challenges in lexical complexity, making their achievements notable.
Low Performers:
- The significant number of users with minimal lexeme acquisition suggests an opportunity to improve initial user retention through engaging onboarding experiences or targeted support.
Recommendations¶
Enhance Early Engagement:
- Introduce gamified incentives for new learners (e.g., badges for learning 10+ lexemes in the first week).
- Simplify onboarding lessons with immediate rewards for completing early milestones.
Support High Performers:
- Offer advanced modules or personalized challenges for users like
tJs
,dlpG
, anderXf
to keep them engaged at higher levels. - Use top learners as case studies or ambassadors to inspire others.
- Offer advanced modules or personalized challenges for users like
Target Low Performers:
- Identify users with minimal lexeme counts and provide personalized follow-ups, such as reminders or suggestions for easier, beginner-friendly lessons.
- Deploy exit surveys or feedback requests to understand why users disengage early.
Language-Specific Enhancements:
- For Spanish and English learners, emphasize conversational and practical vocabulary to capitalize on interest.
- For German and French learners, focus on breaking down lexical complexity and providing mnemonic aids to simplify learning.
Data-Driven Content Iteration:
- Analyze content patterns in languages with both high and low lexeme counts to refine lesson structure and difficulty progression.
By addressing these insights, the platform can foster balanced engagement across different learner profiles while improving retention and satisfaction.
Hypothesis 8 :- Grammar tag categorization has a significant impact on both success rate history and session recall rate, with certain grammar tags showing higher success and recall rates than others.¶
Grouped by grammar_tag
on success_rate_history
with mean and median¶
grammar_stats_history = dl.groupby('grammar_tag')['success_rate_history'].agg(['mean', 'median']).round(2)
grammar_stats_history = grammar_stats_history.reset_index()
grammar_stats_history
grammar_tag | mean | median | |
---|---|---|---|
0 | Adverbs | 0.90 | 0.98 |
1 | Conjunctions | 0.88 | 0.91 |
2 | Determiners and Adjectives | 0.90 | 0.94 |
3 | Interjections | 0.95 | 1.00 |
4 | Nan | 0.78 | 1.00 |
5 | Negation | 0.89 | 1.00 |
6 | Nouns | 0.91 | 1.00 |
7 | Numbers and Quantifiers | 1.00 | 1.00 |
8 | Other | 0.89 | 1.00 |
9 | Pronouns and Related | 0.89 | 0.92 |
10 | Verbs | 0.90 | 0.95 |
- The resulting dataframe,
grammar_stats_history
, summarizes success rate statistics for each grammar tag in mean and median.
Grouped by grammar_tag
on p_recall
with mean and median¶
grammar_stats_session = dl.groupby('grammar_tag')['p_recall'].agg(['mean', 'median']).round(2)
grammar_stats_session = grammar_stats_session.reset_index()
grammar_stats_session
grammar_tag | mean | median | |
---|---|---|---|
0 | Adverbs | 0.89 | 1.0 |
1 | Conjunctions | 0.87 | 1.0 |
2 | Determiners and Adjectives | 0.89 | 1.0 |
3 | Interjections | 0.95 | 1.0 |
4 | Nan | 0.94 | 1.0 |
5 | Negation | 0.89 | 1.0 |
6 | Nouns | 0.90 | 1.0 |
7 | Numbers and Quantifiers | 0.50 | 0.5 |
8 | Other | 0.90 | 1.0 |
9 | Pronouns and Related | 0.88 | 1.0 |
10 | Verbs | 0.89 | 1.0 |
- The resulting dataframe,
grammar_stats_session
, summarizes success rate statistics for each grammar tag in mean and median.grammar_stats_session
Derive grammar_tag
on plural categories¶
# Compare recall rate and success rate across grammatical categories
tag_diff_comparison = dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2)
tag_diff_comparison
mean | median | ||
---|---|---|---|
grammar_tag | plurality_tag | ||
Determiners and Adjectives | Plural | 0.88 | 1.0 |
Singular | 0.89 | 1.0 | |
Nouns | Plural | 0.90 | 1.0 |
Singular | 0.90 | 1.0 | |
Pronouns and Related | Plural | 0.89 | 1.0 |
Singular | 0.90 | 1.0 | |
Verbs | Plural | 0.89 | 1.0 |
Singular | 0.90 | 1.0 |
dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2)
aggregates the mean and median ofp_recall
that revolves overgrammar_tag
which categorized byplurality_tag
.
Analysis of Grammar-Tag-Specific Performance¶
1. Success Rate Trends by Grammar Tag¶
Top Performing Tags:
- Numbers and Quantifiers: Achieve the highest mean (1.00) and median (1.00) success rates, indicating consistency and ease in mastering this category.
- Interjections: Similarly, with a mean (0.95) and median (1.00), these appear straightforward for users.
- Nouns: High mean (0.91) and perfect median (1.00) success rates indicate strong user retention in this foundational area.
Low Performing Tags:
- Conjunctions: With a mean success rate of 0.88, learners face challenges in this grammatical category.
- Nan (Undefined): The mean (0.78) suggests issues with ambiguous or improperly tagged content, despite a perfect median (1.00).
2. Recall Rate Trends by Grammar Tag¶
Top Performing Tags:
- Interjections: Exhibit high recall rates with a mean (0.95) and median (1.00).
- Nouns: Maintain consistent performance, with a mean (0.90) and median (1.00), reflecting their foundational role in language structure.
Low Performing Tags:
- Numbers and Quantifiers: Recall rate struggles, with a mean (0.50) and median (0.50), likely due to numerical complexity or irregularity in lesson structure.
- Conjunctions and Pronouns: Mean values of 0.87 and 0.88 highlight slight challenges.
3. Grammar Tag Performance by Plurality¶
- Singular vs. Plural:
- Across Determiners and Adjectives, Nouns, Pronouns, and Verbs, singular forms exhibit slightly higher recall rates (mean 0.90) compared to plural forms (mean 0.89).
- This suggests that learners are more comfortable with singular constructs, likely due to their prevalence in basic language training.
Key Insights¶
Consistency in Key Categories:
- Grammar tags like Nouns, Interjections, and Numbers are performing well in both success and recall metrics.
- Numbers and Quantifiers, despite a perfect success rate, show a recall rate gap, highlighting a possible discrepancy between practice and retention.
Ambiguous or Undefined Content (Nan):
- Poor mean success rates for the Nan category suggest potential tagging errors or unclear instructional design.
Plurality Challenges:
- Across multiple grammar tags, plural constructs slightly underperform compared to singular constructs, indicating a learning curve in pluralization rules.
Specific Challenges:
- Conjunctions and Pronouns lag in both recall and success rates, hinting at their inherent complexity or insufficient emphasis in lessons.
Recommendations¶
Targeted Improvements for Low-Performing Categories:
- Conjunctions and Pronouns: Introduce focused lessons with simplified rules, visual aids, and relatable examples to enhance comprehension.
- Numbers and Quantifiers: Revise lesson structure to emphasize repetition, mnemonics, and contextual applications for numerical recall.
Optimize Undefined (Nan) Content:
- Audit and refine ambiguous or poorly defined grammar tags to ensure clarity and better instructional value.
Address Pluralization Challenges:
- Include explicit lessons on pluralization rules, especially for Determiners and Adjectives, Pronouns, and Verbs. Practice-based approaches (e.g., fill-in-the-blank exercises) can help learners bridge the gap.
Leverage Strengths:
- Build on successful categories like Nouns and Interjections by integrating them into more complex exercises to maintain engagement.
Data-Driven Curriculum Adjustments:
- Use metrics like recall and success rates to refine lesson plans, prioritizing low-performing areas while maintaining the strengths of well-performing tags.
By addressing these points, the platform can create a more balanced and effective learning experience, improving both user retention and mastery of grammatical nuances.
Hypothesis 9 :- The significant impact of the popularity of a language on the number of unique lexeme IDs in that language.¶
Defining the Function Value_counts
¶
def value_counts(column_name):
value_counts = dl[column_name].value_counts()
value_counts_dl = pd.DataFrame(value_counts)
return value_counts_dl
value_counts(column_name)
computes the number of occurrences of each unique value in the specified column of the DataFrame dl.
value_counts('learning_language_Abb')
learning_language_Abb | |
---|---|
English | 1479926 |
Spanish | 1007678 |
French | 552704 |
German | 425433 |
Italian | 237961 |
Portuguese | 92056 |
- The Function
value_counts(column_name)
ise employed forlearning_language_Abb
column.
Employing Scatter plot for Count of Learning Language vs Unique Lexeme words¶
# Data
learning_language_Abb = ['English', 'Spanish', 'French', 'German', 'Italian', 'Portuguese']
count = [1479926, 1007678, 552704, 425433, 237961, 92056] # Count of learning_language_full
unique_lexeme_ids = [2740, 3052, 3429, 3218, 1750, 2055] # Unique Lexeme IDs
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown'] # Different colors for languages
# Normalize counts for the x-axis
scaled_count = [x / 100000 for x in count]
# Create a figure
plt.figure(figsize=(12, 8))
# Scatter plot with unique colors for each language
plt.scatter(scaled_count, unique_lexeme_ids, color=colors, s=200, alpha=0.8, edgecolors='black')
# Annotate points with language names
for i, lang in enumerate(learning_language_Abb):
plt.text(scaled_count[i], unique_lexeme_ids[i] + 50, lang, fontsize=10, ha='center')
# Add titles and labels
plt.title('Learning Language Count (Normalized) vs. Unique Lexeme IDs', fontsize=16)
plt.xlabel('Normalized Count of Learning Language Learners', fontsize=14)
plt.ylabel('Unique Lexeme IDs', fontsize=14)
plt.grid(alpha=0.5)
# Show the plot
plt.tight_layout()
plt.show()
- The code visualizes the relationship between the count of language learners and the number of unique lexemes for each language. Each point represents a language, annotated with its name, and uses distinct colors for clear differentiation.
Analysis¶
Correlation Analysis:
- The scatter plot allows us to visualize relationships between the normalized count of learners and the number of unique lexeme IDs across different languages.
- Strong Positive Observations:
- English and Spanish have the highest learner counts and correspondingly high unique lexeme IDs, indicating they attract a large learner base and have rich content diversity.
- French and German, despite lower learner counts, show high unique lexeme IDs, suggesting they offer substantial vocabulary depth.
- Weaker Relationships:
- Italian and Portuguese, with lower learner counts and unique lexeme IDs, indicate less engagement and possibly less extensive content compared to other languages.
Key Relationships Interpretation:
- Learner Engagement: English and Spanish's high learner counts suggest these languages are most popular, likely due to their global utility and demand.
- Content Richness: High unique lexeme IDs for French and German imply these languages offer a robust learning experience, potentially attracting learners seeking comprehensive knowledge.
- Underrepresented Languages: Italian and Portuguese lag behind in both metrics, highlighting opportunities to boost their content and learner engagement.
Recommendations¶
Enhance Content for Underrepresented Languages:
- Develop and promote more engaging content for Italian and Portuguese to attract more learners and increase the diversity of lexemes.
- Offer incentives and highlight cultural and practical benefits of learning these languages.
Leverage Popular Languages:
- Utilize the large learner base of English and Spanish to introduce advanced courses, interactive sessions, and community-building activities to maintain engagement.
- Expand marketing efforts to capitalize on their popularity.
Invest in Advanced Courses for Rich Content Languages:
- Given the high lexeme diversity in French and German, focus on developing advanced and specialized courses to cater to learners looking for in-depth knowledge.
- Highlight the unique features and advanced content available in these languages.
By implementing these strategies, you can balance the learning experience across all languages, cater to diverse learner needs, and potentially increase engagement across the board.
Hypothesis 10:- Aiming to understand the influence of language characteristics or learner preferences on performance metrics.¶
Aggregating Mean and Median on success_rate_history
for learning_language_Abb
¶
language_stats_history_success_rate = dl.groupby('learning_language_Abb')['success_rate_history'].agg(['mean', 'median']).round(2)
language_stats_history_success_rate = language_stats_history_success_rate.reset_index()
language_stats_history_success_rate
learning_language_Abb | mean | median | |
---|---|---|---|
0 | English | 0.90 | 0.95 |
1 | French | 0.89 | 0.94 |
2 | German | 0.90 | 1.00 |
3 | Italian | 0.90 | 1.00 |
4 | Portuguese | 0.91 | 1.00 |
5 | Spanish | 0.90 | 1.00 |
- The code computes the mean and median success rates for the
success_rate_history
column, grouped by each language in the learning_language_Abb column. -This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.
Aggregating Mean and Median on p_recall
for learning_language_Abb
¶
language_stats_recall_rate = dl.groupby('learning_language_Abb')['p_recall'].agg(['mean', 'median']).round(2)
language_stats_recall_rate = language_stats_recall_rate.reset_index()
language_stats_recall_rate
learning_language_Abb | mean | median | |
---|---|---|---|
0 | English | 0.90 | 1.0 |
1 | French | 0.88 | 1.0 |
2 | German | 0.89 | 1.0 |
3 | Italian | 0.91 | 1.0 |
4 | Portuguese | 0.90 | 1.0 |
5 | Spanish | 0.90 | 1.0 |
- The code computes the mean and median success rates for the
p_recall
column, grouped by each language in the learning_language_Abb column. -This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.
Merging the above datasets for Further Analysis¶
# Merge the two datasets
merged_data = pd.merge(language_stats_history_success_rate,
language_stats_recall_rate,
on='learning_language_Abb',
suffixes=('_success_rate', '_recall_rate'))
# Restructure the data to long format for easier plotting
long_data = pd.melt(
merged_data,
id_vars=['learning_language_Abb'],
value_vars=['mean_success_rate', 'median_success_rate', 'mean_recall_rate', 'median_recall_rate'],
var_name='Metric',
value_name='Value'
)
# Define a mapping for better labels
metric_labels = {
'mean_success_rate': 'Mean Success Rate',
'median_success_rate': 'Median Success Rate',
'mean_recall_rate': 'Mean Recall Rate',
'median_recall_rate': 'Median Recall Rate'
}
long_data['Metric'] = long_data['Metric'].map(metric_labels)
pd.merge()
Merges two datasets that contain different performance metrics (success rate and recall rate) for different learning languages.pd.melt()
Reshapes the data into a long format to make it easier to plot and analyze.- After setting the name for all data columns using .map() maps the names to the
long_data
structure.
Deriving Line Chart for success_rate
and p_recall
through mean and median¶
# Plot the data
plt.figure(figsize=(12, 6))
for metric in long_data['Metric'].unique():
subset = long_data[long_data['Metric'] == metric]
plt.plot(subset['learning_language_Abb'], subset['Value'], label=metric)
# Customize the chart
plt.title('Mean and Median for Success Rate and Recall Rate by Learning Language', fontsize=14)
plt.xlabel('Learning Language', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Metrics', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
# Show the chart
plt.show()
- The for loop code loops through each metric (e.g., Mean Success Rate, Median Success Rate).
subset = long_data[long_data['Metric'] == metric]
Using the code for each metric, a subset of thelong_data
DataFrame is created, containing only the rows that match the current metric.- Using the Loop, the plot is carriedout.
Analysis¶
Overall Performance:
Mean Success Rate: Most languages have a mean success rate around 0.90, with slight variations. Portuguese stands out with a mean of 0.91.
Median Success Rate: For several languages (German, Italian, Portuguese, Spanish), the median success rate is 1.00, indicating that at least half of the learners achieved a perfect success rate in their historical data.
Mean Recall Rate: The mean recall rate is also around 0.90 for most languages, with slight variations. Italian leads with 0.91.
Median Recall Rate: The median recall rate is 1.00 for all languages, implying that at least half of the learners achieved a perfect recall rate.
Consistency and Outliers:
- The median values being 1.00 for many languages suggest high consistency and potentially a skewed distribution towards high performance.
- Mean values provide a more nuanced view, showing slight differences between languages that the median values do not capture.
Language-Specific Trends:
- Portuguese and Italian: These languages have the highest mean success and recall rates, indicating strong learner performance.
- French: While the median values are strong, French has a slightly lower mean success and recall rate compared to other languages, suggesting some variability in learner performance.
Recommendations¶
Leverage High Performance in Italian and Portuguese:
- Promote these languages more aggressively, showcasing the high success and recall rates to attract new learners.
- Use the high performance as a case study to develop best practices and strategies for other languages.
Address Variability in French:
- Investigate the factors contributing to the slightly lower mean success and recall rates in French.
- Provide targeted support and resources for learners struggling with French to enhance overall performance.
Enhance Learning Experience for All Languages:
- Maintain consistency in teaching methods and materials to ensure learners continue to achieve high success and recall rates.
- Use the insights from high-performing languages to improve the content and engagement strategies for all languages.
By focusing on these recommendations, you can enhance learner performance across different languages, address any variability, and attract more learners to your platform.
Hypothesis 11 :- The distribution of learning and UI languages will reveal varying user preferences and engagement levels across different languages.¶
Chart count on learning_language and ui_language¶
def analyze_lan_distribution(dl, column_name, plot_color='skyblue'):
# Perform value counts
value_counts = dl[column_name].value_counts()
value_counts_dl = pd.DataFrame(value_counts, columns=['Count'])
value_counts_dl.index.name = column_name
# Print the DataFrame
print(f"\nDistribution of {column_name}:")
print(value_counts_dl)
# Plot the distribution
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color=plot_color)
plt.title(f'Distribution of {column_name}', fontsize=16)
plt.xlabel(column_name, fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.6)
plt.tight_layout()
plt.show()
# Return the DataFrame
return value_counts_dl
ui_language_count = analyze_lan_distribution(dl, 'learning_language_Abb', plot_color='yellow')
ui_language_count = analyze_lan_distribution(dl, 'ui_language_Abb', plot_color='orange')
Distribution of learning_language_Abb: Empty DataFrame Columns: [Count] Index: []
Distribution of ui_language_Abb: Empty DataFrame Columns: [Count] Index: []
def analyze_lan_distribution(dl, column_name, plot_color='skyblue')
defines a function on visualization on box plot exihibiting the number of users in the desired column.analyze_lan_distribution(dl, 'learning_language_Abb', plot_color='yellow')
defines the visuakization forlearning_language_Abb
.. Similarly doing it forui_language_Abb
Analysis¶
Distribution of Learning Languages (learning_language_Abb):
- The distribution reveals the count of learners for each language. From this, we can deduce which languages are most and least popular among users.
- Key Observations:
- English has the highest count, indicating it is the most popular learning language.
- Spanish, Portuguese, and Italian also have significant learner counts, but less than English.
Distribution of UI Languages (ui_language_Abb):
- This distribution shows the count of users for each user interface language. It provides insights into user preferences and localization needs.
- Key Observations:
- English is the most used UI language, with a count exceeding 2 million.
- Spanish follows, with a count slightly above 1 million.
- Portuguese and Italian have significantly lower counts, with Portuguese being higher than Italian.
Recommendations¶
Localization and Language Support:
- Given the high counts of English and Spanish as UI languages, ensure these languages have robust and fully localized interfaces.
- Enhance support for Portuguese and Italian UI users, potentially increasing their counts by making these languages more accessible and user-friendly.
Content Development:
- Since English is the most popular learning language, focus on expanding and improving English learning materials to cater to the high demand.
- For other popular learning languages like Spanish, Portuguese, and Italian, develop engaging and comprehensive learning content to retain and attract more learners.
Marketing and Outreach:
- Utilize the data on UI language distribution to target marketing campaigns effectively. For instance, promoting the platform more aggressively in English and Spanish-speaking regions.
- Highlight the benefits of learning less popular languages to diversify user engagement and learner base.
By applying these recommendations, you can enhance user experience, cater to diverse language needs, and potentially increase user engagement across various languages.
Hypothesis 12 :- The process aims to examine how recall probabilities vary across different learning languages and return time categories, providing insights into patterns of retention and engagement.¶
Creating a Pivot Table¶
# Pivot table to calculate average p_recall
pivot_table = dl.pivot_table(
values='p_recall',
index='learning_language_Abb',
columns='delta_days_category',
aggfunc='mean'
)
# Display the pivot table
print(pivot_table)
delta_days_category Less than a day Over a month Within a month \ learning_language_Abb English 0.907246 0.874983 0.884448 French 0.891713 0.848622 0.872736 German 0.907395 0.853919 0.875697 Italian 0.915333 0.887580 0.890401 Portuguese 0.914628 0.869350 0.887903 Spanish 0.909945 0.873059 0.886486 delta_days_category Within a week Zero learning_language_Abb English 0.892298 0.930387 French 0.878085 0.918171 German 0.886278 0.940054 Italian 0.899162 0.924242 Portuguese 0.894537 0.907361 Spanish 0.894120 0.926428
- Creating a pivot table by
dl.pivot_table()
which ranging over the averagep_recall
restricted ondelta_days_category
andlearning_language_Abb
as columns and index respectively.
Create a heatmap of pivot_table¶
# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Heatmap of Recall Probabilities (p_recall) by Language and Time Categories')
plt.xlabel('Delta Days Category')
plt.ylabel('Learning Language')
plt.show()
- This code creates a heatmap to visually represent the recall probabilities (p_recall) based on the learning language and time (delta days) categories.
Analysis:¶
Highest Recall Probabilities:
- The highest recall probabilities are observed in the "Zero" time category for all languages, with German having the highest value at 0.94.
- Italian also shows a high recall probability in the "Less than a day" category at 0.92.
Lowest Recall Probabilities:
- The lowest recall probabilities are generally found in the "Over a month" category, with French and German both having the lowest value at 0.85.
Consistency Across Time Categories:
- English, German, and Spanish show relatively consistent recall probabilities across different time categories, with values generally ranging between 0.85 and 0.94.
- French shows more variability, with a noticeable dip in the "Over a month" category.
Recommendations:¶
Focus on High Recall Time Categories:
- Since the "Zero" time category consistently shows the highest recall probabilities, it may be beneficial to focus on immediate recall techniques for language learning.
Address Low Recall Time Categories:
- Special attention should be given to the "Over a month" category, especially for French and German, to improve long-term retention strategies.
Tailored Learning Strategies:
- Different languages may benefit from tailored learning strategies. For example, French learners might need more frequent reviews to maintain high recall probabilities over longer periods.
Balanced Approach:
- A balanced approach that combines immediate recall techniques with periodic reviews could help maintain high recall probabilities across all time categories.
By analyzing the heatmap, educators and learners can better understand the effectiveness of different recall strategies and tailor their learning plans accordingly.
Hypothesis 13 :- Analysing time since learning sessions influences recall probabilities and success rates, expecting a potential decline in these metrics with increasing time gaps.¶
Correlation between delta_days
and p_recall
¶
dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['p_recall'] = pd.to_numeric(dl['p_recall'], errors='coerce')
correlation = dl['delta_days'].corr(dl['p_recall'])
print(f"Correlation between delta_days and p_recall: {correlation:.2f}")
Correlation between delta_days and p_recall: -0.03
pd.to_numeric()
code takes thedelta_days
andp_recall
columns, converts any non-numeric data to NaN, and then computes the correlation coefficient between them.- The correlation value helps in understanding how strongly these two variables are related. If the value is close to 1, there is a strong positive correlation; if close to -1, a strong negative correlation; and if close to 0, there is little to no linear relationship between them.
Correlation between delta_days
and success_rate_history
¶
dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['success_rate_history'] = pd.to_numeric(dl['success_rate_history'], errors='coerce')
correlation = dl['delta_days'].corr(dl['success_rate_history'])
print(f"Correlation between delta_days and success_rate_history: {correlation:.2f}")
Correlation between delta_days and success_rate_history: 0.02
- the correlation coefficient computed between
delta_days
andsuccess_rate_history
.
3D Scatter Plot: History vs. Session vs. Delta Days¶
# Initialize a 3D plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot
sc = ax.scatter(
dl['history_correct'],
dl['session_correct'],
dl['delta_days'],
c=dl['delta_days'], # Color based on delta_days
cmap='viridis', # Color map
alpha=0.8 # Transparency
)
# Add labels
ax.set_title('3D Scatter Plot: History vs. Session vs. Delta Days')
ax.set_xlabel('History Correct')
ax.set_ylabel('Session Correct')
ax.set_zlabel('Delta Days')
# Add color bar
cbar = plt.colorbar(sc, pad=0.1)
cbar.set_label('Delta Days')
# Show plot
plt.show()
- The scatter plot visualizes the relationship between three variables:
history_correct
,session_correct
, anddelta_days
. - The X-axis represents the range of history_correct
, the Y-axis represents the range of
session_correct, and the Z-axis represents the
delta_days`. - Each data point is colored based on its
delta_days
value, with a color scale (using the viridis color map) indicating the range of delta days. - The color bar provides a reference for interpreting the colors in the scatter plot.
- This 3D scatter plot is useful for understanding how the three variables interact with each other and how they might be correlated.
Correlation Analysis:¶
Delta Days vs. p_recall:
- The correlation between
delta_days
andp_recall
is -0.03, indicating a very weak negative relationship. This suggests that the time elapsed between study sessions has a minimal impact on recall probability.
- The correlation between
Delta Days vs. Success Rate History:
- The correlation between
delta_days
andsuccess_rate_history
is 0.02, indicating a very weak positive relationship. This implies that the time elapsed between study sessions has a negligible effect on historical success rates.
- The correlation between
3D Scatter Plot Insights:¶
The 3D scatter plot visualizes the relationship between the number of correct answers in history (history_correct
), session correctness (session_correct
), and the number of days between sessions (delta_days
).
Key Observations:
- History Correct vs. Session Correct: There's a noticeable clustering of points where higher values of
history_correct
correlate with highersession_correct
values. This suggests that learners who perform well historically tend to perform well in current sessions too. - Color Gradient of Delta Days: The color gradient from purple to yellow represents the
delta_days
values. While there is no clear trend indicating the impact ofdelta_days
, points are spread across the range, indicating varied intervals between sessions.
Recommendations:¶
Focus on Consistency:
- Encourage regular study habits since
delta_days
shows minimal impact on recall and success rates. Consistent engagement is key to maintaining performance.
- Encourage regular study habits since
Utilize Historical Performance:
- Leverage the strong relationship between
history_correct
andsession_correct
. Personalized feedback and adaptive learning paths based on historical performance can help enhance learning outcomes.
- Leverage the strong relationship between
Balance Study Intervals:
- While the weak correlations suggest flexibility in study intervals, maintaining a balanced approach with periodic reviews can support long-term retention.
By leveraging these insights and recommendations, you can enhance learner performance, maintain engagement, and achieve better outcomes in language learning.
Hypothesis 14 :- Analyzing p_recall by delta_days_category aims to explore how recall probabilities vary across time intervals, hypothesizing a potential decline in recall with longer time gaps.¶
Aggregating Mean and Median on p_recall
for learning_language_Abb
¶
p_recall_days_category = dl.groupby("delta_days_category")["p_recall"].agg(["mean", "median"])
p_recall_days_category
mean | median | |
---|---|---|
delta_days_category | ||
Less than a day | 0.906413 | 1.0 |
Over a month | 0.868080 | 1.0 |
Within a month | 0.882737 | 1.0 |
Within a week | 0.890425 | 1.0 |
Zero | 0.927669 | 1.0 |
dl.groupby("delta_days_category")["p_recall"].agg(["mean", "median"])
code groups the data by thedelta_days_category
and calculates two aggregate values (mean and median) for thep_recall
values within each category.
Create a Chart on p_recall
vs delta_days_category
¶
# Extract data for plotting
categories = p_recall_days_category.index # Delta days categories
mean_values = p_recall_days_category['mean'] # Mean of p_recall
median_values = p_recall_days_category['median'] # Median of p_recall
# Create the figure
plt.figure(figsize=(12, 6))
# Bar chart for mean values
plt.bar(categories, mean_values, color='skyblue', label='Mean', alpha=0.7)
# Line chart for median values
plt.plot(categories, median_values, color='red', marker='o', label='Median', linewidth=2)
# Add labels, title, and legend
plt.title('P_Recall by Delta Days Category', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('P_Recall', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.ylim(0.85, 1.02) # Adjust the y-axis to focus on the range of p_recall
plt.legend(title='Metrics', fontsize=12)
plt.grid(alpha=0.5)
# Show the plot
plt.tight_layout()
plt.show()
- This code creates a plot showing the mean and median values of p_recall for each delta_days_category.
- Bar chart represents the mean
p_recall
values for each category. - Line chart represents the median
p_recall
values for each category. - The x-axis shows the different delta_days_category, and the y-axis represents the p_recall values.
Analysis¶
Mean and Median P_Recall:
- The mean P_Recall values vary across the different delta days categories. The "Zero" category has the highest mean value at 0.93, indicating that recall probability is highest when there's no gap between learning sessions.
- The "Over a month" category has the lowest mean P_Recall value at 0.87, suggesting that recall probability decreases with longer intervals between sessions.
- The median P_Recall values are consistently at 1.0 for all categories, indicating that at least half of the learners achieve perfect recall across all time intervals.
Impact of Time Intervals:
- Immediate recall (Zero days) leads to the highest recall probability, emphasizing the effectiveness of frequent and consistent learning sessions.
- Recall probabilities remain relatively high for intervals "Less than a day," "Within a week," and "Within a month," but show a noticeable decline for "Over a month," highlighting the need for reinforced learning strategies over longer periods.
Recommendations¶
Frequent Learning Sessions:
- Encourage learners to engage in frequent learning sessions to maintain high recall probabilities. Implementing spaced repetition techniques can be beneficial.
Reinforced Learning for Long Intervals:
- Develop reinforced learning strategies for learners with longer intervals between sessions. This could include periodic reviews and refresher modules to improve long-term retention.
Personalized Learning Plans:
- Create personalized learning plans that adapt to the individual needs of learners. For those who cannot engage frequently, provide tailored content to maximize recall during longer gaps.
Interactive and Engaging Content:
- Enhance learning materials with interactive and engaging content to keep learners motivated and reduce the likelihood of long gaps between sessions.
By following these recommendations, you can enhance recall probabilities across different time intervals, improve long-term retention, and provide a more effective learning experience for all learners.
Hypothesis 15 :- Might gender-specific differences in recall probability (p_recall) may vary across different learning languages, suggesting that certain languages might exhibit higher or lower recall rates based on gender categorization.¶
Aggregating mean on p_recall constrained on learning_language_Abb
with gender_tag
¶
p_recall_gender_language = dl.groupby(["learning_language_Abb", "gender_tag"])["p_recall"].mean().reset_index(name="count")
p_recall_gender_language = p_recall_gender_language.pivot(index='learning_language_Abb',
columns='gender_tag', values='count')
p_recall_gender_language
gender_tag | Feminine | Masculine | Neuter |
---|---|---|---|
learning_language_Abb | |||
English | 0.915208 | 0.892346 | 0.908488 |
French | 0.883118 | 0.889324 | 0.885532 |
German | 0.890919 | 0.890983 | 0.900045 |
Italian | 0.913691 | 0.901427 | 0.899050 |
Portuguese | 0.902538 | 0.911269 | 0.861442 |
Spanish | 0.898183 | 0.906447 | 0.903529 |
- Groups the data by
learning_language_Abb
andgender_tag
and calculates the meanp_recal
l for each group. - Using p_recall_gender_language.pivot() creates a pivot table for the data for an better analysis.
Heat Map Using pivot table¶
plt.figure(figsize=(10, 8))
sns.heatmap(p_recall_gender_language, annot=True, cmap='cividis', fmt='.2f', linewidths=0.5, cbar_kws={'label': 'Mean Recall Rate'})
plt.title('Mean p_recall by Learning Language and Gender', fontsize=16)
plt.xlabel('Gender Tag', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.tight_layout()
plt.show()
- A heatmap is created to show the relationship between learning_language_full (on the y-axis) and gender_tag (on the x-axis) with the color representing the mean recall rate (p_recall).
Analysis¶
Mean Recall Rates by Gender and Language:
- English: Feminine (0.92), Masculine (0.89), Neuter (0.91)
- French: Feminine (0.88), Masculine (0.89), Neuter (0.89)
- German: Feminine (0.89), Masculine (0.89), Neuter (0.90)
- Italian: Feminine (0.91), Masculine (0.90), Neuter (0.90)
- Portuguese: Feminine (0.90), Masculine (0.91), Neuter (0.86)
- Spanish: Feminine (0.90), Masculine (0.91), Neuter (0.90)
Key Observations:
- English shows the highest recall rate for Feminine gender at 0.92, but slightly lower for Masculine at 0.89.
- German exhibits relatively consistent recall rates across all genders, with Neuter slightly higher at 0.90.
- Portuguese has the lowest Neuter recall rate at 0.86, indicating potential variability in recall based on gender.
- Italian and Spanish demonstrate high recall rates across all genders, indicating strong performance and retention.
Recommendations¶
Tailored Learning Strategies:
- Recognize and address the slight variations in recall rates across different genders. This could involve creating more personalized learning content and support for specific gender groups.
- For languages like Portuguese with lower Neuter recall rates, explore the reasons behind the disparity and implement targeted interventions to improve recall.
Focus on Consistency:
- Languages with consistent recall rates across genders, like German, serve as good models for balancing content and teaching methods.
- Utilize insights from these languages to develop best practices that can be applied to other languages.
Enhanced Engagement:
- Promote interactive and engaging content that caters to diverse learner needs, aiming to improve recall rates across all gender groups.
- Encourage feedback from learners to continuously refine and adapt content, ensuring it meets the needs of all users effectively.
By focusing on these recommendations, you can enhance the learning experience, address variability in recall rates, and provide more effective and personalized support for learners across different languages and gender groups.
Hypothesis 16 :- The hypothesis is that the time of day (hour) may have a significant impact on the success rate of history recall, with certain hours showing higher or lower success rates.¶
Aggregates mean on success_rate_history
restricted on hour
¶
hourly_success_rate = dl.groupby('hour')['success_rate_history'].mean().reset_index()
hourly_success_rate = hourly_success_rate.sort_values('hour')
hourly_success_rate
hour | success_rate_history | |
---|---|---|
0 | 0 | 0.899899 |
1 | 1 | 0.901693 |
2 | 2 | 0.901119 |
3 | 3 | 0.901233 |
4 | 4 | 0.898319 |
5 | 5 | 0.897596 |
6 | 6 | 0.900563 |
7 | 7 | 0.898600 |
8 | 8 | 0.902225 |
9 | 9 | 0.901200 |
10 | 10 | 0.900115 |
11 | 11 | 0.898766 |
12 | 12 | 0.901242 |
13 | 13 | 0.902201 |
14 | 14 | 0.900948 |
15 | 15 | 0.902254 |
16 | 16 | 0.901973 |
17 | 17 | 0.901663 |
18 | 18 | 0.899930 |
19 | 19 | 0.902748 |
20 | 20 | 0.902178 |
21 | 21 | 0.901185 |
22 | 22 | 0.899891 |
23 | 23 | 0.899842 |
- The above code provides DataFrame
hourly_success_rate
that shows the average success rate for eachhour
of the day, ordered from the earliest to the latest hour.
Chart on hourly_success_rate
¶
# Line plot for success rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_success_rate, x='hour', y='success_rate_history', marker='o', color='blue')
plt.title('Success Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Success Rate')
plt.xticks(range(0, 24)) # Ensure all hours are shown
plt.grid(True)
plt.show()
- The line plot has been employed for depicted
hourly_success_rate
.
Analysis¶
The line plot shows the success rate for each hour of the day, providing insights into how learner performance varies throughout the 24-hour period.
Key Observations:
- Overall Performance:
- The success rates are fairly consistent across different hours, ranging from approximately 0.90 to 0.90, indicating stable performance regardless of the time.
- Peak Performance Hours:
- Hour 8: One of the peaks in the success rate is observed around 8 AM (0.902). This might suggest that learners perform well in the morning.
- Hour 15 and 19: Other notable peaks are around 3 PM and 7 PM, with success rates around 0.902 and 0.903 respectively. These could be optimal study times for learners.
- Dip in Performance:
- Hour 4 and 5: A slight dip in success rates is observed around 4 AM and 5 AM (0.898), possibly indicating that early morning hours might not be as effective for learning.
Recommendations¶
Optimal Study Times:
- Encourage learners to engage in study sessions during peak performance hours, such as around 8 AM, 3 PM, and 7 PM, to maximize their success rates.
- Consider scheduling live classes, webinars, or interactive sessions during these optimal hours to leverage higher performance levels.
Targeted Support:
- Provide additional support and resources during hours with lower success rates, such as 4 AM and 5 AM. This could include offering motivational content, shorter study sessions, or interactive learning activities to keep learners engaged.
Personalized Learning Plans:
- Develop personalized learning plans that take into account the individual learner's optimal performance times. Encouraging learners to study during their most productive hours can enhance overall success rates.
By following these recommendations, you can help learners optimize their study schedules, improve their performance, and achieve better outcomes in their language learning journey.
Aggregates mean on p_recall
restricted on hour
¶
hourly_recall_rate = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_recall_rate = hourly_recall_rate.sort_values('hour')
hourly_recall_rate
hour | p_recall | |
---|---|---|
0 | 0 | 0.896413 |
1 | 1 | 0.898376 |
2 | 2 | 0.898629 |
3 | 3 | 0.898826 |
4 | 4 | 0.895443 |
5 | 5 | 0.891778 |
6 | 6 | 0.897451 |
7 | 7 | 0.893905 |
8 | 8 | 0.897472 |
9 | 9 | 0.896174 |
10 | 10 | 0.893066 |
11 | 11 | 0.893905 |
12 | 12 | 0.895460 |
13 | 13 | 0.897065 |
14 | 14 | 0.894525 |
15 | 15 | 0.898139 |
16 | 16 | 0.898184 |
17 | 17 | 0.897501 |
18 | 18 | 0.895011 |
19 | 19 | 0.897626 |
20 | 20 | 0.897241 |
21 | 21 | 0.897527 |
22 | 22 | 0.894974 |
23 | 23 | 0.894339 |
- The above code provides DataFrame
hourly_recall_rate
that shows the average success rate for eachhour
of the day, ordered from the earliest to the latest hour.
Chart on hourly_recall_rate
¶
# Line plot for Recall rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_recall_rate, x='hour', y='p_recall', marker='o', color='blue')
plt.title('Recall Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Recall Rate')
plt.xticks(range(0, 24)) # Ensure all hours are shown
plt.grid(True)
plt.show()
- The line plot has been employed for depicted
p_recall
.
Analysis:¶
Overall Recall Rate:
- The recall rates fluctuate throughout the day, staying within a narrow range (approximately 0.89 to 0.90), indicating relatively consistent performance.
Peak Recall Hours:
- The highest recall rates are observed around 3 AM (0.898826) and 1 AM (0.898376).
- Another peak is around 3 PM (0.898139).
Lowest Recall Hours:
- The lowest recall rates are observed around 5 AM (0.891778) and 4 AM (0.895443).
- There's also a dip around 10 AM (0.893066) and 7 AM (0.893905).
Recommendations:¶
Optimize Study Sessions:
- Schedule critical learning sessions during peak recall hours to maximize retention.
- Avoid scheduling important sessions during low recall hours, such as early morning (4 AM - 5 AM).
Balanced Study Plan:
- Distribute learning activities evenly throughout the day to maintain a balanced recall rate.
- Incorporate short, frequent review sessions during high recall periods to reinforce learning.
Personalized Learning Schedules:
- Tailor learning schedules to individual preferences and peak performance times, leveraging data on recall rates.
- Encourage learners to identify their personal peak hours and align their study plans accordingly.
By implementing these strategies, you can enhance learning efficiency, improve retention, and create more effective study schedules.
Peak and lowest learning hour with success rate¶
# Find the hour with the highest success rate
peak_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmax()]
print(f"Peak learning hour: {peak_hour['hour']} with success rate: {peak_hour['success_rate_history']:.2f}")
# Find the hour with the lowest success rate
lowest_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmin()]
print(f"Lowest learning hour: {lowest_hour['hour']} with success rate: {lowest_hour['success_rate_history']:.2f}")
Peak learning hour: 19.0 with success rate: 0.90 Lowest learning hour: 5.0 with success rate: 0.90
- The code searches for the peak learning hour (hour with the highest success rate) and displays it along with the success rate.
- Similarly, it identifies the lowest learning hour (hour with the lowest success rate) and displays it along with the success rate.
Hypothesis 17 :- The user engagement varies by hour of the day, with certain hours exhibiting higher or lower engagement levels.¶
Getting on Hourly Users Count¶
hourly_count = dl['hour'].value_counts().sort_index()
print(hourly_count)
0 188715 1 189609 2 188554 3 172621 4 145875 5 111427 6 80684 7 75318 8 69965 9 76519 10 89185 11 96127 12 114489 13 135089 14 153733 15 175107 16 199630 17 207284 18 220926 19 220805 20 230840 21 234093 22 221128 23 198035 Name: hour, dtype: int64
- This code shows how many times each hour of the day appears in the dataset, sorted in order of the hour (from 0 to 23).
Chart on Hourly Users Count¶
plt.figure(figsize=(10, 6))
plt.fill_between(hourly_count.index, hourly_count.values, color='lightgreen', alpha=0.4)
plt.plot(hourly_count.index, hourly_count.values, color='forestgreen', linewidth=2)
plt.title('Hourly Count Area Chart', fontsize=16)
plt.xlabel('Hour of the Day', fontsize=14)
plt.ylabel('Number of Learning Sessions', fontsize=14)
plt.grid(alpha=0.5)
plt.show()
- The line plot has been employed for depicting
hourly_count
.
Analysis:¶
The area chart visualizes the number of learning sessions throughout the day, highlighting periods of high and low activity.
Key Observations:
Early Morning Dip:
- The number of learning sessions starts high at midnight and gradually decreases to its lowest point around 7 AM.
- This suggests that early morning hours are less popular for learning activities.
Steady Increase and Peak:
- After 7 AM, the number of learning sessions steadily increases, peaking around 9 PM.
- This indicates that learners are more active in the evening hours.
Late Night Activity:
- There's a slight decline after 9 PM, but the activity remains relatively high until midnight.
- This shows that many learners prefer studying late at night.
Recommendations:¶
Optimize Learning Content Delivery:
- Schedule important lessons, webinars, and interactive sessions during peak hours (evening and late night) to maximize engagement.
- Utilize early morning hours for light, refresher content or motivational materials to gradually engage learners.
Targeted Engagement Strategies:
- Implement targeted engagement strategies during low-activity hours (early morning). This could include push notifications, reminders, or gamified learning to encourage learners to study during these times.
Personalized Learning Schedules:
- Encourage learners to identify their personal optimal study times based on their performance and engagement patterns, and create personalized learning schedules accordingly.
By implementing these strategies, you can enhance learner engagement, optimize content delivery, and support effective learning habits across different times of the day.
Hypothesis 18 :- Engagement and recall rates may vary by hour, indicating potential patterns in user activity and learning effectiveness throughout the day.¶
Getting on Hour engagement¶
# Group by hour and calculate total history seen for engagement by hour
hourly_engagement = dl.groupby('hour')['history_seen'].sum().reset_index()
hourly_engagement
hour | history_seen | |
---|---|---|
0 | 0 | 4294604 |
1 | 1 | 3979716 |
2 | 2 | 4460718 |
3 | 3 | 5276017 |
4 | 4 | 4282354 |
5 | 5 | 3367891 |
6 | 6 | 1255401 |
7 | 7 | 1212662 |
8 | 8 | 1141606 |
9 | 9 | 1504221 |
10 | 10 | 2279598 |
11 | 11 | 2133283 |
12 | 12 | 2619809 |
13 | 13 | 3025397 |
14 | 14 | 3197393 |
15 | 15 | 3309489 |
16 | 16 | 3964395 |
17 | 17 | 4441441 |
18 | 18 | 4695238 |
19 | 19 | 4845972 |
20 | 20 | 5361548 |
21 | 21 | 4408670 |
22 | 22 | 4393974 |
23 | 23 | 3968790 |
- The above code provides DataFrame
hourly_engagement
that shows the sum ofhistory_seen
for eachhour
of the day, ordered from the earliest to the latest hour.
Plot on Hourly Engagement¶
# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))
# Plot total history seen (engagement)
plt.plot(hourly_engagement['hour'], hourly_engagement['history_seen'], marker='o', label='Total Engagement (History Seen)', color='blue')
# Add titles, labels, and grid
plt.title('Hourly Engagement')
plt.xlabel('Hour')
plt.ylabel('Value')
plt.grid(True)
plt.legend()
# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))
# Show the plot
plt.show()
plt.plot()
used to create a line plot.hourly_engagement['hour']
This is the x-axis of the plot, representing the hour of the day (0-23).hourly_engagement['history_seen']
This is the y-axis of the plot, representing the total engagement (history seen) for each hour.marker='o'
adds circular markers at each data point along the line, helping to visualize each data point more clearly.
Analysis:¶
The line plot of hourly engagement (total history seen) provides insights into user activity throughout the day.
Key Observations:
- Early Morning Peak:
- There is a significant peak around 3 AM with a total engagement of over 5 million, indicating a high level of activity during late-night hours.
- Morning Dip:
- A noticeable dip in engagement is observed around 6 AM, with total engagement dropping below 2 million. This could reflect a time when users are less active, likely due to sleep.
- Evening Peak:
- Another peak is observed around 8 PM with total engagement exceeding 5 million, indicating high user activity during the evening hours.
- Consistent Engagement:
- Engagement remains relatively high and consistent throughout the day, particularly from the late afternoon through the evening.
Recommendations:¶
Optimal Content Delivery Times:
- Schedule important lessons, live sessions, or interactive activities during peak hours (late night and evening) to maximize user engagement.
- Utilize the early morning dip for lighter content, reminders, or motivational messages to gradually re-engage users.
Targeted Notifications:
- Send targeted notifications or reminders during periods of lower engagement (early morning) to encourage users to return to the platform and maintain consistent activity.
Personalized Learning Plans:
- Develop personalized learning schedules that align with individual users' peak activity times, promoting effective and efficient study habits.
Interactive and Engaging Content:
- Enhance learning materials with interactive and engaging elements, particularly during high activity periods, to retain user interest and motivation.
By implementing these strategies, you can enhance user engagement, optimize content delivery, and support effective learning habits across different times of the day.
Getting hourly_engagement_recall
Column¶
hourly_engagement_recall = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_engagement_recall
hour | p_recall | |
---|---|---|
0 | 0 | 0.896413 |
1 | 1 | 0.898376 |
2 | 2 | 0.898629 |
3 | 3 | 0.898826 |
4 | 4 | 0.895443 |
5 | 5 | 0.891778 |
6 | 6 | 0.897451 |
7 | 7 | 0.893905 |
8 | 8 | 0.897472 |
9 | 9 | 0.896174 |
10 | 10 | 0.893066 |
11 | 11 | 0.893905 |
12 | 12 | 0.895460 |
13 | 13 | 0.897065 |
14 | 14 | 0.894525 |
15 | 15 | 0.898139 |
16 | 16 | 0.898184 |
17 | 17 | 0.897501 |
18 | 18 | 0.895011 |
19 | 19 | 0.897626 |
20 | 20 | 0.897241 |
21 | 21 | 0.897527 |
22 | 22 | 0.894974 |
23 | 23 | 0.894339 |
dl.groupby('hour')['p_recall'].mean().reset_index()
derives the mean onp_recall
for eachhour
of the day, ordered from the earliest to the latest hour.
Plot on hourly_engagement_recall
¶
# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))
# Plot total history seen (engagement)
plt.plot(hourly_engagement_recall['hour'], hourly_engagement_recall['p_recall'], marker='o', label='Total Engagement on Recall', color='blue')
# Add titles, labels, and grid
plt.title('hourly_engagement_recall')
plt.xlabel('Hour')
plt.ylabel('p_recall')
plt.grid(True)
plt.legend()
# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))
# Show the plot
plt.show()
- Visualizing the line plot for the column
hourly_engagement_recall
.
Analysis:¶
The line graph titled "hourly_engagement_recall" visualizes the average recall probability (p_recall) for each hour of the day. Here are the insights derived from the data and visualization:
Consistent Recall Performance:
- The recall probabilities (p_recall) show relatively consistent performance throughout the day, with values ranging between approximately 0.89 and 0.90.
Early Morning Dip:
- There is a noticeable dip in recall probability around 4 AM and 5 AM, with the lowest value at 0.891778 around 5 AM. This suggests that early morning hours may not be optimal for recall performance.
Peak Recall Hours:
- The highest recall probabilities are observed around 3 AM (0.898826) and 3 PM (0.898139). These peak hours indicate times when learners tend to have better recall performance.
Recommendations:¶
Optimize Study Sessions:
- Schedule critical learning activities, reviews, and quizzes during peak recall hours (e.g., 3 AM and 3 PM) to maximize retention and recall performance.
Avoid Low Recall Hours:
- Avoid scheduling essential learning activities during low recall hours (e.g., 4 AM and 5 AM) to ensure learners are engaging with the material when their recall performance is optimal.
Balanced Study Plan:
- Encourage learners to adopt a balanced study plan that incorporates both peak and non-peak hours, ensuring consistent engagement and minimizing the impact of low recall periods.
Targeted Interventions:
- Implement targeted interventions during early morning dips to help learners improve recall during these times. This could include shorter, interactive sessions or gamified learning activities to boost engagement.
By following these recommendations, you can enhance learning outcomes, improve retention, and create more effective study schedules for learners.
Deriving Correlation Between hourly_engagement
and hourly_engagement_recall
¶
hourly_engagement_corr = hourly_engagement.merge(hourly_engagement_recall, on='hour')
hourly_engagement_corr_corr = hourly_engagement_corr['history_seen'].corr(hourly_engagement_corr['p_recall'])
print(f"Correlation between hourly engagement and learning success: {hourly_engagement_corr_corr:.2f}")
Correlation between hourly engagement and learning success: 0.32
.merge()
joins the two DataFrames based on the shared column hour, which represents the hour of the day.- The method
.corr()
is used to compute the Pearson correlation coefficient between two columns in the engagement_success_corr DataFrame. - The Pearson correlation measures the linear relationship between two variables:
- A value close to 1 indicates a strong positive correlation (as one increases, the other also increases).
- A value close to -1 indicates a strong negative correlation (as one increases, the other decreases).
- A value close to 0 indicates no linear correlation.
Analysis of Hourly Engagement and Learning Success Correlation¶
- Correlation Overview:
- The correlation between hourly engagement (history seen) and learning success (measured by
p_recall
) is 0.32, indicating a weak positive relationship.
- The correlation between hourly engagement (history seen) and learning success (measured by
Key Insights¶
Weak Positive Correlation:
- A 0.32 correlation suggests that as users engage more frequently, there is a slight increase in their learning success. However, this correlation is relatively weak, meaning that engagement alone does not fully explain the variability in learning outcomes.
Diminishing Returns on Engagement:
- The weak correlation implies that increased engagement might not always lead to significant improvements in recall rates. There may be diminishing returns from engagement, where other factors like content quality, learning strategies, or learner characteristics could play a larger role in success.
Potential for More Insight:
- The correlation value indicates that while engagement might have some influence, it's not the dominant factor. Additional data or features (such as learner behavior patterns, content difficulty, or feedback quality) could help identify why engagement doesn't lead to stronger improvements in success rates.
Recommendations¶
Increase Engagement Quality:
- Focus not just on quantity of engagement, but on making each engagement more impactful. For instance, introduce interactive or adaptive learning techniques that align more closely with learners' progress.
Personalized Learning:
- Incorporate data on user learning patterns (e.g., preferred learning times, pace, difficulty levels) to offer more personalized learning experiences. This could boost engagement quality, and in turn, success rates.
Explore Other Influencing Factors:
- Investigate other variables that could be contributing more strongly to learning success (e.g., time spent per session, content type, or difficulty level) to build a more comprehensive model of learner success.
Track Engagement Duration:
- Explore the relationship between time spent on learning rather than just the number of sessions or history seen, as longer, focused sessions might correlate more strongly with success.
In summary, while engagement does contribute to learning success, it should be complemented with other strategies and factors to optimize overall performance.
Hypothesis 19 :- The frequency of learning sessions across different languages varies significantly based on the delta days category, with certain time intervals showing higher engagement for specific languages.¶
Aggregating the Count on each delta_days_category
considering learning_language_Abb
¶
count_days_return_language = dl.groupby(["learning_language_Abb", "delta_days_category"]).size().reset_index(name="count")
heatmap_data = count_days_return_language.pivot(
index='learning_language_Abb',
columns='delta_days_category',
values='count'
)
heatmap_data
delta_days_category | Less than a day | Over a month | Within a month | Within a week | Zero |
---|---|---|---|---|---|
learning_language_Abb | |||||
English | 850154 | 92163 | 183263 | 351786 | 2560 |
French | 291639 | 35138 | 73636 | 151327 | 964 |
German | 206231 | 32629 | 65061 | 120992 | 520 |
Italian | 142301 | 6463 | 26592 | 62506 | 99 |
Portuguese | 50524 | 4774 | 11566 | 25072 | 120 |
Spanish | 478871 | 75805 | 164709 | 287114 | 1179 |
dl.groupby(["learning_language_Abb", "delta_days_category"])
creates groups for each unique combination of language and delta days category which counts the number of records in each group.- Using
.pivot()
the data is structured into pivot table.
Creating the Heatmap on heatmap_data
¶
# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='Blues', linewidths=0.5, cbar_kws={'label': 'Count of Sessions'})
plt.title('Heatmap of Learning Sessions by Language and Return Days', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
- The heatmap visually represents the count of learning sessions for each language and delta days category.
Analysis:¶
Engagement Patterns:
- The highest engagement ("Less than a day" category) is observed for English (850,154 sessions), followed by Spanish (478,871 sessions), indicating these languages have the most frequent sessions.
- The "Over a month" category shows significantly lower counts, with the highest for English (92,163 sessions) and the lowest for Italian (6,463 sessions), suggesting that long breaks between sessions are less common.
Return Time Distributions:
- The "Within a week" category has high engagement for English (351,786 sessions) and Spanish (287,114 sessions), reflecting frequent returns within short intervals.
- The "Zero" category has the lowest counts across all languages, indicating that same-day returns are rare.
Language-Specific Trends:
- French and German also show notable engagement in the "Less than a day" and "Within a week" categories, suggesting consistent study habits.
- Portuguese and Italian have lower overall engagement across all categories, highlighting a need for strategies to increase learner engagement.
Recommendations:¶
Encourage Regular Study Habits:
- Promote consistent study routines, particularly for less engaged languages like Portuguese and Italian, by highlighting the benefits of frequent practice.
Targeted Interventions:
- Implement targeted interventions for learners with long breaks (Over a month) to re-engage them and prevent extended absences.
- Use reminders, notifications, and incentives to encourage regular returns within shorter intervals (Less than a day, Within a week).
Enhance Engagement for Popular Languages:
- For high-engagement languages like English and Spanish, develop advanced courses and interactive sessions to maintain learner interest and motivation.
- Leverage the frequent engagement patterns to introduce community activities, challenges, and gamified content.
By following these recommendations, you can enhance learner engagement, reduce long absences, and support effective learning habits across all languages.
Hypothesis 20 :- The distribution of learning languages is influenced by the UI language preference, with certain learning languages being more frequently paired with specific UI languages.¶
Count on learning_language
with ui_language
¶
learning_language_ui_language_count = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().reset_index(name='count')
learning_language_ui_language_count
learning_language_Abb | ui_language_Abb | count | |
---|---|---|---|
0 | English | Italian | 123157 |
1 | English | Portuguese | 282884 |
2 | English | Spanish | 1073885 |
3 | French | English | 552704 |
4 | German | English | 425433 |
5 | Italian | English | 237961 |
6 | Portuguese | English | 92056 |
7 | Spanish | English | 1007678 |
- Understand the relationship between the languages users are learning and the languages they use in the interface.
Visualization on Learning_language
vs ui_language
¶
# Create a pivot table to count occurrences of 'learning_language_full' for each 'ui_language_full'
count_data = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().unstack(fill_value=0)
# Plot the stacked bar chart
count_data.plot(kind='bar', stacked=True, figsize=(12, 8), colormap='tab20')
plt.title('Count of Learning Languages for Each UI Language', fontsize=16)
plt.xlabel('Learning Language', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
- Groups the dataset dl by the columns
learning_language_Abb
andui_language_Abb
. - Uses
.size()
to count the number of occurrences for each unique combination of learning language and UI language. .unstack(fill_value=0)
converts the grouped data into a pivot table.- Create a plot using
.plot()
with using appropriate title and labels as mentioned.
dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2)
aggregates the mean and median ofp_recall
that revolves overgrammar_tag
which categorized byplurality_tag
.
Analysis:¶
The stacked bar chart displays the count of people learning different languages categorized by the user interface (UI) language they use. Here are the insights:
English Learning Dominance:
- English is the most learned language, with a significant number of learners using the Spanish UI (over 1 million), followed by Portuguese UI (around 282,884) and Italian UI (around 123,157). This indicates a widespread interest in learning English across different UI languages.
Strong Presence of English UI:
- Other languages such as French, German, Italian, Portuguese, and Spanish are primarily learned by users using the English UI. This suggests that English-speaking learners are interested in expanding their language skills to these languages.
Spanish as a Popular Learning Language:
- Spanish is the second most learned language, predominantly by users with the English UI (around 1 million). This reflects a high interest among English-speaking users to learn Spanish.
Recommendations:¶
Leverage English UI Popularity:
- Since many learners use the English UI to learn various languages, ensure the English UI is user-friendly, well-localized, and rich in features to support diverse learning experiences.
- Enhance the English UI with additional tools, resources, and interactive content to keep learners engaged and motivated.
Promote English Learning:
- Utilize the popularity of the Spanish UI to promote English learning. Highlight the benefits of learning English, such as improved career opportunities, and offer specialized English courses tailored to Spanish speakers.
Expand Language Offerings for Non-English UIs:
- Develop and promote learning content for other languages in non-English UIs (e.g., Italian, Portuguese). This can help attract more learners from different linguistic backgrounds and increase engagement.
Interactive and Cultural Content:
- Enhance learning materials with interactive and cultural content to provide a more immersive learning experience. For example, incorporating cultural insights, language games, and real-world scenarios can make learning more engaging and effective.
By implementing these recommendations, you can optimize the learning experience, cater to diverse user preferences, and attract more learners across various languages and UI settings.
dl
p_recall | timestamp | delta | user_id | learning_language | ui_language | lexeme_id | lexeme_string | history_seen | history_correct | ... | lexeme_base | grammar_tag | gender_tag | plurality_tag | delta_days | time | delta_days_category | success_rate_history | time_d | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2013-03-03 17:13:47 | 1825254 | 5C7 | fr | en | 3712581f1a9fbc0894e22664992663e9 | sur/sur<pr> | 2 | 1 | ... | sur/sur | Pronouns and Related | NaN | NaN | 21.126 | 17:13:47 | Within a month | 0.500000 | 1900-01-01 17:13:47 | 17 |
1 | 1.0 | 2013-03-04 18:30:50 | 367 | fWSx | en | es | 0371d118c042c6b44ababe667bed2760 | police/police<n><pl> | 6 | 5 | ... | police/police | Nouns | Neuter | Plural | 0.004 | 18:30:50 | Less than a day | 0.833333 | 1900-01-01 18:30:50 | 18 |
2 | 0.0 | 2013-03-03 18:35:44 | 1329 | hL-s | de | en | 5fa1f0fcc3b5d93b8617169e59884367 | hat/haben<vbhaver><pri><p3><sg> | 10 | 10 | ... | hat/haben | Verbs | NaN | Singular | 0.015 | 18:35:44 | Less than a day | 1.000000 | 1900-01-01 18:35:44 | 18 |
3 | 1.0 | 2013-03-07 17:56:03 | 156 | h2_R | es | en | 4d77de913dc3d65f1c9fac9d1c349684 | en/en<pr> | 111 | 99 | ... | en/en | Pronouns and Related | NaN | NaN | 0.002 | 17:56:03 | Less than a day | 0.891892 | 1900-01-01 17:56:03 | 17 |
4 | 1.0 | 2013-03-05 21:41:22 | 257 | eON | es | en | 35f14d06d95a34607d6abb0e52fc6d2b | caballo/caballo<n><m><sg> | 3 | 3 | ... | caballo/caballo | Nouns | Masculine | Singular | 0.003 | 21:41:22 | Less than a day | 1.000000 | 1900-01-01 21:41:22 | 21 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3795775 | 1.0 | 2013-03-06 23:06:48 | 4792 | iZ7d | es | en | 84e18e86c58e8e61d687dfa06b3aaa36 | soy/ser<vbser><pri><p1><sg> | 6 | 5 | ... | soy/ser | Verbs | NaN | Singular | 0.055 | 23:06:48 | Less than a day | 0.833333 | 1900-01-01 23:06:48 | 23 |
3795776 | 1.0 | 2013-03-07 22:49:23 | 1369 | hxJr | fr | en | f5b66d188d15ccb5d7777a59756e33ad | chiens/chien<n><m><pl> | 3 | 3 | ... | chiens/chien | Nouns | Masculine | Plural | 0.016 | 22:49:23 | Less than a day | 1.000000 | 1900-01-01 22:49:23 | 22 |
3795777 | 1.0 | 2013-03-06 21:20:18 | 615997 | fZeR | it | en | 91a6ab09aa0d2b944525a387cc509090 | voi/voi<prn><tn><p2><mf><pl> | 25 | 22 | ... | voi/voi | Pronouns and Related | NaN | Plural | 7.130 | 21:20:18 | Within a month | 0.880000 | 1900-01-01 21:20:18 | 21 |
3795778 | 1.0 | 2013-03-07 07:54:24 | 289 | g_D3 | en | es | a617ed646a251e339738ce62b84e61ce | are/be<vbser><pres> | 32 | 30 | ... | are/be | Verbs | NaN | NaN | 0.003 | 07:54:24 | Less than a day | 0.937500 | 1900-01-01 07:54:24 | 7 |
3795779 | 1.0 | 2013-03-06 21:12:07 | 191 | iiN7 | pt | en | 4a93acdbafaa061fd69226cf686d7a2b | café/café<n><m><sg> | 3 | 3 | ... | café/café | Nouns | Masculine | Singular | 0.002 | 21:12:07 | Less than a day | 1.000000 | 1900-01-01 21:12:07 | 21 |
3795758 rows × 24 columns