Graduation Project -- Duolingo Analysis¶

Business Problem Overview¶

Enhancing User Retention and Learning Effectiveness on Duolingo¶

Duolingo, a leading language-learning platform, empowers millions of users worldwide to learn new languages through engaging and interactive lessons. Offering diverse languages and skill levels, the platform provides a gamified experience designed to make learning fun and accessible to everyone.

However, the effectiveness and success of Duolingo rely heavily on user retention and learning engagement. With a vast user base and daily learning activities, identifying factors that drive users to stay active and progress through lessons—or conversely, to abandon the platform—is critical. High drop-off rates can hinder individual learning goals, reduce user satisfaction, and ultimately impact Duolingo’s long-term growth and revenue potential.

Key questions arise: Why do some users achieve consistent learning milestones while others disengage? What influences a user's ability to retain and correctly apply learned content? By analyzing user behavior, such as session frequency, accuracy rates, and engagement trends, Duolingo can gain actionable insights to optimize its learning strategies, personalize user experiences, and enhance overall retention and satisfaction.

Enhancing User Retention and Learning Effectiveness on Duolingo¶

Duolingo, a leading language-learning platform, empowers millions of users worldwide to acquire new languages through engaging and interactive lessons. Its gamified structure, encompassing daily streaks, leaderboards, and rewards, makes learning enjoyable and accessible. Catering to learners of diverse backgrounds, Duolingo plays a crucial role in breaking down language barriers and fostering global communication.

However, the platform's long-term success hinges on user retention, engagement, and learning outcomes. Despite its innovative approach, some users struggle to maintain consistent learning habits, achieve proficiency, or stay active on the platform. High drop-off rates and reduced session activity can diminish individual learning journeys and impact the platform’s ability to meet its mission of fostering education globally.

This raises several critical questions:

  • Why do some users excel while others disengage early in their learning journey?
  • What patterns or factors contribute to higher retention, lesson accuracy, and session frequency?
  • How do user behaviors differ across languages, lesson difficulty levels, or demographics?

By analyzing key metrics such as session accuracy, engagement trends, and historical learning data, Duolingo can uncover actionable insights to address these challenges. These insights can enable the platform to:

  1. Personalize learning pathways for diverse user needs.
  2. Introduce adaptive features to re-engage at-risk users.
  3. Optimize lesson structures for improved comprehension and retention.

With a deeper understanding of user behavior, Duolingo can create a more tailored and effective learning experience, ensuring users remain motivated, achieve their language goals, and continue to thrive on the platform.

Objective¶

In this project, we aim to analyze Duolingo user activity data to:

  1. Understand user behavior: Identify trends in session frequency, lesson accuracy, and engagement across different languages and user demographics.
  2. Recognize disengagement signals: Detect key indicators that suggest users are losing interest or struggling with their learning journey.
  3. Optimize learning strategies: Provide actionable insights to enhance user retention, personalize learning experiences, and improve overall platform effectiveness.

Business Impact¶

This analysis will empower Duolingo to:

  • Enhance user retention by identifying and addressing factors that contribute to user disengagement.
  • Improve learning outcomes by personalizing lessons to align with individual user needs and preferences.
  • Drive platform growth by fostering consistent user engagement and satisfaction, leading to positive word-of-mouth and higher subscription rates.

By addressing user disengagement and optimizing learning strategies, this project aligns with Duolingo's mission to make education accessible, engaging, and effective for learners worldwide. Let’s dive into the data to unlock these insights!

Dataset Overview¶

Dataset Overview¶

  • Dataset Name: Duolingo Analytics Dataset
  • Number of Rows: 3,795,780
  • Number of Columns: 12
  • Description: The dataset captures language learning session details, including user behavior, engagement metrics, and learning progress. It records recall probability, session performance, and lexeme interactions, enabling analysis of learning outcomes and activity patterns over days.

Column Definitions¶

  1. p_recall (Proportion of Recall Accuracy): The proportion of exercises in this lesson where the word (or lexeme) was correctly recalled by the student.
  2. timestamp (Time of the Lesson): The timestamp indicating when the current lesson or practice took place.
  3. delta (Time Gap): The time (in seconds) since the last lesson or practice where this specific word (lexeme) was encountered.
  4. user_id (Student ID): An anonymized ID representing the student who completed the lesson or practice.
  5. learning_language (Language Being Learned): The target language that the student is learning.
  6. ui_language (User Interface Language): The language of the app’s user interface, which is usually the student's native language.
  7. lexeme_id (Lexeme Tag ID): A system-generated unique ID for the word or lexeme being practiced.
  8. lexeme_string (Lexeme Tag): A detailed grammar tag describing the lexeme (word), including its properties like tense, gender, and plurality.
  9. history_seen (Times Seen Before): The total number of times the student has encountered this word (lexeme) in lessons or practice sessions before this one.
  10. history_correct (Times Correct Before): The total number of times the student correctly recalled this word (lexeme) in previous lessons or practice sessions.
  11. session_seen (Times the Word/Lexeme Was Seen in the Current Session): This column indicates how many times the student encountered the specific word or lexeme during the current lesson or practice session.
  12. session_correct (Times the Word/Lexeme Was Correctly Recalled in the Current Session): This column indicates how many times the student correctly recalled or answered the specific word or lexeme during the current lesson or practice session.

Analysis & Visualization¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

In [1]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

Loading the Dataset from google drive¶

In [2]:
dl = pd.read_csv("reduced_data_400mb (1).csv") # Loading the Data
In [3]:
dl
Out[3]:
p_recall timestamp delta user_id learning_language ui_language lexeme_id lexeme_string history_seen history_correct session_seen session_correct
0 1.0 2013-03-03 17:13:47 1825254 5C7 fr en 3712581f1a9fbc0894e22664992663e9 sur/sur<pr> 2 1 2 2
1 1.0 2013-03-04 18:30:50 367 fWSx en es 0371d118c042c6b44ababe667bed2760 police/police<n><pl> 6 5 2 2
2 0.0 2013-03-03 18:35:44 1329 hL-s de en 5fa1f0fcc3b5d93b8617169e59884367 hat/haben<vbhaver><pri><p3><sg> 10 10 1 0
3 1.0 2013-03-07 17:56:03 156 h2_R es en 4d77de913dc3d65f1c9fac9d1c349684 en/en<pr> 111 99 4 4
4 1.0 2013-03-05 21:41:22 257 eON es en 35f14d06d95a34607d6abb0e52fc6d2b caballo/caballo<n><m><sg> 3 3 3 3
... ... ... ... ... ... ... ... ... ... ... ... ...
3795775 1.0 2013-03-06 23:06:48 4792 iZ7d es en 84e18e86c58e8e61d687dfa06b3aaa36 soy/ser<vbser><pri><p1><sg> 6 5 2 2
3795776 1.0 2013-03-07 22:49:23 1369 hxJr fr en f5b66d188d15ccb5d7777a59756e33ad chiens/chien<n><m><pl> 3 3 2 2
3795777 1.0 2013-03-06 21:20:18 615997 fZeR it en 91a6ab09aa0d2b944525a387cc509090 voi/voi<prn><tn><p2><mf><pl> 25 22 2 2
3795778 1.0 2013-03-07 07:54:24 289 g_D3 en es a617ed646a251e339738ce62b84e61ce are/be<vbser><pres> 32 30 2 2
3795779 1.0 2013-03-06 21:12:07 191 iiN7 pt en 4a93acdbafaa061fd69226cf686d7a2b café/café<n><m><sg> 3 3 1 1

3795780 rows × 12 columns

Displaying Dataset Information¶

In [4]:
print("Dataset Information:")
dl.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3795780 entries, 0 to 3795779
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   p_recall           float64
 1   timestamp          object 
 2   delta              int64  
 3   user_id            object 
 4   learning_language  object 
 5   ui_language        object 
 6   lexeme_id          object 
 7   lexeme_string      object 
 8   history_seen       int64  
 9   history_correct    int64  
 10  session_seen       int64  
 11  session_correct    int64  
dtypes: float64(1), int64(5), object(6)
memory usage: 347.5+ MB
  • dl.info() gives concise information about the columns and its data types with their count.

Displaying Column Names¶

In [5]:
dl.columns
Out[5]:
Index(['p_recall', 'timestamp', 'delta', 'user_id', 'learning_language',
       'ui_language', 'lexeme_id', 'lexeme_string', 'history_seen',
       'history_correct', 'session_seen', 'session_correct'],
      dtype='object')
  • Displays the Columns in the DataFrame

Describing Dataset Information¶

In [6]:
dl.describe()
Out[6]:
p_recall delta history_seen history_correct session_seen session_correct
count 3.795780e+06 3.795780e+06 3.795780e+06 3.795780e+06 3.795780e+06 3.795780e+06
mean 8.964675e-01 7.055116e+05 2.197719e+01 1.949662e+01 1.808655e+00 1.636209e+00
std 2.711188e-01 2.211979e+06 1.283616e+02 1.136178e+02 1.350644e+00 1.309628e+00
min 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00
25% 1.000000e+00 5.180000e+02 3.000000e+00 3.000000e+00 1.000000e+00 1.000000e+00
50% 1.000000e+00 7.609500e+04 6.000000e+00 6.000000e+00 1.000000e+00 1.000000e+00
75% 1.000000e+00 4.346412e+05 1.500000e+01 1.300000e+01 2.000000e+00 2.000000e+00
max 1.000000e+00 3.964973e+07 1.344200e+04 1.281600e+04 2.000000e+01 2.000000e+01
  • dl.describe() function describes the numerical columns from the DataFrame.
  • It Displays the count, mean, std, min, 25%, 50%, 75%, max which identifies potential outliers or data issues.
  • It Helps to understand the Numerical Data Columns.

Displaying Column Data Types¶

In [7]:
print("The Data Types of all Columns:")
dl.dtypes
The Data Types of all Columns:
Out[7]:
p_recall             float64
timestamp             object
delta                  int64
user_id               object
learning_language     object
ui_language           object
lexeme_id             object
lexeme_string         object
history_seen           int64
history_correct        int64
session_seen           int64
session_correct        int64
dtype: object
  • Using dl.dtypes, the data types for each column has been Displayed.

Checking the Shape of the Dataset¶

In [8]:
rows, columns = dl.shape
print(f"The dataset contains {rows} rows and {columns} columns.")
The dataset contains 3795780 rows and 12 columns.
  • dl.shape has been used for checking the size of the dataset

Checking the unique values in the Dataset¶

In [9]:
for column in dl.columns.tolist():
    print(f"No. of unique values in {column}:")
    print(dl[column].nunique())
No. of unique values in p_recall:
66
No. of unique values in timestamp:
334848
No. of unique values in delta:
808491
No. of unique values in user_id:
79694
No. of unique values in learning_language:
6
No. of unique values in ui_language:
4
No. of unique values in lexeme_id:
16244
No. of unique values in lexeme_string:
15864
No. of unique values in history_seen:
3784
No. of unique values in history_correct:
3308
No. of unique values in session_seen:
20
No. of unique values in session_correct:
21
  • dl.columns.tolist() converts that list of column names into a Python list which loops and iterates over all the columns in the DataFrame dl.
  • dl[column].nunique() returns the number of unique (distinct) values in the column.

Checking for the Value Counts in the Dataset¶

In [10]:
value_counts_dict = {}

for column in dl.columns:
    value_counts_dict[column] = dl[column].value_counts()
    print(value_counts_dict[column])
1.000000    3187760
0.000000     266909
0.500000     132930
0.666667      84204
0.750000      49116
             ...   
0.111111          1
0.545455          1
0.272727          1
0.684211          1
0.375000          1
Name: p_recall, Length: 66, dtype: int64
2013-03-05 21:09:36    105
2013-03-07 22:36:13     97
2013-03-01 20:07:14     94
2013-03-06 14:35:04     88
2013-03-06 19:41:47     87
                      ... 
2013-03-07 14:20:21      1
2013-03-01 11:54:28      1
2013-03-02 03:39:04      1
2013-03-05 02:57:48      1
2013-03-05 19:53:28      1
Name: timestamp, Length: 334848, dtype: int64
165         3900
153         3854
164         3837
169         3771
160         3751
            ... 
4843625        1
579467         1
3626644        1
15992794       1
615997         1
Name: delta, Length: 808491, dtype: int64
bcH_    4202
h8n2    2527
cpBu    2417
ht1n    2396
g2Ev    2383
        ... 
iy7m       1
gGtD       1
g9rg       1
h6Lr       1
iTuP       1
Name: user_id, Length: 79694, dtype: int64
en    1479930
es    1007687
fr     552707
de     425433
it     237967
pt      92056
Name: learning_language, dtype: int64
en    2315850
es    1073889
pt     282884
it     123157
Name: ui_language, dtype: int64
827a8ecb89f9b59ac5c29b620a5d3ed6    36115
97e922f780d628eac638bea7a02bf496    28848
928787744a962cd4ec55c1b22cedc913    27224
b968b069e4e2c04848e9f8924e34c031    21842
a617ed646a251e339738ce62b84e61ce    20331
                                    ...  
4a8be4af945dbf28ff775b9ff933a5df        1
baae0bd8fddd341c208cb6aa04c237e8        1
f1f34a08001125f2337ed0a73f3a9f22        1
00e6f18dedcf8e59b7bf570beca9e80e        1
e370294ee020bc9db24865cac83a840e        1
Name: lexeme_id, Length: 16244, dtype: int64
a/a<det><ind><sg>                                 36115
is/be<vbser><pri><p3><sg>                         28848
eats/eat<vblex><pri><p3><sg>                      27224
we/prpers<prn><subj><p1><mf><pl>                  21842
are/be<vbser><pres>                               20331
                                                  ...  
telefonnummer/telefon<n>+nummer<n><f><sg><acc>        1
<*sf>/rumo<n><m><*numb>                               1
kanada/kanada<np><nt><sg><dat>                        1
ausstattung/ausstattung<n><f><sg><nom>                1
interior/interior<n><m><sg>                           1
Name: lexeme_string, Length: 15864, dtype: int64
3       498066
4       373188
2       339224
5       287130
6       240742
         ...  
3816         1
1399         1
3897         1
2550         1
3517         1
Name: history_seen, Length: 3784, dtype: int64
3       511967
2       416078
4       366865
5       278302
1       249961
         ...  
3602         1
3730         1
3675         1
1587         1
1855         1
Name: history_correct, Length: 3308, dtype: int64
1     2245672
2      781463
3      397726
4      191528
5       93618
6       41790
7       19199
8        9297
9        5441
10       3653
11       1895
12       1106
13        926
16        859
14        796
15        482
17        118
19        115
18         57
20         39
Name: session_seen, dtype: int64
1     2120333
2      737118
3      363098
0      266909
4      166656
5       75562
6       32588
7       14441
8        7370
9        4239
10       2604
11       1484
12       1029
13        786
14        602
15        434
16        332
17         90
18         51
19         40
20         14
Name: session_correct, dtype: int64
  • value_counts_dict = {} initializes an empty dictionary.
  • dl[column].value_counts() returns the counts of unique values in the column.

2. Data Preparation¶

Checking for Missing/Null Values¶

In [11]:
missing_value_count = dl.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
Out[11]:
p_recall             0
timestamp            0
delta                0
user_id              0
learning_language    0
ui_language          0
lexeme_id            0
lexeme_string        0
history_seen         0
history_correct      0
session_seen         0
session_correct      0
dtype: int64
  • dl.isnull() represents whether the data contains null or not.
  • dl.isnull().sum() Provides the count of missing values for each column in the DataFrame dl.

Checking for Duplicate Values in the Dataset¶

In [12]:
duplicates = dl[dl.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 22
  • The dl.duplicated() function is used to identify duplicate rows in a DataFrame dl.

Dropping the Duplicate Values in the Dataset¶

In [13]:
dl= dl.drop_duplicates()
dl
Out[13]:
p_recall timestamp delta user_id learning_language ui_language lexeme_id lexeme_string history_seen history_correct session_seen session_correct
0 1.0 2013-03-03 17:13:47 1825254 5C7 fr en 3712581f1a9fbc0894e22664992663e9 sur/sur<pr> 2 1 2 2
1 1.0 2013-03-04 18:30:50 367 fWSx en es 0371d118c042c6b44ababe667bed2760 police/police<n><pl> 6 5 2 2
2 0.0 2013-03-03 18:35:44 1329 hL-s de en 5fa1f0fcc3b5d93b8617169e59884367 hat/haben<vbhaver><pri><p3><sg> 10 10 1 0
3 1.0 2013-03-07 17:56:03 156 h2_R es en 4d77de913dc3d65f1c9fac9d1c349684 en/en<pr> 111 99 4 4
4 1.0 2013-03-05 21:41:22 257 eON es en 35f14d06d95a34607d6abb0e52fc6d2b caballo/caballo<n><m><sg> 3 3 3 3
... ... ... ... ... ... ... ... ... ... ... ... ...
3795775 1.0 2013-03-06 23:06:48 4792 iZ7d es en 84e18e86c58e8e61d687dfa06b3aaa36 soy/ser<vbser><pri><p1><sg> 6 5 2 2
3795776 1.0 2013-03-07 22:49:23 1369 hxJr fr en f5b66d188d15ccb5d7777a59756e33ad chiens/chien<n><m><pl> 3 3 2 2
3795777 1.0 2013-03-06 21:20:18 615997 fZeR it en 91a6ab09aa0d2b944525a387cc509090 voi/voi<prn><tn><p2><mf><pl> 25 22 2 2
3795778 1.0 2013-03-07 07:54:24 289 g_D3 en es a617ed646a251e339738ce62b84e61ce are/be<vbser><pres> 32 30 2 2
3795779 1.0 2013-03-06 21:12:07 191 iiN7 pt en 4a93acdbafaa061fd69226cf686d7a2b café/café<n><m><sg> 3 3 1 1

3795758 rows × 12 columns

  • Duplicates rows can inflate statistics, mislead analysis.
  • So Duplicates have been dropped using dl.drop_duplicates() function

Checking the shape to ensure the drop of duplicates.¶

In [14]:
dl.shape
Out[14]:
(3795758, 12)
  • By dl.shape function, it ensures that duplicates hav been dropped.

Describing Non-Duplicated data.¶

In [15]:
dl.describe()
Out[15]:
p_recall delta history_seen history_correct session_seen session_correct
count 3.795758e+06 3.795758e+06 3.795758e+06 3.795758e+06 3.795758e+06 3.795758e+06
mean 8.964686e-01 7.055136e+05 2.197721e+01 1.949666e+01 1.808658e+00 1.636214e+00
std 2.711172e-01 2.211985e+06 1.283619e+02 1.136181e+02 1.350646e+00 1.309630e+00
min 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00
25% 1.000000e+00 5.180000e+02 3.000000e+00 3.000000e+00 1.000000e+00 1.000000e+00
50% 1.000000e+00 7.609500e+04 6.000000e+00 6.000000e+00 1.000000e+00 1.000000e+00
75% 1.000000e+00 4.346410e+05 1.500000e+01 1.300000e+01 2.000000e+00 2.000000e+00
max 1.000000e+00 3.964973e+07 1.344200e+04 1.281600e+04 2.000000e+01 2.000000e+01
  • dl.describe() function employed again for an non inflated statistics.

Checking Data Column types for Well Structured Analysis.¶

In [16]:
dl.dtypes
Out[16]:
p_recall             float64
timestamp             object
delta                  int64
user_id               object
learning_language     object
ui_language           object
lexeme_id             object
lexeme_string         object
history_seen           int64
history_correct        int64
session_seen           int64
session_correct        int64
dtype: object
  • For the further analysis, the data types for all the column should be appropriate for a better directed analysis.
  • Here its noticed that data type of timestamp is not appropriate as its states 'object' but it has to be in 'datetime'.

Changing Datatype of Inappropriate Columns.¶

In [17]:
pd.options.mode.chained_assignment = None
dl["timestamp"]= pd.to_datetime(dl["timestamp"], format = '%Y-%m-%d %H:%M:%S')
  • Pandas might issue a warning. Setting pd.options.mode.chained_assignment = None stops that warning from appearing.
  • pd.to_datetime() function that converts column dl["timestamp"] to datetime format.

Segregating Numerical Columns for Devicing Outliers.¶

In [18]:
# Select columns with numeric types (either float64 or int64)
numerical_columns = dl.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(numerical_columns)
['p_recall', 'delta', 'history_seen', 'history_correct', 'session_seen', 'session_correct']
  • dl.select_dtypes(include=['float64', 'int64']) filters the DataFrame dl to select only the columns that are of type float64 or int64 which is numerical column.
  • .columns.tolist() converts the filtered columns into a list.

Devicing outliers from the DataFrame¶

In [19]:
outliers = {}
  • Initializing a Dictionary using outliers = {}.
In [20]:
for column in numerical_columns:
    Q1 = dl[column].quantile(0.10)
    Q3 = dl[column].quantile(0.90)
    IQR = Q3-Q1
    
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    
    outliers[column] = dl[column][(dl[column]<lower_bound) | (dl[column]>upper_bound)]
    print(outliers[column])
Series([], Name: p_recall, dtype: float64)
21          7688640
32          6915874
74          5092318
82          4855994
112         4664462
             ...   
3795641     9571218
3795644     4390588
3795693     4674316
3795718     4538280
3795763    13364296
Name: delta, Length: 149508, dtype: int64
3           111
20          535
73         2756
95          138
99          111
           ... 
3795548    2125
3795557     105
3795655    2019
3795686      95
3795701     359
Name: history_seen, Length: 132129, dtype: int64
3            99
20          510
73         2571
95          130
99          102
           ... 
3795548    1303
3795557     100
3795655    1855
3795686      86
3795701     333
Name: history_correct, Length: 130485, dtype: int64
113         7
157         9
196         9
232         9
263        14
           ..
3795563     7
3795597     7
3795655    13
3795731     8
3795766    20
Name: session_seen, Length: 43983, dtype: int64
113         7
157         9
196         9
232         8
263        12
           ..
3795391     8
3795597     7
3795655    13
3795731     7
3795766    20
Name: session_correct, Length: 33516, dtype: int64
  • The code loops through all numeric columns (numerical_columns) in the DataFrame dl.
  • For each column, it calculates the 10th and 90th percentiles to determine the Inter Quartile Range.
  • lower_bound = Q1 - 1.5*IQR,upper_bound = Q3 + 1.5*IQR calculates the lower and upper bounds.
  • Outliers[column] = dl[column][(dl[column]<lower_bound) | (dl[column]>upper_bound)] identifies the outliers which is any data points outside this range.

Handling the Outliers.¶

In [21]:
for col, data in outliers.items():
    print(f"{col}: {len(data)}")
p_recall: 0
delta: 149508
history_seen: 132129
history_correct: 130485
session_seen: 43983
session_correct: 33516
  • Using the looping through conditions through outliers.items().
  • len(data) employs the length of the outliers.

Data Distribution in Outliers.¶

In [22]:
for columns in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=dl[columns])
    plt.title('Boxplot for Numerical Columns')
    plt.xlabel(columns)
    plt.ylabel('Values')
    plt.xticks(rotation=45)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • The above code is generating separate boxplots for each of the numerical columns in the numerical_columns list as it allows visually inspect the distribution and potential outliers in each column.

Insights on Keeping Outliers For Our Analysis.¶

  • Real-World Phenomena: Outliers often capture extreme but genuine behaviors or events, which may be valuable for business or learning insights.
  • Completeness: Removing rows reduces the size of the dataset and may introduce bias, especially if outliers represent unique user segments.
  • Statistical Relevance: Outliers can help test robustness in models and identify areas for further improvement or intervention.
  • Strategic Decisions: Insights from outliers may influence critical decisions, such as identifying highly engaged users, optimizing content, or addressing user retention issues.

Creating New columns for better Analysis¶

In [23]:
pd.options.mode.chained_assignment = None
language_mapping = {
    'en': 'English',
    'fr': 'French',
    'es': 'Spanish',
    'it': 'Italian',
    'de': 'German',
    'pt': 'Portuguese',
}
  • Pandas might issue a warning. Setting pd.options.mode.chained_assignment = None stops that warning from appearing.
  • This mapping allows you to convert codes to full names easily for more readable for analysis and visualization.

Create Learning_Language_Abb column¶

In [24]:
# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'learning_language_Abb'] = dl['learning_language'].map(language_mapping)

# Display the updated DataFrame
print(dl[['learning_language', 'learning_language_Abb']])
        learning_language learning_language_Abb
0                      fr                French
1                      en               English
2                      de                German
3                      es               Spanish
4                      es               Spanish
...                   ...                   ...
3795775                es               Spanish
3795776                fr                French
3795777                it               Italian
3795778                en               English
3795779                pt            Portuguese

[3795758 rows x 2 columns]
  • dl.loc[:, 'learning_language_Abb'] method ensures that the column is created explicitly.
  • The .map() function applies the language_mapping dictionary to the learning_language column.

Create ui_Language_Abb column¶

In [25]:
# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'ui_language_Abb'] = dl['ui_language'].map(language_mapping)

# Display the updated DataFrame
print(dl[['ui_language', 'ui_language_Abb']])
        ui_language ui_language_Abb
0                en         English
1                es         Spanish
2                en         English
3                en         English
4                en         English
...             ...             ...
3795775          en         English
3795776          en         English
3795777          en         English
3795778          es         Spanish
3795779          en         English

[3795758 rows x 2 columns]
  • dl.loc[:, 'ui_language_Abb'] method ensures that the column is created explicitly.
  • The .map() function applies the language_mapping dictionary to the ui_language column.

Extracting Column lexeme_base from lexeme_string¶

In [26]:
pd.options.mode.chained_assignment = None
dl['lexeme_base'] = dl['lexeme_string'].str.split('<', expand=True)[0]
dl['lexeme_base']
Out[26]:
0                  sur/sur
1            police/police
2                hat/haben
3                    en/en
4          caballo/caballo
                ...       
3795775            soy/ser
3795776       chiens/chien
3795777            voi/voi
3795778             are/be
3795779          café/café
Name: lexeme_base, Length: 3795758, dtype: object
  • lexeme_base is the column extracted from lexeme_string to know the exact word learned.
  • Setting pd.options.mode.chained_assignment = None disables the warning.
  • .str.split('<', expand=True)[0] splits each string in the lexeme_string column wherever the < character appears on the first occurence.

Extracting New column grammar_tag from lexeme_string¶

In [27]:
import re

# Updated grammar_tags for more precise matching patterns
grammar_categories = {
    'Pronouns and Related': ['pr', 'prn', 'preadv', 'predet', 'np', 'rel'],
    'Verbs': ['vbhaver', 'vbdo', 'vbser', 'vblex', 'vbmod', 'vaux', 'ord'],
    'Nouns': ['n', 'gen', 'apos'],
    'Determiners and Adjectives': ['det', 'adj', 'predet'],
    'Adverbs': ['adv'],
    'Interjections': ['ij', '@ij:'],  
    'Conjunctions': ['cnjcoo', 'cnjadv', 'cnjsub', '@cnj:'],  
    'Numbers and Quantifiers': ['num', 'ord'],
    'Negation': ['neg', '@neg:', '@common_phrases:', '@itg:'],  
    'Other': ['pprep', '@adv:', '@pr:', 'apos']  
}
  • import re module is essential for working with pattern-based string operations. By importing re, you unlock powerful tools to handle text data efficiently.
  • Initializing grammar_categories dictionary for more precise matching patterns.
In [28]:
# Reverse the dictionary to map tags to their respective grammar categories
tag_to_category_map = {tag: category for category, tags in grammar_categories.items() for tag in tags}
  • The above code reverse the dictionary grammar_categories to create a new dictionary, tag_to_category_map where the keys are individual tags with the corresponding values.
In [29]:
# Function to determine the grammar category based on prefixes
def grammar_tags(lexeme_string):
    # Extract tags inside '<>' and check if they start with any of the provided prefixes
    tags_in_string = re.findall(r'<(.*?)>', lexeme_string)  # Extract tags inside '<>'
    for tag in tags_in_string:
        for prefix in tag_to_category_map:
            if tag.startswith(prefix):  # Match based on prefix
                return tag_to_category_map[prefix]
    return 'Nan'  # Default if no match is found

# Applying the function to the DataFrame
dl['grammar_tag'] = dl['lexeme_string'].apply(lambda x: grammar_tags(x))

print(dl.head())
   p_recall           timestamp    delta user_id learning_language  \
0       1.0 2013-03-03 17:13:47  1825254     5C7                fr   
1       1.0 2013-03-04 18:30:50      367    fWSx                en   
2       0.0 2013-03-03 18:35:44     1329    hL-s                de   
3       1.0 2013-03-07 17:56:03      156    h2_R                es   
4       1.0 2013-03-05 21:41:22      257     eON                es   

  ui_language                         lexeme_id  \
0          en  3712581f1a9fbc0894e22664992663e9   
1          es  0371d118c042c6b44ababe667bed2760   
2          en  5fa1f0fcc3b5d93b8617169e59884367   
3          en  4d77de913dc3d65f1c9fac9d1c349684   
4          en  35f14d06d95a34607d6abb0e52fc6d2b   

                     lexeme_string  history_seen  history_correct  \
0                      sur/sur<pr>             2                1   
1             police/police<n><pl>             6                5   
2  hat/haben<vbhaver><pri><p3><sg>            10               10   
3                        en/en<pr>           111               99   
4        caballo/caballo<n><m><sg>             3                3   

   session_seen  session_correct learning_language_Abb ui_language_Abb  \
0             2                2                French         English   
1             2                2               English         Spanish   
2             1                0                German         English   
3             4                4               Spanish         English   
4             3                3               Spanish         English   

       lexeme_base           grammar_tag  
0          sur/sur  Pronouns and Related  
1    police/police                 Nouns  
2        hat/haben                 Verbs  
3            en/en  Pronouns and Related  
4  caballo/caballo                 Nouns  
  • re.findall(r'<(.*?)>', lexeme_string) extracts all substrings inside angle brackets (<>) from lexeme_string.
  • Each extracted tag is checked to see if it starts with any prefix in tag_to_category_map. If a match is found, return the corresponding category from tag_to_category_map and if no match is found, return 'Nan'.
  • Applying the function grammar_tags to the lexeme_stringcolumn to extract a new column named grammar_tag.
  • .head() validate the output by printing the first few rows of the DataFrame dl to confirm the new column has been added correctly.
In [30]:
dl['grammar_tag'].value_counts()
Out[30]:
Nouns                         1637545
Verbs                          856416
Determiners and Adjectives     606273
Pronouns and Related           435685
Adverbs                        133575
Conjunctions                    66891
Interjections                   45956
Other                            6821
Negation                         6585
Nan                                 9
Numbers and Quantifiers             2
Name: grammar_tag, dtype: int64
  • value_counts() calculates the frequency distribution of unique values in the column grammar_tag from the DataFrame dl.

Extracting New column gender_tag from lexeme_string¶

In [31]:
gender_tags = {
         "Masculine": "m",
         "Feminine": "f",
         "Neuter": "nt",
         "Masculine or Feminine (common gender)": "mf" 
    }
  • Initializing gender_tags dictionary for more precise matching patterns.
In [32]:
def get_gender_tag_key(lexeme_string):
    for key, tags in gender_tags.items():
        if any(f"<{tag}>" in lexeme_string for tag in tags):
            return key
    return np.nan  # Return NaN if no tags are found

# Apply the function to create the new column
dl['gender_tag'] = dl['lexeme_string'].apply(get_gender_tag_key)
  • get_gender_tag_key function takes lexeme_string as input and returns the key which is a matching tag in the string.
  • The function checks if f"<{tag}>" ie., any tag in the tags exists in lexeme_string. The any() function ensures that even if one tag matches, it returns key.
  • Then applying the function to the lexeme_string column while returning in the new column named gender_tag.
In [33]:
dl['gender_tag'].value_counts()
Out[33]:
Neuter       783971
Masculine    696207
Feminine     546620
Name: gender_tag, dtype: int64
  • value_counts() calculates the frequency distribution of unique values in the column gender_tag from the DataFrame dl.

Extracting New column plurality_tag from lexeme_string¶

In [34]:
plurality_tags = {
    "Singular":{"sg": "Singular"},
    "Plural":{"pl": "Plural"}
}
  • Initializing gender_tags dictionary for more precise matching patterns.
In [35]:
def get_plurality_tags_key(lexeme_string):
    for key, tags in plurality_tags.items():
        if any(f"<{tag}>" in lexeme_string for tag in tags):
            return key
    return np.nan  # Return NaN if no tags are found

# Apply the function to create the new column
dl['plurality_tag'] = dl['lexeme_string'].apply(get_plurality_tags_key)
  • get_plurality_tag_key function takes lexeme_string as input and returns the key which is a matching tag in the string.
  • The function checks if f"<{tag}>" ie., any tag in the tags exists in lexeme_string. The any() function ensures that even if one tag matches, it returns key.
  • Then applying the function to the lexeme_string column while returning in the new column named plurality_tag.
In [36]:
dl['plurality_tag'].value_counts()
Out[36]:
Singular    2233949
Plural       667294
Name: plurality_tag, dtype: int64
  • value_counts() calculates the frequency distribution of unique values in the column gender_tag from the DataFrame dl.

Initializing New Column delta_days from delta column¶

In [37]:
pd.options.mode.chained_assignment = None
dl['delta_days'] = dl['delta'] / 86400
dl['delta_days'] = dl['delta_days'].round(3)
dl['delta_days']
Out[37]:
0          21.126
1           0.004
2           0.015
3           0.002
4           0.003
            ...  
3795775     0.055
3795776     0.016
3795777     7.130
3795778     0.003
3795779     0.002
Name: delta_days, Length: 3795758, dtype: float64
  • Pandas might issue a warning. Setting pd.options.mode.chained_assignment = None stops that warning from appearing.
  • As it already stated that delta is time (in seconds) since the last lesson or practice and it is converted to days since the last session as its looks clean for the analysis.

Extracting Time from Timestamp column¶

In [38]:
dl['time'] = dl['timestamp'].dt.time
dl['time']
Out[38]:
0          17:13:47
1          18:30:50
2          18:35:44
3          17:56:03
4          21:41:22
             ...   
3795775    23:06:48
3795776    22:49:23
3795777    21:20:18
3795778    07:54:24
3795779    21:12:07
Name: time, Length: 3795758, dtype: object
  • dl['timestamp'].dt.time extracts the time portion (hour, minute, second) from the timestamp column.

Categorizing the delta_ways into a column delta_days_category¶

In [39]:
#defing a function to apply for the column
def categorize_delta_days(value):
    if value == 0:
        return 'Zero'
    elif value > 0 and value <= 1:
        return 'Less than a day'
    elif value > 1 and value <= 7:
        return 'Within a week'
    elif value > 7 and value <= 30:
        return 'Within a month'
    else:
        return 'Over a month'
  • The categorize_delta_days function is used to categorize delta_days into predefined categories based on the time difference.
In [40]:
dl['delta_days_category'] = [categorize_delta_days(x) for x in dl['delta_days']]
  • This is a list comprehension that loops through each value x in the delta_days column of the dl DataFrame.

Deriving success_rate_history from two columns¶

In [41]:
dl['success_rate_history'] = dl['history_correct'] / dl['history_seen']
dl['success_rate_history']
Out[41]:
0          0.500000
1          0.833333
2          1.000000
3          0.891892
4          1.000000
             ...   
3795775    0.833333
3795776    1.000000
3795777    0.880000
3795778    0.937500
3795779    1.000000
Name: success_rate_history, Length: 3795758, dtype: float64
  • success_rate_history calculates the success rate for each record by dividing history_correct by history_seen.

Extracting column hour from time column¶

In [42]:
dl['time'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S').dt.time
dl['time']
Out[42]:
0          17:13:47
1          18:30:50
2          18:35:44
3          17:56:03
4          21:41:22
             ...   
3795775    23:06:48
3795776    22:49:23
3795777    21:20:18
3795778    07:54:24
3795779    21:12:07
Name: time, Length: 3795758, dtype: object
  • The above code converts the time column in the DataFrame dl into a proper datetime format, handling errors by coercing. It expects the time values in the format HH:MM:SS.
In [43]:
dl['time_d'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S')
dl['hour'] = dl['time_d'].dt.hour
dl['hour']
Out[43]:
0          17
1          18
2          18
3          17
4          21
           ..
3795775    23
3795776    22
3795777    21
3795778     7
3795779    21
Name: hour, Length: 3795758, dtype: int64
  • dl['time'].dt.hour creates a new column hour that stores the hour (from 0 to 23) of the datetime values in the time column.

DataFrame dl¶

In [44]:
dl
Out[44]:
p_recall timestamp delta user_id learning_language ui_language lexeme_id lexeme_string history_seen history_correct ... lexeme_base grammar_tag gender_tag plurality_tag delta_days time delta_days_category success_rate_history time_d hour
0 1.0 2013-03-03 17:13:47 1825254 5C7 fr en 3712581f1a9fbc0894e22664992663e9 sur/sur<pr> 2 1 ... sur/sur Pronouns and Related NaN NaN 21.126 17:13:47 Within a month 0.500000 1900-01-01 17:13:47 17
1 1.0 2013-03-04 18:30:50 367 fWSx en es 0371d118c042c6b44ababe667bed2760 police/police<n><pl> 6 5 ... police/police Nouns Neuter Plural 0.004 18:30:50 Less than a day 0.833333 1900-01-01 18:30:50 18
2 0.0 2013-03-03 18:35:44 1329 hL-s de en 5fa1f0fcc3b5d93b8617169e59884367 hat/haben<vbhaver><pri><p3><sg> 10 10 ... hat/haben Verbs NaN Singular 0.015 18:35:44 Less than a day 1.000000 1900-01-01 18:35:44 18
3 1.0 2013-03-07 17:56:03 156 h2_R es en 4d77de913dc3d65f1c9fac9d1c349684 en/en<pr> 111 99 ... en/en Pronouns and Related NaN NaN 0.002 17:56:03 Less than a day 0.891892 1900-01-01 17:56:03 17
4 1.0 2013-03-05 21:41:22 257 eON es en 35f14d06d95a34607d6abb0e52fc6d2b caballo/caballo<n><m><sg> 3 3 ... caballo/caballo Nouns Masculine Singular 0.003 21:41:22 Less than a day 1.000000 1900-01-01 21:41:22 21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3795775 1.0 2013-03-06 23:06:48 4792 iZ7d es en 84e18e86c58e8e61d687dfa06b3aaa36 soy/ser<vbser><pri><p1><sg> 6 5 ... soy/ser Verbs NaN Singular 0.055 23:06:48 Less than a day 0.833333 1900-01-01 23:06:48 23
3795776 1.0 2013-03-07 22:49:23 1369 hxJr fr en f5b66d188d15ccb5d7777a59756e33ad chiens/chien<n><m><pl> 3 3 ... chiens/chien Nouns Masculine Plural 0.016 22:49:23 Less than a day 1.000000 1900-01-01 22:49:23 22
3795777 1.0 2013-03-06 21:20:18 615997 fZeR it en 91a6ab09aa0d2b944525a387cc509090 voi/voi<prn><tn><p2><mf><pl> 25 22 ... voi/voi Pronouns and Related NaN Plural 7.130 21:20:18 Within a month 0.880000 1900-01-01 21:20:18 21
3795778 1.0 2013-03-07 07:54:24 289 g_D3 en es a617ed646a251e339738ce62b84e61ce are/be<vbser><pres> 32 30 ... are/be Verbs NaN NaN 0.003 07:54:24 Less than a day 0.937500 1900-01-01 07:54:24 7
3795779 1.0 2013-03-06 21:12:07 191 iiN7 pt en 4a93acdbafaa061fd69226cf686d7a2b café/café<n><m><sg> 3 3 ... café/café Nouns Masculine Singular 0.002 21:12:07 Less than a day 1.000000 1900-01-01 21:12:07 21

3795758 rows × 24 columns

  • Above DataFrame dl is the well organized and ready to Analyse Data for EDA

3. Exploratory Data Analysis (EDA)¶

Performing Statistical Analysis for some of the Derived columns¶

Statistical Analysis of success_rate_history¶

In [45]:
mean_success_rate_history = dl['success_rate_history'].mean()
median_success_rate_history = dl['success_rate_history'].median()
max_success_rate_history = dl['success_rate_history'].max()
min_success_rate_history = dl['success_rate_history'].min()

print(f"Mean Success Rate for History: {mean_success_rate_history:.3f}")
print(f"Median Success Rate for History: {median_success_rate_history:.3f}")
print(f"Max Success Rate for History: {max_success_rate_history}")
print(f"Min Success Rate for History: {min_success_rate_history}")
Mean Success Rate for History: 0.901
Median Success Rate for History: 0.963
Max Success Rate for History: 1.0
Min Success Rate for History: 0.05
  • The above code aims to analyze the performance trends by summarizing key statistics of the success_rate_history column.

Statistical Analysis of p_recall¶

In [46]:
mean_recall_rate_session = dl['p_recall'].mean()
median_recall_rate_session = dl['p_recall'].median()
max_recall_rate_session = dl['p_recall'].max()
min_recall_rate_session = dl['p_recall'].min()

print(f"Mean Recall Rate for Session: {mean_recall_rate_session:.3f}")
print(f"Median Recall Rate for Session: {median_recall_rate_session:.3f}")
print(f"Max Recall Rate for Session: {max_recall_rate_session}")
print(f"Min Recall Rate for Session: {min_recall_rate_session}")
Mean Recall Rate for Session: 0.896
Median Recall Rate for Session: 1.000
Max Recall Rate for Session: 1.0
Min Recall Rate for Session: 0.0
  • The above code aims to analyze the performance trends by summarizing key statistics of the p_recall column.

Range of Columns¶

In [47]:
p_recall_range = dl["p_recall"].max()-dl["p_recall"].min()
delta_range = dl["delta"].max()-dl["delta"].min()
history_seen_range = dl["history_seen"].max()-dl["history_seen"].min()
history_correct_range = dl["history_correct"].max()-dl["history_correct"].min()
session_seen_range = dl["session_seen"].max()-dl["session_seen"].min()
session_correct_range = dl["session_correct"].max()-dl["session_correct"].min()

print(f"The Range of p_recall: {p_recall_range:.3f}")
print(f"The Range of delta: {delta_range:.3f}")
print(f"The Range of history_seen: {history_seen_range}")
print(f"The Range of history_correct: {history_correct_range}")
print(f"The Range of session_seen: {session_seen_range}")
print(f"The Range of session_correct: {session_correct_range}")
The Range of p_recall: 1.000
The Range of delta: 39649729.000
The Range of history_seen: 13441
The Range of history_correct: 12815
The Range of session_seen: 19
The Range of session_correct: 20

Obtain the unique lexeme ids of all learning_language¶

In [48]:
# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_lexemes_per_lang = dl.groupby('learning_language_Abb')['lexeme_id'].nunique().reset_index()

# Rename the column for better readability
unique_lexemes_per_lang.rename(columns={'lexeme_id': 'Unique Lexeme IDs'}, inplace=True)

# Display the result
print(unique_lexemes_per_lang)
  learning_language_Abb  Unique Lexeme IDs
0               English               2740
1                French               3429
2                German               3218
3               Italian               1750
4            Portuguese               2055
5               Spanish               3052
  • dl.groupby('learning_language_Abb')['lexeme_id'].nunique().reset_index() groups the dataset (dl) by the values in the learning_language_Abb column where each unique value in this column forms a group. Moreover, the nunique() method calculates the number of unique lexeme_id values. This gives the count of distinct lexemes associated with each language.
  • unique_lexemes_per_lang column that stores the count of unique lexeme IDs is renamed from 'lexeme_id' to 'Unique Lexeme IDs' to make the DataFrame easier to understand.

Obtain the unique lexeme strings of all learning_language¶

In [49]:
# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_words_per_lang = dl.groupby('learning_language_Abb')['lexeme_string'].nunique().reset_index()

# Rename the column for better readability
unique_words_per_lang.rename(columns={'lexeme_base': 'Unique words'}, inplace=True)

# Display the result
print(unique_words_per_lang)
  learning_language_Abb  lexeme_string
0               English           2740
1                French           3429
2                German           3218
3               Italian           1750
4            Portuguese           2055
5               Spanish           3052
  • dl.groupby('learning_language_Abb')['lexeme_string'].nunique().reset_index() groups the dataset (dl) by the values in the learning_language_Abb column where each unique value in this column forms a group. Moreover, the nunique() method calculates the number of unique lexeme_strings values. This gives the count of distinct lexemes associated with each language.
  • unique_words_per_lang column that stores the count of unique lexeme IDs is renamed from 'lexeme_base' to 'Unique words' to make the DataFrame easier to understand.

Hypothesis 1:- Analyzing user engagement for each user aims to identify patterns, expecting higher engagement levels among a subset of users reflecting varying learning behaviors.¶

Computing User engagement¶

In [50]:
user_engagement = dl.groupby("user_id")["history_seen"].sum().reset_index()
user_engagement = user_engagement.sort_values(by=["history_seen"], ascending=[ False])
user_engagement
Out[50]:
user_id history_seen
3328 bcH_ 3787808
27016 goA 3479726
5998 cpBu 3263073
7448 dOig 1266491
1223 NPs 903094
... ... ...
17912 fkZR 1
318 6XJ 1
40813 hj3 1
25380 g_k- 1
29632 h2Xz 1

79694 rows × 2 columns

  • dl.groupby("user_id")["history_seen"].sum().reset_index() function aggregates the history_seen values for each user_id which sums up all the history_seen values for each unique user.
  • The resultant is sorted by Descending values.

Computing User Engagement on Recall¶

In [51]:
# Correlation between user engagement (history_seen) and learning success (p_recall)
user_engagement_recall = dl.groupby('user_id')['p_recall'].mean().reset_index()
user_engagement_recall = user_engagement_recall.sort_values(by=['p_recall'], ascending=[ False])
user_engagement_recall
Out[51]:
user_id p_recall
12841 exUs 1.0
17986 flbK 1.0
37145 hUni 1.0
17985 flax 1.0
71594 ijPr 1.0
... ... ...
60390 iRTg 0.0
14364 fAzQ 0.0
48288 iB2r 0.0
48361 iBFg 0.0
78581 k0S 0.0

79694 rows × 2 columns

  • dl.groupby('user_id')['p_recall'].mean().reset_index() function aggregates the p_recall values for each user_id which performs the mean of p_recall values for each unique user.
  • The resultant is sorted by Descending values.

Correlation between user engagement and learning success¶

In [52]:
engagement_success_corr = user_engagement.merge(user_engagement_recall, on='user_id')
engagement_success_corr_corr = engagement_success_corr['history_seen'].corr(engagement_success_corr['p_recall'])
print(f"Correlation between user engagement and learning success: {engagement_success_corr_corr:.2f}")
Correlation between user engagement and learning success: -0.01
  • The method .corr() is used to compute the Pearson correlation coefficient between two columns in the engagement_success_corr DataFrame.
  • The Pearson correlation measures the linear relationship between two variables:
  • A value close to 1 indicates a strong positive correlation (as one increases, the other also increases).
  • A value close to -1 indicates a strong negative correlation (as one increases, the other decreases).
  • A value close to 0 indicates no linear correlation.

Most Engaged User in the recorded time¶

In [53]:
# Find the user with the highest engagement
highest_engagement_user = user_engagement.loc[user_engagement["history_seen"].idxmax()]

highest_engagement_user 
Out[53]:
user_id            bcH_
history_seen    3787808
Name: 3328, dtype: object
In [113]:
# Filter the dataset for the highest engaged user
user_data = dl[dl["user_id"] == highest_engagement_user["user_id"]]

# Calculate and display the mean and median of p_recall
mean_p_recall = user_data["p_recall"].mean()
median_p_recall = user_data["p_recall"].median()

print(f"Mean p_recall for highest engaged user: {mean_p_recall}, Median p_recall for highest engaged user: {median_p_recall}")
Mean p_recall for highest engaged user: 0.46417843355016153, Median p_recall for highest engaged user: 0.5

Analysis:¶

  1. Correlation Insight:

    • The correlation between user engagement (history_seen) and learning success (p_recall) is -0.01, indicating virtually no linear relationship between the two. This suggests that higher engagement doesn't necessarily translate to better recall rates, and vice versa.
  2. User with Highest Engagement:

    • The user with the highest engagement is bcH_, with 3,787,808 history_seen. However, the recall performance of this user is not explicitly shown, which could provide further insights into the quality of their engagement.
  3. Engagement and Success Variability:

    • While some users exhibit high engagement, their recall rates vary significantly. Similarly, users with perfect recall (p_recall = 1.0) may not necessarily have high engagement, highlighting potential inconsistencies in the relationship between the quantity of interaction and learning outcomes.

Key Insights:¶

  1. Engagement Alone is Not Enough: High engagement does not guarantee learning success, as seen from the weak correlation.

  2. Targeted Improvement Required: Users with moderate or low recall but high engagement could benefit from tailored learning content to improve their effectiveness.

  3. Outlier Behavior: Certain users with high recall rates but low engagement indicate the potential for efficient learning strategies, which could be analyzed further to design better learning paths.


Recommendations:¶

  1. Personalized Feedback: Focus on providing feedback to highly engaged users with low recall rates to enhance their learning efficiency.

  2. Optimized Content Delivery: Investigate the methods of users with high recall but low engagement to replicate and design efficient learning strategies for others.

  3. Behavioral Analysis: Conduct a deeper analysis of the user with the highest engagement (bcH_) to understand whether their engagement is productive or repetitive and optimize their learning experience.

  4. Incentivize Quality Engagement: Create programs or gamified experiences encouraging not just time spent but effective learning practices to balance engagement and success.

Hypothesis 2:- The analysis aims to identify patterns in user engagement and recall, expecting higher engagement with easier lexemes. This reflects varying learning behaviors and the impact of different materials on user success.¶

Extract lexeme_performance¶

In [55]:
# Group by lexeme_id and calculate mean recall rate
lexeme_performance = dl.groupby('lexeme_id')['p_recall'].agg(['mean', 'median']).round(2)
lexeme_performance
Out[55]:
mean median
lexeme_id
00022efc4121667defd065c88569e748 0.90 1.0
00064bf8c1c3cefa80b193acf7b9fe1d 0.90 1.0
000d2eb6a5658fa17f828c5fb0c66c11 0.95 1.0
000f3063358c188d171d903ec5a7855c 0.85 1.0
001635bcb24496cf2b27731c3708dbfa 0.94 1.0
... ... ...
ffeacb268a19c068cd8171938e5280a8 1.00 1.0
ffedebe922588f522094dd8eac320071 1.00 1.0
ffee4a0570d4eacc9c08f339c2bb11a7 1.00 1.0
fff70e9352d896105563156d7023d878 0.81 1.0
fff799d4d95e416db9dc07ca717b4ef9 0.91 1.0

16244 rows × 2 columns

  • dl.groupby('lexeme_id')['p_recall'].agg(['mean', 'median']).round(2) represents the mean and median for p_recall which groups the data in the DataFrame dl by the lexeme_id column.
  • Each group corresponds to all rows in the dataset associated with a specific lexeme.

Extract lexeme_success¶

In [56]:
# Group by lexeme_id and calculate success rate
lexeme_success = dl.groupby('lexeme_id')['success_rate_history'].agg(['mean', 'median']).round(2)
lexeme_success
Out[56]:
mean median
lexeme_id
00022efc4121667defd065c88569e748 1.00 1.00
00064bf8c1c3cefa80b193acf7b9fe1d 0.86 0.94
000d2eb6a5658fa17f828c5fb0c66c11 0.98 1.00
000f3063358c188d171d903ec5a7855c 0.85 0.88
001635bcb24496cf2b27731c3708dbfa 0.93 1.00
... ... ...
ffeacb268a19c068cd8171938e5280a8 1.00 1.00
ffedebe922588f522094dd8eac320071 1.00 1.00
ffee4a0570d4eacc9c08f339c2bb11a7 1.00 1.00
fff70e9352d896105563156d7023d878 0.92 1.00
fff799d4d95e416db9dc07ca717b4ef9 0.94 1.00

16244 rows × 2 columns

  • dl.groupby('lexeme_id')['success_rate_history'].agg(['mean', 'median']).round(2) represents the mean and median for success_rate_history which groups the data in the DataFrame dl by the lexeme_id column.
  • Each group corresponds to all rows in the dataset associated with a specific lexeme.

Lexeme Difficulty (Based on Recall)¶

In [57]:
# Calculate correlation between lexeme_id and recall rate
lexeme_difficulty_corr = dl.groupby('lexeme_id').agg({'p_recall': 'mean'}).reset_index()
lexeme_difficulty_corr
Out[57]:
lexeme_id p_recall
0 00022efc4121667defd065c88569e748 0.904762
1 00064bf8c1c3cefa80b193acf7b9fe1d 0.900000
2 000d2eb6a5658fa17f828c5fb0c66c11 0.952830
3 000f3063358c188d171d903ec5a7855c 0.849481
4 001635bcb24496cf2b27731c3708dbfa 0.940810
... ... ...
16239 ffeacb268a19c068cd8171938e5280a8 1.000000
16240 ffedebe922588f522094dd8eac320071 1.000000
16241 ffee4a0570d4eacc9c08f339c2bb11a7 1.000000
16242 fff70e9352d896105563156d7023d878 0.808757
16243 fff799d4d95e416db9dc07ca717b4ef9 0.910078

16244 rows × 2 columns

  • dl.groupby('lexeme_id').agg({'p_recall': 'mean'}).reset_index() aggregates the mean of p_recall which grouped by 'lexeme_id' in the DataFrame dl.
  • Each group corresponds to all rows in the dataset associated with a specific lexeme.

Analysis¶

  1. Recall Rates Across Lexemes:

    • The average recall rates for lexemes vary significantly, with some lexemes achieving perfect recall (1.0), while others have lower averages.
    • Median recall rates for most lexemes are high, often 1.0, suggesting that learners tend to perform well on individual lexemes when averaged over all users.
  2. Success Rates Across Lexemes:

    • Similar to recall, success rates for lexemes also show variability. Some lexemes have perfect success rates (1.0 mean and median), while others are lower, indicating that certain lexemes are harder for users to master.
  3. Correlation Insights:

    • The lexeme_difficulty_corr calculation indicates variability in recall performance across lexemes, reflecting differences in difficulty or contextual use.

Key Insights¶

  1. Lexeme Difficulty Varies:

    • Some lexemes are inherently harder for users to recall and master, as evidenced by lower mean recall and success rates.
  2. High Success Rates with Variability:

    • While the median success rate for many lexemes is 1.0, the variability in the mean suggests some users face challenges with specific lexemes.
  3. Consistent Performance in Easy Lexemes:

    • Lexemes with high recall and success rates likely represent concepts or words that are easier or more frequently encountered in practice.

Recommendations¶

  1. Target Difficult Lexemes:

    • Identify lexemes with lower mean recall and success rates and introduce additional practice or review sessions specifically focused on these lexemes.
  2. Contextual Learning:

    • Provide contextual examples or mnemonics for harder lexemes to enhance retention and recall rates.
  3. Adaptive Learning:

    • Use the data on lexeme difficulty to adapt learning paths, offering more repetition and practice for harder lexemes while reducing redundancy for easier ones.
  4. Track Lexeme Mastery Over Time:

    • Continuously monitor lexeme performance to understand trends and adjust teaching methods dynamically, ensuring improvement in both recall and success rates.

Hypothesis 3:- The analysis explores how UI languages influence both recall rates and success rates, hypothesizing that certain languages may lead to better or worse performance in learning. It expects differences in recall and success rates based on the user's preferred UI language.¶

Aggregating Mean and Median on success_rate_history for ui_language_Abb¶

In [58]:
ui_language_stats_history_success_rate = dl.groupby('ui_language_Abb')['success_rate_history'].agg(['mean', 'median'])
ui_language_stats_history_success_rate = ui_language_stats_history_success_rate.reset_index()
ui_language_stats_history_success_rate
Out[58]:
ui_language_Abb mean median
0 English 0.898451 0.988764
1 Italian 0.910783 0.974359
2 Portuguese 0.905714 0.954545
3 Spanish 0.903697 0.948718
  • The code computes the mean and median success rates for the success_rate_history column, grouped by each language in the ui_language_Abb column.
  • This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.

Aggregating Mean and Median on p_recall for ui_language_Abb¶

In [59]:
ui_language_stats_recall_rate = dl.groupby('ui_language_Abb')['p_recall'].agg(['mean', 'median'])
ui_language_stats_recall_rate = ui_language_stats_recall_rate.reset_index()
ui_language_stats_recall_rate
Out[59]:
ui_language_Abb mean median
0 English 0.894915 1.0
1 Italian 0.908159 1.0
2 Portuguese 0.897696 1.0
3 Spanish 0.898156 1.0
  • The code computes the mean and median success rates for the p_recall column, grouped by each language in the ui_language_Abb column.
  • This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.

Analysis¶

  1. Success Rate by UI Language:

    • The mean success rate across UI languages is consistently high, ranging between 0.898 (English) and 0.911 (Italian).
    • Median success rates are even higher, with most UI languages achieving close to or above 0.95, indicating that users tend to achieve high success rates during their learning sessions regardless of the UI language.
  2. Recall Rate by UI Language:

    • The mean recall rates also show a similar trend, ranging from 0.895 (English) to 0.908 (Italian).
    • The median recall rate is perfect (1.0) across all UI languages, suggesting that users often recall individual lexemes correctly when the data is aggregated.
  3. Comparison Across Languages:

    • Italian outperforms other UI languages slightly in both success and recall rates, with the highest mean values in both metrics.
    • Other languages (English, Portuguese, and Spanish) follow closely, with minimal variation, suggesting uniform learning effectiveness.

Key Insights¶

  1. Uniform Performance Across Languages:

    • Users perform consistently across different UI languages, with only slight differences in mean success and recall rates.
  2. Italian Leads in Engagement:

    • Italian users achieve the highest mean recall and success rates, indicating a potentially more engaging or user-friendly interface, or a more motivated user base.
  3. Perfect Median Recall Rates:

    • The perfect median recall rate (1.0) across all UI languages highlights effective learning mechanics, ensuring most users can recall lexemes correctly.

Recommendations¶

  1. Deep Dive into Italian Performance:

    • Investigate why Italian users are achieving slightly higher performance. This could inform UI/UX improvements for other languages.
  2. Standardize UI Features:

    • Apply any positive features or feedback from high-performing languages (like Italian) to other UI languages to further enhance overall performance.
  3. Leverage Median Performance:

    • Since most users achieve a perfect recall rate (median), focus on uplifting the mean by targeting specific user segments or lexemes with lower success rates.
  4. Localized Support and Resources:

    • Offer more tailored support or resources for users of specific UI languages, especially if certain groups exhibit higher variability in their learning outcomes.

Hypothesis 4:- Gender and grammatical features, such as plurality, may influence learning performance. It suggests that gender and grammatical differences (e.g., singular vs plural) will show varying impacts on recall and success rates across different groups.¶

Deriving mean and median on p_recall for gender_tag¶

In [60]:
# Group by gender_tag and calculate mean recall rate for each gender
gender_performance = dl.groupby('gender_tag')['p_recall'].agg(['mean', 'median'])
gender_performance
Out[60]:
mean median
gender_tag
Feminine 0.896025 1.0
Masculine 0.899669 1.0
Neuter 0.907303 1.0
  • dl.groupby('gender_tag')['p_recall'].agg(['mean', 'median']) computes the mean and median success rates for the p_recall column, grouped by each language in the gender_tag column.

Deriving mean and median on success_rate_history for gender_tag¶

In [61]:
# Group by gender_tag and calculate success rate
gender_success = dl.groupby('gender_tag')['success_rate_history'].agg(['mean', 'median'])
gender_success
Out[61]:
mean median
gender_tag
Feminine 0.897845 1.0
Masculine 0.899576 1.0
Neuter 0.911196 1.0
  • dl.groupby('gender_tag')['success_rate_history'].agg(['mean', 'median']) computes the mean and median success rates for the success_rate_history column, grouped by each language in the gender_tag column.

Aggregation on p_recall revolves through 'gender_tag' & 'plurality_tag'¶

In [62]:
# Compare performance across grammatical features (plurality_tag) and gender
gender_plurality_comparison = dl.groupby(['gender_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median'])
gender_plurality_comparison
Out[62]:
mean median
gender_tag plurality_tag
Feminine Plural 0.893904 1.0
Singular 0.896369 1.0
Masculine Plural 0.902440 1.0
Singular 0.899193 1.0
Neuter Plural 0.901168 1.0
Singular 0.908342 1.0
  • The code computes the mean and median success rates for the p_recall column, grouped by each language in the gender_tag & plurality_tag column.

Analysis¶

1. Recall Rate by Gender:¶

  • Mean recall rates are very close across genders: Feminine (0.896), Masculine (0.900), and Neuter (0.907).
  • The median recall rate is 1.0 for all genders, indicating that most users recall items accurately regardless of gender-tagged lexemes.

2. Success Rate by Gender:¶

  • Mean success rates follow a similar pattern: Feminine (0.898), Masculine (0.900), and Neuter (0.911).
  • The median success rate is also 1.0 across all genders, highlighting consistent high performance for all gender categories.

3. Gender and Plurality Performance:¶

  • Neuter Singular lexemes show the highest mean recall rate (0.908) among all gender and plurality combinations.
  • Feminine Plural lexemes have the lowest mean recall rate (0.894) but still maintain a perfect median recall rate.
  • Plural lexemes generally have slightly lower mean recall rates compared to their singular counterparts across all genders.

Key Insights¶

  1. Consistency Across Genders:

    • Performance in terms of recall and success rates is uniform across gender tags, with only minor differences.
  2. Neuter Gender Leads:

    • Lexemes tagged as Neuter show slightly higher performance in both recall and success rates, especially in singular form.
  3. Plurality Impact:

    • Plural lexemes tend to have marginally lower mean recall rates than singular lexemes, suggesting users may find plural forms slightly more challenging.
  4. High Median Performance:

    • The median recall and success rates of 1.0 across all gender and plurality combinations highlight effective learning for most users.

Recommendations¶

  1. Address Plural Lexemes:

    • Develop targeted exercises or materials to improve the recall of plural lexemes, especially those tagged as Feminine Plural.
  2. Leverage Neuter Lexeme Strengths:

    • Analyze why Neuter Singular lexemes perform better to replicate this success in other categories.
  3. Monitor Low Variance Segments:

    • While overall performance is high, identify specific lexemes or users with lower scores in the Feminine and Plural categories for tailored support.
  4. Balanced Lexeme Practice:

    • Ensure learning paths include an equal mix of gender and plurality-tagged lexemes to avoid bias or under-representation of any category.
  5. Advanced Insights:

    • Investigate if cultural or linguistic biases in learning gendered lexemes influence performance, and adapt the content accordingly.

Hypothesis 5 :- Longer median delta days indicate lower consistency, suggesting sporadic learning behavior. Shorter median delta days reflect higher consistency, implying sustained engagement and better performance.¶

Deriving Median Delta Days for each user¶

In [63]:
user_id_consistency =  dl.groupby('user_id')['delta_days'].median().reset_index()
user_id_consistency = user_id_consistency.sort_values(by = 'delta_days', ascending = False)

user_id_consistency.rename(columns={'user_id': 'User ID', 'delta_days': 'Median Delta Days'}, inplace=True)

print(user_id_consistency)
      User ID  Median Delta Days
792        GB           458.9090
1576       TX           444.9910
1842       _2           423.2820
2439      bEO           411.3305
4197      bzm           409.2480
...       ...                ...
24529    gVkB             0.0000
59174    iQCu             0.0000
45411    i1sv             0.0000
17451    fefQ             0.0000
17571    fg7z             0.0000

[79694 rows x 2 columns]

-dl.groupby('user_id')['delta_days'].median().reset_index() delves the median of delta_days which groups over user_id.

Keeping Threshold for consistency¶

In [64]:
user_id_consistency['Median Delta Days'].describe()
Out[64]:
count    79694.000000
mean        12.856487
std         34.171674
min          0.000000
25%          0.039000
50%          2.026000
75%          8.973000
max        458.909000
Name: Median Delta Days, dtype: float64
In [65]:
quantile_90 = user_id_consistency['Median Delta Days'].quantile(0.90)

# Display the 90th quantile
print(f"The 90th percentile (quantile) for Median Delta Days is: {quantile_90:.2f}")
The 90th percentile (quantile) for Median Delta Days is: 30.13
  • By using .describe() function the viewer can understand the how the values of Median Delta Days are distributed through out the data.
  • Since the data is highly deviated from 75% to max. It can be better to use the the 90% quantile to determine the level for Less consistent user.
  • So it is derived that if Median Delta Days greater than 30 i.e., more than a month or odd the user is considered as the less consistent.
  • The consistency is considered as moderate if Median Delta Days is between 8 to 30 based on 75%.
  • The most consistent user if Median Delta Days is lesser than 8.

Less_consistent_user¶

In [66]:
# Filter rows where 'Median Delta Days' is greater than 100
Less_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] > 30]

# Display the filtered DataFrame
print(Less_consistent_user)
      User ID  Median Delta Days
792        GB           458.9090
1576       TX           444.9910
1842       _2           423.2820
2439      bEO           411.3305
4197      bzm           409.2480
...       ...                ...
39727    hdRb            30.0040
13056    f-hA            30.0040
11184    eT8k            30.0030
18695    ftXa            30.0020
41123    hkY5            30.0010

[8025 rows x 2 columns]
  • user_id_consistency[user_id_consistency['Median Delta Days'] > 30] filters the user_id_consistency DataFrame to display only rows where the value in the 'Median Delta Days' column is greater than 30.

Most_consistent_user¶

In [67]:
# Filter rows where 'Median Delta Days' is greater than 100
most_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] < 8]

# Display the filtered DataFrame
print(most_consistent_user)
      User ID  Median Delta Days
47936    iAJZ             7.9995
52441    iI4W             7.9990
38545    hZT8             7.9990
51604    iGoN             7.9990
6511     d2OP             7.9990
...       ...                ...
24529    gVkB             0.0000
59174    iQCu             0.0000
45411    i1sv             0.0000
17451    fefQ             0.0000
17571    fg7z             0.0000

[58361 rows x 2 columns]
  • user_id_consistency[user_id_consistency['Median Delta Days'] < 8] filters the user_id_consistency DataFrame to display only rows where the value in the 'Median Delta Days' column is lesser than 8.

Less Consistent User's data¶

In [68]:
# Step 1: Get the user_ids from Less_consistent_user
less_consistent_user_ids = Less_consistent_user['User ID']

# Step 2: Filter the original dataset (dl) for those user_ids
filtered_dl_l = dl[dl['user_id'].isin(less_consistent_user_ids)]

# Step 3: Group by 'user_id' and calculate required statistics
less_consistent_user_language_stats = (
    filtered_dl_l.groupby(['user_id', 'learning_language_Abb',])[['delta_days','p_recall', 'success_rate_history']]
    .median()
    .reset_index()
    .rename(columns={'p_recall': 'Median p_recall', 'delta_days': 'Median_delta_days','success_rate_history': 'Median Success Rate'})
)
less_consistent_user_language_stats = less_consistent_user_language_stats.sort_values(by = 'Median_delta_days', ascending = False)

# Display the resulting DataFrame
print(less_consistent_user_language_stats)
     user_id learning_language_Abb  Median_delta_days  Median p_recall  \
355       GB                German           458.9090         1.000000   
699       TX               Spanish           444.9910         0.666667   
805       _2               Spanish           423.2820         1.000000   
1033     bEO               Spanish           411.3305         1.000000   
1753     bzm               Spanish           409.2480         1.000000   
...      ...                   ...                ...              ...   
3541    e1x_                French             0.0020         1.000000   
3673    eApr               Spanish             0.0020         1.000000   
2829    dK31               English             0.0020         1.000000   
4751    f8Bu            Portuguese             0.0020         0.750000   
2851    dLoW            Portuguese             0.0020         0.900000   

      Median Success Rate  
355              0.833333  
699              0.666667  
805              0.833333  
1033             0.857143  
1753             0.894444  
...                   ...  
3541             0.800000  
3673             1.000000  
2829             1.000000  
4751             0.857143  
2851             1.000000  

[8193 rows x 5 columns]
  • dl[dl['user_id'].isin(less_consistent_user_ids)] After extracting the User ID column from the Less_consistent_user DataFrame and stores it in the variable less_consistent_user_ids. These IDs represent users with less consistency in their behavior (e.g., high median delta days).The code filter the original dl dataset to only include rows where the user_id is in the list less_consistent_user_ids. The filtered data is stored in filtered_dl_l, which contains the records of users with less consistency.
  • filtered_dl_l.groupby() groups the filtered data (filtered_dl_l) by user_id and learning_language_Abb, then calculates the median for the columns delta_days, p_recall, and success_rate_history for each group.
  • .rename() The column names are renamed to provide clearer labels.
  • .sort_values() This sorts the DataFrame by Median_delta_days in descending order, so that users with the highest median delta days (less consistent) appear at the top.

Most Consistent User's data¶

In [69]:
# Step 1: Filter most consistent user_ids
most_consistent_user_ids = most_consistent_user['User ID']

# Step 2: Filter the original dataset (dl) for these user_ids
filtered_most_consistent_dl = dl[dl['user_id'].isin(most_consistent_user_ids)]

# Step 3: Group by 'user_id' and calculate required statistics
most_consistent_user_stats = (
    filtered_most_consistent_dl.groupby(['user_id'])
    .agg({
        'delta_days': 'median',
        'p_recall': 'median',
        'success_rate_history': 'median'
    })
    .reset_index()
    .rename(columns={
        'delta_days': 'Median Delta Days',
        'p_recall': 'Median p_recall',
        'success_rate_history': 'Median Success Rate'
    })
)

most_consistent_user_stats = most_consistent_user_stats.sort_values(by = 'Median Delta Days', ascending = False)
# Step 4: Merge with the language information
most_consistent_user_stats = most_consistent_user_stats.merge(
    dl[['user_id', 'learning_language_Abb']].drop_duplicates(),
    on='user_id',
    how='left'
)

# Display the resulting DataFrame
print(most_consistent_user_stats)
      user_id  Median Delta Days  Median p_recall  Median Success Rate  \
0        iAJZ             7.9995         1.000000             0.875000   
1        iGoN             7.9990         1.000000             1.000000   
2        iI4W             7.9990         1.000000             0.750000   
3        d2OP             7.9990         1.000000             1.000000   
4        hZT8             7.9990         1.000000             1.000000   
...       ...                ...              ...                  ...   
59968    d52D             0.0000         0.833333             0.833333   
59969    i5iF             0.0000         0.708333             0.708333   
59970    iDPq             0.0000         1.000000             1.000000   
59971    i5BG             0.0000         1.000000             0.878788   
59972    iQCu             0.0000         1.000000             1.000000   

      learning_language_Abb  
0                   English  
1                   English  
2                    German  
3                   Spanish  
4                Portuguese  
...                     ...  
59968               English  
59969               Spanish  
59970               Spanish  
59971               Spanish  
59972               English  

[59973 rows x 5 columns]
  • dl[dl['user_id'].isin(most_consistent_user_ids)] After extracting the User ID column from the most_consistent_user DataFrame and stores it in the variable less_consistent_user_ids. These IDs represent users with less consistency in their behavior (e.g., high median delta days).The code filter the original dl dataset to only include rows where the user_id is in the list most_consistent_user_ids. The filtered data is stored in ffiltered_most_consistent_dl, which contains the records of users with less consistency.
  • filtered_most_consistent_dl.groupby() groups the filtered data (filtered_most_consistent_dl) by user_id and learning_language_Abb, then calculates the median for the columns delta_days, p_recall, and success_rate_history for each group.
  • .rename() The column names are renamed to provide clearer labels.
  • .sort_values() This sorts the DataFrame by Median_delta_days in descending order, so that users with the highest median delta days (most consistent) appear at the top.

Analysis¶

1. Overview of User Consistency¶

  • The Median Delta Days metric reveals how consistently users engage with the platform.

    • Overall Distribution:
      • Mean: 12.86 days
      • Median: 2.03 days
      • 90th Percentile: 30.13 days
      • Max: 458.91 days
    • A significant proportion of users (~75%) have a Median Delta Days of less than 9 days, indicating regular engagement.
  • Users were split into:

    • Less Consistent Users: Median Delta Days > 30 (e.g., User ID GB with 458.91 days).
    • Most Consistent Users: Median Delta Days < 8 (e.g., User ID iAJZ with 7.99 days).

2. Less Consistent Users¶

  • Language Engagement:
    • Top inconsistent users primarily engage with German and Spanish.
    • Median Recall Rate: Ranges from 0.67 to 1.00.
    • Median Success Rate: Some users show low success rates (~0.66), indicating challenges in retention or understanding.
  • Notable Outliers:
    • Users like GB (German) and TX (Spanish) show very high delta days (>400 days), yet they maintain high recall rates (~1.0). This suggests sporadic but focused engagement.

3. Most Consistent Users¶

  • Language Engagement:
    • These users are consistent in practicing languages like English, German, and Spanish.
    • Median Recall Rate: Consistently high (mostly 1.00).
    • Median Success Rate: Majority maintain excellent success rates (~0.75–1.00).
  • Insights:
    • Consistent users display better retention and recall compared to less consistent users.
    • The consistency in engagement likely correlates with high success and recall rates, reflecting effective learning habits.

Key Insights¶

  1. Consistency Drives Performance:

    • Users with regular engagement (lower delta days) consistently outperform less consistent users in recall and success rates.
  2. Language-Specific Challenges:

    • Inconsistent users engaging in languages like Spanish and German often exhibit lower success rates, indicating possible difficulties with these languages.
  3. Outliers Among Inconsistent Users:

    • Some inconsistent users maintain high recall rates despite long gaps between sessions, suggesting a preference for intensive, focused learning.
  4. Median Delta Days Threshold:

    • A Median Delta Days of 30 serves as a threshold for identifying inconsistent users, who may need targeted interventions.

Recommendations¶

  1. Encourage Regular Engagement:

    • Design reminders or streak incentives to prompt users with high delta days to engage more frequently.
    • Offer micro-lessons or bite-sized activities for users who find it hard to commit regularly.
  2. Target Support for Inconsistent Users:

    • Focus on improving success rates for inconsistent users of languages like Spanish and German through:
      • Personalized exercises.
      • Gamified learning strategies to increase motivation.
  3. Reward Consistency:

    • Recognize and reward consistent users to reinforce positive habits. Introduce badges, leaderboards, or personalized achievements.
  4. Analyze High Recall Outliers:

    • Study inconsistent users with high recall rates to identify factors contributing to their performance. Insights can guide strategies for other inconsistent learners.
  5. Data-Driven Personalization:

    • Use the consistency data to dynamically adjust lesson difficulty and suggest tailored practice schedules for both consistent and inconsistent users.

Hypotheses 6 :- The more frequent learning sessions (shorter Delta_Days) lead to better recall and higher success rates, while long breaks between sessions can negatively affect learning outcomes¶

Counting number of members in each language in less consistent users¶

In [70]:
less_consistent_user_language_stats['learning_language_Abb'].value_counts()
Out[70]:
English       2566
Spanish       2278
French        1573
German        1388
Italian        219
Portuguese     169
Name: learning_language_Abb, dtype: int64
  • .value_counts() The code returns the count of how many times each language abbreviation appears in the learning_language_Abb column for the users identified as "less consistent".

Counting number of members in each language for more consistent users¶

In [71]:
most_consistent_user_stats['learning_language_Abb'].value_counts()
Out[71]:
English       23055
Spanish       15740
French         9218
German         6905
Italian        3631
Portuguese     1424
Name: learning_language_Abb, dtype: int64
  • .value_counts() The code returns the count of how many times each language abbreviation appears in the learning_language_Abb column for the users identified as "more consistent".

Getting Correlation on Delta_Days of Each user Vs P_recall and Success Rate¶

In [72]:
# Calculate the correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall = less_consistent_user_language_stats['Median_delta_days'].corr(
    less_consistent_user_language_stats['Median p_recall']
)

# Calculate the correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate = less_consistent_user_language_stats['Median_delta_days'].corr(
    less_consistent_user_language_stats['Median Success Rate']
)

# Print the results
print(f"Correlation between Median_delta_days and Median p_recall: {correlation_p_recall:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate: {correlation_success_rate:.2f}")
Correlation between Median_delta_days and Median p_recall: -0.05
Correlation between Median_delta_days and Median Success Rate: 0.04
  • The .corr() method is used to calculate the Pearson correlation coefficient between two columns: Median_delta_days (representing user engagement or consistency) and Median p_recall (the median recall rate of users).

Devicing Correlation on Delta_Days over P_recall and Success_rate¶

In [73]:
# Correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall_most = most_consistent_user_stats['Median Delta Days'].corr(
    most_consistent_user_stats['Median p_recall']
)

# Correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate_most = most_consistent_user_stats['Median Delta Days'].corr(
    most_consistent_user_stats['Median Success Rate']
)

# Print the results
print(f"Correlation between Median_delta_days and Median p_recall (Most Consistent): {correlation_p_recall_most:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate (Most Consistent): {correlation_success_rate_most:.2f}")
Correlation between Median_delta_days and Median p_recall (Most Consistent): -0.03
Correlation between Median_delta_days and Median Success Rate (Most Consistent): 0.02
  • .corr() devices correlation between Median_delta_days over Median p_recall and Median Success Rate

Analysis of Language Engagement and Correlations¶

1. Learning Language Distribution¶

  • Less Consistent Users:

    • English: 2566 learners (~31.5% of less consistent group).
    • Spanish: 2278 learners (~28%).
    • French: 1573 learners (~19%).
    • German: 1388 learners (~17%).
    • Italian: 219 learners (~2.7%).
    • Portuguese: 169 learners (~2%).
  • Most Consistent Users:

    • English: 23055 learners (~33% of total).
    • Spanish: 15740 learners (~22.5%).
    • French: 9218 learners (~13%).
    • German: 6905 learners (~10%).
    • Italian: 3631 learners (~5%).
    • Portuguese: 1424 learners (~2%).

2. Correlations¶

  • For Less Consistent Users:

    • Median Delta Days and Median p_recall: -0.05 (weak negative correlation).
    • Median Delta Days and Median Success Rate: 0.04 (weak positive correlation).
  • For Most Consistent Users:

    • Median Delta Days and Median p_recall: -0.03 (very weak negative correlation).
    • Median Delta Days and Median Success Rate: 0.02 (very weak positive correlation).

Key Insights¶

  1. Distribution Imbalance:

    • English and Spanish dominate across both consistent and less consistent user groups, but less consistent users have a slightly higher proportion studying German and Spanish.
    • Italian and Portuguese have a smaller share of learners, with Italian learners being more consistent overall.
  2. Weak Correlations:

    • Minimal relationship between how frequently users interact (Median Delta Days) and their performance metrics (Median p_recall or Median Success Rate). This suggests that other factors, such as study habits or course difficulty, might play a larger role in determining success.
  3. Language-Specific Trends:

    • French and German learners are slightly overrepresented among less consistent users. These languages may have content or structure that makes it harder to maintain engagement.

Recommendations¶

  1. Focus on English and Spanish:

    • Invest in personalized reminders and motivational tools for less consistent English and Spanish learners to encourage consistent engagement.
    • Introduce goal-setting features tailored to these languages, such as streak rewards.
  2. Enhance French and German Content:

    • Investigate potential challenges with French and German courses, such as lesson complexity or perceived difficulty.
    • Provide step-by-step learning paths or extra practice materials for commonly challenging topics.
  3. Leverage Consistency in Italian and Portuguese:

    • Promote the high success and consistency rates for Italian and Portuguese to attract new learners.
    • Offer cultural immersion content (e.g., travel-related lessons, conversational phrases).
  4. Analyze External Factors:

    • Conduct user surveys or focus groups to identify external factors influencing engagement and performance (e.g., motivation, user interface, content quality).

By addressing these areas, the platform can improve learner engagement, especially for less consistent users, while maintaining high retention rates for the most consistent groups.

Hypotheses 7 :-Users with a higher number of unique lexemes learned are likely to show greater engagement and proficiency in the language, indicating a stronger commitment to the learning process.¶

Unique Lexemes Learned for each Users¶

In [74]:
# Step 1: Group by 'user_id' and count unique 'lexeme_id' learned
user_lexeme_counts = (
    dl.groupby(['user_id','learning_language_Abb'])['lexeme_id']
    .nunique()
    .reset_index()
    .rename(columns={'lexeme_id': 'Unique Lexemes Learned'})
)

# Step 2: Sort by the count of unique lexemes in descending order
user_lexeme_counts = user_lexeme_counts.sort_values(by='Unique Lexemes Learned', ascending=False)

# Display the resulting DataFrame
print(user_lexeme_counts)
      user_id learning_language_Abb  Unique Lexemes Learned
81262     tJs               Spanish                     941
8867     dlpG                German                     899
13039    erXf               English                     867
25901    gZJc               English                     862
20467    g2Ev               English                     848
...       ...                   ...                     ...
40815    hdFN                German                       1
61977    iRVR                French                       1
46195    i17y                French                       1
75951    ipO1               English                       1
48151    i4P7                French                       1

[81711 rows x 3 columns]
  • dl.groupby(['user_id','learning_language_Abb'])['lexeme_id'].nunique().reset_index().rename(columns={'lexeme_id': 'Unique Lexemes Learned'}) this groups the count on unique lexeme id which groups over user_id and learning_language_Abb and rename the column names for more convinience
  • user_lexeme_counts.sort_values(by='Unique Lexemes Learned', ascending=False) sorts the dataframe by unique lexeme id in descending order.

Analysis of Lexeme Learning Across Users¶

1. Key Findings¶

  • Top Performers:

    • The user tJs studying Spanish learned the most unique lexemes (941), followed by dlpG in German with 899 and erXf in English with 867.
    • The top learners predominantly study Spanish, German, and English.
  • Lower Engagement:

    • A significant portion of users have learned only one unique lexeme across various languages, indicating very low engagement or limited initial interaction.
  • Language Insights:

    • Users learning English dominate the top rankings, followed by Spanish and German, reflecting their popularity or user preference on the platform.
    • Users in less popular languages (e.g., French) generally have fewer high performers compared to English and Spanish learners.

Key Insights¶

  1. Engagement Gap:

    • The wide range in unique lexemes learned (from 941 to 1) highlights a gap between highly engaged users and those who disengage early.
  2. Language Popularity and Complexity:

    • Spanish and English lead in lexeme acquisition, which might be due to user familiarity, the structure of the language courses, or motivation.
    • German learners are relatively high-performing but could face challenges in lexical complexity, making their achievements notable.
  3. Low Performers:

    • The significant number of users with minimal lexeme acquisition suggests an opportunity to improve initial user retention through engaging onboarding experiences or targeted support.

Recommendations¶

  1. Enhance Early Engagement:

    • Introduce gamified incentives for new learners (e.g., badges for learning 10+ lexemes in the first week).
    • Simplify onboarding lessons with immediate rewards for completing early milestones.
  2. Support High Performers:

    • Offer advanced modules or personalized challenges for users like tJs, dlpG, and erXf to keep them engaged at higher levels.
    • Use top learners as case studies or ambassadors to inspire others.
  3. Target Low Performers:

    • Identify users with minimal lexeme counts and provide personalized follow-ups, such as reminders or suggestions for easier, beginner-friendly lessons.
    • Deploy exit surveys or feedback requests to understand why users disengage early.
  4. Language-Specific Enhancements:

    • For Spanish and English learners, emphasize conversational and practical vocabulary to capitalize on interest.
    • For German and French learners, focus on breaking down lexical complexity and providing mnemonic aids to simplify learning.
  5. Data-Driven Content Iteration:

    • Analyze content patterns in languages with both high and low lexeme counts to refine lesson structure and difficulty progression.

By addressing these insights, the platform can foster balanced engagement across different learner profiles while improving retention and satisfaction.

Hypothesis 8 :- Grammar tag categorization has a significant impact on both success rate history and session recall rate, with certain grammar tags showing higher success and recall rates than others.¶

Grouped by grammar_tag on success_rate_history with mean and median¶

In [75]:
grammar_stats_history = dl.groupby('grammar_tag')['success_rate_history'].agg(['mean', 'median']).round(2)
grammar_stats_history = grammar_stats_history.reset_index()
grammar_stats_history
Out[75]:
grammar_tag mean median
0 Adverbs 0.90 0.98
1 Conjunctions 0.88 0.91
2 Determiners and Adjectives 0.90 0.94
3 Interjections 0.95 1.00
4 Nan 0.78 1.00
5 Negation 0.89 1.00
6 Nouns 0.91 1.00
7 Numbers and Quantifiers 1.00 1.00
8 Other 0.89 1.00
9 Pronouns and Related 0.89 0.92
10 Verbs 0.90 0.95
  • The resulting dataframe, grammar_stats_history, summarizes success rate statistics for each grammar tag in mean and median.

Grouped by grammar_tag on p_recall with mean and median¶

In [76]:
grammar_stats_session = dl.groupby('grammar_tag')['p_recall'].agg(['mean', 'median']).round(2)
grammar_stats_session = grammar_stats_session.reset_index()
grammar_stats_session
Out[76]:
grammar_tag mean median
0 Adverbs 0.89 1.0
1 Conjunctions 0.87 1.0
2 Determiners and Adjectives 0.89 1.0
3 Interjections 0.95 1.0
4 Nan 0.94 1.0
5 Negation 0.89 1.0
6 Nouns 0.90 1.0
7 Numbers and Quantifiers 0.50 0.5
8 Other 0.90 1.0
9 Pronouns and Related 0.88 1.0
10 Verbs 0.89 1.0
  • The resulting dataframe, grammar_stats_session, summarizes success rate statistics for each grammar tag in mean and median.grammar_stats_session

Derive grammar_tag on plural categories¶

In [77]:
# Compare recall rate and success rate across grammatical categories
tag_diff_comparison = dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2)
tag_diff_comparison
Out[77]:
mean median
grammar_tag plurality_tag
Determiners and Adjectives Plural 0.88 1.0
Singular 0.89 1.0
Nouns Plural 0.90 1.0
Singular 0.90 1.0
Pronouns and Related Plural 0.89 1.0
Singular 0.90 1.0
Verbs Plural 0.89 1.0
Singular 0.90 1.0
  • dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2) aggregates the mean and median of p_recall that revolves over grammar_tag which categorized by plurality_tag.

Analysis of Grammar-Tag-Specific Performance¶

1. Success Rate Trends by Grammar Tag¶

  • Top Performing Tags:

    • Numbers and Quantifiers: Achieve the highest mean (1.00) and median (1.00) success rates, indicating consistency and ease in mastering this category.
    • Interjections: Similarly, with a mean (0.95) and median (1.00), these appear straightforward for users.
    • Nouns: High mean (0.91) and perfect median (1.00) success rates indicate strong user retention in this foundational area.
  • Low Performing Tags:

    • Conjunctions: With a mean success rate of 0.88, learners face challenges in this grammatical category.
    • Nan (Undefined): The mean (0.78) suggests issues with ambiguous or improperly tagged content, despite a perfect median (1.00).

2. Recall Rate Trends by Grammar Tag¶

  • Top Performing Tags:

    • Interjections: Exhibit high recall rates with a mean (0.95) and median (1.00).
    • Nouns: Maintain consistent performance, with a mean (0.90) and median (1.00), reflecting their foundational role in language structure.
  • Low Performing Tags:

    • Numbers and Quantifiers: Recall rate struggles, with a mean (0.50) and median (0.50), likely due to numerical complexity or irregularity in lesson structure.
    • Conjunctions and Pronouns: Mean values of 0.87 and 0.88 highlight slight challenges.

3. Grammar Tag Performance by Plurality¶

  • Singular vs. Plural:
    • Across Determiners and Adjectives, Nouns, Pronouns, and Verbs, singular forms exhibit slightly higher recall rates (mean 0.90) compared to plural forms (mean 0.89).
    • This suggests that learners are more comfortable with singular constructs, likely due to their prevalence in basic language training.

Key Insights¶

  1. Consistency in Key Categories:

    • Grammar tags like Nouns, Interjections, and Numbers are performing well in both success and recall metrics.
    • Numbers and Quantifiers, despite a perfect success rate, show a recall rate gap, highlighting a possible discrepancy between practice and retention.
  2. Ambiguous or Undefined Content (Nan):

    • Poor mean success rates for the Nan category suggest potential tagging errors or unclear instructional design.
  3. Plurality Challenges:

    • Across multiple grammar tags, plural constructs slightly underperform compared to singular constructs, indicating a learning curve in pluralization rules.
  4. Specific Challenges:

    • Conjunctions and Pronouns lag in both recall and success rates, hinting at their inherent complexity or insufficient emphasis in lessons.

Recommendations¶

  1. Targeted Improvements for Low-Performing Categories:

    • Conjunctions and Pronouns: Introduce focused lessons with simplified rules, visual aids, and relatable examples to enhance comprehension.
    • Numbers and Quantifiers: Revise lesson structure to emphasize repetition, mnemonics, and contextual applications for numerical recall.
  2. Optimize Undefined (Nan) Content:

    • Audit and refine ambiguous or poorly defined grammar tags to ensure clarity and better instructional value.
  3. Address Pluralization Challenges:

    • Include explicit lessons on pluralization rules, especially for Determiners and Adjectives, Pronouns, and Verbs. Practice-based approaches (e.g., fill-in-the-blank exercises) can help learners bridge the gap.
  4. Leverage Strengths:

    • Build on successful categories like Nouns and Interjections by integrating them into more complex exercises to maintain engagement.
  5. Data-Driven Curriculum Adjustments:

    • Use metrics like recall and success rates to refine lesson plans, prioritizing low-performing areas while maintaining the strengths of well-performing tags.

By addressing these points, the platform can create a more balanced and effective learning experience, improving both user retention and mastery of grammatical nuances.

Hypothesis 9 :- The significant impact of the popularity of a language on the number of unique lexeme IDs in that language.¶

Defining the Function Value_counts¶

In [78]:
def value_counts(column_name):
    value_counts = dl[column_name].value_counts()
    value_counts_dl = pd.DataFrame(value_counts)

    return value_counts_dl
  • value_counts(column_name) computes the number of occurrences of each unique value in the specified column of the DataFrame dl.
In [79]:
value_counts('learning_language_Abb')
Out[79]:
learning_language_Abb
English 1479926
Spanish 1007678
French 552704
German 425433
Italian 237961
Portuguese 92056
  • The Function value_counts(column_name) ise employed for learning_language_Abb column.

Employing Scatter plot for Count of Learning Language vs Unique Lexeme words¶

In [80]:
# Data
learning_language_Abb = ['English', 'Spanish', 'French', 'German', 'Italian', 'Portuguese']
count = [1479926, 1007678, 552704, 425433, 237961, 92056]  # Count of learning_language_full
unique_lexeme_ids = [2740, 3052, 3429, 3218, 1750, 2055]  # Unique Lexeme IDs
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown']  # Different colors for languages

# Normalize counts for the x-axis
scaled_count = [x / 100000 for x in count]

# Create a figure
plt.figure(figsize=(12, 8))

# Scatter plot with unique colors for each language
plt.scatter(scaled_count, unique_lexeme_ids, color=colors, s=200, alpha=0.8, edgecolors='black')

# Annotate points with language names
for i, lang in enumerate(learning_language_Abb):
    plt.text(scaled_count[i], unique_lexeme_ids[i] + 50, lang, fontsize=10, ha='center')

# Add titles and labels
plt.title('Learning Language Count (Normalized) vs. Unique Lexeme IDs', fontsize=16)
plt.xlabel('Normalized Count of Learning Language Learners', fontsize=14)
plt.ylabel('Unique Lexeme IDs', fontsize=14)
plt.grid(alpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image
  • The code visualizes the relationship between the count of language learners and the number of unique lexemes for each language. Each point represents a language, annotated with its name, and uses distinct colors for clear differentiation.

Analysis¶

  1. Correlation Analysis:

    • The scatter plot allows us to visualize relationships between the normalized count of learners and the number of unique lexeme IDs across different languages.
    • Strong Positive Observations:
      • English and Spanish have the highest learner counts and correspondingly high unique lexeme IDs, indicating they attract a large learner base and have rich content diversity.
      • French and German, despite lower learner counts, show high unique lexeme IDs, suggesting they offer substantial vocabulary depth.
    • Weaker Relationships:
      • Italian and Portuguese, with lower learner counts and unique lexeme IDs, indicate less engagement and possibly less extensive content compared to other languages.
  2. Key Relationships Interpretation:

    • Learner Engagement: English and Spanish's high learner counts suggest these languages are most popular, likely due to their global utility and demand.
    • Content Richness: High unique lexeme IDs for French and German imply these languages offer a robust learning experience, potentially attracting learners seeking comprehensive knowledge.
    • Underrepresented Languages: Italian and Portuguese lag behind in both metrics, highlighting opportunities to boost their content and learner engagement.

Recommendations¶

  1. Enhance Content for Underrepresented Languages:

    • Develop and promote more engaging content for Italian and Portuguese to attract more learners and increase the diversity of lexemes.
    • Offer incentives and highlight cultural and practical benefits of learning these languages.
  2. Leverage Popular Languages:

    • Utilize the large learner base of English and Spanish to introduce advanced courses, interactive sessions, and community-building activities to maintain engagement.
    • Expand marketing efforts to capitalize on their popularity.
  3. Invest in Advanced Courses for Rich Content Languages:

    • Given the high lexeme diversity in French and German, focus on developing advanced and specialized courses to cater to learners looking for in-depth knowledge.
    • Highlight the unique features and advanced content available in these languages.

By implementing these strategies, you can balance the learning experience across all languages, cater to diverse learner needs, and potentially increase engagement across the board.

Hypothesis 10:- Aiming to understand the influence of language characteristics or learner preferences on performance metrics.¶

Aggregating Mean and Median on success_rate_history for learning_language_Abb¶

In [81]:
language_stats_history_success_rate = dl.groupby('learning_language_Abb')['success_rate_history'].agg(['mean', 'median']).round(2)
language_stats_history_success_rate = language_stats_history_success_rate.reset_index()
language_stats_history_success_rate
Out[81]:
learning_language_Abb mean median
0 English 0.90 0.95
1 French 0.89 0.94
2 German 0.90 1.00
3 Italian 0.90 1.00
4 Portuguese 0.91 1.00
5 Spanish 0.90 1.00
  • The code computes the mean and median success rates for the success_rate_history column, grouped by each language in the learning_language_Abb column. -This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.

Aggregating Mean and Median on p_recall for learning_language_Abb¶

In [82]:
language_stats_recall_rate = dl.groupby('learning_language_Abb')['p_recall'].agg(['mean', 'median']).round(2)
language_stats_recall_rate = language_stats_recall_rate.reset_index()
language_stats_recall_rate
Out[82]:
learning_language_Abb mean median
0 English 0.90 1.0
1 French 0.88 1.0
2 German 0.89 1.0
3 Italian 0.91 1.0
4 Portuguese 0.90 1.0
5 Spanish 0.90 1.0
  • The code computes the mean and median success rates for the p_recall column, grouped by each language in the learning_language_Abb column. -This provides insights into how learners in different languages perform on average and at the median level. It also creates a clean DataFrame for further analysis or visualization.

Merging the above datasets for Further Analysis¶

In [83]:
# Merge the two datasets
merged_data = pd.merge(language_stats_history_success_rate, 
                       language_stats_recall_rate, 
                       on='learning_language_Abb', 
                       suffixes=('_success_rate', '_recall_rate'))

# Restructure the data to long format for easier plotting
long_data = pd.melt(
    merged_data,
    id_vars=['learning_language_Abb'],
    value_vars=['mean_success_rate', 'median_success_rate', 'mean_recall_rate', 'median_recall_rate'],
    var_name='Metric',
    value_name='Value'
)

# Define a mapping for better labels
metric_labels = {
    'mean_success_rate': 'Mean Success Rate',
    'median_success_rate': 'Median Success Rate',
    'mean_recall_rate': 'Mean Recall Rate',
    'median_recall_rate': 'Median Recall Rate'
}
long_data['Metric'] = long_data['Metric'].map(metric_labels)
  • pd.merge()Merges two datasets that contain different performance metrics (success rate and recall rate) for different learning languages.
  • pd.melt() Reshapes the data into a long format to make it easier to plot and analyze.
  • After setting the name for all data columns using .map() maps the names to the long_data structure.

Deriving Line Chart for success_rate and p_recall through mean and median¶

In [84]:
# Plot the data
plt.figure(figsize=(12, 6))
for metric in long_data['Metric'].unique():
    subset = long_data[long_data['Metric'] == metric]
    plt.plot(subset['learning_language_Abb'], subset['Value'], label=metric)

# Customize the chart
plt.title('Mean and Median for Success Rate and Recall Rate by Learning Language', fontsize=14)
plt.xlabel('Learning Language', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Metrics', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Show the chart
plt.show()
No description has been provided for this image
  • The for loop code loops through each metric (e.g., Mean Success Rate, Median Success Rate).
  • subset = long_data[long_data['Metric'] == metric] Using the code for each metric, a subset of the long_data DataFrame is created, containing only the rows that match the current metric.
  • Using the Loop, the plot is carriedout.

Analysis¶

  1. Overall Performance:

    • Mean Success Rate: Most languages have a mean success rate around 0.90, with slight variations. Portuguese stands out with a mean of 0.91.

    • Median Success Rate: For several languages (German, Italian, Portuguese, Spanish), the median success rate is 1.00, indicating that at least half of the learners achieved a perfect success rate in their historical data.

    • Mean Recall Rate: The mean recall rate is also around 0.90 for most languages, with slight variations. Italian leads with 0.91.

    • Median Recall Rate: The median recall rate is 1.00 for all languages, implying that at least half of the learners achieved a perfect recall rate.

  2. Consistency and Outliers:

    • The median values being 1.00 for many languages suggest high consistency and potentially a skewed distribution towards high performance.
    • Mean values provide a more nuanced view, showing slight differences between languages that the median values do not capture.
  3. Language-Specific Trends:

    • Portuguese and Italian: These languages have the highest mean success and recall rates, indicating strong learner performance.
    • French: While the median values are strong, French has a slightly lower mean success and recall rate compared to other languages, suggesting some variability in learner performance.

Recommendations¶

  1. Leverage High Performance in Italian and Portuguese:

    • Promote these languages more aggressively, showcasing the high success and recall rates to attract new learners.
    • Use the high performance as a case study to develop best practices and strategies for other languages.
  2. Address Variability in French:

    • Investigate the factors contributing to the slightly lower mean success and recall rates in French.
    • Provide targeted support and resources for learners struggling with French to enhance overall performance.
  3. Enhance Learning Experience for All Languages:

    • Maintain consistency in teaching methods and materials to ensure learners continue to achieve high success and recall rates.
    • Use the insights from high-performing languages to improve the content and engagement strategies for all languages.

By focusing on these recommendations, you can enhance learner performance across different languages, address any variability, and attract more learners to your platform.

Hypothesis 11 :- The distribution of learning and UI languages will reveal varying user preferences and engagement levels across different languages.¶

Chart count on learning_language and ui_language¶

In [112]:
def analyze_lan_distribution(dl, column_name, plot_color='skyblue'):
  
    # Perform value counts
    value_counts = dl[column_name].value_counts()
    value_counts_dl = pd.DataFrame(value_counts, columns=['Count'])
    value_counts_dl.index.name = column_name

    # Print the DataFrame
    print(f"\nDistribution of {column_name}:")
    print(value_counts_dl)

    # Plot the distribution
    plt.figure(figsize=(10, 6))
    value_counts.plot(kind='bar', color=plot_color)
    plt.title(f'Distribution of {column_name}', fontsize=16)
    plt.xlabel(column_name, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.6)
    plt.tight_layout()
    plt.show()

    # Return the DataFrame
    return value_counts_dl

ui_language_count = analyze_lan_distribution(dl, 'learning_language_Abb', plot_color='yellow')
ui_language_count = analyze_lan_distribution(dl, 'ui_language_Abb', plot_color='orange')
Distribution of learning_language_Abb:
Empty DataFrame
Columns: [Count]
Index: []
No description has been provided for this image
Distribution of ui_language_Abb:
Empty DataFrame
Columns: [Count]
Index: []
No description has been provided for this image
  • def analyze_lan_distribution(dl, column_name, plot_color='skyblue') defines a function on visualization on box plot exihibiting the number of users in the desired column.
  • analyze_lan_distribution(dl, 'learning_language_Abb', plot_color='yellow') defines the visuakization for learning_language_Abb.. Similarly doing it for ui_language_Abb

Analysis¶

  1. Distribution of Learning Languages (learning_language_Abb):

    • The distribution reveals the count of learners for each language. From this, we can deduce which languages are most and least popular among users.
    • Key Observations:
      • English has the highest count, indicating it is the most popular learning language.
      • Spanish, Portuguese, and Italian also have significant learner counts, but less than English.
  2. Distribution of UI Languages (ui_language_Abb):

    • This distribution shows the count of users for each user interface language. It provides insights into user preferences and localization needs.
    • Key Observations:
      • English is the most used UI language, with a count exceeding 2 million.
      • Spanish follows, with a count slightly above 1 million.
      • Portuguese and Italian have significantly lower counts, with Portuguese being higher than Italian.

Recommendations¶

  1. Localization and Language Support:

    • Given the high counts of English and Spanish as UI languages, ensure these languages have robust and fully localized interfaces.
    • Enhance support for Portuguese and Italian UI users, potentially increasing their counts by making these languages more accessible and user-friendly.
  2. Content Development:

    • Since English is the most popular learning language, focus on expanding and improving English learning materials to cater to the high demand.
    • For other popular learning languages like Spanish, Portuguese, and Italian, develop engaging and comprehensive learning content to retain and attract more learners.
  3. Marketing and Outreach:

    • Utilize the data on UI language distribution to target marketing campaigns effectively. For instance, promoting the platform more aggressively in English and Spanish-speaking regions.
    • Highlight the benefits of learning less popular languages to diversify user engagement and learner base.

By applying these recommendations, you can enhance user experience, cater to diverse language needs, and potentially increase user engagement across various languages.

Hypothesis 12 :- The process aims to examine how recall probabilities vary across different learning languages and return time categories, providing insights into patterns of retention and engagement.¶

Creating a Pivot Table¶

In [86]:
# Pivot table to calculate average p_recall
pivot_table = dl.pivot_table(
    values='p_recall',
    index='learning_language_Abb',
    columns='delta_days_category',
    aggfunc='mean'
)

# Display the pivot table
print(pivot_table)
delta_days_category    Less than a day  Over a month  Within a month  \
learning_language_Abb                                                  
English                       0.907246      0.874983        0.884448   
French                        0.891713      0.848622        0.872736   
German                        0.907395      0.853919        0.875697   
Italian                       0.915333      0.887580        0.890401   
Portuguese                    0.914628      0.869350        0.887903   
Spanish                       0.909945      0.873059        0.886486   

delta_days_category    Within a week      Zero  
learning_language_Abb                           
English                     0.892298  0.930387  
French                      0.878085  0.918171  
German                      0.886278  0.940054  
Italian                     0.899162  0.924242  
Portuguese                  0.894537  0.907361  
Spanish                     0.894120  0.926428  
  • Creating a pivot table by dl.pivot_table() which ranging over the average p_recall restricted on delta_days_category and learning_language_Abb as columns and index respectively.

Create a heatmap of pivot_table¶

In [87]:
# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Heatmap of Recall Probabilities (p_recall) by Language and Time Categories')
plt.xlabel('Delta Days Category')
plt.ylabel('Learning Language')
plt.show()
No description has been provided for this image
  • This code creates a heatmap to visually represent the recall probabilities (p_recall) based on the learning language and time (delta days) categories.

Analysis:¶

  1. Highest Recall Probabilities:

    • The highest recall probabilities are observed in the "Zero" time category for all languages, with German having the highest value at 0.94.
    • Italian also shows a high recall probability in the "Less than a day" category at 0.92.
  2. Lowest Recall Probabilities:

    • The lowest recall probabilities are generally found in the "Over a month" category, with French and German both having the lowest value at 0.85.
  3. Consistency Across Time Categories:

    • English, German, and Spanish show relatively consistent recall probabilities across different time categories, with values generally ranging between 0.85 and 0.94.
    • French shows more variability, with a noticeable dip in the "Over a month" category.

Recommendations:¶

  1. Focus on High Recall Time Categories:

    • Since the "Zero" time category consistently shows the highest recall probabilities, it may be beneficial to focus on immediate recall techniques for language learning.
  2. Address Low Recall Time Categories:

    • Special attention should be given to the "Over a month" category, especially for French and German, to improve long-term retention strategies.
  3. Tailored Learning Strategies:

    • Different languages may benefit from tailored learning strategies. For example, French learners might need more frequent reviews to maintain high recall probabilities over longer periods.
  4. Balanced Approach:

    • A balanced approach that combines immediate recall techniques with periodic reviews could help maintain high recall probabilities across all time categories.

By analyzing the heatmap, educators and learners can better understand the effectiveness of different recall strategies and tailor their learning plans accordingly.

Hypothesis 13 :- Analysing time since learning sessions influences recall probabilities and success rates, expecting a potential decline in these metrics with increasing time gaps.¶

Correlation between delta_days and p_recall¶

In [88]:
dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['p_recall'] = pd.to_numeric(dl['p_recall'], errors='coerce')
correlation = dl['delta_days'].corr(dl['p_recall'])
print(f"Correlation between delta_days and p_recall: {correlation:.2f}") 
Correlation between delta_days and p_recall: -0.03
  • pd.to_numeric() code takes the delta_days and p_recall columns, converts any non-numeric data to NaN, and then computes the correlation coefficient between them.
  • The correlation value helps in understanding how strongly these two variables are related. If the value is close to 1, there is a strong positive correlation; if close to -1, a strong negative correlation; and if close to 0, there is little to no linear relationship between them.

Correlation between delta_days and success_rate_history¶

In [89]:
dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['success_rate_history'] = pd.to_numeric(dl['success_rate_history'], errors='coerce')
correlation = dl['delta_days'].corr(dl['success_rate_history'])
print(f"Correlation between delta_days and success_rate_history: {correlation:.2f}") 
Correlation between delta_days and success_rate_history: 0.02
  • the correlation coefficient computed between delta_days and success_rate_history.

3D Scatter Plot: History vs. Session vs. Delta Days¶

In [90]:
# Initialize a 3D plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot
sc = ax.scatter(
    dl['history_correct'], 
    dl['session_correct'], 
    dl['delta_days'], 
    c=dl['delta_days'],  # Color based on delta_days
    cmap='viridis',      # Color map
    alpha=0.8            # Transparency
)

# Add labels
ax.set_title('3D Scatter Plot: History vs. Session vs. Delta Days')
ax.set_xlabel('History Correct')
ax.set_ylabel('Session Correct')
ax.set_zlabel('Delta Days')

# Add color bar
cbar = plt.colorbar(sc, pad=0.1)
cbar.set_label('Delta Days')

# Show plot
plt.show()
No description has been provided for this image
  • The scatter plot visualizes the relationship between three variables: history_correct, session_correct, and delta_days.
  • The X-axis represents the range of history_correct, the Y-axis represents the range of session_correct, and the Z-axis represents the delta_days`.
  • Each data point is colored based on its delta_days value, with a color scale (using the viridis color map) indicating the range of delta days.
  • The color bar provides a reference for interpreting the colors in the scatter plot.
  • This 3D scatter plot is useful for understanding how the three variables interact with each other and how they might be correlated.

Correlation Analysis:¶

  1. Delta Days vs. p_recall:

    • The correlation between delta_days and p_recall is -0.03, indicating a very weak negative relationship. This suggests that the time elapsed between study sessions has a minimal impact on recall probability.
  2. Delta Days vs. Success Rate History:

    • The correlation between delta_days and success_rate_history is 0.02, indicating a very weak positive relationship. This implies that the time elapsed between study sessions has a negligible effect on historical success rates.

3D Scatter Plot Insights:¶

The 3D scatter plot visualizes the relationship between the number of correct answers in history (history_correct), session correctness (session_correct), and the number of days between sessions (delta_days).

Key Observations:

  • History Correct vs. Session Correct: There's a noticeable clustering of points where higher values of history_correct correlate with higher session_correct values. This suggests that learners who perform well historically tend to perform well in current sessions too.
  • Color Gradient of Delta Days: The color gradient from purple to yellow represents the delta_days values. While there is no clear trend indicating the impact of delta_days, points are spread across the range, indicating varied intervals between sessions.

Recommendations:¶

  1. Focus on Consistency:

    • Encourage regular study habits since delta_days shows minimal impact on recall and success rates. Consistent engagement is key to maintaining performance.
  2. Utilize Historical Performance:

    • Leverage the strong relationship between history_correct and session_correct. Personalized feedback and adaptive learning paths based on historical performance can help enhance learning outcomes.
  3. Balance Study Intervals:

    • While the weak correlations suggest flexibility in study intervals, maintaining a balanced approach with periodic reviews can support long-term retention.

By leveraging these insights and recommendations, you can enhance learner performance, maintain engagement, and achieve better outcomes in language learning.

Hypothesis 14 :- Analyzing p_recall by delta_days_category aims to explore how recall probabilities vary across time intervals, hypothesizing a potential decline in recall with longer time gaps.¶

Aggregating Mean and Median on p_recall for learning_language_Abb¶

In [91]:
p_recall_days_category = dl.groupby("delta_days_category")["p_recall"].agg(["mean", "median"])
p_recall_days_category
Out[91]:
mean median
delta_days_category
Less than a day 0.906413 1.0
Over a month 0.868080 1.0
Within a month 0.882737 1.0
Within a week 0.890425 1.0
Zero 0.927669 1.0
  • dl.groupby("delta_days_category")["p_recall"].agg(["mean", "median"]) code groups the data by the delta_days_category and calculates two aggregate values (mean and median) for the p_recall values within each category.

Create a Chart on p_recall vs delta_days_category¶

In [92]:
# Extract data for plotting
categories = p_recall_days_category.index  # Delta days categories
mean_values = p_recall_days_category['mean']  # Mean of p_recall
median_values = p_recall_days_category['median']  # Median of p_recall

# Create the figure
plt.figure(figsize=(12, 6))

# Bar chart for mean values
plt.bar(categories, mean_values, color='skyblue', label='Mean', alpha=0.7)

# Line chart for median values
plt.plot(categories, median_values, color='red', marker='o', label='Median', linewidth=2)

# Add labels, title, and legend
plt.title('P_Recall by Delta Days Category', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('P_Recall', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.ylim(0.85, 1.02)  # Adjust the y-axis to focus on the range of p_recall
plt.legend(title='Metrics', fontsize=12)
plt.grid(alpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image
  • This code creates a plot showing the mean and median values of p_recall for each delta_days_category.
  • Bar chart represents the mean p_recall values for each category.
  • Line chart represents the median p_recall values for each category.
  • The x-axis shows the different delta_days_category, and the y-axis represents the p_recall values.

Analysis¶

  1. Mean and Median P_Recall:

    • The mean P_Recall values vary across the different delta days categories. The "Zero" category has the highest mean value at 0.93, indicating that recall probability is highest when there's no gap between learning sessions.
    • The "Over a month" category has the lowest mean P_Recall value at 0.87, suggesting that recall probability decreases with longer intervals between sessions.
    • The median P_Recall values are consistently at 1.0 for all categories, indicating that at least half of the learners achieve perfect recall across all time intervals.
  2. Impact of Time Intervals:

    • Immediate recall (Zero days) leads to the highest recall probability, emphasizing the effectiveness of frequent and consistent learning sessions.
    • Recall probabilities remain relatively high for intervals "Less than a day," "Within a week," and "Within a month," but show a noticeable decline for "Over a month," highlighting the need for reinforced learning strategies over longer periods.

Recommendations¶

  1. Frequent Learning Sessions:

    • Encourage learners to engage in frequent learning sessions to maintain high recall probabilities. Implementing spaced repetition techniques can be beneficial.
  2. Reinforced Learning for Long Intervals:

    • Develop reinforced learning strategies for learners with longer intervals between sessions. This could include periodic reviews and refresher modules to improve long-term retention.
  3. Personalized Learning Plans:

    • Create personalized learning plans that adapt to the individual needs of learners. For those who cannot engage frequently, provide tailored content to maximize recall during longer gaps.
  4. Interactive and Engaging Content:

    • Enhance learning materials with interactive and engaging content to keep learners motivated and reduce the likelihood of long gaps between sessions.

By following these recommendations, you can enhance recall probabilities across different time intervals, improve long-term retention, and provide a more effective learning experience for all learners.

Hypothesis 15 :- Might gender-specific differences in recall probability (p_recall) may vary across different learning languages, suggesting that certain languages might exhibit higher or lower recall rates based on gender categorization.¶

Aggregating mean on p_recall constrained on learning_language_Abb with gender_tag¶

In [93]:
p_recall_gender_language = dl.groupby(["learning_language_Abb", "gender_tag"])["p_recall"].mean().reset_index(name="count")
p_recall_gender_language = p_recall_gender_language.pivot(index='learning_language_Abb', 
    columns='gender_tag', values='count')
p_recall_gender_language
Out[93]:
gender_tag Feminine Masculine Neuter
learning_language_Abb
English 0.915208 0.892346 0.908488
French 0.883118 0.889324 0.885532
German 0.890919 0.890983 0.900045
Italian 0.913691 0.901427 0.899050
Portuguese 0.902538 0.911269 0.861442
Spanish 0.898183 0.906447 0.903529
  • Groups the data by learning_language_Abb and gender_tag and calculates the mean p_recall for each group.
  • Using p_recall_gender_language.pivot() creates a pivot table for the data for an better analysis.

Heat Map Using pivot table¶

In [94]:
plt.figure(figsize=(10, 8))
sns.heatmap(p_recall_gender_language, annot=True, cmap='cividis', fmt='.2f', linewidths=0.5, cbar_kws={'label': 'Mean Recall Rate'})
plt.title('Mean p_recall by Learning Language and Gender', fontsize=16)
plt.xlabel('Gender Tag', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • A heatmap is created to show the relationship between learning_language_full (on the y-axis) and gender_tag (on the x-axis) with the color representing the mean recall rate (p_recall).

Analysis¶

  1. Mean Recall Rates by Gender and Language:

    • English: Feminine (0.92), Masculine (0.89), Neuter (0.91)
    • French: Feminine (0.88), Masculine (0.89), Neuter (0.89)
    • German: Feminine (0.89), Masculine (0.89), Neuter (0.90)
    • Italian: Feminine (0.91), Masculine (0.90), Neuter (0.90)
    • Portuguese: Feminine (0.90), Masculine (0.91), Neuter (0.86)
    • Spanish: Feminine (0.90), Masculine (0.91), Neuter (0.90)
  2. Key Observations:

    • English shows the highest recall rate for Feminine gender at 0.92, but slightly lower for Masculine at 0.89.
    • German exhibits relatively consistent recall rates across all genders, with Neuter slightly higher at 0.90.
    • Portuguese has the lowest Neuter recall rate at 0.86, indicating potential variability in recall based on gender.
    • Italian and Spanish demonstrate high recall rates across all genders, indicating strong performance and retention.

Recommendations¶

  1. Tailored Learning Strategies:

    • Recognize and address the slight variations in recall rates across different genders. This could involve creating more personalized learning content and support for specific gender groups.
    • For languages like Portuguese with lower Neuter recall rates, explore the reasons behind the disparity and implement targeted interventions to improve recall.
  2. Focus on Consistency:

    • Languages with consistent recall rates across genders, like German, serve as good models for balancing content and teaching methods.
    • Utilize insights from these languages to develop best practices that can be applied to other languages.
  3. Enhanced Engagement:

    • Promote interactive and engaging content that caters to diverse learner needs, aiming to improve recall rates across all gender groups.
    • Encourage feedback from learners to continuously refine and adapt content, ensuring it meets the needs of all users effectively.

By focusing on these recommendations, you can enhance the learning experience, address variability in recall rates, and provide more effective and personalized support for learners across different languages and gender groups.

Hypothesis 16 :- The hypothesis is that the time of day (hour) may have a significant impact on the success rate of history recall, with certain hours showing higher or lower success rates.¶

Aggregates mean on success_rate_history restricted on hour¶

In [95]:
hourly_success_rate = dl.groupby('hour')['success_rate_history'].mean().reset_index()
hourly_success_rate = hourly_success_rate.sort_values('hour')
hourly_success_rate
Out[95]:
hour success_rate_history
0 0 0.899899
1 1 0.901693
2 2 0.901119
3 3 0.901233
4 4 0.898319
5 5 0.897596
6 6 0.900563
7 7 0.898600
8 8 0.902225
9 9 0.901200
10 10 0.900115
11 11 0.898766
12 12 0.901242
13 13 0.902201
14 14 0.900948
15 15 0.902254
16 16 0.901973
17 17 0.901663
18 18 0.899930
19 19 0.902748
20 20 0.902178
21 21 0.901185
22 22 0.899891
23 23 0.899842
  • The above code provides DataFrame hourly_success_rate that shows the average success rate for each hour of the day, ordered from the earliest to the latest hour.

Chart on hourly_success_rate¶

In [96]:
# Line plot for success rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_success_rate, x='hour', y='success_rate_history', marker='o', color='blue')
plt.title('Success Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Success Rate')
plt.xticks(range(0, 24))  # Ensure all hours are shown
plt.grid(True)
plt.show()
No description has been provided for this image
  • The line plot has been employed for depicted hourly_success_rate.

Analysis¶

The line plot shows the success rate for each hour of the day, providing insights into how learner performance varies throughout the 24-hour period.

Key Observations:

  1. Overall Performance:
    • The success rates are fairly consistent across different hours, ranging from approximately 0.90 to 0.90, indicating stable performance regardless of the time.
  2. Peak Performance Hours:
    • Hour 8: One of the peaks in the success rate is observed around 8 AM (0.902). This might suggest that learners perform well in the morning.
    • Hour 15 and 19: Other notable peaks are around 3 PM and 7 PM, with success rates around 0.902 and 0.903 respectively. These could be optimal study times for learners.
  3. Dip in Performance:
    • Hour 4 and 5: A slight dip in success rates is observed around 4 AM and 5 AM (0.898), possibly indicating that early morning hours might not be as effective for learning.

Recommendations¶

  1. Optimal Study Times:

    • Encourage learners to engage in study sessions during peak performance hours, such as around 8 AM, 3 PM, and 7 PM, to maximize their success rates.
    • Consider scheduling live classes, webinars, or interactive sessions during these optimal hours to leverage higher performance levels.
  2. Targeted Support:

    • Provide additional support and resources during hours with lower success rates, such as 4 AM and 5 AM. This could include offering motivational content, shorter study sessions, or interactive learning activities to keep learners engaged.
  3. Personalized Learning Plans:

    • Develop personalized learning plans that take into account the individual learner's optimal performance times. Encouraging learners to study during their most productive hours can enhance overall success rates.

By following these recommendations, you can help learners optimize their study schedules, improve their performance, and achieve better outcomes in their language learning journey.

Aggregates mean on p_recall restricted on hour¶

In [97]:
hourly_recall_rate = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_recall_rate = hourly_recall_rate.sort_values('hour')
hourly_recall_rate
Out[97]:
hour p_recall
0 0 0.896413
1 1 0.898376
2 2 0.898629
3 3 0.898826
4 4 0.895443
5 5 0.891778
6 6 0.897451
7 7 0.893905
8 8 0.897472
9 9 0.896174
10 10 0.893066
11 11 0.893905
12 12 0.895460
13 13 0.897065
14 14 0.894525
15 15 0.898139
16 16 0.898184
17 17 0.897501
18 18 0.895011
19 19 0.897626
20 20 0.897241
21 21 0.897527
22 22 0.894974
23 23 0.894339
  • The above code provides DataFrame hourly_recall_rate that shows the average success rate for each hour of the day, ordered from the earliest to the latest hour.

Chart on hourly_recall_rate¶

In [98]:
# Line plot for Recall rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_recall_rate, x='hour', y='p_recall', marker='o', color='blue')
plt.title('Recall Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Recall Rate')
plt.xticks(range(0, 24))  # Ensure all hours are shown
plt.grid(True)
plt.show()
No description has been provided for this image
  • The line plot has been employed for depicted p_recall.

Analysis:¶

  1. Overall Recall Rate:

    • The recall rates fluctuate throughout the day, staying within a narrow range (approximately 0.89 to 0.90), indicating relatively consistent performance.
  2. Peak Recall Hours:

    • The highest recall rates are observed around 3 AM (0.898826) and 1 AM (0.898376).
    • Another peak is around 3 PM (0.898139).
  3. Lowest Recall Hours:

    • The lowest recall rates are observed around 5 AM (0.891778) and 4 AM (0.895443).
    • There's also a dip around 10 AM (0.893066) and 7 AM (0.893905).

Recommendations:¶

  1. Optimize Study Sessions:

    • Schedule critical learning sessions during peak recall hours to maximize retention.
    • Avoid scheduling important sessions during low recall hours, such as early morning (4 AM - 5 AM).
  2. Balanced Study Plan:

    • Distribute learning activities evenly throughout the day to maintain a balanced recall rate.
    • Incorporate short, frequent review sessions during high recall periods to reinforce learning.
  3. Personalized Learning Schedules:

    • Tailor learning schedules to individual preferences and peak performance times, leveraging data on recall rates.
    • Encourage learners to identify their personal peak hours and align their study plans accordingly.

By implementing these strategies, you can enhance learning efficiency, improve retention, and create more effective study schedules.

Peak and lowest learning hour with success rate¶

In [99]:
# Find the hour with the highest success rate
peak_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmax()]
print(f"Peak learning hour: {peak_hour['hour']} with success rate: {peak_hour['success_rate_history']:.2f}")

# Find the hour with the lowest success rate
lowest_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmin()]
print(f"Lowest learning hour: {lowest_hour['hour']} with success rate: {lowest_hour['success_rate_history']:.2f}")
Peak learning hour: 19.0 with success rate: 0.90
Lowest learning hour: 5.0 with success rate: 0.90
  • The code searches for the peak learning hour (hour with the highest success rate) and displays it along with the success rate.
  • Similarly, it identifies the lowest learning hour (hour with the lowest success rate) and displays it along with the success rate.

Hypothesis 17 :- The user engagement varies by hour of the day, with certain hours exhibiting higher or lower engagement levels.¶

Getting on Hourly Users Count¶

In [100]:
hourly_count = dl['hour'].value_counts().sort_index()
print(hourly_count)
0     188715
1     189609
2     188554
3     172621
4     145875
5     111427
6      80684
7      75318
8      69965
9      76519
10     89185
11     96127
12    114489
13    135089
14    153733
15    175107
16    199630
17    207284
18    220926
19    220805
20    230840
21    234093
22    221128
23    198035
Name: hour, dtype: int64
  • This code shows how many times each hour of the day appears in the dataset, sorted in order of the hour (from 0 to 23).

Chart on Hourly Users Count¶

In [101]:
plt.figure(figsize=(10, 6))
plt.fill_between(hourly_count.index, hourly_count.values, color='lightgreen', alpha=0.4)
plt.plot(hourly_count.index, hourly_count.values, color='forestgreen', linewidth=2)
plt.title('Hourly Count Area Chart', fontsize=16)
plt.xlabel('Hour of the Day', fontsize=14)
plt.ylabel('Number of Learning Sessions', fontsize=14)
plt.grid(alpha=0.5)
plt.show()
No description has been provided for this image
  • The line plot has been employed for depicting hourly_count.

Analysis:¶

The area chart visualizes the number of learning sessions throughout the day, highlighting periods of high and low activity.

Key Observations:

  1. Early Morning Dip:

    • The number of learning sessions starts high at midnight and gradually decreases to its lowest point around 7 AM.
    • This suggests that early morning hours are less popular for learning activities.
  2. Steady Increase and Peak:

    • After 7 AM, the number of learning sessions steadily increases, peaking around 9 PM.
    • This indicates that learners are more active in the evening hours.
  3. Late Night Activity:

    • There's a slight decline after 9 PM, but the activity remains relatively high until midnight.
    • This shows that many learners prefer studying late at night.

Recommendations:¶

  1. Optimize Learning Content Delivery:

    • Schedule important lessons, webinars, and interactive sessions during peak hours (evening and late night) to maximize engagement.
    • Utilize early morning hours for light, refresher content or motivational materials to gradually engage learners.
  2. Targeted Engagement Strategies:

    • Implement targeted engagement strategies during low-activity hours (early morning). This could include push notifications, reminders, or gamified learning to encourage learners to study during these times.
  3. Personalized Learning Schedules:

    • Encourage learners to identify their personal optimal study times based on their performance and engagement patterns, and create personalized learning schedules accordingly.

By implementing these strategies, you can enhance learner engagement, optimize content delivery, and support effective learning habits across different times of the day.

Hypothesis 18 :- Engagement and recall rates may vary by hour, indicating potential patterns in user activity and learning effectiveness throughout the day.¶

Getting on Hour engagement¶

In [102]:
# Group by hour and calculate total history seen for engagement by hour
hourly_engagement = dl.groupby('hour')['history_seen'].sum().reset_index()
hourly_engagement
Out[102]:
hour history_seen
0 0 4294604
1 1 3979716
2 2 4460718
3 3 5276017
4 4 4282354
5 5 3367891
6 6 1255401
7 7 1212662
8 8 1141606
9 9 1504221
10 10 2279598
11 11 2133283
12 12 2619809
13 13 3025397
14 14 3197393
15 15 3309489
16 16 3964395
17 17 4441441
18 18 4695238
19 19 4845972
20 20 5361548
21 21 4408670
22 22 4393974
23 23 3968790
  • The above code provides DataFrame hourly_engagement that shows the sum of history_seen for each hour of the day, ordered from the earliest to the latest hour.

Plot on Hourly Engagement¶

In [103]:
# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))

# Plot total history seen (engagement)
plt.plot(hourly_engagement['hour'], hourly_engagement['history_seen'], marker='o', label='Total Engagement (History Seen)', color='blue')

# Add titles, labels, and grid
plt.title('Hourly Engagement')
plt.xlabel('Hour')
plt.ylabel('Value')
plt.grid(True)
plt.legend()

# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))

# Show the plot
plt.show()
No description has been provided for this image
  • plt.plot() used to create a line plot.
  • hourly_engagement['hour'] This is the x-axis of the plot, representing the hour of the day (0-23).
  • hourly_engagement['history_seen'] This is the y-axis of the plot, representing the total engagement (history seen) for each hour.
  • marker='o' adds circular markers at each data point along the line, helping to visualize each data point more clearly.

Analysis:¶

The line plot of hourly engagement (total history seen) provides insights into user activity throughout the day.

Key Observations:

  1. Early Morning Peak:
    • There is a significant peak around 3 AM with a total engagement of over 5 million, indicating a high level of activity during late-night hours.
  2. Morning Dip:
    • A noticeable dip in engagement is observed around 6 AM, with total engagement dropping below 2 million. This could reflect a time when users are less active, likely due to sleep.
  3. Evening Peak:
    • Another peak is observed around 8 PM with total engagement exceeding 5 million, indicating high user activity during the evening hours.
  4. Consistent Engagement:
    • Engagement remains relatively high and consistent throughout the day, particularly from the late afternoon through the evening.

Recommendations:¶

  1. Optimal Content Delivery Times:

    • Schedule important lessons, live sessions, or interactive activities during peak hours (late night and evening) to maximize user engagement.
    • Utilize the early morning dip for lighter content, reminders, or motivational messages to gradually re-engage users.
  2. Targeted Notifications:

    • Send targeted notifications or reminders during periods of lower engagement (early morning) to encourage users to return to the platform and maintain consistent activity.
  3. Personalized Learning Plans:

    • Develop personalized learning schedules that align with individual users' peak activity times, promoting effective and efficient study habits.
  4. Interactive and Engaging Content:

    • Enhance learning materials with interactive and engaging elements, particularly during high activity periods, to retain user interest and motivation.

By implementing these strategies, you can enhance user engagement, optimize content delivery, and support effective learning habits across different times of the day.

Getting hourly_engagement_recall Column¶

In [104]:
hourly_engagement_recall = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_engagement_recall
Out[104]:
hour p_recall
0 0 0.896413
1 1 0.898376
2 2 0.898629
3 3 0.898826
4 4 0.895443
5 5 0.891778
6 6 0.897451
7 7 0.893905
8 8 0.897472
9 9 0.896174
10 10 0.893066
11 11 0.893905
12 12 0.895460
13 13 0.897065
14 14 0.894525
15 15 0.898139
16 16 0.898184
17 17 0.897501
18 18 0.895011
19 19 0.897626
20 20 0.897241
21 21 0.897527
22 22 0.894974
23 23 0.894339
  • dl.groupby('hour')['p_recall'].mean().reset_index() derives the mean on p_recall for each hour of the day, ordered from the earliest to the latest hour.

Plot on hourly_engagement_recall¶

In [105]:
# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))

# Plot total history seen (engagement)
plt.plot(hourly_engagement_recall['hour'], hourly_engagement_recall['p_recall'], marker='o', label='Total Engagement on Recall', color='blue')

# Add titles, labels, and grid
plt.title('hourly_engagement_recall')
plt.xlabel('Hour')
plt.ylabel('p_recall')
plt.grid(True)
plt.legend()

# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))

# Show the plot
plt.show()
No description has been provided for this image
  • Visualizing the line plot for the column hourly_engagement_recall.

Analysis:¶

The line graph titled "hourly_engagement_recall" visualizes the average recall probability (p_recall) for each hour of the day. Here are the insights derived from the data and visualization:

  1. Consistent Recall Performance:

    • The recall probabilities (p_recall) show relatively consistent performance throughout the day, with values ranging between approximately 0.89 and 0.90.
  2. Early Morning Dip:

    • There is a noticeable dip in recall probability around 4 AM and 5 AM, with the lowest value at 0.891778 around 5 AM. This suggests that early morning hours may not be optimal for recall performance.
  3. Peak Recall Hours:

    • The highest recall probabilities are observed around 3 AM (0.898826) and 3 PM (0.898139). These peak hours indicate times when learners tend to have better recall performance.

Recommendations:¶

  1. Optimize Study Sessions:

    • Schedule critical learning activities, reviews, and quizzes during peak recall hours (e.g., 3 AM and 3 PM) to maximize retention and recall performance.
  2. Avoid Low Recall Hours:

    • Avoid scheduling essential learning activities during low recall hours (e.g., 4 AM and 5 AM) to ensure learners are engaging with the material when their recall performance is optimal.
  3. Balanced Study Plan:

    • Encourage learners to adopt a balanced study plan that incorporates both peak and non-peak hours, ensuring consistent engagement and minimizing the impact of low recall periods.
  4. Targeted Interventions:

    • Implement targeted interventions during early morning dips to help learners improve recall during these times. This could include shorter, interactive sessions or gamified learning activities to boost engagement.

By following these recommendations, you can enhance learning outcomes, improve retention, and create more effective study schedules for learners.

Deriving Correlation Between hourly_engagement and hourly_engagement_recall¶

In [106]:
hourly_engagement_corr = hourly_engagement.merge(hourly_engagement_recall, on='hour')
hourly_engagement_corr_corr = hourly_engagement_corr['history_seen'].corr(hourly_engagement_corr['p_recall'])
print(f"Correlation between hourly engagement and learning success: {hourly_engagement_corr_corr:.2f}")
Correlation between hourly engagement and learning success: 0.32
  • .merge() joins the two DataFrames based on the shared column hour, which represents the hour of the day.
  • The method .corr() is used to compute the Pearson correlation coefficient between two columns in the engagement_success_corr DataFrame.
  • The Pearson correlation measures the linear relationship between two variables:
  • A value close to 1 indicates a strong positive correlation (as one increases, the other also increases).
  • A value close to -1 indicates a strong negative correlation (as one increases, the other decreases).
  • A value close to 0 indicates no linear correlation.

Analysis of Hourly Engagement and Learning Success Correlation¶

  • Correlation Overview:
    • The correlation between hourly engagement (history seen) and learning success (measured by p_recall) is 0.32, indicating a weak positive relationship.

Key Insights¶

  1. Weak Positive Correlation:

    • A 0.32 correlation suggests that as users engage more frequently, there is a slight increase in their learning success. However, this correlation is relatively weak, meaning that engagement alone does not fully explain the variability in learning outcomes.
  2. Diminishing Returns on Engagement:

    • The weak correlation implies that increased engagement might not always lead to significant improvements in recall rates. There may be diminishing returns from engagement, where other factors like content quality, learning strategies, or learner characteristics could play a larger role in success.
  3. Potential for More Insight:

    • The correlation value indicates that while engagement might have some influence, it's not the dominant factor. Additional data or features (such as learner behavior patterns, content difficulty, or feedback quality) could help identify why engagement doesn't lead to stronger improvements in success rates.

Recommendations¶

  1. Increase Engagement Quality:

    • Focus not just on quantity of engagement, but on making each engagement more impactful. For instance, introduce interactive or adaptive learning techniques that align more closely with learners' progress.
  2. Personalized Learning:

    • Incorporate data on user learning patterns (e.g., preferred learning times, pace, difficulty levels) to offer more personalized learning experiences. This could boost engagement quality, and in turn, success rates.
  3. Explore Other Influencing Factors:

    • Investigate other variables that could be contributing more strongly to learning success (e.g., time spent per session, content type, or difficulty level) to build a more comprehensive model of learner success.
  4. Track Engagement Duration:

    • Explore the relationship between time spent on learning rather than just the number of sessions or history seen, as longer, focused sessions might correlate more strongly with success.

In summary, while engagement does contribute to learning success, it should be complemented with other strategies and factors to optimize overall performance.

Hypothesis 19 :- The frequency of learning sessions across different languages varies significantly based on the delta days category, with certain time intervals showing higher engagement for specific languages.¶

Aggregating the Count on each delta_days_category considering learning_language_Abb¶

In [107]:
count_days_return_language = dl.groupby(["learning_language_Abb", "delta_days_category"]).size().reset_index(name="count")
heatmap_data = count_days_return_language.pivot(
    index='learning_language_Abb', 
    columns='delta_days_category', 
    values='count'
)
heatmap_data
Out[107]:
delta_days_category Less than a day Over a month Within a month Within a week Zero
learning_language_Abb
English 850154 92163 183263 351786 2560
French 291639 35138 73636 151327 964
German 206231 32629 65061 120992 520
Italian 142301 6463 26592 62506 99
Portuguese 50524 4774 11566 25072 120
Spanish 478871 75805 164709 287114 1179
  • dl.groupby(["learning_language_Abb", "delta_days_category"]) creates groups for each unique combination of language and delta days category which counts the number of records in each group.
  • Using .pivot() the data is structured into pivot table.

Creating the Heatmap on heatmap_data¶

In [108]:
# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='Blues', linewidths=0.5, cbar_kws={'label': 'Count of Sessions'})
plt.title('Heatmap of Learning Sessions by Language and Return Days', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • The heatmap visually represents the count of learning sessions for each language and delta days category.

Analysis:¶

  1. Engagement Patterns:

    • The highest engagement ("Less than a day" category) is observed for English (850,154 sessions), followed by Spanish (478,871 sessions), indicating these languages have the most frequent sessions.
    • The "Over a month" category shows significantly lower counts, with the highest for English (92,163 sessions) and the lowest for Italian (6,463 sessions), suggesting that long breaks between sessions are less common.
  2. Return Time Distributions:

    • The "Within a week" category has high engagement for English (351,786 sessions) and Spanish (287,114 sessions), reflecting frequent returns within short intervals.
    • The "Zero" category has the lowest counts across all languages, indicating that same-day returns are rare.
  3. Language-Specific Trends:

    • French and German also show notable engagement in the "Less than a day" and "Within a week" categories, suggesting consistent study habits.
    • Portuguese and Italian have lower overall engagement across all categories, highlighting a need for strategies to increase learner engagement.

Recommendations:¶

  1. Encourage Regular Study Habits:

    • Promote consistent study routines, particularly for less engaged languages like Portuguese and Italian, by highlighting the benefits of frequent practice.
  2. Targeted Interventions:

    • Implement targeted interventions for learners with long breaks (Over a month) to re-engage them and prevent extended absences.
    • Use reminders, notifications, and incentives to encourage regular returns within shorter intervals (Less than a day, Within a week).
  3. Enhance Engagement for Popular Languages:

    • For high-engagement languages like English and Spanish, develop advanced courses and interactive sessions to maintain learner interest and motivation.
    • Leverage the frequent engagement patterns to introduce community activities, challenges, and gamified content.

By following these recommendations, you can enhance learner engagement, reduce long absences, and support effective learning habits across all languages.

Hypothesis 20 :- The distribution of learning languages is influenced by the UI language preference, with certain learning languages being more frequently paired with specific UI languages.¶

Count on learning_language with ui_language¶

In [109]:
learning_language_ui_language_count = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().reset_index(name='count')
learning_language_ui_language_count
Out[109]:
learning_language_Abb ui_language_Abb count
0 English Italian 123157
1 English Portuguese 282884
2 English Spanish 1073885
3 French English 552704
4 German English 425433
5 Italian English 237961
6 Portuguese English 92056
7 Spanish English 1007678
  • Understand the relationship between the languages users are learning and the languages they use in the interface.

Visualization on Learning_language vs ui_language¶

In [110]:
# Create a pivot table to count occurrences of 'learning_language_full' for each 'ui_language_full'
count_data = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().unstack(fill_value=0)

# Plot the stacked bar chart
count_data.plot(kind='bar', stacked=True, figsize=(12, 8), colormap='tab20')

plt.title('Count of Learning Languages for Each UI Language', fontsize=16)
plt.xlabel('Learning Language', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
  • Groups the dataset dl by the columns learning_language_Abb and ui_language_Abb.
  • Uses .size() to count the number of occurrences for each unique combination of learning language and UI language.
  • .unstack(fill_value=0) converts the grouped data into a pivot table.
  • Create a plot using .plot() with using appropriate title and labels as mentioned.
  • dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2) aggregates the mean and median of p_recall that revolves over grammar_tag which categorized by plurality_tag.

Analysis:¶

The stacked bar chart displays the count of people learning different languages categorized by the user interface (UI) language they use. Here are the insights:

  1. English Learning Dominance:

    • English is the most learned language, with a significant number of learners using the Spanish UI (over 1 million), followed by Portuguese UI (around 282,884) and Italian UI (around 123,157). This indicates a widespread interest in learning English across different UI languages.
  2. Strong Presence of English UI:

    • Other languages such as French, German, Italian, Portuguese, and Spanish are primarily learned by users using the English UI. This suggests that English-speaking learners are interested in expanding their language skills to these languages.
  3. Spanish as a Popular Learning Language:

    • Spanish is the second most learned language, predominantly by users with the English UI (around 1 million). This reflects a high interest among English-speaking users to learn Spanish.

Recommendations:¶

  1. Leverage English UI Popularity:

    • Since many learners use the English UI to learn various languages, ensure the English UI is user-friendly, well-localized, and rich in features to support diverse learning experiences.
    • Enhance the English UI with additional tools, resources, and interactive content to keep learners engaged and motivated.
  2. Promote English Learning:

    • Utilize the popularity of the Spanish UI to promote English learning. Highlight the benefits of learning English, such as improved career opportunities, and offer specialized English courses tailored to Spanish speakers.
  3. Expand Language Offerings for Non-English UIs:

    • Develop and promote learning content for other languages in non-English UIs (e.g., Italian, Portuguese). This can help attract more learners from different linguistic backgrounds and increase engagement.
  4. Interactive and Cultural Content:

    • Enhance learning materials with interactive and cultural content to provide a more immersive learning experience. For example, incorporating cultural insights, language games, and real-world scenarios can make learning more engaging and effective.

By implementing these recommendations, you can optimize the learning experience, cater to diverse user preferences, and attract more learners across various languages and UI settings.

In [111]:
dl
Out[111]:
p_recall timestamp delta user_id learning_language ui_language lexeme_id lexeme_string history_seen history_correct ... lexeme_base grammar_tag gender_tag plurality_tag delta_days time delta_days_category success_rate_history time_d hour
0 1.0 2013-03-03 17:13:47 1825254 5C7 fr en 3712581f1a9fbc0894e22664992663e9 sur/sur<pr> 2 1 ... sur/sur Pronouns and Related NaN NaN 21.126 17:13:47 Within a month 0.500000 1900-01-01 17:13:47 17
1 1.0 2013-03-04 18:30:50 367 fWSx en es 0371d118c042c6b44ababe667bed2760 police/police<n><pl> 6 5 ... police/police Nouns Neuter Plural 0.004 18:30:50 Less than a day 0.833333 1900-01-01 18:30:50 18
2 0.0 2013-03-03 18:35:44 1329 hL-s de en 5fa1f0fcc3b5d93b8617169e59884367 hat/haben<vbhaver><pri><p3><sg> 10 10 ... hat/haben Verbs NaN Singular 0.015 18:35:44 Less than a day 1.000000 1900-01-01 18:35:44 18
3 1.0 2013-03-07 17:56:03 156 h2_R es en 4d77de913dc3d65f1c9fac9d1c349684 en/en<pr> 111 99 ... en/en Pronouns and Related NaN NaN 0.002 17:56:03 Less than a day 0.891892 1900-01-01 17:56:03 17
4 1.0 2013-03-05 21:41:22 257 eON es en 35f14d06d95a34607d6abb0e52fc6d2b caballo/caballo<n><m><sg> 3 3 ... caballo/caballo Nouns Masculine Singular 0.003 21:41:22 Less than a day 1.000000 1900-01-01 21:41:22 21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3795775 1.0 2013-03-06 23:06:48 4792 iZ7d es en 84e18e86c58e8e61d687dfa06b3aaa36 soy/ser<vbser><pri><p1><sg> 6 5 ... soy/ser Verbs NaN Singular 0.055 23:06:48 Less than a day 0.833333 1900-01-01 23:06:48 23
3795776 1.0 2013-03-07 22:49:23 1369 hxJr fr en f5b66d188d15ccb5d7777a59756e33ad chiens/chien<n><m><pl> 3 3 ... chiens/chien Nouns Masculine Plural 0.016 22:49:23 Less than a day 1.000000 1900-01-01 22:49:23 22
3795777 1.0 2013-03-06 21:20:18 615997 fZeR it en 91a6ab09aa0d2b944525a387cc509090 voi/voi<prn><tn><p2><mf><pl> 25 22 ... voi/voi Pronouns and Related NaN Plural 7.130 21:20:18 Within a month 0.880000 1900-01-01 21:20:18 21
3795778 1.0 2013-03-07 07:54:24 289 g_D3 en es a617ed646a251e339738ce62b84e61ce are/be<vbser><pres> 32 30 ... are/be Verbs NaN NaN 0.003 07:54:24 Less than a day 0.937500 1900-01-01 07:54:24 7
3795779 1.0 2013-03-06 21:12:07 191 iiN7 pt en 4a93acdbafaa061fd69226cf686d7a2b café/café<n><m><sg> 3 3 ... café/café Nouns Masculine Singular 0.002 21:12:07 Less than a day 1.000000 1900-01-01 21:12:07 21

3795758 rows × 24 columns

In [ ]: