import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

dl = pd.read_csv("reduced_data_400mb (1).csv") # Loading the Data

dl

print("Dataset Information:")
dl.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3795780 entries, 0 to 3795779
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   p_recall           float64
 1   timestamp          object 
 2   delta              int64  
 3   user_id            object 
 4   learning_language  object 
 5   ui_language        object 
 6   lexeme_id          object 
 7   lexeme_string      object 
 8   history_seen       int64  
 9   history_correct    int64  
 10  session_seen       int64  
 11  session_correct    int64  
dtypes: float64(1), int64(5), object(6)
memory usage: 347.5+ MB

dl.columns

Index(['p_recall', 'timestamp', 'delta', 'user_id', 'learning_language',
       'ui_language', 'lexeme_id', 'lexeme_string', 'history_seen',
       'history_correct', 'session_seen', 'session_correct'],
      dtype='object')

dl.describe()

print("The Data Types of all Columns:")
dl.dtypes

The Data Types of all Columns:

p_recall             float64
timestamp             object
delta                  int64
user_id               object
learning_language     object
ui_language           object
lexeme_id             object
lexeme_string         object
history_seen           int64
history_correct        int64
session_seen           int64
session_correct        int64
dtype: object

rows, columns = dl.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

The dataset contains 3795780 rows and 12 columns.

for column in dl.columns.tolist():
    print(f"No. of unique values in {column}:")
    print(dl[column].nunique())

No. of unique values in p_recall:
66
No. of unique values in timestamp:
334848
No. of unique values in delta:
808491
No. of unique values in user_id:
79694
No. of unique values in learning_language:
6
No. of unique values in ui_language:
4
No. of unique values in lexeme_id:
16244
No. of unique values in lexeme_string:
15864
No. of unique values in history_seen:
3784
No. of unique values in history_correct:
3308
No. of unique values in session_seen:
20
No. of unique values in session_correct:
21

value_counts_dict = {}

for column in dl.columns:
    value_counts_dict[column] = dl[column].value_counts()
    print(value_counts_dict[column])

1.000000    3187760
0.000000     266909
0.500000     132930
0.666667      84204
0.750000      49116
             ...   
0.111111          1
0.545455          1
0.272727          1
0.684211          1
0.375000          1
Name: p_recall, Length: 66, dtype: int64
2013-03-05 21:09:36    105
2013-03-07 22:36:13     97
2013-03-01 20:07:14     94
2013-03-06 14:35:04     88
2013-03-06 19:41:47     87
                      ... 
2013-03-07 14:20:21      1
2013-03-01 11:54:28      1
2013-03-02 03:39:04      1
2013-03-05 02:57:48      1
2013-03-05 19:53:28      1
Name: timestamp, Length: 334848, dtype: int64
165         3900
153         3854
164         3837
169         3771
160         3751
            ... 
4843625        1
579467         1
3626644        1
15992794       1
615997         1
Name: delta, Length: 808491, dtype: int64
bcH_    4202
h8n2    2527
cpBu    2417
ht1n    2396
g2Ev    2383
        ... 
iy7m       1
gGtD       1
g9rg       1
h6Lr       1
iTuP       1
Name: user_id, Length: 79694, dtype: int64
en    1479930
es    1007687
fr     552707
de     425433
it     237967
pt      92056
Name: learning_language, dtype: int64
en    2315850
es    1073889
pt     282884
it     123157
Name: ui_language, dtype: int64
827a8ecb89f9b59ac5c29b620a5d3ed6    36115
97e922f780d628eac638bea7a02bf496    28848
928787744a962cd4ec55c1b22cedc913    27224
b968b069e4e2c04848e9f8924e34c031    21842
a617ed646a251e339738ce62b84e61ce    20331
                                    ...  
4a8be4af945dbf28ff775b9ff933a5df        1
baae0bd8fddd341c208cb6aa04c237e8        1
f1f34a08001125f2337ed0a73f3a9f22        1
00e6f18dedcf8e59b7bf570beca9e80e        1
e370294ee020bc9db24865cac83a840e        1
Name: lexeme_id, Length: 16244, dtype: int64
a/a<det><ind><sg>                                 36115
is/be<vbser><pri><p3><sg>                         28848
eats/eat<vblex><pri><p3><sg>                      27224
we/prpers<prn><subj><p1><mf><pl>                  21842
are/be<vbser><pres>                               20331
                                                  ...  
telefonnummer/telefon<n>+nummer<n><f><sg><acc>        1
<*sf>/rumo<n><m><*numb>                               1
kanada/kanada<np><nt><sg><dat>                        1
ausstattung/ausstattung<n><f><sg><nom>                1
interior/interior<n><m><sg>                           1
Name: lexeme_string, Length: 15864, dtype: int64
3       498066
4       373188
2       339224
5       287130
6       240742
         ...  
3816         1
1399         1
3897         1
2550         1
3517         1
Name: history_seen, Length: 3784, dtype: int64
3       511967
2       416078
4       366865
5       278302
1       249961
         ...  
3602         1
3730         1
3675         1
1587         1
1855         1
Name: history_correct, Length: 3308, dtype: int64
1     2245672
2      781463
3      397726
4      191528
5       93618
6       41790
7       19199
8        9297
9        5441
10       3653
11       1895
12       1106
13        926
16        859
14        796
15        482
17        118
19        115
18         57
20         39
Name: session_seen, dtype: int64
1     2120333
2      737118
3      363098
0      266909
4      166656
5       75562
6       32588
7       14441
8        7370
9        4239
10       2604
11       1484
12       1029
13        786
14        602
15        434
16        332
17         90
18         51
19         40
20         14
Name: session_correct, dtype: int64

missing_value_count = dl.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count

Missing Values in Each Column:

p_recall             0
timestamp            0
delta                0
user_id              0
learning_language    0
ui_language          0
lexeme_id            0
lexeme_string        0
history_seen         0
history_correct      0
session_seen         0
session_correct      0
dtype: int64

duplicates = dl[dl.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")

Number of Duplicate Rows in the Dataset: 22

dl= dl.drop_duplicates()
dl

dl.shape

(3795758, 12)

dl.describe()

dl.dtypes

p_recall             float64
timestamp             object
delta                  int64
user_id               object
learning_language     object
ui_language           object
lexeme_id             object
lexeme_string         object
history_seen           int64
history_correct        int64
session_seen           int64
session_correct        int64
dtype: object

pd.options.mode.chained_assignment = None
dl["timestamp"]= pd.to_datetime(dl["timestamp"], format = '%Y-%m-%d %H:%M:%S')

# Select columns with numeric types (either float64 or int64)
numerical_columns = dl.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(numerical_columns)

['p_recall', 'delta', 'history_seen', 'history_correct', 'session_seen', 'session_correct']

outliers = {}

for column in numerical_columns:
    Q1 = dl[column].quantile(0.10)
    Q3 = dl[column].quantile(0.90)
    IQR = Q3-Q1
    
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    
    outliers[column] = dl[column][(dl[column]<lower_bound) | (dl[column]>upper_bound)]
    print(outliers[column])

Series([], Name: p_recall, dtype: float64)
21          7688640
32          6915874
74          5092318
82          4855994
112         4664462
             ...   
3795641     9571218
3795644     4390588
3795693     4674316
3795718     4538280
3795763    13364296
Name: delta, Length: 149508, dtype: int64
3           111
20          535
73         2756
95          138
99          111
           ... 
3795548    2125
3795557     105
3795655    2019
3795686      95
3795701     359
Name: history_seen, Length: 132129, dtype: int64
3            99
20          510
73         2571
95          130
99          102
           ... 
3795548    1303
3795557     100
3795655    1855
3795686      86
3795701     333
Name: history_correct, Length: 130485, dtype: int64
113         7
157         9
196         9
232         9
263        14
           ..
3795563     7
3795597     7
3795655    13
3795731     8
3795766    20
Name: session_seen, Length: 43983, dtype: int64
113         7
157         9
196         9
232         8
263        12
           ..
3795391     8
3795597     7
3795655    13
3795731     7
3795766    20
Name: session_correct, Length: 33516, dtype: int64

for col, data in outliers.items():
    print(f"{col}: {len(data)}")

p_recall: 0
delta: 149508
history_seen: 132129
history_correct: 130485
session_seen: 43983
session_correct: 33516

for columns in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=dl[columns])
    plt.title('Boxplot for Numerical Columns')
    plt.xlabel(columns)
    plt.ylabel('Values')
    plt.xticks(rotation=45)
    plt.show()

pd.options.mode.chained_assignment = None
language_mapping = {
    'en': 'English',
    'fr': 'French',
    'es': 'Spanish',
    'it': 'Italian',
    'de': 'German',
    'pt': 'Portuguese',
}

# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'learning_language_Abb'] = dl['learning_language'].map(language_mapping)

# Display the updated DataFrame
print(dl[['learning_language', 'learning_language_Abb']])

        learning_language learning_language_Abb
0                      fr                French
1                      en               English
2                      de                German
3                      es               Spanish
4                      es               Spanish
...                   ...                   ...
3795775                es               Spanish
3795776                fr                French
3795777                it               Italian
3795778                en               English
3795779                pt            Portuguese

[3795758 rows x 2 columns]

# Create a new column 'learning_language_full' based on the mapping
dl.loc[:, 'ui_language_Abb'] = dl['ui_language'].map(language_mapping)

# Display the updated DataFrame
print(dl[['ui_language', 'ui_language_Abb']])

        ui_language ui_language_Abb
0                en         English
1                es         Spanish
2                en         English
3                en         English
4                en         English
...             ...             ...
3795775          en         English
3795776          en         English
3795777          en         English
3795778          es         Spanish
3795779          en         English

[3795758 rows x 2 columns]

pd.options.mode.chained_assignment = None
dl['lexeme_base'] = dl['lexeme_string'].str.split('<', expand=True)[0]
dl['lexeme_base']

0                  sur/sur
1            police/police
2                hat/haben
3                    en/en
4          caballo/caballo
                ...       
3795775            soy/ser
3795776       chiens/chien
3795777            voi/voi
3795778             are/be
3795779          café/café
Name: lexeme_base, Length: 3795758, dtype: object

import re

# Updated grammar_tags for more precise matching patterns
grammar_categories = {
    'Pronouns and Related': ['pr', 'prn', 'preadv', 'predet', 'np', 'rel'],
    'Verbs': ['vbhaver', 'vbdo', 'vbser', 'vblex', 'vbmod', 'vaux', 'ord'],
    'Nouns': ['n', 'gen', 'apos'],
    'Determiners and Adjectives': ['det', 'adj', 'predet'],
    'Adverbs': ['adv'],
    'Interjections': ['ij', '@ij:'],  
    'Conjunctions': ['cnjcoo', 'cnjadv', 'cnjsub', '@cnj:'],  
    'Numbers and Quantifiers': ['num', 'ord'],
    'Negation': ['neg', '@neg:', '@common_phrases:', '@itg:'],  
    'Other': ['pprep', '@adv:', '@pr:', 'apos']  
}

# Reverse the dictionary to map tags to their respective grammar categories
tag_to_category_map = {tag: category for category, tags in grammar_categories.items() for tag in tags}

# Function to determine the grammar category based on prefixes
def grammar_tags(lexeme_string):
    # Extract tags inside '<>' and check if they start with any of the provided prefixes
    tags_in_string = re.findall(r'<(.*?)>', lexeme_string)  # Extract tags inside '<>'
    for tag in tags_in_string:
        for prefix in tag_to_category_map:
            if tag.startswith(prefix):  # Match based on prefix
                return tag_to_category_map[prefix]
    return 'Nan'  # Default if no match is found

# Applying the function to the DataFrame
dl['grammar_tag'] = dl['lexeme_string'].apply(lambda x: grammar_tags(x))

print(dl.head())

   p_recall           timestamp    delta user_id learning_language  \
0       1.0 2013-03-03 17:13:47  1825254     5C7                fr   
1       1.0 2013-03-04 18:30:50      367    fWSx                en   
2       0.0 2013-03-03 18:35:44     1329    hL-s                de   
3       1.0 2013-03-07 17:56:03      156    h2_R                es   
4       1.0 2013-03-05 21:41:22      257     eON                es   

  ui_language                         lexeme_id  \
0          en  3712581f1a9fbc0894e22664992663e9   
1          es  0371d118c042c6b44ababe667bed2760   
2          en  5fa1f0fcc3b5d93b8617169e59884367   
3          en  4d77de913dc3d65f1c9fac9d1c349684   
4          en  35f14d06d95a34607d6abb0e52fc6d2b   

                     lexeme_string  history_seen  history_correct  \
0                      sur/sur<pr>             2                1   
1             police/police<n><pl>             6                5   
2  hat/haben<vbhaver><pri><p3><sg>            10               10   
3                        en/en<pr>           111               99   
4        caballo/caballo<n><m><sg>             3                3   

   session_seen  session_correct learning_language_Abb ui_language_Abb  \
0             2                2                French         English   
1             2                2               English         Spanish   
2             1                0                German         English   
3             4                4               Spanish         English   
4             3                3               Spanish         English   

       lexeme_base           grammar_tag  
0          sur/sur  Pronouns and Related  
1    police/police                 Nouns  
2        hat/haben                 Verbs  
3            en/en  Pronouns and Related  
4  caballo/caballo                 Nouns

dl['grammar_tag'].value_counts()

Nouns                         1637545
Verbs                          856416
Determiners and Adjectives     606273
Pronouns and Related           435685
Adverbs                        133575
Conjunctions                    66891
Interjections                   45956
Other                            6821
Negation                         6585
Nan                                 9
Numbers and Quantifiers             2
Name: grammar_tag, dtype: int64

gender_tags = {
         "Masculine": "m",
         "Feminine": "f",
         "Neuter": "nt",
         "Masculine or Feminine (common gender)": "mf" 
    }

def get_gender_tag_key(lexeme_string):
    for key, tags in gender_tags.items():
        if any(f"<{tag}>" in lexeme_string for tag in tags):
            return key
    return np.nan  # Return NaN if no tags are found

# Apply the function to create the new column
dl['gender_tag'] = dl['lexeme_string'].apply(get_gender_tag_key)

dl['gender_tag'].value_counts()

Neuter       783971
Masculine    696207
Feminine     546620
Name: gender_tag, dtype: int64

plurality_tags = {
    "Singular":{"sg": "Singular"},
    "Plural":{"pl": "Plural"}
}

def get_plurality_tags_key(lexeme_string):
    for key, tags in plurality_tags.items():
        if any(f"<{tag}>" in lexeme_string for tag in tags):
            return key
    return np.nan  # Return NaN if no tags are found

# Apply the function to create the new column
dl['plurality_tag'] = dl['lexeme_string'].apply(get_plurality_tags_key)

dl['plurality_tag'].value_counts()

Singular    2233949
Plural       667294
Name: plurality_tag, dtype: int64

pd.options.mode.chained_assignment = None
dl['delta_days'] = dl['delta'] / 86400
dl['delta_days'] = dl['delta_days'].round(3)
dl['delta_days']

0          21.126
1           0.004
2           0.015
3           0.002
4           0.003
            ...  
3795775     0.055
3795776     0.016
3795777     7.130
3795778     0.003
3795779     0.002
Name: delta_days, Length: 3795758, dtype: float64

dl['time'] = dl['timestamp'].dt.time
dl['time']

0          17:13:47
1          18:30:50
2          18:35:44
3          17:56:03
4          21:41:22
             ...   
3795775    23:06:48
3795776    22:49:23
3795777    21:20:18
3795778    07:54:24
3795779    21:12:07
Name: time, Length: 3795758, dtype: object

#defing a function to apply for the column
def categorize_delta_days(value):
    if value == 0:
        return 'Zero'
    elif value > 0 and value <= 1:
        return 'Less than a day'
    elif value > 1 and value <= 7:
        return 'Within a week'
    elif value > 7 and value <= 30:
        return 'Within a month'
    else:
        return 'Over a month'

dl['delta_days_category'] = [categorize_delta_days(x) for x in dl['delta_days']]

dl['success_rate_history'] = dl['history_correct'] / dl['history_seen']
dl['success_rate_history']

0          0.500000
1          0.833333
2          1.000000
3          0.891892
4          1.000000
             ...   
3795775    0.833333
3795776    1.000000
3795777    0.880000
3795778    0.937500
3795779    1.000000
Name: success_rate_history, Length: 3795758, dtype: float64

dl['time'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S').dt.time
dl['time']

0          17:13:47
1          18:30:50
2          18:35:44
3          17:56:03
4          21:41:22
             ...   
3795775    23:06:48
3795776    22:49:23
3795777    21:20:18
3795778    07:54:24
3795779    21:12:07
Name: time, Length: 3795758, dtype: object

dl['time_d'] = pd.to_datetime(dl['time'], errors= 'coerce', format='%H:%M:%S')
dl['hour'] = dl['time_d'].dt.hour
dl['hour']

0          17
1          18
2          18
3          17
4          21
           ..
3795775    23
3795776    22
3795777    21
3795778     7
3795779    21
Name: hour, Length: 3795758, dtype: int64

dl

mean_success_rate_history = dl['success_rate_history'].mean()
median_success_rate_history = dl['success_rate_history'].median()
max_success_rate_history = dl['success_rate_history'].max()
min_success_rate_history = dl['success_rate_history'].min()

print(f"Mean Success Rate for History: {mean_success_rate_history:.3f}")
print(f"Median Success Rate for History: {median_success_rate_history:.3f}")
print(f"Max Success Rate for History: {max_success_rate_history}")
print(f"Min Success Rate for History: {min_success_rate_history}")

Mean Success Rate for History: 0.901
Median Success Rate for History: 0.963
Max Success Rate for History: 1.0
Min Success Rate for History: 0.05

mean_recall_rate_session = dl['p_recall'].mean()
median_recall_rate_session = dl['p_recall'].median()
max_recall_rate_session = dl['p_recall'].max()
min_recall_rate_session = dl['p_recall'].min()

print(f"Mean Recall Rate for Session: {mean_recall_rate_session:.3f}")
print(f"Median Recall Rate for Session: {median_recall_rate_session:.3f}")
print(f"Max Recall Rate for Session: {max_recall_rate_session}")
print(f"Min Recall Rate for Session: {min_recall_rate_session}")

Mean Recall Rate for Session: 0.896
Median Recall Rate for Session: 1.000
Max Recall Rate for Session: 1.0
Min Recall Rate for Session: 0.0

p_recall_range = dl["p_recall"].max()-dl["p_recall"].min()
delta_range = dl["delta"].max()-dl["delta"].min()
history_seen_range = dl["history_seen"].max()-dl["history_seen"].min()
history_correct_range = dl["history_correct"].max()-dl["history_correct"].min()
session_seen_range = dl["session_seen"].max()-dl["session_seen"].min()
session_correct_range = dl["session_correct"].max()-dl["session_correct"].min()

print(f"The Range of p_recall: {p_recall_range:.3f}")
print(f"The Range of delta: {delta_range:.3f}")
print(f"The Range of history_seen: {history_seen_range}")
print(f"The Range of history_correct: {history_correct_range}")
print(f"The Range of session_seen: {session_seen_range}")
print(f"The Range of session_correct: {session_correct_range}")

The Range of p_recall: 1.000
The Range of delta: 39649729.000
The Range of history_seen: 13441
The Range of history_correct: 12815
The Range of session_seen: 19
The Range of session_correct: 20

# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_lexemes_per_lang = dl.groupby('learning_language_Abb')['lexeme_id'].nunique().reset_index()

# Rename the column for better readability
unique_lexemes_per_lang.rename(columns={'lexeme_id': 'Unique Lexeme IDs'}, inplace=True)

# Display the result
print(unique_lexemes_per_lang)

  learning_language_Abb  Unique Lexeme IDs
0               English               2740
1                French               3429
2                German               3218
3               Italian               1750
4            Portuguese               2055
5               Spanish               3052

# Group by the 'learning_language' column and get unique lexeme_ids for each language
unique_words_per_lang = dl.groupby('learning_language_Abb')['lexeme_string'].nunique().reset_index()

# Rename the column for better readability
unique_words_per_lang.rename(columns={'lexeme_base': 'Unique words'}, inplace=True)

# Display the result
print(unique_words_per_lang)

  learning_language_Abb  lexeme_string
0               English           2740
1                French           3429
2                German           3218
3               Italian           1750
4            Portuguese           2055
5               Spanish           3052

user_engagement = dl.groupby("user_id")["history_seen"].sum().reset_index()
user_engagement = user_engagement.sort_values(by=["history_seen"], ascending=[ False])
user_engagement

# Correlation between user engagement (history_seen) and learning success (p_recall)
user_engagement_recall = dl.groupby('user_id')['p_recall'].mean().reset_index()
user_engagement_recall = user_engagement_recall.sort_values(by=['p_recall'], ascending=[ False])
user_engagement_recall

engagement_success_corr = user_engagement.merge(user_engagement_recall, on='user_id')
engagement_success_corr_corr = engagement_success_corr['history_seen'].corr(engagement_success_corr['p_recall'])
print(f"Correlation between user engagement and learning success: {engagement_success_corr_corr:.2f}")

Correlation between user engagement and learning success: -0.01

# Find the user with the highest engagement
highest_engagement_user = user_engagement.loc[user_engagement["history_seen"].idxmax()]

highest_engagement_user

user_id            bcH_
history_seen    3787808
Name: 3328, dtype: object

# Filter the dataset for the highest engaged user
user_data = dl[dl["user_id"] == highest_engagement_user["user_id"]]

# Calculate and display the mean and median of p_recall
mean_p_recall = user_data["p_recall"].mean()
median_p_recall = user_data["p_recall"].median()

print(f"Mean p_recall for highest engaged user: {mean_p_recall}, Median p_recall for highest engaged user: {median_p_recall}")

Mean p_recall for highest engaged user: 0.46417843355016153, Median p_recall for highest engaged user: 0.5

# Group by lexeme_id and calculate mean recall rate
lexeme_performance = dl.groupby('lexeme_id')['p_recall'].agg(['mean', 'median']).round(2)
lexeme_performance

# Group by lexeme_id and calculate success rate
lexeme_success = dl.groupby('lexeme_id')['success_rate_history'].agg(['mean', 'median']).round(2)
lexeme_success

# Calculate correlation between lexeme_id and recall rate
lexeme_difficulty_corr = dl.groupby('lexeme_id').agg({'p_recall': 'mean'}).reset_index()
lexeme_difficulty_corr

ui_language_stats_history_success_rate = dl.groupby('ui_language_Abb')['success_rate_history'].agg(['mean', 'median'])
ui_language_stats_history_success_rate = ui_language_stats_history_success_rate.reset_index()
ui_language_stats_history_success_rate

ui_language_stats_recall_rate = dl.groupby('ui_language_Abb')['p_recall'].agg(['mean', 'median'])
ui_language_stats_recall_rate = ui_language_stats_recall_rate.reset_index()
ui_language_stats_recall_rate

# Group by gender_tag and calculate mean recall rate for each gender
gender_performance = dl.groupby('gender_tag')['p_recall'].agg(['mean', 'median'])
gender_performance

# Group by gender_tag and calculate success rate
gender_success = dl.groupby('gender_tag')['success_rate_history'].agg(['mean', 'median'])
gender_success

# Compare performance across grammatical features (plurality_tag) and gender
gender_plurality_comparison = dl.groupby(['gender_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median'])
gender_plurality_comparison

user_id_consistency =  dl.groupby('user_id')['delta_days'].median().reset_index()
user_id_consistency = user_id_consistency.sort_values(by = 'delta_days', ascending = False)

user_id_consistency.rename(columns={'user_id': 'User ID', 'delta_days': 'Median Delta Days'}, inplace=True)

print(user_id_consistency)

      User ID  Median Delta Days
792        GB           458.9090
1576       TX           444.9910
1842       _2           423.2820
2439      bEO           411.3305
4197      bzm           409.2480
...       ...                ...
24529    gVkB             0.0000
59174    iQCu             0.0000
45411    i1sv             0.0000
17451    fefQ             0.0000
17571    fg7z             0.0000

[79694 rows x 2 columns]

user_id_consistency['Median Delta Days'].describe()

count    79694.000000
mean        12.856487
std         34.171674
min          0.000000
25%          0.039000
50%          2.026000
75%          8.973000
max        458.909000
Name: Median Delta Days, dtype: float64

quantile_90 = user_id_consistency['Median Delta Days'].quantile(0.90)

# Display the 90th quantile
print(f"The 90th percentile (quantile) for Median Delta Days is: {quantile_90:.2f}")

The 90th percentile (quantile) for Median Delta Days is: 30.13

# Filter rows where 'Median Delta Days' is greater than 100
Less_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] > 30]

# Display the filtered DataFrame
print(Less_consistent_user)

      User ID  Median Delta Days
792        GB           458.9090
1576       TX           444.9910
1842       _2           423.2820
2439      bEO           411.3305
4197      bzm           409.2480
...       ...                ...
39727    hdRb            30.0040
13056    f-hA            30.0040
11184    eT8k            30.0030
18695    ftXa            30.0020
41123    hkY5            30.0010

[8025 rows x 2 columns]

# Filter rows where 'Median Delta Days' is greater than 100
most_consistent_user = user_id_consistency[user_id_consistency['Median Delta Days'] < 8]

# Display the filtered DataFrame
print(most_consistent_user)

      User ID  Median Delta Days
47936    iAJZ             7.9995
52441    iI4W             7.9990
38545    hZT8             7.9990
51604    iGoN             7.9990
6511     d2OP             7.9990
...       ...                ...
24529    gVkB             0.0000
59174    iQCu             0.0000
45411    i1sv             0.0000
17451    fefQ             0.0000
17571    fg7z             0.0000

[58361 rows x 2 columns]

# Step 1: Get the user_ids from Less_consistent_user
less_consistent_user_ids = Less_consistent_user['User ID']

# Step 2: Filter the original dataset (dl) for those user_ids
filtered_dl_l = dl[dl['user_id'].isin(less_consistent_user_ids)]

# Step 3: Group by 'user_id' and calculate required statistics
less_consistent_user_language_stats = (
    filtered_dl_l.groupby(['user_id', 'learning_language_Abb',])[['delta_days','p_recall', 'success_rate_history']]
    .median()
    .reset_index()
    .rename(columns={'p_recall': 'Median p_recall', 'delta_days': 'Median_delta_days','success_rate_history': 'Median Success Rate'})
)
less_consistent_user_language_stats = less_consistent_user_language_stats.sort_values(by = 'Median_delta_days', ascending = False)

# Display the resulting DataFrame
print(less_consistent_user_language_stats)

     user_id learning_language_Abb  Median_delta_days  Median p_recall  \
355       GB                German           458.9090         1.000000   
699       TX               Spanish           444.9910         0.666667   
805       _2               Spanish           423.2820         1.000000   
1033     bEO               Spanish           411.3305         1.000000   
1753     bzm               Spanish           409.2480         1.000000   
...      ...                   ...                ...              ...   
3541    e1x_                French             0.0020         1.000000   
3673    eApr               Spanish             0.0020         1.000000   
2829    dK31               English             0.0020         1.000000   
4751    f8Bu            Portuguese             0.0020         0.750000   
2851    dLoW            Portuguese             0.0020         0.900000   

      Median Success Rate  
355              0.833333  
699              0.666667  
805              0.833333  
1033             0.857143  
1753             0.894444  
...                   ...  
3541             0.800000  
3673             1.000000  
2829             1.000000  
4751             0.857143  
2851             1.000000  

[8193 rows x 5 columns]

# Step 1: Filter most consistent user_ids
most_consistent_user_ids = most_consistent_user['User ID']

# Step 2: Filter the original dataset (dl) for these user_ids
filtered_most_consistent_dl = dl[dl['user_id'].isin(most_consistent_user_ids)]

# Step 3: Group by 'user_id' and calculate required statistics
most_consistent_user_stats = (
    filtered_most_consistent_dl.groupby(['user_id'])
    .agg({
        'delta_days': 'median',
        'p_recall': 'median',
        'success_rate_history': 'median'
    })
    .reset_index()
    .rename(columns={
        'delta_days': 'Median Delta Days',
        'p_recall': 'Median p_recall',
        'success_rate_history': 'Median Success Rate'
    })
)

most_consistent_user_stats = most_consistent_user_stats.sort_values(by = 'Median Delta Days', ascending = False)
# Step 4: Merge with the language information
most_consistent_user_stats = most_consistent_user_stats.merge(
    dl[['user_id', 'learning_language_Abb']].drop_duplicates(),
    on='user_id',
    how='left'
)

# Display the resulting DataFrame
print(most_consistent_user_stats)

      user_id  Median Delta Days  Median p_recall  Median Success Rate  \
0        iAJZ             7.9995         1.000000             0.875000   
1        iGoN             7.9990         1.000000             1.000000   
2        iI4W             7.9990         1.000000             0.750000   
3        d2OP             7.9990         1.000000             1.000000   
4        hZT8             7.9990         1.000000             1.000000   
...       ...                ...              ...                  ...   
59968    d52D             0.0000         0.833333             0.833333   
59969    i5iF             0.0000         0.708333             0.708333   
59970    iDPq             0.0000         1.000000             1.000000   
59971    i5BG             0.0000         1.000000             0.878788   
59972    iQCu             0.0000         1.000000             1.000000   

      learning_language_Abb  
0                   English  
1                   English  
2                    German  
3                   Spanish  
4                Portuguese  
...                     ...  
59968               English  
59969               Spanish  
59970               Spanish  
59971               Spanish  
59972               English  

[59973 rows x 5 columns]

less_consistent_user_language_stats['learning_language_Abb'].value_counts()

English       2566
Spanish       2278
French        1573
German        1388
Italian        219
Portuguese     169
Name: learning_language_Abb, dtype: int64

most_consistent_user_stats['learning_language_Abb'].value_counts()

English       23055
Spanish       15740
French         9218
German         6905
Italian        3631
Portuguese     1424
Name: learning_language_Abb, dtype: int64

# Calculate the correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall = less_consistent_user_language_stats['Median_delta_days'].corr(
    less_consistent_user_language_stats['Median p_recall']
)

# Calculate the correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate = less_consistent_user_language_stats['Median_delta_days'].corr(
    less_consistent_user_language_stats['Median Success Rate']
)

# Print the results
print(f"Correlation between Median_delta_days and Median p_recall: {correlation_p_recall:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate: {correlation_success_rate:.2f}")

Correlation between Median_delta_days and Median p_recall: -0.05
Correlation between Median_delta_days and Median Success Rate: 0.04

# Correlation between 'Median_delta_days' and 'Median p_recall'
correlation_p_recall_most = most_consistent_user_stats['Median Delta Days'].corr(
    most_consistent_user_stats['Median p_recall']
)

# Correlation between 'Median_delta_days' and 'Median Success Rate'
correlation_success_rate_most = most_consistent_user_stats['Median Delta Days'].corr(
    most_consistent_user_stats['Median Success Rate']
)

# Print the results
print(f"Correlation between Median_delta_days and Median p_recall (Most Consistent): {correlation_p_recall_most:.2f}")
print(f"Correlation between Median_delta_days and Median Success Rate (Most Consistent): {correlation_success_rate_most:.2f}")

Correlation between Median_delta_days and Median p_recall (Most Consistent): -0.03
Correlation between Median_delta_days and Median Success Rate (Most Consistent): 0.02

# Step 1: Group by 'user_id' and count unique 'lexeme_id' learned
user_lexeme_counts = (
    dl.groupby(['user_id','learning_language_Abb'])['lexeme_id']
    .nunique()
    .reset_index()
    .rename(columns={'lexeme_id': 'Unique Lexemes Learned'})
)

# Step 2: Sort by the count of unique lexemes in descending order
user_lexeme_counts = user_lexeme_counts.sort_values(by='Unique Lexemes Learned', ascending=False)

# Display the resulting DataFrame
print(user_lexeme_counts)

      user_id learning_language_Abb  Unique Lexemes Learned
81262     tJs               Spanish                     941
8867     dlpG                German                     899
13039    erXf               English                     867
25901    gZJc               English                     862
20467    g2Ev               English                     848
...       ...                   ...                     ...
40815    hdFN                German                       1
61977    iRVR                French                       1
46195    i17y                French                       1
75951    ipO1               English                       1
48151    i4P7                French                       1

[81711 rows x 3 columns]

grammar_stats_history = dl.groupby('grammar_tag')['success_rate_history'].agg(['mean', 'median']).round(2)
grammar_stats_history = grammar_stats_history.reset_index()
grammar_stats_history

grammar_stats_session = dl.groupby('grammar_tag')['p_recall'].agg(['mean', 'median']).round(2)
grammar_stats_session = grammar_stats_session.reset_index()
grammar_stats_session

# Compare recall rate and success rate across grammatical categories
tag_diff_comparison = dl.groupby(['grammar_tag', 'plurality_tag'])['p_recall'].agg(['mean', 'median']).round(2)
tag_diff_comparison

def value_counts(column_name):
    value_counts = dl[column_name].value_counts()
    value_counts_dl = pd.DataFrame(value_counts)

    return value_counts_dl

value_counts('learning_language_Abb')

# Data
learning_language_Abb = ['English', 'Spanish', 'French', 'German', 'Italian', 'Portuguese']
count = [1479926, 1007678, 552704, 425433, 237961, 92056]  # Count of learning_language_full
unique_lexeme_ids = [2740, 3052, 3429, 3218, 1750, 2055]  # Unique Lexeme IDs
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown']  # Different colors for languages

# Normalize counts for the x-axis
scaled_count = [x / 100000 for x in count]

# Create a figure
plt.figure(figsize=(12, 8))

# Scatter plot with unique colors for each language
plt.scatter(scaled_count, unique_lexeme_ids, color=colors, s=200, alpha=0.8, edgecolors='black')

# Annotate points with language names
for i, lang in enumerate(learning_language_Abb):
    plt.text(scaled_count[i], unique_lexeme_ids[i] + 50, lang, fontsize=10, ha='center')

# Add titles and labels
plt.title('Learning Language Count (Normalized) vs. Unique Lexeme IDs', fontsize=16)
plt.xlabel('Normalized Count of Learning Language Learners', fontsize=14)
plt.ylabel('Unique Lexeme IDs', fontsize=14)
plt.grid(alpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()

language_stats_history_success_rate = dl.groupby('learning_language_Abb')['success_rate_history'].agg(['mean', 'median']).round(2)
language_stats_history_success_rate = language_stats_history_success_rate.reset_index()
language_stats_history_success_rate

language_stats_recall_rate = dl.groupby('learning_language_Abb')['p_recall'].agg(['mean', 'median']).round(2)
language_stats_recall_rate = language_stats_recall_rate.reset_index()
language_stats_recall_rate

# Merge the two datasets
merged_data = pd.merge(language_stats_history_success_rate, 
                       language_stats_recall_rate, 
                       on='learning_language_Abb', 
                       suffixes=('_success_rate', '_recall_rate'))

# Restructure the data to long format for easier plotting
long_data = pd.melt(
    merged_data,
    id_vars=['learning_language_Abb'],
    value_vars=['mean_success_rate', 'median_success_rate', 'mean_recall_rate', 'median_recall_rate'],
    var_name='Metric',
    value_name='Value'
)

# Define a mapping for better labels
metric_labels = {
    'mean_success_rate': 'Mean Success Rate',
    'median_success_rate': 'Median Success Rate',
    'mean_recall_rate': 'Mean Recall Rate',
    'median_recall_rate': 'Median Recall Rate'
}
long_data['Metric'] = long_data['Metric'].map(metric_labels)

# Plot the data
plt.figure(figsize=(12, 6))
for metric in long_data['Metric'].unique():
    subset = long_data[long_data['Metric'] == metric]
    plt.plot(subset['learning_language_Abb'], subset['Value'], label=metric)

# Customize the chart
plt.title('Mean and Median for Success Rate and Recall Rate by Learning Language', fontsize=14)
plt.xlabel('Learning Language', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Metrics', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Show the chart
plt.show()

def analyze_lan_distribution(dl, column_name, plot_color='skyblue'):
  
    # Perform value counts
    value_counts = dl[column_name].value_counts()
    value_counts_dl = pd.DataFrame(value_counts, columns=['Count'])
    value_counts_dl.index.name = column_name

    # Print the DataFrame
    print(f"\nDistribution of {column_name}:")
    print(value_counts_dl)

    # Plot the distribution
    plt.figure(figsize=(10, 6))
    value_counts.plot(kind='bar', color=plot_color)
    plt.title(f'Distribution of {column_name}', fontsize=16)
    plt.xlabel(column_name, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.6)
    plt.tight_layout()
    plt.show()

    # Return the DataFrame
    return value_counts_dl

ui_language_count = analyze_lan_distribution(dl, 'learning_language_Abb', plot_color='yellow')
ui_language_count = analyze_lan_distribution(dl, 'ui_language_Abb', plot_color='orange')

Distribution of learning_language_Abb:
Empty DataFrame
Columns: [Count]
Index: []

Distribution of ui_language_Abb:
Empty DataFrame
Columns: [Count]
Index: []

# Pivot table to calculate average p_recall
pivot_table = dl.pivot_table(
    values='p_recall',
    index='learning_language_Abb',
    columns='delta_days_category',
    aggfunc='mean'
)

# Display the pivot table
print(pivot_table)

delta_days_category    Less than a day  Over a month  Within a month  \
learning_language_Abb                                                  
English                       0.907246      0.874983        0.884448   
French                        0.891713      0.848622        0.872736   
German                        0.907395      0.853919        0.875697   
Italian                       0.915333      0.887580        0.890401   
Portuguese                    0.914628      0.869350        0.887903   
Spanish                       0.909945      0.873059        0.886486   

delta_days_category    Within a week      Zero  
learning_language_Abb                           
English                     0.892298  0.930387  
French                      0.878085  0.918171  
German                      0.886278  0.940054  
Italian                     0.899162  0.924242  
Portuguese                  0.894537  0.907361  
Spanish                     0.894120  0.926428

# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Heatmap of Recall Probabilities (p_recall) by Language and Time Categories')
plt.xlabel('Delta Days Category')
plt.ylabel('Learning Language')
plt.show()

dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['p_recall'] = pd.to_numeric(dl['p_recall'], errors='coerce')
correlation = dl['delta_days'].corr(dl['p_recall'])
print(f"Correlation between delta_days and p_recall: {correlation:.2f}")

Correlation between delta_days and p_recall: -0.03

dl['delta_days'] = pd.to_numeric(dl['delta_days'], errors='coerce')
dl['success_rate_history'] = pd.to_numeric(dl['success_rate_history'], errors='coerce')
correlation = dl['delta_days'].corr(dl['success_rate_history'])
print(f"Correlation between delta_days and success_rate_history: {correlation:.2f}")

Correlation between delta_days and success_rate_history: 0.02

# Initialize a 3D plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot
sc = ax.scatter(
    dl['history_correct'], 
    dl['session_correct'], 
    dl['delta_days'], 
    c=dl['delta_days'],  # Color based on delta_days
    cmap='viridis',      # Color map
    alpha=0.8            # Transparency
)

# Add labels
ax.set_title('3D Scatter Plot: History vs. Session vs. Delta Days')
ax.set_xlabel('History Correct')
ax.set_ylabel('Session Correct')
ax.set_zlabel('Delta Days')

# Add color bar
cbar = plt.colorbar(sc, pad=0.1)
cbar.set_label('Delta Days')

# Show plot
plt.show()

p_recall_days_category = dl.groupby("delta_days_category")["p_recall"].agg(["mean", "median"])
p_recall_days_category

# Extract data for plotting
categories = p_recall_days_category.index  # Delta days categories
mean_values = p_recall_days_category['mean']  # Mean of p_recall
median_values = p_recall_days_category['median']  # Median of p_recall

# Create the figure
plt.figure(figsize=(12, 6))

# Bar chart for mean values
plt.bar(categories, mean_values, color='skyblue', label='Mean', alpha=0.7)

# Line chart for median values
plt.plot(categories, median_values, color='red', marker='o', label='Median', linewidth=2)

# Add labels, title, and legend
plt.title('P_Recall by Delta Days Category', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('P_Recall', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.ylim(0.85, 1.02)  # Adjust the y-axis to focus on the range of p_recall
plt.legend(title='Metrics', fontsize=12)
plt.grid(alpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()

p_recall_gender_language = dl.groupby(["learning_language_Abb", "gender_tag"])["p_recall"].mean().reset_index(name="count")
p_recall_gender_language = p_recall_gender_language.pivot(index='learning_language_Abb', 
    columns='gender_tag', values='count')
p_recall_gender_language

plt.figure(figsize=(10, 8))
sns.heatmap(p_recall_gender_language, annot=True, cmap='cividis', fmt='.2f', linewidths=0.5, cbar_kws={'label': 'Mean Recall Rate'})
plt.title('Mean p_recall by Learning Language and Gender', fontsize=16)
plt.xlabel('Gender Tag', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.tight_layout()
plt.show()

hourly_success_rate = dl.groupby('hour')['success_rate_history'].mean().reset_index()
hourly_success_rate = hourly_success_rate.sort_values('hour')
hourly_success_rate

# Line plot for success rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_success_rate, x='hour', y='success_rate_history', marker='o', color='blue')
plt.title('Success Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Success Rate')
plt.xticks(range(0, 24))  # Ensure all hours are shown
plt.grid(True)
plt.show()

hourly_recall_rate = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_recall_rate = hourly_recall_rate.sort_values('hour')
hourly_recall_rate

# Line plot for Recall rate by hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_recall_rate, x='hour', y='p_recall', marker='o', color='blue')
plt.title('Recall Rate by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Recall Rate')
plt.xticks(range(0, 24))  # Ensure all hours are shown
plt.grid(True)
plt.show()

# Find the hour with the highest success rate
peak_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmax()]
print(f"Peak learning hour: {peak_hour['hour']} with success rate: {peak_hour['success_rate_history']:.2f}")

# Find the hour with the lowest success rate
lowest_hour = hourly_success_rate.loc[hourly_success_rate['success_rate_history'].idxmin()]
print(f"Lowest learning hour: {lowest_hour['hour']} with success rate: {lowest_hour['success_rate_history']:.2f}")

Peak learning hour: 19.0 with success rate: 0.90
Lowest learning hour: 5.0 with success rate: 0.90

hourly_count = dl['hour'].value_counts().sort_index()
print(hourly_count)

0     188715
1     189609
2     188554
3     172621
4     145875
5     111427
6      80684
7      75318
8      69965
9      76519
10     89185
11     96127
12    114489
13    135089
14    153733
15    175107
16    199630
17    207284
18    220926
19    220805
20    230840
21    234093
22    221128
23    198035
Name: hour, dtype: int64

plt.figure(figsize=(10, 6))
plt.fill_between(hourly_count.index, hourly_count.values, color='lightgreen', alpha=0.4)
plt.plot(hourly_count.index, hourly_count.values, color='forestgreen', linewidth=2)
plt.title('Hourly Count Area Chart', fontsize=16)
plt.xlabel('Hour of the Day', fontsize=14)
plt.ylabel('Number of Learning Sessions', fontsize=14)
plt.grid(alpha=0.5)
plt.show()

# Group by hour and calculate total history seen for engagement by hour
hourly_engagement = dl.groupby('hour')['history_seen'].sum().reset_index()
hourly_engagement

# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))

# Plot total history seen (engagement)
plt.plot(hourly_engagement['hour'], hourly_engagement['history_seen'], marker='o', label='Total Engagement (History Seen)', color='blue')

# Add titles, labels, and grid
plt.title('Hourly Engagement')
plt.xlabel('Hour')
plt.ylabel('Value')
plt.grid(True)
plt.legend()

# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))

# Show the plot
plt.show()

hourly_engagement_recall = dl.groupby('hour')['p_recall'].mean().reset_index()
hourly_engagement_recall

# Plot for both hourly engagement and hourly recall on the same chart
plt.figure(figsize=(12, 6))

# Plot total history seen (engagement)
plt.plot(hourly_engagement_recall['hour'], hourly_engagement_recall['p_recall'], marker='o', label='Total Engagement on Recall', color='blue')

# Add titles, labels, and grid
plt.title('hourly_engagement_recall')
plt.xlabel('Hour')
plt.ylabel('p_recall')
plt.grid(True)
plt.legend()

# Set x-ticks for each hour (0 to 23)
plt.xticks(range(24))

# Show the plot
plt.show()

hourly_engagement_corr = hourly_engagement.merge(hourly_engagement_recall, on='hour')
hourly_engagement_corr_corr = hourly_engagement_corr['history_seen'].corr(hourly_engagement_corr['p_recall'])
print(f"Correlation between hourly engagement and learning success: {hourly_engagement_corr_corr:.2f}")

Correlation between hourly engagement and learning success: 0.32

count_days_return_language = dl.groupby(["learning_language_Abb", "delta_days_category"]).size().reset_index(name="count")
heatmap_data = count_days_return_language.pivot(
    index='learning_language_Abb', 
    columns='delta_days_category', 
    values='count'
)
heatmap_data

# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='Blues', linewidths=0.5, cbar_kws={'label': 'Count of Sessions'})
plt.title('Heatmap of Learning Sessions by Language and Return Days', fontsize=16)
plt.xlabel('Delta Days Category', fontsize=14)
plt.ylabel('Learning Language', fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

learning_language_ui_language_count = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().reset_index(name='count')
learning_language_ui_language_count

# Create a pivot table to count occurrences of 'learning_language_full' for each 'ui_language_full'
count_data = dl.groupby(['learning_language_Abb', 'ui_language_Abb']).size().unstack(fill_value=0)

# Plot the stacked bar chart
count_data.plot(kind='bar', stacked=True, figsize=(12, 8), colormap='tab20')

plt.title('Count of Learning Languages for Each UI Language', fontsize=16)
plt.xlabel('Learning Language', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

dl

	p_recall	delta	history_seen	history_correct	session_seen	session_correct
count	3.795780e+06	3.795780e+06	3.795780e+06	3.795780e+06	3.795780e+06	3.795780e+06
mean	8.964675e-01	7.055116e+05	2.197719e+01	1.949662e+01	1.808655e+00	1.636209e+00
std	2.711188e-01	2.211979e+06	1.283616e+02	1.136178e+02	1.350644e+00	1.309628e+00
min	0.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	0.000000e+00
25%	1.000000e+00	5.180000e+02	3.000000e+00	3.000000e+00	1.000000e+00	1.000000e+00
50%	1.000000e+00	7.609500e+04	6.000000e+00	6.000000e+00	1.000000e+00	1.000000e+00
75%	1.000000e+00	4.346412e+05	1.500000e+01	1.300000e+01	2.000000e+00	2.000000e+00
max	1.000000e+00	3.964973e+07	1.344200e+04	1.281600e+04	2.000000e+01	2.000000e+01

	p_recall	delta	history_seen	history_correct	session_seen	session_correct
count	3.795758e+06	3.795758e+06	3.795758e+06	3.795758e+06	3.795758e+06	3.795758e+06
mean	8.964686e-01	7.055136e+05	2.197721e+01	1.949666e+01	1.808658e+00	1.636214e+00
std	2.711172e-01	2.211985e+06	1.283619e+02	1.136181e+02	1.350646e+00	1.309630e+00
min	0.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	0.000000e+00
25%	1.000000e+00	5.180000e+02	3.000000e+00	3.000000e+00	1.000000e+00	1.000000e+00
50%	1.000000e+00	7.609500e+04	6.000000e+00	6.000000e+00	1.000000e+00	1.000000e+00
75%	1.000000e+00	4.346410e+05	1.500000e+01	1.300000e+01	2.000000e+00	2.000000e+00
max	1.000000e+00	3.964973e+07	1.344200e+04	1.281600e+04	2.000000e+01	2.000000e+01

	user_id	history_seen
3328	bcH_	3787808
27016	goA	3479726
5998	cpBu	3263073
7448	dOig	1266491
1223	NPs	903094
...	...	...
17912	fkZR	1
318	6XJ	1
40813	hj3	1
25380	g_k-	1
29632	h2Xz	1

	user_id	p_recall
12841	exUs	1.0
17986	flbK	1.0
37145	hUni	1.0
17985	flax	1.0
71594	ijPr	1.0
...	...	...
60390	iRTg	0.0
14364	fAzQ	0.0
48288	iB2r	0.0
48361	iBFg	0.0
78581	k0S	0.0

	lexeme_id	p_recall
0	00022efc4121667defd065c88569e748	0.904762
1	00064bf8c1c3cefa80b193acf7b9fe1d	0.900000
2	000d2eb6a5658fa17f828c5fb0c66c11	0.952830
3	000f3063358c188d171d903ec5a7855c	0.849481
4	001635bcb24496cf2b27731c3708dbfa	0.940810
...	...	...
16239	ffeacb268a19c068cd8171938e5280a8	1.000000
16240	ffedebe922588f522094dd8eac320071	1.000000
16241	ffee4a0570d4eacc9c08f339c2bb11a7	1.000000
16242	fff70e9352d896105563156d7023d878	0.808757
16243	fff799d4d95e416db9dc07ca717b4ef9	0.910078

	p_recall	timestamp	delta	user_id	learning_language	ui_language	lexeme_id	lexeme_string	history_seen	history_correct	session_seen	session_correct
0	1.0	2013-03-03 17:13:47	1825254	5C7	fr	en	3712581f1a9fbc0894e22664992663e9	sur/sur<pr>	2	1	2	2
1	1.0	2013-03-04 18:30:50	367	fWSx	en	es	0371d118c042c6b44ababe667bed2760	police/police<n><pl>	6	5	2	2
2	0.0	2013-03-03 18:35:44	1329	hL-s	de	en	5fa1f0fcc3b5d93b8617169e59884367	hat/haben<vbhaver><pri><p3><sg>	10	10	1	0
3	1.0	2013-03-07 17:56:03	156	h2_R	es	en	4d77de913dc3d65f1c9fac9d1c349684	en/en<pr>	111	99	4	4
4	1.0	2013-03-05 21:41:22	257	eON	es	en	35f14d06d95a34607d6abb0e52fc6d2b	caballo/caballo<n><m><sg>	3	3	3	3
...	...	...	...	...	...	...	...	...	...	...	...	...
3795775	1.0	2013-03-06 23:06:48	4792	iZ7d	es	en	84e18e86c58e8e61d687dfa06b3aaa36	soy/ser<vbser><pri><p1><sg>	6	5	2	2
3795776	1.0	2013-03-07 22:49:23	1369	hxJr	fr	en	f5b66d188d15ccb5d7777a59756e33ad	chiens/chien<n><m><pl>	3	3	2	2
3795777	1.0	2013-03-06 21:20:18	615997	fZeR	it	en	91a6ab09aa0d2b944525a387cc509090	voi/voi<prn><tn><p2><mf><pl>	25	22	2	2
3795778	1.0	2013-03-07 07:54:24	289	g_D3	en	es	a617ed646a251e339738ce62b84e61ce	are/be<vbser><pres>	32	30	2	2
3795779	1.0	2013-03-06 21:12:07	191	iiN7	pt	en	4a93acdbafaa061fd69226cf686d7a2b	café/café<n><m><sg>	3	3	1	1

	mean	median
lexeme_id
00022efc4121667defd065c88569e748	0.90	1.0
00064bf8c1c3cefa80b193acf7b9fe1d	0.90	1.0
000d2eb6a5658fa17f828c5fb0c66c11	0.95	1.0
000f3063358c188d171d903ec5a7855c	0.85	1.0
001635bcb24496cf2b27731c3708dbfa	0.94	1.0
...	...	...
ffeacb268a19c068cd8171938e5280a8	1.00	1.0
ffedebe922588f522094dd8eac320071	1.00	1.0
ffee4a0570d4eacc9c08f339c2bb11a7	1.00	1.0
fff70e9352d896105563156d7023d878	0.81	1.0
fff799d4d95e416db9dc07ca717b4ef9	0.91	1.0

	ui_language_Abb	mean	median
0	English	0.898451	0.988764
1	Italian	0.910783	0.974359
2	Portuguese	0.905714	0.954545
3	Spanish	0.903697	0.948718

	ui_language_Abb	mean	median
0	English	0.894915	1.0
1	Italian	0.908159	1.0
2	Portuguese	0.897696	1.0
3	Spanish	0.898156	1.0

	mean	median
gender_tag
Feminine	0.896025	1.0
Masculine	0.899669	1.0
Neuter	0.907303	1.0

	mean	median
gender_tag
Feminine	0.897845	1.0
Masculine	0.899576	1.0
Neuter	0.911196	1.0

		mean	median
gender_tag	plurality_tag
Feminine	Plural	0.893904	1.0
Feminine	Singular	0.896369	1.0
Masculine	Plural	0.902440	1.0
Masculine	Singular	0.899193	1.0
Neuter	Plural	0.901168	1.0
Neuter	Singular	0.908342	1.0

	grammar_tag	mean	median
0	Adverbs	0.90	0.98
1	Conjunctions	0.88	0.91
2	Determiners and Adjectives	0.90	0.94
3	Interjections	0.95	1.00
4	Nan	0.78	1.00
5	Negation	0.89	1.00
6	Nouns	0.91	1.00
7	Numbers and Quantifiers	1.00	1.00
8	Other	0.89	1.00
9	Pronouns and Related	0.89	0.92
10	Verbs	0.90	0.95

		mean	median
grammar_tag	plurality_tag
Determiners and Adjectives	Plural	0.88	1.0
Determiners and Adjectives	Singular	0.89	1.0
Nouns	Plural	0.90	1.0
Nouns	Singular	0.90	1.0
Pronouns and Related	Plural	0.89	1.0
Pronouns and Related	Singular	0.90	1.0
Verbs	Plural	0.89	1.0
Verbs	Singular	0.90	1.0

	learning_language_Abb
English	1479926
Spanish	1007678
French	552704
German	425433
Italian	237961
Portuguese	92056

gender_tag	Feminine	Masculine	Neuter
learning_language_Abb
English	0.915208	0.892346	0.908488
French	0.883118	0.889324	0.885532
German	0.890919	0.890983	0.900045
Italian	0.913691	0.901427	0.899050
Portuguese	0.902538	0.911269	0.861442
Spanish	0.898183	0.906447	0.903529

	hour	success_rate_history
0	0	0.899899
1	1	0.901693
2	2	0.901119
3	3	0.901233
4	4	0.898319
5	5	0.897596
6	6	0.900563
7	7	0.898600
8	8	0.902225
9	9	0.901200
10	10	0.900115
11	11	0.898766
12	12	0.901242
13	13	0.902201
14	14	0.900948
15	15	0.902254
16	16	0.901973
17	17	0.901663
18	18	0.899930
19	19	0.902748
20	20	0.902178
21	21	0.901185
22	22	0.899891
23	23	0.899842

	hour	p_recall
0	0	0.896413
1	1	0.898376
2	2	0.898629
3	3	0.898826
4	4	0.895443
5	5	0.891778
6	6	0.897451
7	7	0.893905
8	8	0.897472
9	9	0.896174
10	10	0.893066
11	11	0.893905
12	12	0.895460
13	13	0.897065
14	14	0.894525
15	15	0.898139
16	16	0.898184
17	17	0.897501
18	18	0.895011
19	19	0.897626
20	20	0.897241
21	21	0.897527
22	22	0.894974
23	23	0.894339

	hour	history_seen
0	0	4294604
1	1	3979716
2	2	4460718
3	3	5276017
4	4	4282354
5	5	3367891
6	6	1255401
7	7	1212662
8	8	1141606
9	9	1504221
10	10	2279598
11	11	2133283
12	12	2619809
13	13	3025397
14	14	3197393
15	15	3309489
16	16	3964395
17	17	4441441
18	18	4695238
19	19	4845972
20	20	5361548
21	21	4408670
22	22	4393974
23	23	3968790

	mean	median
delta_days_category
Less than a day	0.906413	1.0
Over a month	0.868080	1.0
Within a month	0.882737	1.0
Within a week	0.890425	1.0
Zero	0.927669	1.0

delta_days_category	Less than a day	Over a month	Within a month	Within a week	Zero
learning_language_Abb
English	850154	92163	183263	351786	2560
French	291639	35138	73636	151327	964
German	206231	32629	65061	120992	520
Italian	142301	6463	26592	62506	99
Portuguese	50524	4774	11566	25072	120
Spanish	478871	75805	164709	287114	1179

	learning_language_Abb	ui_language_Abb	count
0	English	Italian	123157
1	English	Portuguese	282884
2	English	Spanish	1073885
3	French	English	552704
4	German	English	425433
5	Italian	English	237961
6	Portuguese	English	92056
7	Spanish	English	1007678

Graduation Project -- Duolingo Analysis¶

Business Problem Overview¶

Enhancing User Retention and Learning Effectiveness on Duolingo¶

Enhancing User Retention and Learning Effectiveness on Duolingo¶

Objective¶

Business Impact¶

Dataset Overview¶

Dataset Overview¶

Column Definitions¶

Analysis & Visualization¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

Loading the Dataset from google drive¶

Displaying Dataset Information¶

Displaying Column Names¶

Describing Dataset Information¶

Displaying Column Data Types¶

Checking the Shape of the Dataset¶

Checking the unique values in the Dataset¶

Checking for the Value Counts in the Dataset¶

2. Data Preparation¶

Checking for Missing/Null Values¶

Checking for Duplicate Values in the Dataset¶

Dropping the Duplicate Values in the Dataset¶

Checking the shape to ensure the drop of duplicates.¶

Describing Non-Duplicated data.¶

Checking Data Column types for Well Structured Analysis.¶

Changing Datatype of Inappropriate Columns.¶

Segregating Numerical Columns for Devicing Outliers.¶

Devicing outliers from the DataFrame¶

Handling the Outliers.¶

Data Distribution in Outliers.¶

Insights on Keeping Outliers For Our Analysis.¶

Creating New columns for better Analysis¶

Create Learning_Language_Abb column¶

Create ui_Language_Abb column¶

Extracting Column lexeme_base from lexeme_string¶

Extracting New column grammar_tag from lexeme_string¶

Extracting New column gender_tag from lexeme_string¶

Extracting New column plurality_tag from lexeme_string¶

Initializing New Column delta_days from delta column¶

Extracting Time from Timestamp column¶

Categorizing the delta_ways into a column delta_days_category¶

Deriving success_rate_history from two columns¶

Extracting column hour from time column¶

DataFrame dl¶

3. Exploratory Data Analysis (EDA)¶

Performing Statistical Analysis for some of the Derived columns¶

Statistical Analysis of success_rate_history¶

Statistical Analysis of p_recall¶

Range of Columns¶

Obtain the unique lexeme ids of all learning_language¶

Obtain the unique lexeme strings of all learning_language¶

Hypothesis 1:- Analyzing user engagement for each user aims to identify patterns, expecting higher engagement levels among a subset of users reflecting varying learning behaviors.¶

Computing User engagement¶

Computing User Engagement on Recall¶

Correlation between user engagement and learning success¶

Most Engaged User in the recorded time¶

Analysis:¶

Key Insights:¶

Recommendations:¶

Hypothesis 2:- The analysis aims to identify patterns in user engagement and recall, expecting higher engagement with easier lexemes. This reflects varying learning behaviors and the impact of different materials on user success.¶

Extract lexeme_performance¶

Extract lexeme_success¶

Lexeme Difficulty (Based on Recall)¶

Analysis¶

Key Insights¶

Recommendations¶

Hypothesis 3:- The analysis explores how UI languages influence both recall rates and success rates, hypothesizing that certain languages may lead to better or worse performance in learning. It expects differences in recall and success rates based on the user's preferred UI language.¶

Aggregating Mean and Median on success_rate_history for ui_language_Abb¶

Aggregating Mean and Median on p_recall for ui_language_Abb¶

Analysis¶

Key Insights¶

Recommendations¶

Hypothesis 4:- Gender and grammatical features, such as plurality, may influence learning performance. It suggests that gender and grammatical differences (e.g., singular vs plural) will show varying impacts on recall and success rates across different groups.¶

Deriving mean and median on p_recall for gender_tag¶

Deriving mean and median on success_rate_history for gender_tag¶

Aggregation on p_recall revolves through 'gender_tag' & 'plurality_tag'¶

Analysis¶

1. Recall Rate by Gender:¶

Create `Learning_Language_Abb` column¶

Create `ui_Language_Abb` column¶

Extracting Column `lexeme_base` from `lexeme_string`¶

Extracting New column `grammar_tag` from `lexeme_string`¶

Extracting New column `gender_tag` from `lexeme_string`¶

Extracting New column `plurality_tag` from `lexeme_string`¶

Initializing New Column `delta_days` from `delta` column¶

Extracting `Time` from `Timestamp` column¶

Categorizing the `delta_ways` into a column `delta_days_category`¶

Deriving `success_rate_history` from two columns¶

Extracting column `hour` from `time` column¶

DataFrame `dl`¶

Statistical Analysis of `success_rate_history`¶

Statistical Analysis of `p_recall`¶

Obtain the `unique lexeme ids` of all `learning_language`¶

Obtain the `unique lexeme strings` of all `learning_language`¶

Extract `lexeme_performance`¶

Extract `lexeme_success`¶

Deriving mean and median on `p_recall` for `gender_tag`¶

Deriving mean and median on `success_rate_history` for `gender_tag`¶

Aggregation on `p_recall` revolves through 'gender_tag' & 'plurality_tag'¶

Deriving `Median Delta Days` for each user¶

Grouped by `grammar_tag` on `success_rate_history` with mean and median¶

Grouped by `grammar_tag` on `p_recall` with mean and median¶

Derive `grammar_tag` on plural categories¶

Defining the Function `Value_counts`¶

Aggregating Mean and Median on `success_rate_history` for `learning_language_Abb`¶

Aggregating Mean and Median on `p_recall` for `learning_language_Abb`¶

Deriving Line Chart for `success_rate` and `p_recall` through mean and median¶

Correlation between `delta_days` and `p_recall`¶

Correlation between `delta_days` and `success_rate_history`¶