Кейс-стади включает в себя следующие ключевые компоненты:
- Постановка задачи
- Исследовательский анализ данных и предварительная обработка
- Обучение моделей машинного обучения
- Разработка GUI (графического пользовательского интерфейса)
Постановка задачи
В тематическом исследовании делается попытка предсказать результат одобрения проектных предложений, представленных учителями на DonorsChoose.org. Путем анализа текстового содержания описаний проектов и включения соответствующих метаданных, связанных с проектом, учителем и школой, исследование направлено на разработку прогностической модели. Эта модель позволяет DonorsChoose.org эффективно выявлять проекты, которые могут потребовать дополнительной проверки перед утверждением, тем самым улучшая процесс отбора с помощью интеллектуальных методологий машинного обучения.
Исследовательский анализ данных и предварительная обработка
а) Анализ данных
Набор данных для этого анализа состоит из двух файлов CSV: «train_data.csv» и «resources.csv». Чтобы начать наш анализ, мы рассмотрим содержимое файла «train_data.csv». Набор данных содержит в общей сложности 109 248 строк и 16 столбцов, охватывающих различные функции или атрибуты.
Набор данных включает в себя широкий спектр данных, охватывающих следующие категории:
Категориальные характеристики:
‘id’, ‘teacher_id’, ‘teacher_prefix’, ‘school_state’,
‘project_submitted_datetime’, ‘project_grade_category’,
‘project_subject_categories’, ‘project_subject_subcategories’, ‘project_is_approved’
Особенности текста:
2) "project_essay_1", "project_essay_2",
"project_essay_3", "project_essay_4", "project_resource_summary"
Набор данных содержит отсутствующие или нулевые значения в различных полях.
Теперь давайте переключим наше внимание на другой файл с именем resources.csv. Этот конкретный набор данных содержит 1 541 272 строки и включает 4 функции/столбца. В наборе данных мы можем найти комбинацию числовых, текстовых и идентификационных данных.
В наборе данных нет нулевых значений.
b) Обработка нулевых значений
Проверка нулевых значений в префиксе учителя
Согласно анализу, нет повторяющихся значений, основанных на признаке Teacher_id, а поскольку объем набора данных слишком велик, т. е. 109 248 строк, удаление 3 строк не приведет к какому-либо эффекту.
# Dropping Null value train_data_df.dropna(subset=['teacher_prefix'], inplace=True) train_data_df.shape # checking null values train_data_df.isnull().sum()
в) Предварительная обработка данных
- Обработка project_subject_categories
Давайте проверим функцию
# Working on project subject category projct_subject_category=train_data_df["project_subject_categories"] projct_subject_category.head(20)
Форматирование данных следующим образом
1) Грамотность и язык в Грамотность_Язык
2) История и обществоведение, здоровье и спорт в История_ОбществознаниеЗдоровье_Спорт
# Remove ampersands, split the strings by commas train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories'].str.replace("&", "").str.split(",") # Remove leading/trailing spaces, convert to lowercase, replace spaces with underscores train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].apply(lambda words: [word.strip().lower().replace(" ", "_") for word in words]) # Concatenate the categories into a single string train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].apply(lambda words: " ".join(words)) # Replace double underscores with single underscores train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].str.replace("__", "_")
2. Обработка project_subject_subcategories
Форматирование данных следующим образом
1) ESL, Грамотность в ESL_Literacy
2) Гражданское право, Командные виды спорта в Гражданское право_Правительственное управление Team_Sports
3. Обработка project_grade_category
Форматирование данных следующим образом
1) Классы PreK-2 в класс_prel_2
2) Классы 3–5 в класс_3_5
3) Классы 6–8 в класс_6_8
4) Классы 9–12 в класс_9_12
# Remove ampersands, split the strings by commas train_data_df['project_grade_category_updated'] = train_data_df['project_grade_category'].str.split(" ") # # # Concatenate the categories into a single string train_data_df['project_grade_category_updated'] = train_data_df['project_grade_category_updated'].apply(lambda words: "_".join(word.strip().lower().replace("-", "_") for word in words)) train_data_df.head()
4. Обработка препода_perfix_
Форматирование данных следующим образом
от госпожи к госпоже
от госпожи к госпоже
от господина к господину
от учителя к учителю
от доктора к доктору
# Replacing Mrs., Ms. to mrs,ms, etc train_data_df['teacher_prefix_upgrade'] = train_data_df['teacher_prefix'].str.replace(r"\.", "").str.lower() train_data_df.head()
4. Обработка состояния учебного заведения
Преобразование в нижний регистр
5. Функция обработки эссе
Поскольку полное эссе разделено на четыре функции, то есть от project_essay1 до project_essay4, необходимо объединить их, чтобы получить единую функцию.
Следующие эссе взяты из id=p182444
essay1:- \r\n\"True champions aren't always the ones that win, but those with the most guts.\" By Mia Hamm This quote best describes how the students at Cholla Middle School approach playing sports, especially for the girls and boys soccer teams . The teams are made up of 7th and 8th grade students, and most of them have not had the opportunity to play in an organized sport due to family financial difficulties. \r\nI teach at a Title One middle school in an urban neighborhood. 74% of our students qualify for free and reduced lunch and many come from very activity/ sport opportunity-poor homes. My students love to participate in sports to learn new skills and be apart of team atmosphere. My school lacks the funding to meet my students' needs and I am concerned that their lack of exposure will not prepare them for the participating in sports and teams in high school. By the end of the school year, the goal is to provide our students with an opportunity to learn a variety of soccer skills, and positive qualities of a person who actively participates on a team. essay2:- The students on the campus come to school knowing they face an uphill battle when it comes to participating in organized sports. The players would thrive on the field, with the confidence from having the appropriate soccer equipment to play soccer to the best of their abilities. The students will experience how to be a helpful person by being part of team that teaches them to be positive, supportive, and encouraging to others. \r\nMy students will be using the soccer equipment during practice and games on a daily basis to learn and practice the necessary skills to develop a strong soccer team. This experience will create the opportunity for students to learn about being part of a team, and how to be a positive contribution for their teammates. The students will get the opportunity to learn and practice a variety of soccer skills, and how to use those skills during a game. Access to this type of experience is nearly impossible without soccer equipment for the students/ players to utilize during practice and games . essay3 :- Nan essay4:- Nan # Merging Eassay # merge column text dataframe: train_data_df["essay"] = train_data_df["project_essay_1"].map(str) +\ train_data_df["project_essay_2"].map(str) + \ train_data_df["project_essay_3"].map(str) + \ train_data_df["project_essay_4"].map(str) # Working essay column # 1) Initial data train_data_df[train_data_df["id"]=="p182444"]["essay"].values[0] \\r\\n\\"True champions aren\'t always the ones that win, but those with the most guts.\\" By Mia Hamm This quote best describes how the students at Cholla Middle School approach playing sports, especially for the girls and boys soccer teams. The teams are made up of 7th and 8th grade students, and most of them have not had the opportunity to play in an organized sport due to family financial difficulties. \\r\\nI teach at a Title One middle school in an urban neighborhood. 74% of our students qualify for free and reduced lunch and many come from very activity/ sport opportunity-poor homes. My students love to participate in sports to learn new skills and be apart of team atmosphere. My school lacks the funding to meet my students’ needs and I am concerned that their lack of exposure will not prepare them for the participating in sports and teams in high school. By the end of the school year, the goal is to provide our students with an opportunity to learn a variety of soccer skills, and positive qualities of a person who actively participates on a team.The students on the campus come to school knowing they face an uphill battle when it comes to participating in organized sports. The players would thrive on the field, with the confidence from having the appropriate soccer equipment to play soccer to the best of their abilities. The students will experience how to be a helpful person by being part of team that teaches them to be positive, supportive, and encouraging to others. \\r\\nMy students will be using the soccer equipment during practice and games on a daily basis to learn and practice the necessary skills to develop a strong soccer team. This experience will create the opportunity for students to learn about being part of a team, and how to be a positive contribution for their teammates. The students will get the opportunity to learn and practice a variety of soccer skills, and how to use those skills during a game. Access to this type of experience is nearly impossible without soccer equipment for the students/ players to utilize during practice and games .nannan
Поскольку эссе слияния состоит из нежелательных символов, т. е. \\r ,\\n, nan и т. д., требуется дальнейшая обработка
# performing action on entire dataset train_data_df['essay_updated'] = train_data_df['essay'].str.replace(r'\\r|\\n|\\|nan',"", regex=True) # processed data as follows processed data:- True champions aren\'t always the ones that win, but those with the most guts. " By Mia Hamm This quote best describes how the students at Cholla Middle School approach playing sports, especially for the girls and boys soccer teams. The teams are made up of 7th and 8th grade students, and most of them have not had the opportunity to play in an organized sport due to family ficial difficulties. I teach at a Title One middle school in an urban neighborhood. 74% of our students qualify for free and reduced lunch and many come from very activity/ sport opportunity-poor homes. My students love to participate in sports to learn new skills and be apart of team atmosphere. My school lacks the funding to meet my students’ needs and I am concerned that their lack of exposure will not prepare them for the participating in sports and teams in high school. By the end of the school year, the goal is to provide our students with an opportunity to learn a variety of soccer skills, and positive qualities of a person who actively participates on a team. The students on the campus come to school knowing they face an uphill battle when it comes to participating in organized sports. The players would thrive on the field, with the confidence from having the appropriate soccer equipment to play soccer to the best of their abilities. The students will experience how to be a helpful person by being part of team that teaches them to be positive, supportive, and encouraging to others. My students will be using the soccer equipment during practice and games on a daily basis to learn and practice the necessary skills to develop a strong soccer team. This experience will create the opportunity for students to learn about being part of a team, and how to be a positive contribution for their teammates. The students will get the opportunity to learn and practice a variety of soccer skills, and how to use those skills during a game. Access to this type of experience is nearly impossible without soccer equipment for the students/ players to utilize during practice and games
6. Обработка названия проекта и сводки ресурсов проекта
Тот же фильтр применяется к названию проекта и сводке ресурсов проекта.
# performing action on project_titlet train_data_df['project_title'] = train_data_df['project_title'].str.replace(r'\\r|\\n|\\|nan',"", regex=True) # performing action on project_resource_summary train_data_df['project_resource_summary'] = train_data_df['project_resource_summary'].str.replace(r'\\r|\\n|\\|nan',"", regex=True)
Создание нового фрейма данных
# gererating main dataset train_df=train_data_df[['id', 'teacher_prefix_upgrade', 'school_state_upgrade','project_title','project_resource_summary', 'teacher_number_of_previously_posted_projects', 'project_is_approved', 'project_subject_categories_updated', 'project_subject_subcategories_updated', 'project_grade_category_updated','essay_updated']] train_df.head()
7. создание новых данных из resources.csv
Мы стремимся генерировать дополнительные данные, комбинируя характеристики цены и количества из файла «resources.csv». Однако, поскольку функция «описание» уже существует в виде сводной версии в столбце «project_resource_summary» файла «train_df.csv», мы не будем включать ее в наш выбор.
resource_price_data = resource_df.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index() resource_price_data.head()
Давайте объединим train_df и resource_price_data.
# merging train_df and resource_price_data final_df = pd.merge(train_df, resource_price_data, on='id', how='left') final_df.head()
Измените предыдущий и удалите столбцы идентификаторов, поскольку они не требуются в будущей задаче.
preprocessed_data_df.rename(columns={'teacher_prefix_upgrade':"teacher_prefix", 'school_state_upgrade':"school_state", 'project_subject_categories_updated':"project_subject_categories", 'project_subject_subcategories_updated':"project_subject_subcategories", 'project_grade_category_updated':"project_grade_category", 'essay_updated':"essay"},inplace=True) preprocessed_data_df.drop(columns=["id"],inplace=True)
e) Графический анализ и проверка гипотез
Анализ категориальных переменных вместе с проверкой гипотез
В этом разделе мы проведем анализ категориальных признаков и оценим их значимость по отношению к зависимой переменной с помощью проверки гипотез.
- Анализ предметной категории проекта
from scipy.stats import chi2_contingency ''' Analyzing project subject categories ''' # Create a contingency table for project_subject_categories contingency_table_subject_category = pd.crosstab(preprocessed_data_df["project_subject_categories"], preprocessed_data_df["project_is_approved"],margins=True) # contingency_table_subject_category = pd.crosstab(preprocessed_data_df["project_subject_categories"], preprocessed_data_df["project_is_approved"]) # Sort the contingency table in descending order # contingency_table_subject_category_sorted = contingency_table_subject_category.sort_values(by=[1,0] ,ascending=False) contingency_table_subject_category_sorted = contingency_table_subject_category.sort_values(by="All" ,ascending=False) # Display the contingency table contingency_table_subject_category_sorted
top_10_values = contingency_table_subject_category.nlargest(10, columns=True) # dropping 'All' top_10_values.drop(index="All",inplace=True) # Reverse the order to show top values first top_10_values = top_10_values.iloc[::-1] # print(top_10_values) # Set the figure size fig, ax = plt.subplots(figsize=(10, 10)) # Width, height in inches # Create a bar plot with stacked bars barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax) # Show the plot plt.show() # Select top 10 categories top_10_categories = contingency_table_subject_category.nlargest(10, columns=True).index print(top_10_categories) # Filter contingency table for top 10 categories contingency_table_top_10 = contingency_table_subject_category.loc[top_10_categories] # Set the figure size plt.figure(figsize=(8,8)) # Create the heatmap with annotations sns.heatmap(contingency_table_top_10, cmap="YlGnBu", annot=True, cbar=False, fmt='d') # Show the plot plt.show()
Согласно анализу, грамотность_язык, математика_наука, грамотность_язык, математика_наука, здоровье_спорт тесно связаны с утвержденным проектом.
Проверка гипотезы
# Hypothesis testing ''' Null hypothesis (H0): There is no association between the category column(project_subject_categories) and the target variable(project_is_approved). Alternative hypothesis (HA): There is an association between the category column(project_subject_categories) and the target variable(project_is_approved). ''' # Perform chi-square test chi2, p_value, dof, expected = chi2_contingency(contingency_table_subject_category) # Print the results print("Chi-square statistic:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") # print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# the critical region from scipy.stats import chi2 cr=chi2.ppf(q=0.95,df=dof) print("Critical Region= ",cr) alpha=1-0.95 print("aplha= ",alpha)
Поскольку p-значение меньше альфа, хи-квадрат падает после критической области, следовательно, нулевая гипотеза отвергается.
Согласно project_subject_categories, это зависит от того, одобрен проект или нет.
2. Анализ проекта subject_subcategory
''' Analyzing project_subject_subcategories ''' # Create a contingency table for project_subject_categories contingency_table_project_subject_subcategories = pd.crosstab(preprocessed_data_df["project_subject_subcategories"], preprocessed_data_df["project_is_approved"],margins=True) # Sort the contingency table in descending order contingency_table_project_subject_subcategories_sorted = contingency_table_project_subject_subcategories.sort_values(by="All", ascending=False) # Display the contingency table contingency_table_project_subject_subcategories_sorted.head(10)
# Select top 10 values top_10_values = contingency_table_project_subject_subcategories.nlargest(10, columns=True,keep='first') # dropping 'All' top_10_values.drop(index="All",inplace=True) # Reverse the order to show top values first top_10_values = top_10_values.iloc[::-1] # Set the figure size fig, ax = plt.subplots(figsize=(10, 10)) # Width, height in inches # Create a bar plot with stacked bars barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax) # Show the plot plt.show() # Select top 10 categories top_10_categories = contingency_table_project_subject_subcategories.nlargest(10, columns=True).index # Filter contingency table for top 10 categories contingency_table_top_10 = contingency_table_project_subject_subcategories.loc[top_10_categories] # Set the figure size plt.figure(figsize=(8,8)) # Create the heatmap with annotations sns.heatmap(contingency_table_top_10, cmap="YlGnBu", annot=True, cbar=False, fmt='d') # Show the plot plt.show()
Согласно анализу грамотность, грамотность математика, литература_письмо математика, грамотность литература_письмо, математика, литература_письмо, special_needs имеет большой объем данных
Проверка гипотезы
# Hypothesis testing ''' Null hypothesis (H0): There is no association between the category column(project_subject_categories) and the target variable(project_is_approved). Alternative hypothesis (HA): There is an association between the category column(project_subject_categories) and the target variable(project_is_approved). ''' # Perform chi-square test chi2, p_value, dof, expected = chi2_contingency(contingency_table_project_subject_subcategories) # Print the results print("Chi-square statistic:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") # print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# the critical region from scipy.stats import chi2 cr=chi2.ppf(q=0.95,df=dof) print("Critical Region= ",cr) alpha=1-0.95 print("aplha= ",alpha)
''' As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis. As per the project_subject_categoriesr is dependent on project approved or not '''
3. Анализ данных префикса учителя
''' Analyzing teacher prefix ''' # Create a contingency table for project_subject_categories contingency_teacher_prefix = pd.crosstab(preprocessed_data_df["teacher_prefix"], preprocessed_data_df["project_is_approved"],margins=True) # Sort the contingency table in descending order contingency_teacher_prefix_sorted = contingency_teacher_prefix.sort_values(by="All", ascending=False) # Display the contingency table contingency_teacher_prefix_sorted
# Plotting Barplot of 10 categories # Select top 10 values top_10_values = contingency_teacher_prefix.nlargest(10, columns=True,keep='first') # dropping 'All' top_10_values.drop(index="All",inplace=True) # Reverse the order to show top values first top_10_values = top_10_values.iloc[::-1] # Set the figure size fig, ax = plt.subplots(figsize=(10, 10)) # Width, height in inches # Create a bar plot with stacked bars barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax) # Show the plot plt.show() # Plotting heatmap of top 10 categories # Select top 10 categories top_10_categories = contingency_teacher_prefix.nlargest(10, columns=True).index # Filter contingency table for top 10 categories contingency_table_top_10 = contingency_teacher_prefix.loc[top_10_categories] # Set the figure size plt.figure(figsize=(8,8)) # Create the heatmap with annotations sns.heatmap(contingency_table_top_10, cmap="YlGnBu", annot=True, cbar=False, fmt='d') # Show the plot plt.show()
Согласно анализу госпожа, госпожа имеет высокую корреляцию по утвержденному проекту
Проверка гипотезы
# hypothesis testin ''' Null hypothesis (H0): There is no association between the category column(teacher_prefix) and the target variable(project_is_approved). Alternative hypothesis (HA): There is an association between the category column(teacher_prefix) and the target variable(project_is_approved). ''' # Perform chi-square test chi2, p_value, dof, expected = chi2_contingency(contingency_teacher_prefix) # Print the results print("Chi-square statistic:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") # print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# the critical region from scipy.stats import chi2 cr=chi2.ppf(q=0.95,df=dof) print("Critical Region= ",cr) alpha=1-0.95 print("aplha= ",alpha)
As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis. As per the teacher_prefix is dependent on project approved or not
4) Анализ характеристики состояния школы
''' Analyzing school_state ''' # Create a contingency table for project_subject_categories contingency_school_state_category = pd.crosstab(preprocessed_data_df["school_state"], preprocessed_data_df["project_is_approved"],margins=True) # Sort the contingency table in descending order contingency_school_state_category_sorted = contingency_school_state_category.sort_values(by="All", ascending=False) # Display the contingency table contingency_school_state_category_sorted
# Barplot analysis of top 10 school states # Select top 10 values top_10_values = contingency_school_state_category.nlargest(10, columns=True,keep='first') # dropping 'All' top_10_values.drop(index="All",inplace=True) # Reverse the order to show top values first top_10_values = top_10_values.iloc[::-1] # Set the figure size fig, ax = plt.subplots(figsize=(10, 10)) # Width, height in inches # Create a bar plot with stacked bars barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax) # Show the plot plt.show() # Heatmap analysis of 10 school states # Select top 10 categories top_10_categories = contingency_school_state_category.nlargest(10, columns=True).index # Filter contingency table for top 10 categories contingency_table_top_10 = contingency_school_state_category.loc[top_10_categories] # Set the figure size plt.figure(figsize=(8,8)) # Create the heatmap with annotations sns.heatmap(contingency_table_top_10, cmap="YlGnBu", annot=True, cbar=False, fmt='d') # Show the plot plt.show()
As per the analysis ca and ny have high corelation on approved project
Проверка гипотезы
# hypothesis testing ''' Null hypothesis (H0): There is no association between the category column(school_state) and the target variable(project_is_approved). Alternative hypothesis (HA): There is an association between the category column(school_state) and the target variable(project_is_approved). ''' # Perform chi-square test chi2, p_value, dof, expected = chi2_contingency(contingency_school_state_category) # Print the results print("Chi-square statistic:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") # print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# the critical region from scipy.stats import chi2 cr=chi2.ppf(q=0.95,df=dof) print("Critical Region= ",cr) alpha=1-0.95 print("aplha= ",alpha) As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis. As per the school_state is dependent on project approved or not
5) Анализ project_grade_category
''' Analyzing project_grade_category ''' # Create a contingency table for project_subject_categories contingency_project_grade_category = pd.crosstab(preprocessed_data_df["project_grade_category"], preprocessed_data_df["project_is_approved"],margins=True) # Sort the contingency table in descending order contingency_table_subject_category_sorted = contingency_project_grade_category.sort_values(by="All", ascending=False) # Display the contingency table contingency_table_subject_category_sorted
# Barplot analysis of project grade top_10_values = contingency_project_grade_category.nlargest(10, columns=True,keep='first') # dropping 'All' top_10_values.drop(index="All",inplace=True) # Reverse the order to show top values first top_10_values = top_10_values.iloc[::-1] # Set the figure size fig, ax = plt.subplots(figsize=(8,8)) # Width, height in inches # Create a bar plot with stacked bars barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax) # Show the plot plt.show() # Heatmap analysis of project grade top_10_categories = contingency_project_grade_category.nlargest(10, columns=True).index # Filter contingency table for top 10 categories contingency_table_top_10 = contingency_project_grade_category.loc[top_10_categories] # Set the figure size plt.figure(figsize=(8,8)) # Create the heatmap with annotations sns.heatmap(contingency_table_top_10, cmap="YlGnBu", annot=True, cbar=False, fmt='d') # Show the plot plt.show()
Согласно анализу, оценки_prek_2 и оценки_3_5 тесно связаны с утвержденным проектом.
Проверка гипотезы
# hypothesis testing ''' Null hypothesis (H0): There is no association between the category column(project_grade_category) and the target variable(project_is_approved). Alternative hypothesis (HA): There is an association between the category column(project_grade_category) and the target variable(project_is_approved). ''' # Perform chi-square test chi2, p_value, dof, expected = chi2_contingency(contingency_project_grade_category) # Print the results print("Chi-square statistic:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") # print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# the critical region from scipy.stats import chi2 cr=chi2.ppf(q=0.95,df=dof) print("Critical Region= ",cr) alpha=1-0.95 print("aplha= ",alpha)
As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis. As per the project_grade_category is dependent on project approved or not
6. Анализ гистограммы утвержденного проекта
# Countplot for the project_is_approved plt.figure(figsize = (10,8)) sns.countplot(data=preprocessed_data_df, y='project_is_approved')
# Checking value counts preprocessed_data_df["project_is_approved"].value_counts()
По анализу соотношение 1 и 0 почти 80:20
Анализ числовых переменных и проверка гипотез
A) Анализ блочной диаграммы и количественный анализ графика KDE
# Boxplot analysis of quantity feature df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])] plt.figure(figsize=(15, 8)) sns.boxplot(x="project_is_approved", y="quantity", data=df_box) plt.xticks(ticks=[0, 1], labels=["0", "1"]) plt.show()
As per the analysis, there are outliers in both 0's and 1's from quantity feature # Calculate percentiles percentiles = np.arange(1, 101) values = np.percentile(preprocessed_data_df['quantity'], percentiles) # Plotting using Seaborn sns.lineplot(x=percentiles, y=values) plt.ylabel('Value') plt.title('Percentile Plot') plt.grid(True) plt.show()
As per the analysis there are outliers at the extreme points i.e. from 99.1 to 100 so need to remove it
Проверка гипотезы
# Hypothesis testing ''' The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(quantity) for two binary target classes. Null hypothesis H0: no difference between the group means. Alternate hypothesis H1: difference between the group means. Considering the following parameters Confidence leve= 95% aplha=0.05 ''' # Import library import scipy.stats as stats t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["quantity"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["quantity"].values) print("t_value =",t_value) print("p_value =",p_value) print("alpha_value= ",0.05)
As p_values is greater than alpha (p_value<alpha) we reject the null hypothesis. Hence as per the hypothesis testing, quantity group for both approved and not approved project is not equal # Overlap graph for the quantity data plt.figure(figsize = (8,10)) sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["quantity"].values,shade=True,color="r") sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["quantity"].values,shade=True,color="b")
As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the spreading of both the groups are almost same
B) Анализ блочных диаграмм и анализ графиков KDE для данных Teacher_number_of_previously_posted_projects
df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])] plt.figure(figsize=(15, 8)) sns.boxplot(x="project_is_approved", y="teacher_number_of_previously_posted_projects", data=df_box) plt.xticks(ticks=[0, 1], labels=["0", "1"]) plt.show()
As per the analysis, there are outliers in both 0's and 1's from quantity feature # Calculate percentiles percentiles = np.arange(1, 101) values = np.percentile(preprocessed_data_df['teacher_number_of_previously_posted_projects'], percentiles) # Plotting using Seaborn sns.lineplot(x=percentiles, y=values) plt.ylabel('Value') plt.title('Percentile Plot') plt.grid(True) plt.show()
As per the analysis there are outliers at the extreme points i.e. from 98 to 100 so need to remove it
Проверка гипотезы
''' The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(teacher_number_of_previously_posted_projects) for two binary target classes. Null hypothesis H0: no difference between the group means. Alternate hypothesis H1: difference between the group means. Considering the following parameters Confidence leve= 95% aplha=0.05 ''' # Import library import scipy.stats as stats t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["teacher_number_of_previously_posted_projects"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["teacher_number_of_previously_posted_projects"].values) print("t_value =",t_value) print("p_value =",p_value) print("aplha value= ",0.05)
As p_values is greater than alpha (p_value<alpha) we reject the null hypothesis. Hence as per the hypothesis testing, teacher_number_of_previously_posted_projects group for both approved and not approved project is not equal # Overlap graph for the teacher_number_of_previously_posted_projects data plt.figure(figsize = (8,10)) sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["teacher_number_of_previously_posted_projects"].values,shade=True,color="r") sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["teacher_number_of_previously_posted_projects"].values,shade=True,color="b")
As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the spreading of both the groups are almost same
C) Boxplot Analysis и анализ графика KDE цены
# boxplot analysis of price df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])] plt.figure(figsize=(15, 8)) sns.boxplot(x="project_is_approved", y="price", data=df_box) plt.xticks(ticks=[0, 1], labels=["0", "1"]) plt.show()
As per the analysis, there are outliers in both 0's and 1's from quantity feature # Calculate percentiles percentiles = np.arange(1, 101) values = np.percentile(preprocessed_data_df['price'], percentiles) # Plotting using Seaborn sns.lineplot(x=percentiles, y=values) plt.ylabel('Value') plt.title('Percentile Plot') plt.grid(True) plt.show()
As per the analysis there are outliers at the extreme points i.e. from 97 to 100 so need to remove it
Проверка гипотезы
''' The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(price) for two binary target classes. Null hypothesis H0: no difference between the group means. Alternate hypothesis H1: difference between the group means. Considering the following parameters Confidence leve= 95% aplha=0.05 ''' # Import library import scipy.stats as stats t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["price"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["price"].values) print("t_value =",t_value) print("p_value =",p_value) print("aplha value= ",0.05)
As p_values is greater than alpha (p_value<alpha) we reject the null hypothesis. Hence as per the hypothesis testing, price group for both approved and not approved roject is not equal # Overlap graph for the price data plt.figure(figsize = (8,10)) sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["price"].values,shade=True,color="r") sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["price"].values,shade=True,color="b")
As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the spreading of both the groups are almost same
Анализ текстовых данных
# Libraries for text processing import re, nltk nltk.download('punkt') nltk.download('stopwords') from nltk import word_tokenize, sent_tokenize from nltk.corpus import stopwords def clean_tokenized_sentence(s): """Performs basic cleaning of a tokenized sentence""" cleaned_s = "" # Create empty string to store processed sentence. words = nltk.word_tokenize(s) for word in words: # Convert to lowercase # c_word = word.lower() # Remove punctuations # c_word = re.sub(r'[^\w\s]', '', c_word) # Remove stopwords # if c_word != '' and c_word not in stopwords.words('english'): cleaned_s = cleaned_s + " " + c_word # Append processed words to new list. return(cleaned_s.strip()) # creating new dataframe for analysis df_text=preprocessed_data_df[['project_title','essay','project_is_approved','project_resource_summary']] df_text.head(2)
- Анализ особенности эссе
# the top words from eassy in approved project # selecting approved essay approved=df_text[df_text["project_is_approved"]==1]["essay_message"] # clubbing all text data approved = " ".join(approved) #Now Splitting the entire text into words approved=approved.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_approved = Counter(approved).most_common(50) counter_approved
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_approved] word_freq = [freq for word, freq in counter_approved] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in Approved Essays') plt.tight_layout() plt.show()
# Analysing top words from eassy in not approved project # selecting approved essay not_approved=df_text[df_text["project_is_approved"]==0]["essay_message"] # clubbing all text data not_approved = " ".join(not_approved) #Now Splitting the entire text into words not_approved=not_approved.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_not_approved = Counter(not_approved).most_common(50) counter_not_approved
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_not_approved] word_freq = [freq for word, freq in counter_not_approved] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in not Approved Essays') plt.tight_layout() plt.show()
# calculating common top 50 words form the approved and disapproved words top50_approved=[x[0] for x in counter_approved] top50_notapproved=[x[0] for x in counter_not_approved] intersection_words = list(set(top50_approved).intersection(top50_notapproved)) print(intersection_words) print("len of intersection_words= ",len(intersection_words)) ['use', 'like', 'classroom', 'one', 'many', 'come', 'project', 'learn', 'make', 'help', 'different', 'year', 'get', 'work', 'also', 'provide', 'learning', 'time', 'books', 'day', 'need', 'class', 'reading', 'teach', 'children', 'math', 'want', 'best', 'students', 'create', 'needs', 'new', 'materials', 'technology', 'world', 'every', 'love', 'able', 'school', 'student', 'grade', 'would', 'way', 'skills', 'allow'] len of intersection_words= 45 As per the analysis there 45 words which are common between approved and not approved essays
2. Анализ данных о названии проекта
df_text["project_title_message"] = df_text["project_title"].apply(clean_tokenized_sentence) df_text.head(2) # Analysing top words from eassy in approved project # selecting approved essay approved_title=df_text[df_text["project_is_approved"]==1]["project_title_message"] # clubbing all text data approved_title = " ".join(approved_title) #Now Splitting the entire text into words approved_title=approved_title.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_approved_title = Counter(approved_title).most_common(50) counter_approved_title
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_approved_title] word_freq = [freq for word, freq in counter_approved_title] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in not Approved Essays Project title') plt.tight_layout() plt.show()
# Analysing top words from project_title in not approved project # selecting approved project_title not_approved_title=df_text[df_text["project_is_approved"]==0]["project_title_message"] # clubbing all text data not_approved_title = " ".join(not_approved_title) #Now Splitting the entire text into words not_approved_title=not_approved_title.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_not_approved_title = Counter(not_approved_title).most_common(50) counter_not_approved_title
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_not_approved_title] word_freq = [freq for word, freq in counter_not_approved_title] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in not Approved Project title') plt.tight_layout() plt.show()
# calculating common top 50 words form the approved and disapproved words top50_approved_title=[x[0] for x in counter_approved] top50_not_approved_title=[x[0] for x in counter_not_approved] intersection_words_title = list(set(top50_approved_title).intersection(top50_not_approved_title)) print(intersection_words_title) print("len of intersection_words= ",len(intersection_words_title)) ['use', 'like', 'classroom', 'one', 'many', 'come', 'project', 'learn', 'make', 'help', 'different', 'year', 'get', 'work', 'also', 'provide', 'learning', 'time', 'books', 'day', 'need', 'class', 'reading', 'teach', 'children', 'math', 'want', 'best', 'students', 'create', 'needs', 'new', 'materials', 'technology', 'world', 'every', 'love', 'able', 'school', 'student', 'grade', 'would', 'way', 'skills', 'allow'] len of intersection_words= 45 As per the analysis there 45 words which are common between approved and not approved title
3. Анализ данных проекта project_resource_summary
df_text["project_resource_summary_message"] = df_text["project_resource_summary"].apply(clean_tokenized_sentence) df_text["project_resource_summary"]=df_text["project_resource_summary_message"] df_text.head(2) # Analysing top words from project_resource_summary in approved project # selecting approved essay approved_project_resource_summary=df_text[df_text["project_is_approved"]==1]["project_resource_summary"] # clubbing all text data approved_project_resource_summary = " ".join(approved_project_resource_summary) #Now Splitting the entire text into words approved_project_resource_summary=approved_project_resource_summary.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_approved_project_resource_summary = Counter(approved_project_resource_summary).most_common(50) counter_approved_project_resource_summary
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_approved_project_resource_summary] word_freq = [freq for word, freq in counter_approved_project_resource_summary] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in Approved project_resource_summary') plt.tight_layout() plt.show()
# Analysing top words from eassy in not approved project # selecting approved essay not_approved_project_resource_summary=df_text[df_text["project_is_approved"]==0]["project_resource_summary"] # clubbing all text data not_approved_project_resource_summary = " ".join(not_approved_project_resource_summary) #Now Splitting the entire text into words not_approved_project_resource_summary=not_approved_project_resource_summary.split() #Finding the top 40 words mostly used in approved from collections import Counter counter_not_approved_project_resource_summary = Counter(not_approved_project_resource_summary).most_common(50) counter_not_approved_project_resource_summary
# Extracting word and frequency for plotting top_words = [word for word, freq in counter_not_approved_project_resource_summary] word_freq = [freq for word, freq in counter_not_approved_project_resource_summary] # Plotting the top words plt.figure(figsize=(10, 6)) plt.bar(top_words, word_freq) plt.xticks(rotation=90) plt.xlabel('Top Words') plt.ylabel('Frequency') plt.title('Top Words in not Approved project_resource_summary') plt.tight_layout() plt.show()
# calculating common top 50 words form the approved and disapproved words top50_approved_project_resource_summary=[x[0] for x in counter_approved_project_resource_summary] top50_not_approved_project_resource_summary=[x[0] for x in counter_not_approved_project_resource_summary] intersection_words_title = list(set(top50_approved_project_resource_summary).intersection(top50_not_approved_project_resource_summary)) print(intersection_words_title) print("len of intersection_words= ",len(intersection_words_title)) ['class', 'time', 'skills', 'keep', 'students', 'school', 'new', 'science', 'access', 'use', 'writing', 'need', 'make', 'technology', 'enhance', 'learning', 'chairs', 'learn', 'work', 'supplies', 'materials', 'center', 'seating', 'projects', 'read', 'order', 'classroom', 'activities', 'books', 'allow', 'ipad', 'art', 'help', 'paper', 'create', 'able', 'flexible', 'reading', 'math'] len of intersection_words= 39 As per the analysis there 39 words which are common between approved and not approved tile
f) Разделение данных на подмножество обучения, тестирования и проверки
from sklearn.model_selection import train_test_split X,x_val,Y,y_val=train_test_split( new_df, new_df['project_is_approved'], test_size=0.2, random_state=42, stratify=preprocessed_data_df[['project_is_approved']]) print("x_train: ",X.shape) print("x_test : ",x_val.shape) print("y_train: ",Y.shape) print("y_test : ",y_val.shape) # splitting data to train and test split x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, stratify=Y) print("x_train: ",x_train.shape) print("y_train: ",y_train.shape) print("x_val : ",x_val.shape) print("y_val : ",y_val.shape) print("x_test : ",x_test.shape) print("y_test : ",y_test.shape) # y_value_counts = row1['project_is_approved'].value_counts() print("X_TRAIN-------------------------") x_train_y_value_counts = x_train['project_is_approved'].value_counts() print("Number of projects that are approved for funding ", x_train_y_value_counts[1]," -> ",round(x_train_y_value_counts[1]/(x_train_y_value_counts[1]+x_train_y_value_counts[0])*100,2),"%") print("Number of projects that are not approved for funding ",x_train_y_value_counts[0]," -> ",round(x_train_y_value_counts[0]/(x_train_y_value_counts[1]+x_train_y_value_counts[0])*100,2),"%") print("\n") # y_value_counts = row1['project_is_approved'].value_counts() print("X_TEST--------------------------") x_test_y_value_counts = x_test['project_is_approved'].value_counts() print("Number of projects that are approved for funding ", x_test_y_value_counts[1]," -> ",round(x_test_y_value_counts[1]/(x_test_y_value_counts[1]+x_test_y_value_counts[0])*100,2),"%") print("Number of projects that are not approved for funding ",x_test_y_value_counts[0]," -> ",round(x_test_y_value_counts[0]/(x_test_y_value_counts[1]+x_test_y_value_counts[0])*100,2),"%") print("\n") print("X_Val--------------------------") x_val_y_value_counts = x_val['project_is_approved'].value_counts() print("Number of projects that are approved for funding ", x_val_y_value_counts[1]," -> ",round(x_val_y_value_counts[1]/(x_val_y_value_counts[1]+x_val_y_value_counts[0])*100,2),"%") print("Number of projects that are not approved for funding ",x_val_y_value_counts[0]," -> ",round(x_val_y_value_counts[0]/(x_val_y_value_counts[1]+x_val_y_value_counts[0])*100,2),"%") print("\n")
# Vectorizing project subject categories # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_sub = CountVectorizer(lowercase=False, binary=True) vectorizer_sub.fit(x_train['project_subject_categories'].values) x_train_project_subject_categories_one_hot = vectorizer_sub.transform(x_train['project_subject_categories'].values) x_test_project_subject_categories_one_hot = vectorizer_sub.transform(x_test['project_subject_categories'].values) x_val_project_subject_categories_one_hot = vectorizer_sub.transform(x_val['project_subject_categories'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_sub.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_categories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_subject_categories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_categories_one_hot.shape) # Vectorizing project subject categories # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_sub = CountVectorizer(lowercase=False, binary=True) vectorizer_sub.fit(x_train['project_resource_summary'].values) x_train_project_resource_summary_one_hot = vectorizer_sub.transform(x_train['project_resource_summary'].values) x_test_project_resource_summary_one_hot = vectorizer_sub.transform(x_test['project_resource_summary'].values) x_val_project_resource_summary_one_hot = vectorizer_sub.transform(x_val['project_resource_summary'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() # print(vectorizer_sub.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_resource_summary_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_resource_summary_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_resource_summary_one_hot.shape) # Vectorizing project subject sub-categories # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True) vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values) x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values) x_test_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values) x_val_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_sub_sub_category.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape) # Vectorizing project subject sub-categories # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True) vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values) x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values) x_test_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values) x_val_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_sub_sub_category.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape) # Vectorizing project subject sub-categories # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True) vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values) x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values) x_test_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values) x_val_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_sub_sub_category.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_subject_subcategories_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape) # Vectorizing teacher_prefix # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_teacher_prefix = CountVectorizer(lowercase=False, binary=True) vectorizer_teacher_prefix.fit(x_train['teacher_prefix'].values) x_train_teacher_prefix_one_hot = vectorizer_teacher_prefix.transform(x_train['teacher_prefix'].values) x_test_teacher_prefix_one_hot = vectorizer_teacher_prefix.transform(x_test['teacher_prefix'].values) x_val_teacher_prefix_one_hot = vectorizer_teacher_prefix.transform(x_val['teacher_prefix'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_teacher_prefix.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_test_teacher_prefix_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_teacher_prefix_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_teacher_prefix_one_hot.shape) # Vectorizing school_state # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_school_state = CountVectorizer(lowercase=False, binary=True) vectorizer_school_state.fit(x_train['school_state'].values) x_train_school_state_one_hot = vectorizer_school_state.transform(x_train['school_state'].values) x_test_school_state_one_hot = vectorizer_school_state.transform(x_test['school_state'].values) x_val_school_state_one_hot = vectorizer_school_state.transform(x_val['school_state'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_school_state.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_school_state_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_school_state_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_school_state_one_hot.shape) # Vectorizing project_grade_category # we use count vectorizer to convert the values into one from sklearn.feature_extraction.text import CountVectorizer vectorizer_project_grade_category = CountVectorizer(lowercase=False, binary=True) vectorizer_project_grade_category.fit(x_train['project_grade_category'].values) x_train_project_grade_category_one_hot = vectorizer_project_grade_category.transform(x_train['project_grade_category'].values) x_test_project_grade_category_one_hot = vectorizer_project_grade_category.transform(x_test['project_grade_category'].values) x_val_project_grade_category_one_hot = vectorizer_project_grade_category.transform(x_val['project_grade_category'].values) # x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray() # x_test['project_subject_categories_encoded'] = x_test_categories_one_hot.toarray() print(vectorizer_project_grade_category.get_feature_names_out()) print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_grade_category_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_val : ",x_val_project_grade_category_one_hot.shape) print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_grade_category_one_hot.shape) # Applying tf-IDF on project_title # from sklearn.feature_extraction.text import TfidfVectorizer # vectorizer_project_title_tfidf = TfidfVectorizer(min_df=10) from sklearn.feature_extraction.text import CountVectorizer vectorizer_project_title_category = CountVectorizer(min_df=10,lowercase=False, binary=True) vectorizer_project_title_category.fit(x_train['project_title']) x_train_project_titles_tfidf = vectorizer_project_title_category.transform(x_train['project_title']) x_val_project_titles_tfidf = vectorizer_project_title_category.transform(x_val['project_title']) x_test_project_titles_tfidf = vectorizer_project_title_category.transform(x_test['project_title']) print("Shape of matrix after TF-IDF -> Title: x_train: ",x_train_project_titles_tfidf.shape) print("Shape of matrix after TF-IDF -> Title: x_val : ",x_val_project_titles_tfidf.shape) print("Shape of matrix after TF-IDF -> Title: x_test : ",x_test_project_titles_tfidf.shape) # Applying tf-IDF on essay # from sklearn.feature_extraction.text import TfidfVectorizer # vectorizer_essay_tfidf = TfidfVectorizer(min_df=10) # vectorizer_essay_tfidf.fit(x_train['essay']) from sklearn.feature_extraction.text import CountVectorizer vectorizer_project_essay_category = CountVectorizer(min_df=10,lowercase=False, binary=True) vectorizer_project_essay_category.fit(x_train['essay']) x_train_essay_tfidf = vectorizer_project_essay_category.transform(x_train['essay']) x_val_essay_tfidf = vectorizer_project_essay_category.transform(x_val['essay']) x_test_essay_tfidf = vectorizer_project_essay_category.transform(x_test['essay']) print("Shape of matrix after TF-IDF -> Title: x_train: ",x_train_essay_tfidf.shape) print("Shape of matrix after TF-IDF -> Title: x_val: ",x_val_essay_tfidf.shape) print("Shape of matrix after TF-IDF -> Title: x_test : ",x_test_essay_tfidf.shape) df=pd.DataFrame() # applying StandardScaler to specific columns from sklearn.preprocessing import StandardScaler scaler_transform_numerical_value = StandardScaler() from sklearn.preprocessing import Normalizer normalizer = Normalizer() # scaler_transform_numerical_value.fit(x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) # df[["x","y","z"]]=x_train_teacher_number_of_previously_posted_projects_scaler=scaler_transform_numerical_value.transform(x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) normalizer.fit(x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) df[["x","y","z"]]=x_train_teacher_number_of_previously_posted_projects_scaler=normalizer.transform(x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]=normalizer.transform(x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) x_test[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]=normalizer.transform(x_test[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) x_val[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]=normalizer.transform(x_val[['teacher_number_of_previously_posted_projects', 'price', 'quantity']]) from scipy.sparse import hstack # merging train values x_train_onehot = hstack((x_train_project_subject_categories_one_hot, x_train_project_resource_summary_one_hot, x_train_project_subject_subcategories_one_hot , x_train_teacher_prefix_one_hot , x_train_school_state_one_hot , x_train_project_grade_category_one_hot , x_train_project_titles_tfidf , x_train_essay_tfidf, x_train[['teacher_number_of_previously_posted_projects', 'price', 'quantity']])) print(x_train_onehot.shape) # merging test value x_test_onehot = hstack((x_test_project_subject_categories_one_hot, x_test_project_resource_summary_one_hot, x_test_project_subject_subcategories_one_hot , x_test_teacher_prefix_one_hot , x_test_school_state_one_hot , x_test_project_grade_category_one_hot , x_test_project_titles_tfidf , x_test_essay_tfidf, x_test[['teacher_number_of_previously_posted_projects', 'price', 'quantity']])) # merging val value x_val_onehot = hstack((x_val_project_subject_categories_one_hot, x_val_project_resource_summary_one_hot, x_val_project_subject_subcategories_one_hot , x_val_teacher_prefix_one_hot , x_val_school_state_one_hot , x_val_project_grade_category_one_hot , x_val_project_titles_tfidf , x_val_essay_tfidf, x_val[['teacher_number_of_previously_posted_projects', 'price', 'quantity']])) print(x_val_onehot.shape) Note :- In the code both tf-idf and CountVectorization techniques. As per the analysis of training CountVectorization provide good better results as compaired to tf-ifd Both techniques are provided in the above code
Обучение модели машинного обучения
Note :- 1)Following models were trainined on the personal computers. 2) Complete code is provided in the Github link.
1) Классификатор случайного леса
# following parameters choose after performing hyperparameter tuning classifier = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced_subsample') classifier.fit(x_train_onehot, y_train) Predictiion on test data pred_xgboost=classifier.predict(x_test_onehot) pred_xgboost_train = classifier.predict(x_train_onehot) #Checking different metrics for bagging model with default hyper parameters print('Checking different metrics for bagging model with default hyper parameters:\n') print("Training accuracy: ",classifier.score(x_train_onehot,y_train)) acc_score = accuracy_score(y_test, pred_xgboost) print('Testing accuracy: ',acc_score) conf_mat = confusion_matrix(y_test, pred_xgboost) print('Confusion Matrix: \n',conf_mat) roc_auc = roc_auc_score(y_test,pred_xgboost) print('ROC AUC score: ',roc_auc) class_rep_xgboost= classification_report(y_test,pred_xgboost) print('Classification Report: \n',class_rep_xgboost)
As per the analysis, the roc auc curve is 0.5 which means the model is not able to distinguish between 1 and 0. Also the training accuracy is greater than testing accuracy, model seems to be overfit
2) Классификатор Xgboost
from xgboost import XGBClassifier # following parameters choose after perfoeming hyperparameter tuning classifier = XGBClassifier(scale_pos_weight=0.17999,n_estimators=1000,learning_rate=0.01) # Fit the clas sifier to the preprocessed training data classifier.fit(x_train_onehot, y_train) pred_xgboost_train = classifier.predict(x_train_onehot) print() print("Testing Score") #Checking different metrics for bagging model with default hyper parameters print('Checking different metrics for bagging model with default hyper parameters:\n') print("Training accuracy: ",classifier.score(x_train_onehot,y_train)) acc_score = accuracy_score(y_test, pred_xgboost) print('Testing accuracy: ',acc_score) conf_mat = confusion_matrix(y_test, pred_xgboost) print('Confusion Matrix: \n',conf_mat) roc_auc = roc_auc_score(y_test,pred_xgboost) print('ROC AUC score: ',roc_auc) class_rep_xgboost= classification_report(y_test,pred_xgboost) print('Classification Report: \n',class_rep_xgboost) print() print("Validation Score") pred_xgboost=classifier.predict(x_val_onehot) pred_xgboost_train = classifier.predict(x_train_onehot) #Checking different metrics for bagging model with default hyper parameters print('Checking different metrics for bagging model with default hyper parameters:\n') print("Training accuracy: ",classifier.score(x_train_onehot,y_train)) acc_score = accuracy_score(y_test, pred_xgboost) print('Testing accuracy: ',acc_score) conf_mat = confusion_matrix(y_test, pred_xgboost) print('Confusion Matrix: \n',conf_mat) roc_auc = roc_auc_score(y_test,pred_xgboost) print('ROC AUC score: ',roc_auc) class_rep_xgboost= classification_report(y_test,pred_xgboost) print('Classification Report: \n',class_rep_xgboost)
As the training accuracy is greater than testing accuracy, model seems to be overfit
3) Наивный Байес
The Naive Bayes model were trainined before hyperparameter tunning and it provides a roc_auc score of 0.70. Also, the training accuracy is approximately equal to testing accuracy, hence the model seems to be descent, but can be improved #Trainig Naive Bayes along nwith hyperparmter tunning # mnb_bow = MultinomialNB(class_prior=[0.5, 0.5]) mnb_bow = MultinomialNB(class_prior=[0.8, 0.2]) parameters = {'alpha':[0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.5,0.6,0.7,0.8,0.9, 1, 5, 10, 50, 100, 500, 1000, 2500, 5000, 10000]} clf = GridSearchCV(mnb_bow, parameters, cv= 10, scoring='roc_auc',verbose=1,return_train_score=True) # clf.fit(x_cv_onehot_bow, y_cv) clf.fit(x_train_onehot,y_train) train_auc= clf.cv_results_['mean_train_score'] train_auc_std= clf.cv_results_['std_train_score'] cv_auc = clf.cv_results_['mean_test_score'] cv_auc_std= clf.cv_results_['std_test_score'] bestAlpha_1=clf.best_params_['alpha'] bestScore_1=clf.best_score_ print("BEST ALPHA: ",clf.best_params_['alpha']," BEST SCORE: ",clf.best_score_) mnb_bow_testModel = MultinomialNB(alpha = bestAlpha_1,class_prior=[0.5, 0.5]) # class_prior is used for balancing the class mnb_bow_testModel.fit(x_train_onehot, y_train)
Based on the analysis, Navie Baye's model demonstrates promising performance in both test and validation scores. With a roc_suc score of 0.80, the model exhibits effective discrimination between the values 1 and 0. Furthermore, the model is well-fitted, as indicated by the nearly equal training and testing accuracies. GUI Development
Разработка графического интерфейса
пожалуйста, проверьте ссылку на Github
Вывод.В заключение следует отметить, что предоставленный набор данных демонстрирует высокий уровень дисбаланса классов, а также несколько общих факторов, выявленных между утвержденными и неутвержденными группами в данных, как объясняется в тематическом исследовании. Для достижения желаемого прогноза было введено несколько моделей машинного обучения. Среди этих моделей наивный байесовский алгоритм продемонстрировал достойные оценки, а модель оказалась хорошо обученной, что дало удовлетворительные оценки точности обучения и тестирования. Кроме того, в тематическое исследование был включен графический интерфейс с использованием Gradio для улучшения взаимодействия с пользователем и доступности.
Ссылка на Github: — https:///github.com/atharvnagrikar/DonorChooseFinalScaler
Jupyter Notebook:https://github.com/atharvnagrikar/DonorChooseFinalScaler/tree/main/JupyterNotebook
Дальнейшая работа. В тематическом исследовании можно использовать методы глубокого обучения для решения поставленной задачи. Кроме того, использование токенизатора BERT может способствовать эффективному преобразованию текстовых данных в векторные представления.
Ссылки:-
- https://towardsdatascience.com/donorschoose-extensive-exploratory-data-analysis-eda-9e7879464d0d
- https://medium.com/@manturdipa/application-screening-donorschoose-dataset-d6e5f1827327
- https://www.kaggle.com/code/nikhilparmar9/naive-bayes-donorschoose-dataset#-%3E-8.1:-SET-1--Applying-Naive-Bayes-on-BOW
- https://www.scaler.com/
- https://gradio.app/guides/quickstart