Кейс-стади включает в себя следующие ключевые компоненты:

  1. Постановка задачи
  2. Исследовательский анализ данных и предварительная обработка
  3. Обучение моделей машинного обучения
  4. Разработка GUI (графического пользовательского интерфейса)

Постановка задачи

В тематическом исследовании делается попытка предсказать результат одобрения проектных предложений, представленных учителями на DonorsChoose.org. Путем анализа текстового содержания описаний проектов и включения соответствующих метаданных, связанных с проектом, учителем и школой, исследование направлено на разработку прогностической модели. Эта модель позволяет DonorsChoose.org эффективно выявлять проекты, которые могут потребовать дополнительной проверки перед утверждением, тем самым улучшая процесс отбора с помощью интеллектуальных методологий машинного обучения.

Исследовательский анализ данных и предварительная обработка

а) Анализ данных

Набор данных для этого анализа состоит из двух файлов CSV: «train_data.csv» и «resources.csv». Чтобы начать наш анализ, мы рассмотрим содержимое файла «train_data.csv». Набор данных содержит в общей сложности 109 248 строк и 16 столбцов, охватывающих различные функции или атрибуты.

Набор данных включает в себя широкий спектр данных, охватывающих следующие категории:

Категориальные характеристики:

‘id’, ‘teacher_id’, ‘teacher_prefix’, ‘school_state’,
‘project_submitted_datetime’, ‘project_grade_category’,
‘project_subject_categories’, ‘project_subject_subcategories’, ‘project_is_approved’

Особенности текста:
2) "project_essay_1", "project_essay_2",
"project_essay_3", "project_essay_4", "project_resource_summary"

Набор данных содержит отсутствующие или нулевые значения в различных полях.

Теперь давайте переключим наше внимание на другой файл с именем resources.csv. Этот конкретный набор данных содержит 1 541 272 строки и включает 4 функции/столбца. В наборе данных мы можем найти комбинацию числовых, текстовых и идентификационных данных.

В наборе данных нет нулевых значений.

b) Обработка нулевых значений

Проверка нулевых значений в префиксе учителя

Согласно анализу, нет повторяющихся значений, основанных на признаке Teacher_id, а поскольку объем набора данных слишком велик, т. е. 109 248 строк, удаление 3 строк не приведет к какому-либо эффекту.

# Dropping Null value
train_data_df.dropna(subset=['teacher_prefix'], inplace=True)
train_data_df.shape

# checking null values
train_data_df.isnull().sum()

в) Предварительная обработка данных

  1. Обработка project_subject_categories

Давайте проверим функцию

# Working on project subject category
projct_subject_category=train_data_df["project_subject_categories"]
projct_subject_category.head(20)

Форматирование данных следующим образом
1) Грамотность и язык в Грамотность_Язык
2) История и обществоведение, здоровье и спорт в История_ОбществознаниеЗдоровье_Спорт

# Remove ampersands, split the strings by commas
train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories'].str.replace("&", "").str.split(",")

# Remove leading/trailing spaces, convert to lowercase, replace spaces with underscores
train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].apply(lambda words: [word.strip().lower().replace(" ", "_") for word in words])

# Concatenate the categories into a single string
train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].apply(lambda words: " ".join(words))
# Replace double underscores with single underscores
train_data_df['project_subject_categories_updated'] = train_data_df['project_subject_categories_updated'].str.replace("__", "_")

2. Обработка project_subject_subcategories

Форматирование данных следующим образом
1) ESL, Грамотность в ESL_Literacy
2) Гражданское право, Командные виды спорта в Гражданское право_Правительственное управление Team_Sports

3. Обработка project_grade_category

Форматирование данных следующим образом
1) Классы PreK-2 в класс_prel_2
2) Классы 3–5 в класс_3_5
3) Классы 6–8 в класс_6_8
4) Классы 9–12 в класс_9_12

# Remove ampersands, split the strings by commas
train_data_df['project_grade_category_updated'] = train_data_df['project_grade_category'].str.split(" ")


# # # Concatenate the categories into a single string
train_data_df['project_grade_category_updated'] = train_data_df['project_grade_category_updated'].apply(lambda words: "_".join(word.strip().lower().replace("-", "_") for word in words))
train_data_df.head()

4. Обработка препода_perfix_

Форматирование данных следующим образом
от госпожи к госпоже
от госпожи к госпоже
от господина к господину
от учителя к учителю
от доктора к доктору

# Replacing Mrs., Ms. to mrs,ms, etc
train_data_df['teacher_prefix_upgrade'] = train_data_df['teacher_prefix'].str.replace(r"\.", "").str.lower()
train_data_df.head()

4. Обработка состояния учебного заведения

Преобразование в нижний регистр

5. Функция обработки эссе

Поскольку полное эссе разделено на четыре функции, то есть от project_essay1 до project_essay4, необходимо объединить их, чтобы получить единую функцию.

Следующие эссе взяты из id=p182444

essay1:-
\r\n\"True champions aren't always the ones that win, but those with the most
 guts.\" By Mia Hamm This quote best describes how the students at Cholla Middle
 School approach playing sports, especially for the girls and boys soccer teams
. The teams are made up of 7th and 8th grade students, and most of them have 
not had the opportunity to play in an organized sport due to family financial
 difficulties. \r\nI teach at a Title One middle school in an urban neighborhood.
 74% of our students qualify for free and reduced lunch and many come from very 
activity/ sport opportunity-poor homes. My students love to participate in sports
 to learn new skills and be apart of team atmosphere. My school lacks the funding 
to meet my students' needs and I am concerned that their lack of exposure will not
 prepare them for the participating in sports and teams in high school. By the end 
of the school year, the goal is to provide our students with an opportunity to learn
 a variety of soccer skills, and positive qualities of a person who actively participates 
on a team.
essay2:-
The students on the campus come to school knowing they face an uphill battle 
when it comes to participating in organized sports. The players would thrive 
on the field, with the confidence from having the appropriate soccer equipment 
to play soccer to the best of their abilities. The students will experience 
how to be a helpful person by being part of team that teaches them to be 
positive, supportive, and encouraging to others. \r\nMy students will be using
 the soccer equipment during practice and games on a daily basis to learn and 
practice the necessary skills to develop a strong soccer team. This experience
 will create the opportunity for students to learn about being part of a team,
 and how to be a positive contribution for their teammates. The students will 
get the opportunity to learn and practice a variety of soccer skills, and how 
to use those skills during a game.  Access to this type of experience is nearly
 impossible without soccer equipment for the students/ players to utilize 
during practice and games .
essay3 :-
Nan
essay4:-
Nan

# Merging Eassay
# merge  column text dataframe:
train_data_df["essay"] = train_data_df["project_essay_1"].map(str) +\
                        train_data_df["project_essay_2"].map(str) + \
                        train_data_df["project_essay_3"].map(str) + \
                        train_data_df["project_essay_4"].map(str)
# Working essay column
# 1) Initial data
train_data_df[train_data_df["id"]=="p182444"]["essay"].values[0]


\\r\\n\\"True champions aren\'t always the ones that win, but those with the
 most guts.\\" By Mia Hamm This quote best describes how the students at 
Cholla Middle School approach playing sports, especially for the girls and 
boys soccer teams. The teams are made up of 7th and 8th grade students, and 
most of them have not had the opportunity to play in an organized sport due 
to family financial difficulties. \\r\\nI teach at a Title One middle school
 in an urban neighborhood. 74% of our students qualify for free and reduced 
lunch and many come from very activity/ sport opportunity-poor homes. My 
students love to participate in sports to learn new skills and be apart 
of team atmosphere. My school lacks the funding to meet my students’ needs
 and I am concerned that their lack of exposure will not prepare them for 
the participating in sports and teams in high school. By the end of the school
 year, the goal is to provide our students with an opportunity to learn a 
variety of soccer skills, and positive qualities of a person who actively 
participates on a team.The students on the campus come to school knowing
 they face an uphill battle when it comes to participating in organized
 sports. The players would thrive on the field, with the confidence from 
having the appropriate soccer equipment to play soccer to the best of their
 abilities. The students will experience how to be a helpful person by being
 part of team that teaches them to be positive, supportive, and encouraging 
to others. \\r\\nMy students will be using the soccer equipment during 
practice and games on a daily basis to learn and practice the necessary
 skills to develop a strong soccer team. This experience will create the 
opportunity for students to learn about being part of a team, and how to be
 a positive contribution for their teammates. The students will get the 
opportunity to learn and practice a variety of soccer skills, and how to use 
those skills during a game.  Access to this type of experience is nearly 
impossible without soccer equipment for the students/ players to utilize 
during practice and games .nannan

Поскольку эссе слияния состоит из нежелательных символов, т. е. \\r ,\\n, nan и т. д., требуется дальнейшая обработка

# performing action on entire dataset
train_data_df['essay_updated'] = train_data_df['essay'].str.replace(r'\\r|\\n|\\|nan',"", regex=True)

# processed data as follows
processed data:-

True champions aren\'t always the ones that win, but those with the most guts.
" By Mia Hamm This quote best describes how the students at Cholla Middle
 School approach playing sports, especially for the girls and boys soccer 
teams. The teams are made up of 7th and 8th grade students, and most of them
 have not had the opportunity to play in an organized sport due to family 
ficial difficulties. I teach at a Title One middle school in an urban 
neighborhood. 74% of our students qualify for free and reduced lunch and 
many come from very activity/ sport opportunity-poor homes. My students love 
to participate in sports to learn new skills and be apart of team atmosphere.
 My school lacks the funding to meet my students’ needs and I am concerned 
that their lack of exposure will not prepare them for the participating in 
sports and teams in high school. By the end of the school year, the goal is
 to provide our students with an opportunity to learn a variety of soccer 
skills, and positive qualities of a person who actively participates on a team.
The students on the campus come to school knowing they face an uphill battle
 when it comes to participating in organized sports. The players would thrive 
on the field, with the confidence from having the appropriate soccer equipment
 to play soccer to the best of their abilities. The students will experience 
how to be a helpful person by being part of team that teaches them to be 
positive, supportive, and encouraging to others. My students will be using the 
soccer equipment during practice and games on a daily basis to learn and 
practice the necessary skills to develop a strong soccer team. This experience
 will create the opportunity for students to learn about being part of a team,
 and how to be a positive contribution for their teammates. The students will
 get the opportunity to learn and practice a variety of soccer skills, and
 how to use those skills during a game.  Access to this type of experience is
 nearly impossible without soccer equipment for the students/ players to 
utilize during practice and games 

6. Обработка названия проекта и сводки ресурсов проекта

Тот же фильтр применяется к названию проекта и сводке ресурсов проекта.

# performing action on project_titlet
train_data_df['project_title'] = train_data_df['project_title'].str.replace(r'\\r|\\n|\\|nan',"", regex=True)
# performing action on project_resource_summary
train_data_df['project_resource_summary'] = train_data_df['project_resource_summary'].str.replace(r'\\r|\\n|\\|nan',"", regex=True)

Создание нового фрейма данных

# gererating main dataset
train_df=train_data_df[['id', 'teacher_prefix_upgrade', 'school_state_upgrade','project_title','project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'project_subject_categories_updated',
       'project_subject_subcategories_updated',
       'project_grade_category_updated','essay_updated']]

train_df.head()

7. создание новых данных из resources.csv

Мы стремимся генерировать дополнительные данные, комбинируя характеристики цены и количества из файла «resources.csv». Однако, поскольку функция «описание» уже существует в виде сводной версии в столбце «project_resource_summary» файла «train_df.csv», мы не будем включать ее в наш выбор.

resource_price_data = resource_df.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
resource_price_data.head()

Давайте объединим train_df и resource_price_data.

# merging train_df and resource_price_data
final_df = pd.merge(train_df, resource_price_data, on='id', how='left')
final_df.head()

Измените предыдущий и удалите столбцы идентификаторов, поскольку они не требуются в будущей задаче.

preprocessed_data_df.rename(columns={'teacher_prefix_upgrade':"teacher_prefix", 'school_state_upgrade':"school_state",
       'project_subject_categories_updated':"project_subject_categories",
       'project_subject_subcategories_updated':"project_subject_subcategories",
       'project_grade_category_updated':"project_grade_category", 'essay_updated':"essay"},inplace=True)
preprocessed_data_df.drop(columns=["id"],inplace=True)

e) Графический анализ и проверка гипотез

Анализ категориальных переменных вместе с проверкой гипотез

В этом разделе мы проведем анализ категориальных признаков и оценим их значимость по отношению к зависимой переменной с помощью проверки гипотез.

  1. Анализ предметной категории проекта
from scipy.stats import chi2_contingency
'''
Analyzing project subject categories
'''
# Create a contingency table for project_subject_categories
contingency_table_subject_category = pd.crosstab(preprocessed_data_df["project_subject_categories"], preprocessed_data_df["project_is_approved"],margins=True)
# contingency_table_subject_category = pd.crosstab(preprocessed_data_df["project_subject_categories"], preprocessed_data_df["project_is_approved"])

# Sort the contingency table in descending order
# contingency_table_subject_category_sorted = contingency_table_subject_category.sort_values(by=[1,0] ,ascending=False)
contingency_table_subject_category_sorted = contingency_table_subject_category.sort_values(by="All" ,ascending=False)


# Display the contingency table
contingency_table_subject_category_sorted

top_10_values = contingency_table_subject_category.nlargest(10, columns=True)


# dropping 'All'
top_10_values.drop(index="All",inplace=True)

# Reverse the order to show top values first
top_10_values = top_10_values.iloc[::-1]
# print(top_10_values)

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))  # Width, height in inches

# Create a bar plot with stacked bars
barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax)

# Show the plot
plt.show()
# Select top 10 categories
top_10_categories = contingency_table_subject_category.nlargest(10, columns=True).index
print(top_10_categories)

# Filter contingency table for top 10 categories
contingency_table_top_10 = contingency_table_subject_category.loc[top_10_categories]

# Set the figure size
plt.figure(figsize=(8,8))

# Create the heatmap with annotations
sns.heatmap(contingency_table_top_10, cmap="YlGnBu",
            annot=True, cbar=False, fmt='d')

# Show the plot
plt.show()

Согласно анализу, грамотность_язык, математика_наука, грамотность_язык, математика_наука, здоровье_спорт тесно связаны с утвержденным проектом.

Проверка гипотезы

# Hypothesis testing
'''
Null hypothesis (H0): There is no association between the category column(project_subject_categories) and the target variable(project_is_approved).
Alternative hypothesis (HA): There is an association between the category column(project_subject_categories) and the target variable(project_is_approved).
'''

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table_subject_category)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# the critical region
from scipy.stats import chi2
cr=chi2.ppf(q=0.95,df=dof)
print("Critical Region= ",cr)
alpha=1-0.95
print("aplha= ",alpha)

Поскольку p-значение меньше альфа, хи-квадрат падает после критической области, следовательно, нулевая гипотеза отвергается.

Согласно project_subject_categories, это зависит от того, одобрен проект или нет.

2. Анализ проекта subject_subcategory

'''
Analyzing project_subject_subcategories
'''
# Create a contingency table for project_subject_categories
contingency_table_project_subject_subcategories = pd.crosstab(preprocessed_data_df["project_subject_subcategories"], preprocessed_data_df["project_is_approved"],margins=True)
# Sort the contingency table in descending order
contingency_table_project_subject_subcategories_sorted = contingency_table_project_subject_subcategories.sort_values(by="All", ascending=False)

# Display the contingency table
contingency_table_project_subject_subcategories_sorted.head(10)

# Select top 10 values
top_10_values = contingency_table_project_subject_subcategories.nlargest(10, columns=True,keep='first')
# dropping 'All'
top_10_values.drop(index="All",inplace=True)

# Reverse the order to show top values first
top_10_values = top_10_values.iloc[::-1]

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))  # Width, height in inches

# Create a bar plot with stacked bars
barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax)

# Show the plot
plt.show()
# Select top 10 categories
top_10_categories = contingency_table_project_subject_subcategories.nlargest(10, columns=True).index

# Filter contingency table for top 10 categories
contingency_table_top_10 = contingency_table_project_subject_subcategories.loc[top_10_categories]

# Set the figure size
plt.figure(figsize=(8,8))

# Create the heatmap with annotations
sns.heatmap(contingency_table_top_10, cmap="YlGnBu",
            annot=True, cbar=False, fmt='d')

# Show the plot
plt.show()

Согласно анализу грамотность, грамотность математика, литература_письмо математика, грамотность литература_письмо, математика, литература_письмо, special_needs имеет большой объем данных

Проверка гипотезы

# Hypothesis testing
'''
Null hypothesis (H0): There is no association between the category column(project_subject_categories) and the target variable(project_is_approved).
Alternative hypothesis (HA): There is an association between the category column(project_subject_categories) and the target variable(project_is_approved).
'''
# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table_project_subject_subcategories)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# the critical region
from scipy.stats import chi2
cr=chi2.ppf(q=0.95,df=dof)
print("Critical Region= ",cr)
alpha=1-0.95
print("aplha= ",alpha)

'''
As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis.
As per the project_subject_categoriesr is dependent on project approved or not
'''

3. Анализ данных префикса учителя

'''
Analyzing teacher prefix
'''
# Create a contingency table for project_subject_categories
contingency_teacher_prefix = pd.crosstab(preprocessed_data_df["teacher_prefix"], preprocessed_data_df["project_is_approved"],margins=True)
# Sort the contingency table in descending order
contingency_teacher_prefix_sorted = contingency_teacher_prefix.sort_values(by="All", ascending=False)

# Display the contingency table
contingency_teacher_prefix_sorted

# Plotting Barplot of 10 categories
# Select top 10 values
top_10_values = contingency_teacher_prefix.nlargest(10, columns=True,keep='first')
# dropping 'All'
top_10_values.drop(index="All",inplace=True)

# Reverse the order to show top values first
top_10_values = top_10_values.iloc[::-1]

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))  # Width, height in inches

# Create a bar plot with stacked bars
barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax)

# Show the plot
plt.show()
# Plotting heatmap of top 10 categories
# Select top 10 categories
top_10_categories = contingency_teacher_prefix.nlargest(10, columns=True).index

# Filter contingency table for top 10 categories
contingency_table_top_10 = contingency_teacher_prefix.loc[top_10_categories]

# Set the figure size
plt.figure(figsize=(8,8))

# Create the heatmap with annotations
sns.heatmap(contingency_table_top_10, cmap="YlGnBu",
            annot=True, cbar=False, fmt='d')

# Show the plot
plt.show()

Согласно анализу госпожа, госпожа имеет высокую корреляцию по утвержденному проекту

Проверка гипотезы

# hypothesis testin
'''
Null hypothesis (H0): There is no association between the category column(teacher_prefix) and the target variable(project_is_approved).
Alternative hypothesis (HA): There is an association between the category column(teacher_prefix) and the target variable(project_is_approved).
'''

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_teacher_prefix)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# the critical region
from scipy.stats import chi2
cr=chi2.ppf(q=0.95,df=dof)
print("Critical Region= ",cr)
alpha=1-0.95
print("aplha= ",alpha)

As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis.
As per the teacher_prefix is dependent on project approved or not

4) Анализ характеристики состояния школы

'''
Analyzing school_state
'''
# Create a contingency table for project_subject_categories
contingency_school_state_category = pd.crosstab(preprocessed_data_df["school_state"], preprocessed_data_df["project_is_approved"],margins=True)
# Sort the contingency table in descending order
contingency_school_state_category_sorted = contingency_school_state_category.sort_values(by="All", ascending=False)

# Display the contingency table
contingency_school_state_category_sorted

# Barplot analysis of top 10 school states 



# Select top 10 values
top_10_values = contingency_school_state_category.nlargest(10, columns=True,keep='first')
# dropping 'All'
top_10_values.drop(index="All",inplace=True)

# Reverse the order to show top values first
top_10_values = top_10_values.iloc[::-1]

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))  # Width, height in inches

# Create a bar plot with stacked bars
barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax)

# Show the plot
plt.show()

# Heatmap analysis of 10 school states
# Select top 10 categories
top_10_categories = contingency_school_state_category.nlargest(10, columns=True).index

# Filter contingency table for top 10 categories
contingency_table_top_10 = contingency_school_state_category.loc[top_10_categories]

# Set the figure size
plt.figure(figsize=(8,8))

# Create the heatmap with annotations
sns.heatmap(contingency_table_top_10, cmap="YlGnBu",
            annot=True, cbar=False, fmt='d')

# Show the plot
plt.show()

As per the analysis ca and ny have high corelation on approved project

Проверка гипотезы

# hypothesis testing
'''
Null hypothesis (H0): There is no association between the category column(school_state) and the target variable(project_is_approved).
Alternative hypothesis (HA): There is an association between the category column(school_state) and the target variable(project_is_approved).
'''

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_school_state_category)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# the critical region
from scipy.stats import chi2
cr=chi2.ppf(q=0.95,df=dof)
print("Critical Region= ",cr)
alpha=1-0.95
print("aplha= ",alpha)
As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis.
As per the school_state is dependent on project approved or not

5) Анализ project_grade_category

'''
Analyzing project_grade_category
'''
# Create a contingency table for project_subject_categories
contingency_project_grade_category = pd.crosstab(preprocessed_data_df["project_grade_category"], preprocessed_data_df["project_is_approved"],margins=True)
# Sort the contingency table in descending order
contingency_table_subject_category_sorted = contingency_project_grade_category.sort_values(by="All", ascending=False)

# Display the contingency table
contingency_table_subject_category_sorted  

# Barplot analysis of project grade
top_10_values = contingency_project_grade_category.nlargest(10, columns=True,keep='first')
# dropping 'All'
top_10_values.drop(index="All",inplace=True)

# Reverse the order to show top values first
top_10_values = top_10_values.iloc[::-1]

# Set the figure size
fig, ax = plt.subplots(figsize=(8,8))  # Width, height in inches

# Create a bar plot with stacked bars
barplot = top_10_values.plot(kind='barh', stacked=True, rot=0, ax=ax)

# Show the plot
plt.show()
# Heatmap analysis of project grade
top_10_categories = contingency_project_grade_category.nlargest(10, columns=True).index

# Filter contingency table for top 10 categories
contingency_table_top_10 = contingency_project_grade_category.loc[top_10_categories]

# Set the figure size
plt.figure(figsize=(8,8))

# Create the heatmap with annotations
sns.heatmap(contingency_table_top_10, cmap="YlGnBu",
            annot=True, cbar=False, fmt='d')

# Show the plot
plt.show()

Согласно анализу, оценки_prek_2 и оценки_3_5 тесно связаны с утвержденным проектом.

Проверка гипотезы

# hypothesis testing
'''
Null hypothesis (H0): There is no association between the category column(project_grade_category) and the target variable(project_is_approved).
Alternative hypothesis (HA): There is an association between the category column(project_grade_category) and the target variable(project_is_approved).
'''

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_project_grade_category)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# the critical region
from scipy.stats import chi2
cr=chi2.ppf(q=0.95,df=dof)
print("Critical Region= ",cr)
alpha=1-0.95
print("aplha= ",alpha)

As p_value is less than alpha, also chi-square is falls after the critical region, hence reject the null hypothesis.
As per the project_grade_category is dependent on project approved or not

6. Анализ гистограммы утвержденного проекта

# Countplot for the project_is_approved
plt.figure(figsize = (10,8))
sns.countplot(data=preprocessed_data_df,
          y='project_is_approved')

# Checking value counts
preprocessed_data_df["project_is_approved"].value_counts()

По анализу соотношение 1 и 0 почти 80:20

Анализ числовых переменных и проверка гипотез

A) Анализ блочной диаграммы и количественный анализ графика KDE

# Boxplot analysis of quantity feature
df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])]
plt.figure(figsize=(15, 8))
sns.boxplot(x="project_is_approved", y="quantity", data=df_box)
plt.xticks(ticks=[0, 1], labels=["0", "1"])
plt.show()

As per the analysis, there are outliers in both 0's and 1's  from quantity feature
# Calculate percentiles
percentiles = np.arange(1, 101)
values = np.percentile(preprocessed_data_df['quantity'], percentiles)

# Plotting using Seaborn
sns.lineplot(x=percentiles, y=values)
plt.ylabel('Value')
plt.title('Percentile Plot')
plt.grid(True)
plt.show()

As per the analysis there are outliers at the extreme points i.e. from 99.1 to 100
so need to remove it

Проверка гипотезы

# Hypothesis testing
'''
The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(quantity) for two binary target classes.
Null hypothesis H0: no difference between the group means.
Alternate hypothesis H1: difference between the group means.
Considering the following parameters
Confidence leve= 95%
aplha=0.05
'''
# Import library
import scipy.stats as stats
t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["quantity"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["quantity"].values)
print("t_value =",t_value)
print("p_value =",p_value)
print("alpha_value= ",0.05)

As p_values is greater than alpha (p_value<alpha) we  reject the null hypothesis.
Hence as per the hypothesis testing, quantity group for both approved and not approved project  is not equal
# Overlap graph for the quantity data
plt.figure(figsize = (8,10))
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["quantity"].values,shade=True,color="r")
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["quantity"].values,shade=True,color="b")

As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the
spreading of both the groups are almost same

B) Анализ блочных диаграмм и анализ графиков KDE для данных Teacher_number_of_previously_posted_projects

df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])]
plt.figure(figsize=(15, 8))
sns.boxplot(x="project_is_approved", y="teacher_number_of_previously_posted_projects", data=df_box)
plt.xticks(ticks=[0, 1], labels=["0", "1"])
plt.show()

As per the analysis, there are outliers in both 0's and 1's  from quantity feature
# Calculate percentiles
percentiles = np.arange(1, 101)
values = np.percentile(preprocessed_data_df['teacher_number_of_previously_posted_projects'], percentiles)

# Plotting using Seaborn
sns.lineplot(x=percentiles, y=values)
plt.ylabel('Value')
plt.title('Percentile Plot')
plt.grid(True)
plt.show()

As per the analysis there are outliers at the extreme points i.e. from 98 to 100
so need to remove it

Проверка гипотезы

'''
The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(teacher_number_of_previously_posted_projects) for two binary target classes.
Null hypothesis H0: no difference between the group means.
Alternate hypothesis H1: difference between the group means.
Considering the following parameters
Confidence leve= 95%
aplha=0.05
'''
# Import library
import scipy.stats as stats
t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["teacher_number_of_previously_posted_projects"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["teacher_number_of_previously_posted_projects"].values)
print("t_value =",t_value)
print("p_value =",p_value)
print("aplha value= ",0.05)

As p_values is greater than alpha (p_value<alpha) we  reject the null hypothesis.
Hence as per the hypothesis testing, teacher_number_of_previously_posted_projects group for both approved and not approved project  is not equal
# Overlap graph for the teacher_number_of_previously_posted_projects data
plt.figure(figsize = (8,10))
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["teacher_number_of_previously_posted_projects"].values,shade=True,color="r")
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["teacher_number_of_previously_posted_projects"].values,shade=True,color="b")

As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the
spreading of both the groups are almost same

C) Boxplot Analysis и анализ графика KDE цены

# boxplot analysis of price
df_box = preprocessed_data_df[preprocessed_data_df['project_is_approved'].isin([0, 1])]
plt.figure(figsize=(15, 8))
sns.boxplot(x="project_is_approved", y="price", data=df_box)
plt.xticks(ticks=[0, 1], labels=["0", "1"])
plt.show()

As per the analysis, there are outliers in both 0's and 1's  from quantity feature
# Calculate percentiles
percentiles = np.arange(1, 101)
values = np.percentile(preprocessed_data_df['price'], percentiles)

# Plotting using Seaborn
sns.lineplot(x=percentiles, y=values)
plt.ylabel('Value')
plt.title('Percentile Plot')
plt.grid(True)
plt.show()

As per the analysis there are outliers at the extreme points i.e. from 97 to 100
so need to remove it

Проверка гипотезы

'''
The objective of the t-test in this context is to determine if there is a statistically significant difference between the means of the numerical data(price) for two binary target classes.
Null hypothesis H0: no difference between the group means.
Alternate hypothesis H1: difference between the group means.
Considering the following parameters
Confidence leve= 95%
aplha=0.05
'''
# Import library
import scipy.stats as stats
t_value,p_value=stats.ttest_ind(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["price"].values,preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["price"].values)
print("t_value =",t_value)
print("p_value =",p_value)
print("aplha value= ",0.05)

As p_values is greater than alpha (p_value<alpha) we  reject the null hypothesis.
Hence as per the hypothesis testing, price group for both approved and not approved roject  is not equal
# Overlap graph for the price data
plt.figure(figsize = (8,10))
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==1]["price"].values,shade=True,color="r")
sns.kdeplot(preprocessed_data_df[preprocessed_data_df["project_is_approved"]==0]["price"].values,shade=True,color="b")

As per the analysis there is slight difference between 1 and 0 and it is barely noticable. Also the
spreading of both the groups are almost same

Анализ текстовых данных

# Libraries for text processing
import re, nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

def clean_tokenized_sentence(s):
    """Performs basic cleaning of a tokenized sentence"""
    cleaned_s = ""  # Create empty string to store processed sentence.
    words = nltk.word_tokenize(s)
    for word in words:
        # Convert to lowercase #
        c_word = word.lower()
        # Remove punctuations #
        c_word = re.sub(r'[^\w\s]', '', c_word)
        # Remove stopwords #
        if c_word != '' and c_word not in stopwords.words('english'):
            cleaned_s = cleaned_s + " " + c_word    # Append processed words to new list.
    return(cleaned_s.strip())
# creating new dataframe for analysis
df_text=preprocessed_data_df[['project_title','essay','project_is_approved','project_resource_summary']]
df_text.head(2)
  1. Анализ особенности эссе
# the top words from eassy in approved project
# selecting approved essay
approved=df_text[df_text["project_is_approved"]==1]["essay_message"]
# clubbing all text data
approved = " ".join(approved)
#Now Splitting the entire text into words
approved=approved.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_approved = Counter(approved).most_common(50)
counter_approved

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_approved]
word_freq = [freq for word, freq in counter_approved]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in Approved Essays')
plt.tight_layout()
plt.show()

# Analysing top words from eassy in not approved project
# selecting approved essay
not_approved=df_text[df_text["project_is_approved"]==0]["essay_message"]
# clubbing all text data
not_approved = " ".join(not_approved)
#Now Splitting the entire text into words
not_approved=not_approved.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_not_approved = Counter(not_approved).most_common(50)
counter_not_approved

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_not_approved]
word_freq = [freq for word, freq in counter_not_approved]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in not Approved Essays')
plt.tight_layout()
plt.show()

# calculating common top  50  words form the approved and disapproved words
top50_approved=[x[0] for x in counter_approved]
top50_notapproved=[x[0] for x in counter_not_approved]

intersection_words = list(set(top50_approved).intersection(top50_notapproved))

print(intersection_words)
print("len of intersection_words= ",len(intersection_words))
['use', 'like', 'classroom', 'one', 'many', 'come', 'project', 'learn', 
'make', 'help', 'different', 'year', 'get', 'work', 'also', 'provide', 
'learning', 'time', 'books', 'day', 'need', 'class', 'reading', 'teach', 
'children', 'math', 'want', 'best', 'students', 'create', 'needs', 'new',
 'materials', 'technology', 'world', 'every', 'love', 'able', 'school',
 'student', 'grade', 'would', 'way', 'skills', 'allow']

len of intersection_words=  45

As per the analysis there 45 words which are common between approved 
and not approved essays

2. Анализ данных о названии проекта

df_text["project_title_message"] = df_text["project_title"].apply(clean_tokenized_sentence)
df_text.head(2)
# Analysing top words from eassy in approved project
# selecting approved essay
approved_title=df_text[df_text["project_is_approved"]==1]["project_title_message"]
# clubbing all text data
approved_title = " ".join(approved_title)
#Now Splitting the entire text into words
approved_title=approved_title.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_approved_title = Counter(approved_title).most_common(50)
counter_approved_title

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_approved_title]
word_freq = [freq for word, freq in counter_approved_title]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in not Approved Essays Project title')
plt.tight_layout()
plt.show()

# Analysing top words from project_title in not approved project
# selecting approved project_title
not_approved_title=df_text[df_text["project_is_approved"]==0]["project_title_message"]
# clubbing all text data
not_approved_title = " ".join(not_approved_title)
#Now Splitting the entire text into words
not_approved_title=not_approved_title.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_not_approved_title = Counter(not_approved_title).most_common(50)
counter_not_approved_title

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_not_approved_title]
word_freq = [freq for word, freq in counter_not_approved_title]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in not Approved  Project title')
plt.tight_layout()
plt.show()

# calculating common top  50  words form the approved and disapproved words
top50_approved_title=[x[0] for x in counter_approved]
top50_not_approved_title=[x[0] for x in counter_not_approved]
intersection_words_title = list(set(top50_approved_title).intersection(top50_not_approved_title))

print(intersection_words_title)
print("len of intersection_words= ",len(intersection_words_title))
['use', 'like', 'classroom', 'one', 'many', 'come', 'project', 
'learn', 'make', 'help', 'different', 'year', 'get', 'work', 
'also', 'provide', 'learning', 'time', 'books', 'day', 'need', 
'class', 'reading', 'teach', 'children', 'math', 'want', 'best', 
'students', 'create', 'needs', 'new', 'materials', 'technology', 
'world', 'every', 'love', 'able', 'school', 'student', 'grade', 
'would', 'way', 'skills', 'allow']

len of intersection_words=  45

As per the analysis there 45 words which are common between 
approved and not approved title

3. Анализ данных проекта project_resource_summary

df_text["project_resource_summary_message"] = df_text["project_resource_summary"].apply(clean_tokenized_sentence)

df_text["project_resource_summary"]=df_text["project_resource_summary_message"]
df_text.head(2)
# Analysing top words from project_resource_summary in approved project
# selecting approved essay
approved_project_resource_summary=df_text[df_text["project_is_approved"]==1]["project_resource_summary"]
# clubbing all text data
approved_project_resource_summary = " ".join(approved_project_resource_summary)
#Now Splitting the entire text into words
approved_project_resource_summary=approved_project_resource_summary.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_approved_project_resource_summary = Counter(approved_project_resource_summary).most_common(50)
counter_approved_project_resource_summary

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_approved_project_resource_summary]
word_freq = [freq for word, freq in counter_approved_project_resource_summary]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in Approved project_resource_summary')
plt.tight_layout()
plt.show()

# Analysing top words from eassy in not approved project
# selecting approved essay
not_approved_project_resource_summary=df_text[df_text["project_is_approved"]==0]["project_resource_summary"]
# clubbing all text data
not_approved_project_resource_summary = " ".join(not_approved_project_resource_summary)
#Now Splitting the entire text into words
not_approved_project_resource_summary=not_approved_project_resource_summary.split()
 #Finding the top 40 words mostly used in approved
from collections import Counter
counter_not_approved_project_resource_summary = Counter(not_approved_project_resource_summary).most_common(50)
counter_not_approved_project_resource_summary

# Extracting word and frequency for plotting
top_words = [word for word, freq in counter_not_approved_project_resource_summary]
word_freq = [freq for word, freq in counter_not_approved_project_resource_summary]

# Plotting the top words
plt.figure(figsize=(10, 6))
plt.bar(top_words, word_freq)
plt.xticks(rotation=90)
plt.xlabel('Top Words')
plt.ylabel('Frequency')
plt.title('Top Words in not Approved project_resource_summary')
plt.tight_layout()
plt.show()

# calculating common top  50  words form the approved and disapproved words

top50_approved_project_resource_summary=[x[0] for x in counter_approved_project_resource_summary]
top50_not_approved_project_resource_summary=[x[0] for x in counter_not_approved_project_resource_summary]

intersection_words_title = list(set(top50_approved_project_resource_summary).intersection(top50_not_approved_project_resource_summary))

print(intersection_words_title)
print("len of intersection_words= ",len(intersection_words_title))
['class', 'time', 'skills', 'keep', 'students', 'school', 
'new', 'science', 'access', 'use', 'writing', 'need', 'make',
 'technology', 'enhance', 'learning', 'chairs', 'learn', 'work', 
'supplies', 'materials', 'center', 'seating', 'projects', 'read', 
'order', 'classroom', 'activities', 'books', 'allow', 'ipad', 'art',
 'help', 'paper', 'create', 'able', 'flexible', 'reading', 'math']

len of intersection_words=  39

As per the analysis there 39 words which are common between approved and not approved tile

f) Разделение данных на подмножество обучения, тестирования и проверки

from sklearn.model_selection import train_test_split
X,x_val,Y,y_val=train_test_split(
    new_df,
    new_df['project_is_approved'],
    test_size=0.2,
    random_state=42,
    stratify=preprocessed_data_df[['project_is_approved']])


print("x_train: ",X.shape)
print("x_test : ",x_val.shape)
print("y_train: ",Y.shape)
print("y_test : ",y_val.shape)

# splitting data to train and test split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, stratify=Y)

print("x_train: ",x_train.shape)
print("y_train: ",y_train.shape)
print("x_val   : ",x_val.shape)
print("y_val   : ",y_val.shape)
print("x_test : ",x_test.shape)
print("y_test : ",y_test.shape)


# y_value_counts = row1['project_is_approved'].value_counts()
print("X_TRAIN-------------------------")
x_train_y_value_counts = x_train['project_is_approved'].value_counts()
print("Number of projects that are approved for funding    ", x_train_y_value_counts[1]," -> ",round(x_train_y_value_counts[1]/(x_train_y_value_counts[1]+x_train_y_value_counts[0])*100,2),"%")
print("Number of projects that are not approved for funding ",x_train_y_value_counts[0]," -> ",round(x_train_y_value_counts[0]/(x_train_y_value_counts[1]+x_train_y_value_counts[0])*100,2),"%")
print("\n")
# y_value_counts = row1['project_is_approved'].value_counts()
print("X_TEST--------------------------")
x_test_y_value_counts = x_test['project_is_approved'].value_counts()
print("Number of projects that are approved for funding    ", x_test_y_value_counts[1]," -> ",round(x_test_y_value_counts[1]/(x_test_y_value_counts[1]+x_test_y_value_counts[0])*100,2),"%")
print("Number of projects that are not approved for funding ",x_test_y_value_counts[0]," -> ",round(x_test_y_value_counts[0]/(x_test_y_value_counts[1]+x_test_y_value_counts[0])*100,2),"%")
print("\n")

print("X_Val--------------------------")
x_val_y_value_counts = x_val['project_is_approved'].value_counts()
print("Number of projects that are approved for funding    ", x_val_y_value_counts[1]," -> ",round(x_val_y_value_counts[1]/(x_val_y_value_counts[1]+x_val_y_value_counts[0])*100,2),"%")
print("Number of projects that are not approved for funding ",x_val_y_value_counts[0]," -> ",round(x_val_y_value_counts[0]/(x_val_y_value_counts[1]+x_val_y_value_counts[0])*100,2),"%")
print("\n")

# Vectorizing project subject categories
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_sub = CountVectorizer(lowercase=False, binary=True)

vectorizer_sub.fit(x_train['project_subject_categories'].values)

x_train_project_subject_categories_one_hot = vectorizer_sub.transform(x_train['project_subject_categories'].values)

x_test_project_subject_categories_one_hot  = vectorizer_sub.transform(x_test['project_subject_categories'].values)

x_val_project_subject_categories_one_hot  = vectorizer_sub.transform(x_val['project_subject_categories'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_sub.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_categories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_subject_categories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_categories_one_hot.shape)




# Vectorizing project subject categories
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_sub = CountVectorizer(lowercase=False, binary=True)

vectorizer_sub.fit(x_train['project_resource_summary'].values)

x_train_project_resource_summary_one_hot = vectorizer_sub.transform(x_train['project_resource_summary'].values)

x_test_project_resource_summary_one_hot  = vectorizer_sub.transform(x_test['project_resource_summary'].values)

x_val_project_resource_summary_one_hot  = vectorizer_sub.transform(x_val['project_resource_summary'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
# print(vectorizer_sub.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_resource_summary_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_resource_summary_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_resource_summary_one_hot.shape)







# Vectorizing project subject sub-categories
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True)

vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values)

x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values)

x_test_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values)

x_val_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_sub_sub_category.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape)


# Vectorizing project subject sub-categories
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True)

vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values)

x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values)

x_test_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values)

x_val_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_sub_sub_category.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape)


# Vectorizing project subject sub-categories
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_sub_sub_category = CountVectorizer(lowercase=False, binary=True)

vectorizer_sub_sub_category.fit(x_train['project_subject_subcategories'].values)

x_train_project_subject_subcategories_one_hot = vectorizer_sub_sub_category.transform(x_train['project_subject_subcategories'].values)

x_test_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_test['project_subject_subcategories'].values)

x_val_project_subject_subcategories_one_hot  = vectorizer_sub_sub_category.transform(x_val['project_subject_subcategories'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_sub_sub_category.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_subject_subcategories_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_subject_subcategories_one_hot.shape)

# Vectorizing teacher_prefix
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_teacher_prefix = CountVectorizer(lowercase=False, binary=True)

vectorizer_teacher_prefix.fit(x_train['teacher_prefix'].values)

x_train_teacher_prefix_one_hot = vectorizer_teacher_prefix.transform(x_train['teacher_prefix'].values)

x_test_teacher_prefix_one_hot  = vectorizer_teacher_prefix.transform(x_test['teacher_prefix'].values)

x_val_teacher_prefix_one_hot  = vectorizer_teacher_prefix.transform(x_val['teacher_prefix'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_teacher_prefix.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_test_teacher_prefix_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_teacher_prefix_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_teacher_prefix_one_hot.shape)

# Vectorizing school_state
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_school_state = CountVectorizer(lowercase=False, binary=True)

vectorizer_school_state.fit(x_train['school_state'].values)

x_train_school_state_one_hot = vectorizer_school_state.transform(x_train['school_state'].values)

x_test_school_state_one_hot  = vectorizer_school_state.transform(x_test['school_state'].values)

x_val_school_state_one_hot  = vectorizer_school_state.transform(x_val['school_state'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_school_state.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_school_state_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_school_state_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_school_state_one_hot.shape)


# Vectorizing project_grade_category
# we use count vectorizer to convert the values into one
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_project_grade_category = CountVectorizer(lowercase=False, binary=True)

vectorizer_project_grade_category.fit(x_train['project_grade_category'].values)

x_train_project_grade_category_one_hot = vectorizer_project_grade_category.transform(x_train['project_grade_category'].values)

x_test_project_grade_category_one_hot  = vectorizer_project_grade_category.transform(x_test['project_grade_category'].values)

x_val_project_grade_category_one_hot  = vectorizer_project_grade_category.transform(x_val['project_grade_category'].values)


# x_train['project_subject_categories_encoded'] = x_train_categories_one_hot.toarray()

# x_test['project_subject_categories_encoded']  = x_test_categories_one_hot.toarray()
print(vectorizer_project_grade_category.get_feature_names_out())

print("Shape of matrix after one hot encoding -> categories: x_train: ",x_train_project_grade_category_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_val   : ",x_val_project_grade_category_one_hot.shape)
print("Shape of matrix after one hot encoding -> categories: x_test : ",x_test_project_grade_category_one_hot.shape)


# Applying tf-IDF on project_title

# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer_project_title_tfidf = TfidfVectorizer(min_df=10)


from sklearn.feature_extraction.text import CountVectorizer
vectorizer_project_title_category = CountVectorizer(min_df=10,lowercase=False, binary=True)

vectorizer_project_title_category.fit(x_train['project_title'])
x_train_project_titles_tfidf = vectorizer_project_title_category.transform(x_train['project_title'])
x_val_project_titles_tfidf    = vectorizer_project_title_category.transform(x_val['project_title'])
x_test_project_titles_tfidf  = vectorizer_project_title_category.transform(x_test['project_title'])

print("Shape of matrix after TF-IDF -> Title: x_train: ",x_train_project_titles_tfidf.shape)
print("Shape of matrix after TF-IDF -> Title: x_val  : ",x_val_project_titles_tfidf.shape)
print("Shape of matrix after TF-IDF -> Title: x_test : ",x_test_project_titles_tfidf.shape)


# Applying tf-IDF on essay

# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer_essay_tfidf = TfidfVectorizer(min_df=10)
# vectorizer_essay_tfidf.fit(x_train['essay'])

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_project_essay_category = CountVectorizer(min_df=10,lowercase=False, binary=True)
vectorizer_project_essay_category.fit(x_train['essay'])

x_train_essay_tfidf = vectorizer_project_essay_category.transform(x_train['essay'])
x_val_essay_tfidf = vectorizer_project_essay_category.transform(x_val['essay'])

x_test_essay_tfidf  = vectorizer_project_essay_category.transform(x_test['essay'])

print("Shape of matrix after TF-IDF -> Title: x_train: ",x_train_essay_tfidf.shape)
print("Shape of matrix after TF-IDF -> Title: x_val: ",x_val_essay_tfidf.shape)
print("Shape of matrix after TF-IDF -> Title: x_test : ",x_test_essay_tfidf.shape)


df=pd.DataFrame()
# applying StandardScaler to specific columns
from sklearn.preprocessing import StandardScaler
scaler_transform_numerical_value = StandardScaler()
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
# scaler_transform_numerical_value.fit(x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])
# df[["x","y","z"]]=x_train_teacher_number_of_previously_posted_projects_scaler=scaler_transform_numerical_value.transform(x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])


normalizer.fit(x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])
df[["x","y","z"]]=x_train_teacher_number_of_previously_posted_projects_scaler=normalizer.transform(x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])

x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]=normalizer.transform(x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])
x_test[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]=normalizer.transform(x_test[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])
x_val[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]=normalizer.transform(x_val[['teacher_number_of_previously_posted_projects', 'price',  'quantity']])



from scipy.sparse import hstack
# merging train values

x_train_onehot = hstack((x_train_project_subject_categories_one_hot,
                         x_train_project_resource_summary_one_hot,
                         x_train_project_subject_subcategories_one_hot   ,
                         x_train_teacher_prefix_one_hot    ,
                         x_train_school_state_one_hot  ,
                         x_train_project_grade_category_one_hot  ,
                         x_train_project_titles_tfidf  ,
                         x_train_essay_tfidf,
                         x_train[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]))


print(x_train_onehot.shape)

# merging test value
x_test_onehot = hstack((x_test_project_subject_categories_one_hot,
                         x_test_project_resource_summary_one_hot,
                         x_test_project_subject_subcategories_one_hot   ,
                         x_test_teacher_prefix_one_hot    ,
                         x_test_school_state_one_hot  ,
                         x_test_project_grade_category_one_hot  ,
                         x_test_project_titles_tfidf  ,
                         x_test_essay_tfidf,
                         x_test[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]))

# merging val value
x_val_onehot = hstack((x_val_project_subject_categories_one_hot,
                         x_val_project_resource_summary_one_hot,
                         x_val_project_subject_subcategories_one_hot   ,
                         x_val_teacher_prefix_one_hot    ,
                         x_val_school_state_one_hot  ,
                         x_val_project_grade_category_one_hot  ,
                         x_val_project_titles_tfidf  ,
                         x_val_essay_tfidf,
                         x_val[['teacher_number_of_previously_posted_projects', 'price',  'quantity']]))


print(x_val_onehot.shape)
Note :- In the code both tf-idf and CountVectorization techniques. As per the analysis of training 
CountVectorization provide good better results as compaired to tf-ifd

Both techniques are provided in the above code

Обучение модели машинного обучения

Note :- 1)Following models were trainined on the personal computers.
        2) Complete code is provided in the Github link.

1) Классификатор случайного леса

# following parameters choose after performing hyperparameter tuning
classifier = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced_subsample')
classifier.fit(x_train_onehot, y_train)
Predictiion on test data
pred_xgboost=classifier.predict(x_test_onehot)
pred_xgboost_train = classifier.predict(x_train_onehot)


#Checking different metrics for bagging model with default hyper parameters
print('Checking different metrics for bagging model with default hyper parameters:\n')
print("Training accuracy: ",classifier.score(x_train_onehot,y_train))
acc_score = accuracy_score(y_test, pred_xgboost)
print('Testing accuracy: ',acc_score)
conf_mat = confusion_matrix(y_test, pred_xgboost)
print('Confusion Matrix: \n',conf_mat)
roc_auc = roc_auc_score(y_test,pred_xgboost)
print('ROC AUC score: ',roc_auc)
class_rep_xgboost= classification_report(y_test,pred_xgboost)
print('Classification Report: \n',class_rep_xgboost)

As per the analysis, the roc auc curve is 0.5 which means the model 
is not able to distinguish between 1 and 0. Also the training accuracy 
is greater than testing accuracy, model seems to be overfit

2) Классификатор Xgboost

from xgboost import XGBClassifier
# following parameters choose after perfoeming hyperparameter tuning
classifier = XGBClassifier(scale_pos_weight=0.17999,n_estimators=1000,learning_rate=0.01)
# Fit the clas  sifier to the preprocessed training data
classifier.fit(x_train_onehot, y_train)
pred_xgboost_train = classifier.predict(x_train_onehot)
print()
print("Testing Score")

#Checking different metrics for bagging model with default hyper parameters
print('Checking different metrics for bagging model with default hyper parameters:\n')
print("Training accuracy: ",classifier.score(x_train_onehot,y_train))
acc_score = accuracy_score(y_test, pred_xgboost)
print('Testing accuracy: ',acc_score)
conf_mat = confusion_matrix(y_test, pred_xgboost)
print('Confusion Matrix: \n',conf_mat)
roc_auc = roc_auc_score(y_test,pred_xgboost)
print('ROC AUC score: ',roc_auc)
class_rep_xgboost= classification_report(y_test,pred_xgboost)
print('Classification Report: \n',class_rep_xgboost)

print()
print("Validation Score")
pred_xgboost=classifier.predict(x_val_onehot)
pred_xgboost_train = classifier.predict(x_train_onehot)


#Checking different metrics for bagging model with default hyper parameters
print('Checking different metrics for bagging model with default hyper parameters:\n')
print("Training accuracy: ",classifier.score(x_train_onehot,y_train))
acc_score = accuracy_score(y_test, pred_xgboost)
print('Testing accuracy: ',acc_score)
conf_mat = confusion_matrix(y_test, pred_xgboost)
print('Confusion Matrix: \n',conf_mat)
roc_auc = roc_auc_score(y_test,pred_xgboost)
print('ROC AUC score: ',roc_auc)
class_rep_xgboost= classification_report(y_test,pred_xgboost)
print('Classification Report: \n',class_rep_xgboost)

As the training accuracy is greater than testing accuracy, model seems to
be overfit

3) Наивный Байес

The Naive Bayes model were trainined before hyperparameter tunning
and it provides a roc_auc score of 0.70. Also, the training accuracy 
is approximately equal to testing accuracy, hence the model seems 
to be descent, but can be improved
#Trainig Naive Bayes along nwith hyperparmter tunning
# mnb_bow = MultinomialNB(class_prior=[0.5, 0.5])
mnb_bow = MultinomialNB(class_prior=[0.8, 0.2])

parameters = {'alpha':[0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.5,0.6,0.7,0.8,0.9, 1, 5, 10, 50, 100, 500, 1000, 2500, 5000, 10000]}
clf = GridSearchCV(mnb_bow, parameters, cv= 10, scoring='roc_auc',verbose=1,return_train_score=True)
# clf.fit(x_cv_onehot_bow, y_cv)
clf.fit(x_train_onehot,y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
bestAlpha_1=clf.best_params_['alpha']
bestScore_1=clf.best_score_
print("BEST ALPHA: ",clf.best_params_['alpha']," BEST SCORE: ",clf.best_score_)

mnb_bow_testModel = MultinomialNB(alpha = bestAlpha_1,class_prior=[0.5, 0.5])
# class_prior is used for balancing the class
mnb_bow_testModel.fit(x_train_onehot, y_train)

Based on the analysis, Navie Baye's model demonstrates promising performance
in both test and validation scores. With a roc_suc score of 0.80, the model
exhibits effective discrimination between the values 1 and 0. Furthermore,
the model is well-fitted, as indicated by the nearly equal training and
testing accuracies. GUI Development

Разработка графического интерфейса

пожалуйста, проверьте ссылку на Github

Вывод.В заключение следует отметить, что предоставленный набор данных демонстрирует высокий уровень дисбаланса классов, а также несколько общих факторов, выявленных между утвержденными и неутвержденными группами в данных, как объясняется в тематическом исследовании. Для достижения желаемого прогноза было введено несколько моделей машинного обучения. Среди этих моделей наивный байесовский алгоритм продемонстрировал достойные оценки, а модель оказалась хорошо обученной, что дало удовлетворительные оценки точности обучения и тестирования. Кроме того, в тематическое исследование был включен графический интерфейс с использованием Gradio для улучшения взаимодействия с пользователем и доступности.

Ссылка на Github: — https:///github.com/atharvnagrikar/DonorChooseFinalScaler

Jupyter Notebook:https://github.com/atharvnagrikar/DonorChooseFinalScaler/tree/main/JupyterNotebook

Дальнейшая работа. В тематическом исследовании можно использовать методы глубокого обучения для решения поставленной задачи. Кроме того, использование токенизатора BERT может способствовать эффективному преобразованию текстовых данных в векторные представления.

Ссылки:-

  1. https://towardsdatascience.com/donorschoose-extensive-exploratory-data-analysis-eda-9e7879464d0d
  2. https://medium.com/@manturdipa/application-screening-donorschoose-dataset-d6e5f1827327
  3. https://www.kaggle.com/code/nikhilparmar9/naive-bayes-donorschoose-dataset#-%3E-8.1:-SET-1--Applying-Naive-Bayes-on-BOW
  4. https://www.scaler.com/
  5. https://gradio.app/guides/quickstart