-
Notifications
You must be signed in to change notification settings - Fork 1
391 compare student speech to slides text #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
cd92984
add text normalization func
arhihihipov 8ad9442
rename criterion
arhihihipov a7f96f9
fix imports
arhihihipov 5b4dbaf
add new test criteria pack
arhihihipov bd544d9
Merge branch 'master' into new_criteria_matching_slides_to_spoken_words
arhihihipov 9abf00c
add criteria evaluation
arhihihipov cf28b7a
fix criteria
arhihihipov 6e216c7
Merge branch 'master' into new_criteria_matching_slides_to_spoken_words
arhihihipov e600ad5
add logger, fixes
arhihihipov 305a1fa
remove debug logging
arhihihipov 803f59c
russian stopWords -> utils
arhihihipov 4e519ab
add ignore slides
arhihihipov 78d0a9e
improve nltk download
HadronCollider f927741
fixes
arhihihipov a2d538d
add tf-idf vectorizer and verdict
arhihihipov 88e4e83
Merge branch 'master' into new_criteria_matching_slides_to_spoken_words
arhihihipov 61781c1
delete unusable params
arhihihipov ce30af1
remove comment
arhihihipov 4aaea8b
remove unused func
arhihihipov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
from bson import ObjectId | ||
|
||
from app.root_logger import get_root_logger | ||
from app.localisation import * | ||
from ..criterion_base import BaseCriterion | ||
from ..criterion_result import CriterionResult | ||
from app.audio import Audio | ||
from app.presentation import Presentation | ||
from app.utils import normalize_text, delete_punctuation | ||
from ..text_comparison import SlidesSimilarityEvaluator | ||
|
||
logger = get_root_logger('web') | ||
|
||
|
||
# Критерий, оценивающий, насколько текст слайда перекликается с речью студента на этом слайде | ||
class ComparisonSpeechSlidesCriterion(BaseCriterion): | ||
PARAMETERS = dict( | ||
skip_slides=list.__name__, | ||
) | ||
|
||
def __init__(self, parameters, dependent_criteria, name=''): | ||
super().__init__( | ||
name=name, | ||
parameters=parameters, | ||
dependent_criteria=dependent_criteria, | ||
) | ||
self.evaluator = SlidesSimilarityEvaluator() | ||
|
||
@property | ||
def description(self): | ||
return { | ||
"Критерий": t(self.name), | ||
"Описание": t( | ||
"Проверяет, что текст слайда соответствует словам, которые произносит студент во время демонстрации " | ||
"этого слайда"), | ||
"Оценка": t("1, если среднее значение соответствия речи содержимому слайдов равно или превосходит 0.125, " | ||
"иначе 8 * r, где r - среднее значение соответствия речи демонстрируемым слайдам") | ||
} | ||
|
||
def skip_slide(self, current_slide_text: str) -> bool: | ||
for skip_slide in self.parameters['skip_slides']: | ||
if skip_slide.lower() in delete_punctuation(current_slide_text).lower(): | ||
return True | ||
return False | ||
|
||
def apply(self, audio: Audio, presentation: Presentation, training_id: ObjectId, | ||
criteria_results: dict) -> CriterionResult: | ||
# Результаты сравнения текстов | ||
results = {} | ||
|
||
slides_to_process = [] | ||
|
||
for current_slide_index in range(len(audio.audio_slides)): | ||
# Список слов, сказанных студентом на данном слайде -- список из RecognizedWord | ||
current_slide_speech = audio.audio_slides[current_slide_index].recognized_words | ||
# Удаление time_stamp-ов и probability, ибо работа будет вестись только со словами | ||
current_slide_speech = list(map(lambda x: x.word.value, current_slide_speech)) | ||
# Нормализация текста выступления | ||
current_slide_speech = " ".join(normalize_text(current_slide_speech)) | ||
|
||
# Если на данном слайде ничего не сказано, то не обрабатываем данный слайд | ||
if len(current_slide_speech.split()) == 0: | ||
results[current_slide_index + 1] = 0.000 | ||
continue | ||
|
||
# Список слов со слайда презентации | ||
current_slide_text = presentation.slides[current_slide_index].words | ||
# Проверяем, входит ли рассматриваемый слайд в список нерасмматриваемых | ||
if self.skip_slide(current_slide_text): | ||
logger.info(f"Слайд №{current_slide_index + 1} пропущен") | ||
continue | ||
|
||
# Нормализация текста слайда | ||
current_slide_text = " ".join(normalize_text(current_slide_text.split())) | ||
slides_to_process.append((current_slide_speech, current_slide_text, current_slide_index + 1)) | ||
|
||
self.evaluator.train_model([" ".join(list(map(lambda x: x[0], slides_to_process))), " ".join(list(map(lambda x: x[1], slides_to_process)))]) | ||
|
||
for speech, slide_text, slide_number in slides_to_process: | ||
results[slide_number] = self.evaluator.evaluate_semantic_similarity(speech, slide_text) | ||
|
||
results = dict(sorted(results.items())) | ||
|
||
score = 8 * (sum(list(results.values())) / len(list(results.values()))) | ||
|
||
return CriterionResult(1 if score >= 1 else score, "Отлично" if score >= 1 else "Следует уделить внимание " | ||
"соотвествию речи на слайдах " | ||
"{}".format(",\n".join([f"№{n} - {results[n]}" for n in dict(filter(lambda item: item[1] < 0.125, results.items()))]))) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
from sklearn.metrics.pairwise import cosine_similarity | ||
|
||
|
||
class SlidesSimilarityEvaluator: | ||
def __init__(self): | ||
self.vectorizer = TfidfVectorizer(ngram_range=(1, 1)) | ||
|
||
def train_model(self, corpus: list): | ||
self.vectorizer.fit(corpus) | ||
|
||
def evaluate_semantic_similarity(self, text1: str, text2: str) -> float: | ||
vector1 = self.vectorizer.transform([text1]) | ||
vector2 = self.vectorizer.transform([text2]) | ||
similarity = cosine_similarity(vector1, vector2)[0][0] | ||
|
||
return round(similarity, 3) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,5 @@ | ||
import fitz | ||
import pymorphy2 | ||
import nltk | ||
nltk.download('stopwords') | ||
from nltk.corpus import stopwords | ||
|
||
import os | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
import os | ||
import string | ||
import re | ||
import tempfile | ||
from distutils.util import strtobool | ||
from threading import Timer | ||
|
@@ -7,6 +9,8 @@ | |
from bson import ObjectId | ||
from flask import json | ||
import magic | ||
import pymorphy2 | ||
from nltk.corpus import stopwords | ||
from pydub import AudioSegment | ||
import subprocess | ||
|
||
|
@@ -16,11 +20,11 @@ | |
SECONDS_PER_MINUTE = 60 | ||
BYTES_PER_MEGABYTE = 1024 * 1024 | ||
ALLOWED_MIMETYPES = { | ||
'pdf': ['application/pdf'], | ||
'ppt': ['application/vnd.ms-powerpoint'], | ||
'odp': ['application/vnd.oasis.opendocument.presentation'], | ||
'pptx': ['application/vnd.openxmlformats-officedocument.presentationml.presentation', 'application/zip'] | ||
} | ||
'pdf': ['application/pdf'], | ||
'ppt': ['application/vnd.ms-powerpoint'], | ||
'odp': ['application/vnd.oasis.opendocument.presentation'], | ||
'pptx': ['application/vnd.openxmlformats-officedocument.presentationml.presentation', 'application/zip'] | ||
} | ||
CONVERTIBLE_EXTENSIONS = ('ppt', 'pptx', 'odp') | ||
ALLOWED_EXTENSIONS = set(ALLOWED_MIMETYPES.keys()) | ||
DEFAULT_EXTENSION = 'pdf' | ||
|
@@ -74,7 +78,7 @@ def convert_to_pdf(presentation_file): | |
temp_file.write(presentation_file.read()) | ||
temp_file.close() | ||
presentation_file.seek(0) | ||
|
||
converted_file = None | ||
convert_cmd = f"soffice --headless --convert-to pdf --outdir {os.path.dirname(temp_file.name)} {temp_file.name}" | ||
if run_process(convert_cmd).returncode == 0: | ||
|
@@ -136,9 +140,9 @@ def check_argument_is_convertible_to_object_id(arg): | |
return {'message': '{} cannot be converted to ObjectId. {}: {}'.format(arg, e1.__class__, e1)}, 404 | ||
except Exception as e2: | ||
return { | ||
'message': 'Some arguments cannot be converted to ObjectId or to str. {}: {}.' | ||
.format(e2.__class__, e2) | ||
}, 404 | ||
'message': 'Some arguments cannot be converted to ObjectId or to str. {}: {}.' | ||
.format(e2.__class__, e2) | ||
}, 404 | ||
|
||
|
||
def check_arguments_are_convertible_to_object_id(f): | ||
|
@@ -182,6 +186,29 @@ def check_dict_keys(dictionary, keys): | |
return f"{msg}\n{dictionary}" if msg else '' | ||
|
||
|
||
# Функция нормализации текста | ||
def normalize_text(text: list) -> list: | ||
table = str.maketrans("", "", string.punctuation) | ||
morph = pymorphy2.MorphAnalyzer() | ||
|
||
# Замена знаков препинания на пустые строки, конвертация в нижний регистр и обрезание пробелов по краям | ||
text = list(map(lambda x: x.translate(table).lower().strip(), text)) | ||
# Замена цифр и слов не на русском языке на пустые строки | ||
text = list(map(lambda x: re.sub(r'[^А-яёЁ\s]', '', x), text)) | ||
# Удаление пустых строк | ||
text = list(filter(lambda x: x.isalpha(), text)) | ||
# Приведение слов к нормальной форме | ||
text = list(map(lambda x: morph.normal_forms(x)[0], text)) | ||
# Очистка от стоп-слов | ||
text = list(filter(lambda x: x not in RussianStopwords().words, text)) | ||
return text | ||
|
||
|
||
# Удаление пунктуации из текста | ||
def delete_punctuation(text: str) -> str: | ||
return text.translate(str.maketrans('', '', string.punctuation + "\t\n\r\v\f")) | ||
|
||
|
||
class RepeatedTimer: | ||
""" | ||
Utility class to call a function with a given interval between the end and the beginning of consecutive calls | ||
|
@@ -210,3 +237,18 @@ def start(self): | |
def stop(self): | ||
self._timer.cancel() | ||
self.is_running = False | ||
|
||
|
||
class Singleton(type): | ||
_instances = {} | ||
|
||
def __call__(cls, *args, **kwargs): | ||
if cls not in cls._instances: | ||
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) | ||
return cls._instances[cls] | ||
|
||
|
||
class RussianStopwords(metaclass=Singleton): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Поправил работу с nltk.download - вынес с стартовый модуль и сделал volume между контейнерами, использующими nltk (чтобы каждый из них не загружал нужные словари каждый в себя) |
||
|
||
def __init__(self): | ||
self.words = stopwords.words('russian') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Дальше по методу очень часто повторяется операция
" ".join(x)
для current_slide_speech/current_slide_text - возможно, стоит сделать это один раз в начале?