Deep Learning Structure for Thai Vowel Pronunciation Recognition

Please use this identifier to cite or link to this item: http://ithesis-ir.su.ac.th/dspace/handle/123456789/4412

Title:	Deep Learning Structure for Thai Vowel Pronunciation Recognition โครงสร้างการเรียนรู้เชิงลึกสำหรับการรู้จำการออกเสียงสระภาษาไทย
Authors:	Niyada RUKWONG นิยดา รักวงษ์ sunee pongpinigpinyo สุนีย์ พงษ์พินิจภิญโญ Silpakorn University sunee pongpinigpinyo สุนีย์ พงษ์พินิจภิญโญ pongpinigpinyo_s@silpakorn.edu pongpinigpinyo_s@silpakorn.edu
Keywords:	การเรียนรู้เชิงลึก การรู้จำเสียง สระภาษาไทย โมเดล Convolutional Neural Networks Thai vowels speech recognition Deep Learning Convolutional Neural Networks
Issue Date:	4
Publisher:	Silpakorn University
Abstract:	Effective and proper pronunciation is essential to pronounce words correctly. Practicing Thai vowel pronunciation is difficult for non-native speakers to understand on their own. Experts are required to provide guidance. Nowadays, online learning is popular, and pronunciation training technology can help improve language teaching and learning. This technology can solve the problem of practicing Thai vowel pronunciation for non-native learners, non-standard Thai speakers, and persons with disabilities. It provides a solution to the inadequacy of instructional specialists and the complexity of teaching vowel pronunciation. The purpose of this research is to study deep learning structures for the recognition of 18 standard Thai vowel sounds. This research presents a deep learning model that plays a crucial part in recognizing Thai vowel sounds for Computer-Assisted Pronunciation Training (CAPT). Identifying the correct vowels when pronouncing them in real situations is a significant challenge in Thai vowel recognition. This present study applies deep learning models, including Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and a combination of CNN and LSTM with Mel spectrogram (MS) and Mel Frequency Cepstrum Coefficient (MFCC), to recognize Thai vowels. In the automatic recognition of Thai vowels, a new dataset for Thai vowels was designed, collected, and verified by linguists. The noise level in the environment of the sound files is between 30 - 50 dB. The results showed that the CNN model combined with the MS acoustic feature is the most suitable model for Thai vowel recognition in this research, with an accuracy of 98.61%. This work presents Gradient-weighted class activation mapping (Grad-CAM) with a CNN model for recognition to explain the importance of significant areas when the model predicted all 18 Thai vowel sounds. The results showed that Grad-CAM considers both high and low frequencies for each vowel recognition. This work confirms that the CNN model's clarity of predictions could help CAPT systems be more accurate and efficient. This system is developed by combining computer techniques with linguistics, allowing learners to practice vowel pronunciation in real-time. It is like having experts, Thai language teachers, and linguists continuously advise learners on the correct pronunciation of vowels, making it suitable for today's world that requires online learning. การฝึกออกเสียงที่มีประสิทธิภาพและได้มาตรฐานเป็นสิ่งสำคัญของการออกเสียงอย่างถูกต้อง การออกเสียงสระผิดทำให้ความหมายของคำเปลี่ยนไป การฝึกออกเสียงสระสามารถก่อให้เกิดปัญหาสำหรับผู้เรียนที่ไม่ได้เป็นเจ้าของภาษาได้จึงต้องมีผู้เชี่ยวชาญให้คำแนะนำ ปัจจุบันการเรียนรู้ออนไลน์ได้รับความนิยม การนำเทคโนโลยีสำหรับการฝึกออกเสียงมาใช้เป็นเครื่องมือสามารถช่วยพัฒนาในด้านการเรียนการสอนสำหรับการเรียนรู้ภาษาเพื่อแก้ปัญหาการฝึกออกเสียงสระภาษาไทยสำหรับผู้เรียนที่ไม่ใช่เจ้าของภาษา ผู้ที่พูดภาษาไทยไม่ได้มาตรฐาน หรือผู้พิการทางการออกเสียง โดยสามารถแก้ปัญหาความไม่เพียงพอของผู้เชี่ยวชาญด้านการสอน และความซับซ้อนของกระบวนการสอนการออกเสียงสระ งานวิจัยนี้มีวัตถุประสงค์เพื่อศึกษาโครงสร้างการเรียนรู้เชิงลึกสำหรับการรู้จำการออกเสียงสระภาษาไทย 18 เสียง ซึ่งเป็นเสียงสระเดี่ยวมาตรฐานของภาษาไทย โดยนำเสนอโมเดลการเรียนรู้เชิงลึกที่เป็นส่วนสำคัญในการรู้จำเสียงสระภาษาไทยสำหรับระบบการฝึกการออกเสียงโดยใช้คอมพิวเตอร์ช่วย (Computer-Assisted Pronunciation Training : CAPT) การระบุเสียงสระที่ถูกต้องเมื่อพูดในสถานการณ์จริงถือเป็นความท้าทายหลักในการรู้จำเสียงสระภาษาไทย งานวิจัยนี้มีการเปรียบเทียบประสิทธิภาพของโมเดล Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) และการรวมกันของ Convolutional Neural Network และ Long Short-Term Memory (CNN_LSTM) กับคุณสมบัติด้านเสียง Mel spectrogram (MS) และ Mel Frequency Cepstrum Coefficient (MFCC) ในการรู้จำเสียงสระภาษาไทย ชุดข้อมูลเสียงสระภาษาไทยใหม่ถูกออกแบบ รวบรวม และตรวจสอบโดยนักภาษาศาสตร์ ทำให้ได้ชุดข้อมูลเสียงสระภาษาไทยที่มีความแตกต่างหลายมิติ เช่น เพศ อายุ สภาวะแวดล้อมที่ใช้ในการพูด เป็นต้น โดยมีระดับความดังของเสียงรบกวนในสภาวะแวดล้อมประมาณ 30 - 50 dB ผลลัพธ์พบว่า โมเดลที่เหมาะสมในการรู้จำเสียงสระภาษาไทยในงานวิจัยนี้คือ โมเดล CNN รวมกับคุณสมบัติด้านเสียง MS มีค่าความถูกต้อง 98.61% มีการนำเสนอวิธีการ Gradient-weighted class activation mapping (Grad-CAM) กับโมเดลการเรียนรู้เชิงลึก CNN สำหรับการรู้จำเพื่ออธิบายบริเวณที่สำคัญเมื่อโมเดลทำนายเสียงสระภาษาไทยทั้ง 18 เสียง ผลพบว่าการรู้จำเสียงสระในแต่ละสระ Grad-CAM จะพิจารณาทั้งความถี่สูงและความถี่ต่ำ งานนี้สามารถยืนยันว่าความชัดเจนและความโปร่งใสในการทำนายผลของโมเดล CNN สามารถช่วยให้ระบบการฝึกการออกเสียงโดยใช้คอมพิวเตอร์ช่วย (CAPT) สำหรับการรู้จำเสียงสระภาษาไทยมีความถูกต้องและมีประสิทธิภาพมากยิ่งขึ้น ระบบนี้เป็นระบบที่พัฒนาเทคนิคคอมพิวเตอร์ผสมผสานกับภาษาศาสตร์ สามารถช่วยให้ผู้เรียนได้ฝึกการออกเสียงสระแบบเรียลไทม์ เสมือนมีผู้เชี่ยวชาญ คอยให้คำแนะนำเกี่ยวกับการออกเสียงสระที่ถูกต้องอย่างต่อเนื่อง เหมาะกับสถานการณ์โลกในปัจจุบันที่ต้องมีการเรียนในรูปแบบออนไลน์
URI:	http://ithesis-ir.su.ac.th/dspace/handle/123456789/4412
Appears in Collections:	Science

Files in This Item:

File	Description	Size	Format
60309801.pdf		6.8 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets