Using machine learning for Thai defamatory text classification on public facebook

Please use this identifier to cite or link to this item: http://ithesis-ir.su.ac.th/dspace/handle/123456789/5258

Title:	Using machine learning for Thai defamatory text classification on public facebook การใช้การเรียนรู้ของเครื่องสำหรับการจำแนกข้อความภาษาไทยที่เข้าข่ายหมิ่นประมาทบนเฟสบุ๊คสาธารณะ
Authors:	Patipan WATJANAPRON ปฏิภาณ วัจนาภรณ์ orawan chaowalit อรวรรณ เชาวลิต Silpakorn University orawan chaowalit อรวรรณ เชาวลิต ochaowalit@hotmail.com ochaowalit@hotmail.com
Keywords:	การหมิ่นประมาท การเรียนรู้เชิงลึก การจำแนกประเภทข้อความ สื่อสังคมออนไลน์ การเรียนรู้ของเครื่อง โครงข่ายประสาทเทียมแบบคอนโวลูชัน การพิจารณาคดี Defamatory Deep learning Text classification Social media Machine learning Convolutional Neural Network Judgement
Issue Date:	28
Publisher:	Silpakorn University
Abstract:	This research aims to classify Thai texts or sentences with defamatory characteristics on Facebook by referencing the opinions of legal experts. The goal is to create a tool for filtering messages in the context of legal proceedings or lawsuits concerning defamation under Thai law. Additionally, it can assist in screening posts for social media users before they publish content. This study employs deep learning techniques to analyze comments under photos or articles of individuals mentioned on Facebook, using input data that comprises text along with features extracted from the text. We developed five deep learning models to classify defamatory messages: 1) Long Short-Term Memory (LSTM) 2) Bidirectional Long Short-Term Memory (Bi-LSTM) 3) Convolutional Neural Networks (CNN) 4) WangchanBERTa 5) PhayaThaiBERT. The feature extraction methods included word embedding with thai2fit, term frequency of judges' vocabulary, part-of-speech (POS) tagging, and named entity tagging. The experimental results showed that PhayaThaiBERT provided the best performance when using word embedding with PhayaThaiBERT and term frequency of judges' vocabulary for feature extraction. In this study, we used a base model configuration and found that tuning model parameters and tokenization methods could potentially enhance the model's performance. งานวิจัยนี้มีวัตถุประสงค์เพื่อจำแนกข้อความ หรือประโยคภาษาไทยที่มีลักษณะหมิ่นประมาทบนเฟซบุ๊ก โดยอ้างอิงจากความคิดเห็นของผู้เชี่ยวชาญด้านกฎหมาย เพื่อใช้เป็นเครื่องมือในการคัดกรองข้อความสำหรับการพิจารณาฟ้องร้อง หรือดำเนินคดีทางกฎหมายในความผิดฐานหมิ่นประมาทตามประมวลกฎหมายของไทย นอกจากนี้ยังสามารถใช้เป็นตัวช่วยคัดกรองข้อความก่อนโพสต์ของผู้ใช้งานสื่อสังคมออนไลน์ได้อีกด้วย งานวิจัยนี้ใช้เทคนิคการเรียนรู้เชิงลึกเพื่อวิเคราะห์ข้อความจากการแสดงความคิดเห็น (comments) ใต้รูปภาพ หรือบทความของบุคคลที่ถูกกล่าวถึงบนเฟซบุ๊ก และใช้ข้อมูลนำเข้าที่ประกอบด้วยข้อความร่วมกับคุณลักษณะพิเศษที่ถูกสกัดจากข้อความ โดยได้สร้างแบบจำลองการเรียนรู้เชิงลึก 5 วิธีเพื่อจำแนกข้อความหมิ่นประมาท ได้แก่ 1) Long Short-Term Memory (LSTM) 2) Bidirectional Long-Short Term Memory (Bi-LSTM) 3) Convolutional Neural Networks (CNN) 4) WangchanBERTa 5) PhayaThaiBERT โดยใช้การสกัดคุณลักษณะจากการฝังคำ (word embedding) ด้วย thai2fit การนับความถี่คำศัพท์จากคำพิพากษา (Term Frequency of judges' vocabulary) การแท็กส่วนประกอบคำพูด (Part-of-Speech tagging) และการแท็กชื่อเฉพาะ (Named Entity tagging) ผลการทดลองแสดงให้เห็นว่า PhayaThaiBERT ให้ผลลัพธ์ดีที่สุดเมื่อใช้การฝังคำด้วย PhayaThaiBERT และการนับความถี่คำศัพท์จากคำพิพากษาในการสกัดคุณลักษณะของคำ ซึ่งในงานวิจัยนี้ใช้แบบจำลองพื้นฐาน (base model) และพบว่าการปรับแต่งพารามิเตอร์ของแบบจำลองรวมถึงวิธีการตัดคำ อาจส่งผลให้ประสิทธิภาพของแบบจำลองดีขึ้นได้
URI:	http://ithesis-ir.su.ac.th/dspace/handle/123456789/5258
Appears in Collections:	Science

Files in This Item:

File	Description	Size	Format
620720028.pdf		4.41 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets