การประมวลผลแบบขนานของลิงค์เดต้าบนสถาปัตยกรรมแบบหลายแกน

โชคสุชาติ, ชิดชนก; Choksuchat, Chidchanok

Please use this identifier to cite or link to this item: http://ithesis-ir.su.ac.th/dspace/handle/123456789/209

Title:	การประมวลผลแบบขนานของลิงค์เดต้าบนสถาปัตยกรรมแบบหลายแกน
Other Titles:	PARALLEL PROCESSING OF LINKED DATA ON MANY-CORE ARCHITECTURE
Authors:	โชคสุชาติ, ชิดชนก Choksuchat, Chidchanok
Keywords:	การประมวลผลแบบขนาน ลิงค์เดต้า สถาปัตยกรรมแบบหลายแกน PARALLEL PROCESSING OF LINKED DATA ON MANY-CORE ARCHITECTURE
Issue Date:	6-Aug-2559
Publisher:	มหาวิทยาลัยศิลปากร
Abstract:	ในปัจจุบันเว็บเชิงความหมายได้ใช้รูปแบบข้อมูลอาร์ดีเอฟ (Resource Description Framework) ที่กำหนดโดยองค์กร W3C เป็นเมตาเดต้าสำหรับเก็บ เผยแพร่และเชื่อมข้อมูลระหว่างกันด้วยเทคโนโลยีลิงค์เดต้า ลักษณะของอาร์ดีเอฟมีหลายแบบตามการใช้งาน ในงานวิจัยนี้สนใจการค้นหาข้อมูลเชิงความหมายโดยใช้อาร์ดีเอฟรูปแบบทริพเพิลซึ่งเป็นเซตของประธาน กริยา กรรม มีการเก็บข้อมูลจำนวนหลายล้านทริพเพิลในรูปแบบไฟล์ข้อความทำให้ไฟล์มีขนาดใหญ่ขึ้น ในการคิวรี่ข้อมูลจากไฟล์ขนาดใหญ่นี้ด้วยโปรแกรมแบบเทรดเดียวบนเครื่องคอมพิวเตอร์ส่วนบุคคลจะเร็วไม่พอ จึงมักนำเซิร์ฟเวอร์ขนาดใหญ่และมีหลายแกนมาใช้ประมวลผลแบบขนาน ซึ่งมีค่าใช้จ่ายสูง งานวิจัยนี้จึงนำเสนอการประมวลผลคิวรี่ข้อมูลแบบอาร์ดีเอฟโดยใช้สถาปัตยกรรมแบบหลายแกนด้วยหน่วยประมวลกราฟิกส์บนเครื่องคอมพิวเตอร์ส่วนบุคคล ที่มีราคาไม่แพงและให้คำตอบได้ในเวลารวดเร็ว เนื่องจากสถาปัตยกรรมของหน่วยประมวลผลกราฟิกส์นั้นมีหน่วยความจำภายในขนาดเล็ก จึงจำเป็นต้องแปลงข้อมูลอาร์ดีเอฟเพื่อลดขนาดข้อมูล ให้นำเข้าสู่หน่วยความจำได้จำนวนมาก เพื่อให้ทำงานค้นหาคุ้มค่าที่สุดในการนำข้อมูลเข้าแต่ละครั้ง ผู้วิจัยจึงนำเสนอการแปลงข้อมูลรูปแบบซีบีเอ็ม (CBM, Combined BitMap representation) ที่ลดขนาดข้อมูล 1.7 กิกะไบต์ จากรูปแบบทริพเพิลได้ถึง 93% เมื่อนำไปค้นหาตามคิวรี่บนหน่วยประมวลผลกราฟิกส์รุ่น Tesla K40c พบว่าสามารถเร่งความเร็วได้มากกว่าการประมวลผลแบบลำดับถึง 13,000-27,000 เท่า เนื่องจากลักษณะเฉพาะของข้อมูลอาร์ดีเอฟเพื่อการค้นหา มีการเปลี่ยนแปลงน้อย เน้นที่การอ่านอย่างเดียว จากประเด็นการใช้เวลาแปลงข้อมูลเพื่อลดขนาดหลายชั่วโมง สามารถพัฒนาไปสู่การตั้งเวลาแปลงข้อมูลเพื่อเตรียมไว้ใช้ในการค้นหาได้ ทั้งนี้ยังมีการทดลองค้นหาโดยตรงด้วยวิธีเปรียบเทียบสตริงจากข้อมูลไฟล์ทริพเพิลแบบบรูซฟอร์ทแบบลำดับและแบบมัลติเทรดบนหน่วยประมวลผลกลางและบนหน่วยประมวลผลกราฟิกส์ ด้วยการใช้ประโยชน์ของหน่วยความจำพิน (โฮสต์) หน่วยความจำโกลบอล แชร์ และโอเปอเรชันสตรีม บนหน่วยประมวลผลกราฟิกส์หลายใบ ทั้งนี้ไฟล์ขนาดใหญ่สุดที่ใช้ทดสอบมีขนาด 400 กิกะไบต์หรือ 2.86 X 109 ทริพเพิล นอกจากนี้ยังมีการประยุกต์ใช้การประมวลผลแบบขนานกับฐานข้อมูลแบบคีย์-ค่า ภาษาจาวา และในส่วนการดึงข้อมูลจากเว็บด้วยวิธีแมพรีดิวซ์ Resource Description Framework (RDF) is the commonly used format for Semantic Web data. Nowadays, huge amounts of data on the Internet in the RDF format are used by search engines for providing answers to the queries of users. Querying through big data needs suitable searching methods supported by a very high processing power, because the traditional, sequential keyword matching on a semantic web server may take a prohibitively long time. In this research, we aim at accelerating the search in big RDF data by exploiting modern many-core architectures based on Graphics Processing Units (GPUs). While GPUs have become inexpensive processors that can be used for general purpose computing, the restrictions of the GPU memory hierarchy prevent a direct use of GPUs for processing big data: 1) data transfers between the GPU (device) memory and CPU (host) may incur a high run time overhead, 2) the size of GPU memory is not large enough to hold big amounts of data. Hence, in this research, we aim to propose a software engineering solution for the search and management of the related data transfers, taking into account the constrained memory of GPU architectures. Our method is general enough and can be applied to search any large text data, but we specifically focus on the parallel search for the RDF data set, because it is currently the most common format for semantic web applications. We present two representation frameworks. First, method with preprocessing transforms RDF data set to Combined BitMap representation (CBM) for compact data and convenient search. Since GPUs have limited memory size, without compaction, the RDF data may not be entirely stored in the GPU memory; thus, using the CBM structure enables us to put more RDF data in the GPU memory. Since GPUs contain many processing elements, utilizing them concurrently will speed up the RDF query processing. The experimental results show that the proposed representation can reduce the size of original RDF data from 1.7 GB by 93 percent. While query on NVIDIA Tesla K40c, the accelerating of search around 13,000-27,000 times compare to the serial search version. According the long time consuming of preprocessing method, the RDF data set rarely updated, and, hence, we develop to schedule batch file for preparation the compact data before searching through GPUs in the future. Second, we also develop several implementations of the RDF search for many-core architectures using two programming approaches: OpenMP for systems with CPUs and CUDA for systems comprising CPUs and GPUs. We implement the directed search on multiple GPUs by brute-force string matching method. We use the global memory (device), shared memory (device), pinned memory (host) and stream operation (device) on multiple GPUs. The maximum size of RDF data set is 400 GB or 2.68 x 109triples. In addition, we experiment to search the key-value storage with parallel engine, applying Java library to parallelize with multithread version; Parallel Java, and extract web sites of Hua Hin Health Tourism domain by Java Concurrency and MapReduce.
Description:	54307801 ; สาขาวิชาวิทยาการคอมพิวเตอร์และสารสนเทศ -- ชิดชนก โชคสุชาติ
URI:	http://ithesis-ir.su.ac.th/dspace/handle/123456789/209
Appears in Collections:	Science

Files in This Item:

File	Description	Size	Format
54307801.pdf		7.04 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets