題目(中) 從非結構化文件中學習資訊粹取法則之技術
題目(英) LIEF: An Algorithm for Learning Information Extraction Rules from Unstructured Documents
研究生 潘智仁(碩士學位)
指導教授 魏志平
摘要(中) 隨著網路時代的來臨,有愈來愈多的資訊以數位化的型式儲存,包括各種數化的文件,這些文件裡往往存在非常多有價值的資訊。然而由於大部分的數位化文件都以非結構化的形式存在,使得如何從大量這類文件中快速地取得有用資訊成為非常重要的課題。傳統的作法是形成資訊粹取法則,然後透過資訊粹取系統來取出。不過應用人工的方式產生資訊粹取法則,存在著許多問題,比如非常耗時。所以冀望這些法則能夠自動產生。但是現有自動產生法則所採用的學習策略,存在著一些盲點,尤其在針對非結構化文件做處理時,都很難獲致良好的效果。因此本研究提出了一個新的學習策略-從錯誤經驗中學習,用以改善現有策略所遭遇到的問題。此外本研究也建置出採用這種學習策略的雛型系統,並與技術基準做效果上的比較。根據驗證的結果顯示,本研究所提出的學習策略確實有著明顯的效果。

In the past, information was stored more or less well-structured in database. Nowadays, a lot of information is presented in unstructured format. The management of and retrieval from such large vast of textual information has been a challenging issue for organizations or individuals. Information extraction is the process of extracting relevant data from semi-structured or unstructured documents and transforming them into structured representations. Many information extraction learning techniques have been proposed. However, they are ineffectiveness on unstructured documents. Thus, in the research, we proposed a new information extraction learning algorithm, called LIEF, that enhancing existing information extraction learning techniques. According to the empirical evaluations on news documents that are unstructured format, the LIEF algorithm proposed showed its capabilities in accuracy rate.

論文下載 etd-0802101-100356.pdf


國立中山大學資管系 版權所有
Copyright 2001 Department of Information Management, NSYSU.
歡迎轉載 但請尊重智慧財產權 註明出處