1. 數位人文——學科對話與融合的新領域 PDF 項潔 / 陳麗華 /

2. Discovering land transaction relations from land deeds of Taiwan PDF Shih-Pei Chen / Yu-Ming Huang / Jieh Hsiang / Hsieh-Chang Tu / Hou Ieong Ho / Ping-Yen Chen /

Land deeds were the only proof of ownership in pre-1900 Taiwan. They are indispensable for the studies of Taiwan’s social, anthropological, and economic evolution. We have built a full-text digital library that contains almost 40,000 land deeds. The deeds in our collection range over 250 years and are collected from over 100 sources. The unprecedented volume and diversity of the sources provide an exciting source of primary documents for historians. But they also pose an interesting challenge: how to tell if two land deeds are related. In this article, we describe an approach to discover two important relations: successive transactions and allotment agreements involving the same property. Our method enabled us to construct 6,035 such transaction pairs. We also introduce a notion of ‘land transitivity graph’ to capture the transitivity embedded in these transactions. We discovered 2,436 such graphs, the largest of which includes 104 deeds. Some of these graphs involve land behavior that had never been studied before.

3. Discovering Relationships from Imperial Court Documents of Qing Dynasty PDF J. Hsiang / S.P. Chen / H.I. Ho / H.C. Tu /

The Qing Imperial Court documents are a major source of primary research material for studying the Qing era China since they provide the most direct and first-hand details of how national affairs were handled. However, the way Qing archived these documents has made it cumbersome to collect documents covering the same event and rebuild their original contexts. In this paper, we describe some information technology that we have developed to discover two important and useful relations among these documents. The first is the citation relation among the Imperial Edicts and the Memorials. We discovered 6,801 pairs from the 37,831 Taiwan-related Imperial Court documents in the Taiwan History Digital Library (THDL) and produced 1,101 graphs of successive citations, which we call IE-M diagrams. The second relation is a template relation, which indicates groups of documents that were created following a specific format. Numerical data can also be tabulated from these documents and be used for further analysis. Our studies show how information technology can be used to discover useful contexts from seemingly unrelated historical documents.

4. 臺灣契約文書的蒐集與分類(1898-2008) PDF 涂豐恩 /

臺灣契約文書的蒐集與研究,近年來頗受矚目。在各方努力下,目 前所挖掘出的契約文書數量,已遠超過前人的預估。面對空前豐富的資 料,契約分類學很自然地浮現出來。本文回顧近一百多年來,研究者蒐 集和分類臺灣契約文書的歷程。希望藉由往昔的經驗,提供來者思考的 基點。 我們將從日本時代殖民者對臺灣「舊慣」的調查開始——這是臺灣 蒐集民間契約文書的濫觴,更深刻影響後人對古文書的認識。此處我們 特別集中在岡松參太郎對契約的觀點,以及《臺灣私法》中的分類。接 下來,則沿時間順序,依次討論戰後幾波蒐集民間文書的經過,包括戰 後初期,1970-1980年代、1990以後等三個時間點。最後則分析戰後整理 者所採取的分類體系。 本文指出,多數契約文書的整理者著眼的並不是分類的普遍法則, 而是針對手中所握有的資料,進行彈性調整。這種由下而上的分類,是 務實的作法,卻很難形成普遍適用的原則。而且當需要分類的對象,從 幾百件、幾千件,突然成為幾萬件時,內部錯綜複雜的程度相對增加, 要理出頭緒的困難度自然也大幅上升。原本的分類,便因此失去了普遍 的適用性。臺灣契約的分類學,尚待集思廣益。

5. 導論——數位人文的變與不變 PDF 項潔 / 翁稷安 /

本文以史學方法中「史無定法」的觀念為例,說明傳統人文學在研究的方法 上,本身即保有開放的態度;在數位時代的今日,與資訊技術合作,開展出新的數 位人文研究,是符合人文學傳統的進步。 在學科專業化的今日,現階段數位和人文最好的合作方式,可能是打造成一 個溝通、開放的團隊。本文認為數位人文是數位時代人文研究的一種新的方式與選 項,是一種助益而不是傷害,它分享人文學不變的堅持和執著,也體現著人文學對 方法的開放和創新。這變與不變的平衡和掌握,將會是投身於此領域的學者,全力 以赴的挑戰。

6. 多重脈絡--數位檔案之問題與挑戰 PDF 項潔 / 翁稷安 /

7. 導論——什麼是數位人文 PDF 項潔 / 涂豐恩 /

數位人文的研究,近年來在國內外有著長足的發展。2009 年底,國立臺灣大學 數位典藏研究發展中心舉辦的第一屆「數位典藏與數位人文」國際研討會,是這股 趨勢在臺灣學術界的具體呈現。本文要圍繞「什麼是數位人文」的問題,提出觀察 和討論。我們將從近年來的趨勢談起,簡單描述數位人文如何在全球的學界中萌芽, 並迅速地發展,包括機構、期刊與會議的成立。接著討論數位人文發展的背景,即 過去十餘年內數位化在各地所累積的豐碩成果,這樣的成果為人文研究帶來了挑戰, 但同時也是另一波發展的契機。如何迎接這樣的挑戰,並將挑戰轉化為機會,有賴 新的工具與新的研究方法。最後,將視野放大,數位科技改變的不只是人文學者研 究的方法,更改變了整體的學術環境,改變我們傳遞、溝通和交換知識的方式,包 括教學與學習,研究的呈現,還有知識社群的型態。

8. 導論——關於數位人文的思考:理論與方法 PDF 項潔 / 翁稷安 /

數位人文於今日已逐漸受到重視,在不斷研發其各種可能性的同時,也必須進 一步去思索數位人文作為一個學術門類所該有的規範。本文回顧了數位人文發展的 歷史,並特別介紹了金觀濤對數位人文的討論。金觀濤從哲學的角度出發,去討論 了數位人文作為知識的方法論,是十分值得重視的洞見,但同時卻也有著無法迴避 的質疑。我們認為作為一實作性很強的學門,只由抽象的角度去理解數位人文的方 法論是不夠的,也必須從實踐中獲得。我們指出一個以研究為取向的系統在數位資 料運用上所扮演的關鍵角色,唯有建立這樣一個功能強大的系統,使用者才能更自 由地依自己研究所需,去觀察史料,建立、發挖出史料間的脈絡,開展出自己的論 述。也唯有以一個研究取向的系統為基礎,研究者和資料之間的關係才會真正被改 變,這才是我們思考數位人文方法論,乃至其未來發展的起點。

9. 同位詞夾子:主題式分類詞庫萃取演算法 PDF 謝育平 /

在資訊檢索、自然語言處理、數位典藏等眾多領域之 中,文字處理始終是研究者面臨的第一個課題。對中文資料 處理來說,自動斷詞、命名實體萃取、詞彙自動分類是前置 工作的重點,傳統研究著重於各項自動化工作的精準率與召 回率。本文提出半自動主題式詞庫萃取演算法,命名為「同 位詞夾子」,主要利用人工來保證精準率,利用機器速度來 補足召回率,以達到極高的準確率與儘量高的召回率。 同類的詞彙具有很高的同位性;例如「台北」與「高雄」 就有很高的同位性;所謂同位性係指在文件中所有出現「台 北」的地方,幾乎都可以使用「高雄」來替代,且替代後文 句仍是非常通順,所以我們稱「台北」與「高雄」互為同位 詞。「同位詞夾子」是由五個部件所組成(前文、前綴、中 綴、後綴、後文),主要描述一個詞彙在文件某處的特徵, 用以在文件中萃取該詞彙的同位詞。 本文演算法事先要求使用者提供該類詞彙的種子範 例,演算法利用種子範例在文件中掃描以產生該類詞彙的詞 夾子,再利用產生的詞夾子在文件中掃描並夾出該類詞彙的 候選詞,依照候選詞數值化後的「同位性分數」排序供使用 者人工決定是否符合該分類;再依人工幫助擴充種子範例、 重啟演算法,如此互動循環到滿意為止。 本文成功萃取台灣歷史數位圖書館中的人名、地名、官 職名、事件名等,也成功在中國古典小說中萃取三國演義的 武器名、西遊記的法術名、紅樓夢的衣飾名、金瓶梅的小吃 名等非傳統命名實體研究的詞彙分類。平均而言,一個分類 詞庫的萃取可以在兩個小時內完成。

10. 協助解決Google難題的資訊萃取機制 PDF 謝育平 / 詹登淵 / 郭子文 /

搜尋引擎在日常生活上已被廣泛的使用,可是搜尋引擎不能滿足使用者所有的資訊需求, 依舊存在著許多使用 Google 無法快速解決的搜尋問題,本文稱之為 Google 難題(Google-Hard Problem)。為解決 Google 難題,本文建立萃取機制。該機制對第零階網頁和第一階網頁進行 萃取,先將文本透過長詞優先法建立以詞為單位的句子,再運用 N-gram 演算法在詞序列中進 行中文斷詞,提供一個加上關聯性比重的詞集列表來對使用者進行內容提示及精度關鍵詞推薦。 另外利用詞夾子演算法(Term-Clip Algorithm)的特性與優點,以學習模式對文本探勘,然後找 出與樣本詞相同屬性的同位詞來對使用者進行內容提示及廣度關鍵詞推薦。萃取機制幫助使用 者加速搜尋的過程,並透過操作詞集來改變搜尋的方向,這是一種全新概念的搜尋操作模式。

12. 史料整體分析工具之幕後—介紹臺灣歷史數位圖書館的資料前置處理程序 PDF 陳詩沛 / 杜協昌 / 項潔 /

「臺灣歷史數位圖書館」(THDL)是臺灣大學近年建置的一個大型全文史 料數位圖書館,共含近八萬件的第一手臺灣史文獻,累積的全文字數已達約一 億五千萬字。本文對 THDL 納入史料時進行的資料前置處理方式做說明。一般 的數位圖書館在納入資料時,往往不假設這些資料之間有顯著的關聯,所以在 納入分批建置的新資料時,除了重新製作檢索索引(re-index)之外,並不需做 太多的前置作業。然而,為了讓臺灣史研究者能更有效地利用與分析 THDL 中 的大量史料,我們發展了一系列的史料分析工具,這些工具需要對史料作整體 性的分析,所以 THDL 對資料的處理方式遠比一般的數位圖書館複雜,尤其在 輸入一批新資料的時候,需要與現有資料作整體、精細的前置處理,以預先建 立全體資料的分析資訊與資料之間的關聯性,讓研究者有能力深入大量的史 料,進行分析與研究。這個前置作業程序也反映我們打造 THDL 的原則,即是 一個好的數位圖書館不應僅是一個資料倉儲,而應是資料與分析工具之間無縫 的結合。

13. On Building a Full-Text Digital Library of Land Deeds of Taiwan. PDF Jieh Hsiang / Szu-Pei Chen / Hsieh-Chang Tu /

14. 台灣古契約文書全文資料庫的建置 PDF 項潔 / 陳詩沛 / 杜協昌 /

15. 使用詞夾子建立中文典籍分析加值服務 PDF 謝育平 / 楊龍廉 / 趙建宏 / 黃銘立 / 古馮文 / 林郁智 /

當我們在一般搜尋引擎針對問題,進行搜尋動作時,回傳結果的單元數少則兩百,多則上千, 需要花上一些時間閱讀分析,才能引導自己繼續往下搜尋真正想要的資訊。所以搜尋結果的自動 閱讀分析代理程式,就顯得相當的重要。目前通用型搜尋引擎未提供此類功能的主要原因,是設 定搜尋的領域太廣造成搜尋結果分析困難。本文主要選用 432 部中文典籍,實作搜尋引擎及自動 閱讀分析代理程式,供中文領域研究學者使用,並作為中文典籍領域的研究基礎建設。 在自動閱讀分析實作上,本文主要使用詞庫頻率分析呈現,在其中 96 部典籍中,使用詞夾 子萃取 139 個分類詞庫,計 21538 個詞。處理一個分類詞庫平均約需 14 分鐘的機器時間及 35 分鐘的人工時間,平均新增 64 個詞。


1. 互動式網際網路檢索:模型、度量及實驗 杜協昌 /

由於 World Wide Web(網際網路)上資料量龐大,有必要研究 interactive IR(互動式資訊檢索)來增進檢索效率。本論文的目的在於構建一個模型,來討論當前許多 interactive IR services 在檢索中所扮演的角色。其次,要評估檢索系統的效率,我們必須有合適的 effectiveness measures (有效性度量)。本論文討論傳統的一些有效性度量,並試著提出比較切於實際的方法,來評估檢索效率。最後,我們也討論如何利用簡單的文件屬性,來增進 reranking(檢索結果重新排序)的效率。 本論文有三個主要子題,分別是 1. 一個概念上相當簡單的 interactive IR 操作型模型:嘗試利用數種型態的 operators,定義 interactive Web IR 的行為﹔也就是說,我們可以利用不同的 operator sequences 來規範出不同的檢索行為。我們也提出一個 spotlight cues(聚光燈式提示)的觀念,來表示系統建議使用者可能感興趣的文件。我們並由此觀念伸衍出對 browsing(瀏覽)行為有助益的 "multi-dimensional navigation" (多維度檢索導覽)應用。 2. 我們提出五項理由,來說明為何 recall(回收率,是一個在傳統 IR 中,被廣為接受的有效性度量)在 Web IR 上,是不實際的。我們也提出兩個比較實際的度量(稱為 "DCV-precision plot" 和 "relnum-precision plot"),並且討論它們的一些特性。 3. 我們做了實驗,利用相當簡單的 Web 文件屬性(例如 keywords、anchors、以及「相關 terms」的數量),對 search engines 所傳回的結果做 reranking。我們也討論如何利用這些簡單屬性來增進檢索效率。 另外,本論文在「背景資訊」的複習章節中,討論了許多 probabilistic IR 的問題,也列出許多網際網路上現有的互動式技術。我們在論文最後給了幾項結論,並試著提出一些未來研究的方向。

2. 以日誌為基礎的照片管理系統 廖漢騰 /

一般的照片管理系統是以相本加上分類為主,本文則回顧個人彙整(personal archive)系統的議題,改以日誌系統為基礎,利用超鏈結及語意標記來管理相片。本系統實作採取XHTML標準,為的是要讓資料能永久保留不因應用程式改版而變成資料孤兒;本系統引入一對多鏈結,一方面可以做為一種資料組織方式,另外一方面則也符合新舊瀏覽器的基本需要;本系統引入關於照片的 語意標記,讓超鏈結的語意更為清楚。最後,本實作將含有語意的超鏈結轉換成RDF格式並製圖,來彰顯資源之間關係。Most photo management systems are based on the concept of albums with that of category. This paper proposed an alternative that is based on the concept of blog systems, using hyperlinks (or trails) and semantic markup to manage photos. The approach is improved by an examination of the issues related to personal archive systems. The implementation of this approach adopts several standards. In order to keep the information safe from the so-called

3. 臺灣圖書館館藏目錄聯合檢索系統之研究與實作 何浩洋 /

不同的圖書館有各自的館藏目錄檢索系統,並且擁有各自不同的使用介面。為這些不同的圖書館建置一個「館藏目錄聯合檢索系統」能為讀者節省釵h時間與心力,讓讀者不需到各個不同圖書館的檢索系統去一一學習操作方法、一一找書、再一一過濾出真正想要的書目資料。 這類型的聯合檢索系統在國內已經有兩個較大的實作系統出現:國家圖書館所建置的 NBINet ,與中正大學的「國內圖書館圖書虛擬聯合目錄」。與 NBINet 合作的圖書館會定期將館藏目錄資料轉出、並上傳到 NBINet 集中存放,讀者透過 NBINet 檢索時便可找到各圖書館已上傳的書目資料。 中正大學的作法則是透過程式,在使用者檢索時及時到各圖書館的 Webpac 系統、為使用者代發檢索條件,再將各圖書館的查詢結果合併在同一個網頁上呈現給使用者。然而 NBINet 需要各圖書館的配合,館藏目錄與借儐洩p也無法反應及時現狀。中正大學的「國內圖書館圖書虛擬聯合目錄」則需分別對每個圖書館 Webpac 系統撰寫代替使用者下查詢的程式,並且其查詢結果只是將各館的查詢結果放在同一個網頁上,並沒有做任何的整理與整合,對使用者仍然非常不方便。 本論文所實作的 MetaCat 系統採用與中正大學相同的分散式 Meta search 架構,將使用者的檢索條件代發到各圖書館的 Webpac 系統,並將查詢結果針對書名與圖書館,做良好的整合與呈現。除了強調以使用者角度為考量重點以外,本系統將處理不同 webpac 的部分簡化為 XML 設定,強調軟體架構上的重用性與彈性、以及維護管理的簡便性。

4. 以鏈結為基礎的網站行為研究 蔡雨利 /

隨著網際網路的快速發展以及網路使用者的年齡層不斷下降,網路上的不當資訊越來越容易在家長不注意的時候,對心智尚未成熟的使用者造成不良的影響;但網際網路跨越國界的特性、節點分佈的廣闊,也使得政府機構對於規範網路上流通的資訊使不上力,唯有依靠民間企業或團體來發展過濾的機制。 現今市面上的網路內容過濾軟體,其所用以判斷不當內容網站的機制多為關鍵字比對(Keyword Comparison)或者內容分析(Content Analysis),此兩種技術皆已發展多年,達到成熟的階段,但文字基礎的分析方法容易在不同文化背景下遭遇到障礙,是值得注意的。 本篇論文試著從另一個角度出發來作特定主題網站的判斷與收集,我們著眼於當文件在超鏈結環境(網際網路)中所展現出的新特質,藉由觀察特定主題網站的行為,利用並分析鏈結結構所帶給我們對於網站的資訊,嘗試發展一個適用於收集和判斷特定類型網站的演算法。 The prospering of World Wide Web has brought some unexpected social problems, one of which is the influx of material not suitable for children, such as pornography and hate groups. How to shield impressionable minds from such pollution has become a challenge for computer scientists. One common approach is to build a content filtering tool that block websites containing improper information from being transmitted to the browser. Most content filtering software use keyword comparison or content analysis to identify such websites. Although these methods are effective to some extent, there are still some drawbacks. For instance, same words may represent different concepts under different cultures could lead to misdetection. When applying a pure textual based mechanism on different cultural environments for developing web site analysis algorithms, blocking sites by mistake or fail to block intended sites is a critical and crucial issue. In this thesis, we propose a new approach to website analysis. Our method is based on the observation that related websites tend to refer to each other through hyperlinks. A graph-based algorithm that utilizes this property has been designed and implemented. We have shown that our algorithm is efficient and effective in finding related site by collecting porno-sites together as an example. Additional experiments conducted on butterfly-related websites and gun-related websites have also produced satisfactory results.

5. 可攜式文字集之研究與應用 吳政泓 /

缺字問題存在已久,到目前為止仍未完全解決,雖然對於大部分的人並不會造成太大的困擾,但對於特定族群而言,如佛教經文數位化團體、國家數位典藏計畫單位卻是一個沉重的負擔。因為在數位化的過程中,缺字的處理過程是相當繁雜的,會造成輸入者相當的困擾,而在缺字文件建立並傳播後,缺字的瀏覽和使用對於一般的使用者而言,不夠簡易和直覺,亦會造成缺字的傳播和再使用不易。 本論文以可攜式文字資源(Portable Word Resource)[13]的概念,提出可攜式文字集(Portable Word Set)的架構,將電腦裡缺字相關的形、音、義資訊視為文字本體資訊並結合成一項網路資源,做為電腦詮釋缺字之用,並定義文字識別碼的方式是以“字集名稱+內碼擴充碼”,而在字集名稱中則暗示該字集資源在網路上的位址。一旦瀏覽者瀏覽到用此資源所製作的缺字文件,程式會自動分辨缺字資源所在並自動下載、安裝缺字資源,如此一來,使用者便可自然的瀏覽缺字文件,其具有“只要瀏覽過的就可以使用輸入法輸入,只要寫完文件就可以傳播,只要傳播就可以瀏覽”的特性,且支援各式支援字型的應用程式。

6. 可攜式網站之研究與實作 蔡長恩 /

當今網路環境的成長和便捷豐富了人們的生活,隨著網路環境的擴展我們得以伸出了觸角到世界的每個角落。但儘管是身處於如此方便的網路時代,人們仍會有是在沒有網路或是低頻寬網路的時刻,在此受限的網路世界下欲瀏覽網站、得到知識是一件不方便的事情,所以在此環境之下就衍生出離線瀏覽問題。 目前處理離線瀏覽問題的方式主要是預先下載欲瀏覽的網頁,但此類方法僅僅能處理靜態內容網頁部份,對於較有互動性質的動態內容網頁部份,目前尚無一個較為便利的機制使得網站的使用者和建置者雙方面可以達到供給和需求上的平衡。 本篇論文試著以站在網站建置者的立場上來解決離線瀏覽問題,架構出將網站內容和架構網站所需元件分離成兩大支柱。並藉由將網站以包裝的形式,以及將所需元件與閱讀網站的所需的瀏覽器合而為一成為網站使用者的使用工具,嘗試發展出一個適用於低頻寬和無網路環境之下使用者和建置者方便的架構。 The World Wide Web has brought a wealth of information, in the form of websites, like never before. A major limitation of using a website, however, is that the user must have access to the Internet. Conventional solutions such as pre-downloading a Website for offline browsing are effective only for static web pages. It does not work if a Website allows internal queries and forms a webpage dynamically from query results. In this thesis, we describe an approach to using a Website in a network-less environment. Our method wraps an entire Website and un-wraps it when one wants to browse its content. Since the un-wrapping can be done in any environment, with or without the Internet, the Website becomes portable. We have implemented this portable Website system and have given some experimental evidence of its usefulness.

7. 詞夾子演算法在專有名詞辨識上的應用-以歷史文件為例 張尚斌 /

中文詞集是一個開放集合,現階段不存在任何一個詞典或方法可以盡列所 有的中文詞。當處理不同領域的文件時,領域相關的特殊詞彙或專有名詞,常常造成辨識錯誤的情況。 現今的作法大致上分為三種,一種是以人工撰寫規則的rule-based方法,一種是以建置詞庫為主的corpus-based方法,最後一種是利用學習方式machine-learning的方法。大部分的作法都是以詞庫為主,但是詞庫要建置完備並不容易。本論文的目的是提供一個不建立詞庫的方法,來做專有名詞辨識。 本論文提出詞夾子演算法來解決專有名詞辨識的處理,詞夾子是使用“前文”、“詞首”、“詞尾”、“後文”的組合。主要概念是利用文章寫作上的一些特定習性與字辭之間的耦合關係,來找出專有名詞。先給予樣本詞,然後找出和樣本詞相關的詞夾子,並利用這些詞夾子找出與樣本詞類似的候選詞出來,之後以迭代方式不斷的產生詞夾子和候選詞。我們以歷史文件(在明清檔按有33025個檔案古契書有21575個檔案)為實驗資料。明清擋案在人名辨識上,得到在77.1%的召回率下得到56.1%的精確度,而在地名辨識上,得到在87.9%的召回率下得到87.0%的精確度。古契書在人名辨識上,得到在72.9%的召回率下得到45.6%的精確度,而在地名辨識上,得到在80.3%的召回率下得到77.6%的精確度。 The Chinese characters may in principle be composed into a countless number of phrases, which no existing methods, including dictionaries, can completely enumerate. This leads to the problem of erroneous detections or misses when attempting to identify proper nouns (PN) in a document. In this thesis, we have proposed a method based on a notion of word-clip to identify proper nouns from documents in a specific domain. Methods for PN recognition can be classified into the following three categories: rule-based methods, corpus-based methods, and machine-learning methods. The corpus-based methods are the most widely used approach. However, they usually require the establishment of a large dictionary. This is where the bulk of work lies. The word-clip method has no need of establishing a dictionary, which makes our algorithm more efficient. The main concept of the word-clip method is to use some existing relationships between PNs and the whole phrase. For example, the abbreviation ""Mr."" is usually followed by the name of a person (with a few exceptions such as ""Mr. President""). A typical word-clip is thus formed by combining a ""leading phrase"", a ""PN prefix"", a ""PN postfix"", and an ""ending phrase."" Our algorithm uses a set of initial sample PNs plus a set of training documents to generate word-clips. These word-clips are then used to identify new PNs for the next training cycle. This process is iterated to generate candidate PNs. We have tested our method on two large sets of historical documents. One is a set of 33,025 court documents from the Ming and Qing Dynasties, and the other is a set of 21,575 old land deeds. For the former we have generated 74,825 names of persons with a precision rate of 56.1% and recall rate of 77.1% ,and we have generated 6,306 names of location with a precision rate of 87.0% and recall rate of 87.9%. For the latter we have generated 28,358 names of persons with a precision rate of 45.6% and recall rate of 72.9%, and we have generated 4,132 names of location with a precision rate of 77.6% and recall rate of 80.3%.

8. 標籤樹於文件檢索後分類與呈現之運用-以古文書為例 莊萬慶 /

本論文的目的在於幫助使用者更簡單、便利的抓住檢索結果的重點,修正檢索結果,找到有用的資訊。由於大多的文件檢索系統所提供的檢索結果呈現-條列出文件摘要,常常發生數量太多的情況,如數百筆的資料、數十頁的分頁,這並非使用者可以消化的資訊數量,造成花費大量的時間和精神在瀏覽一筆一筆的文件,令人灰心的是,仍不一定找到相關的文件。因此我們提出一種後分類及呈現檢索結果特徵的架構-標籤樹,來解決這個問題。 標籤樹組織文件中屬性明確的關鍵詞,如人名、地名、時間等。使用者透過標籤樹提供的資訊,可以簡單且直覺的判斷出檢索結果中的概要、重點、並透過不同主軸(面向)觀察,如以人名為主、以地名為主、以時間為主,瞭解重要特徵的關連性,降低使用者對系統所提供的資訊之誤解。並且透過超連結的幫助,使用者能夠便利的縮小文件範圍,修正檢索結果。 我們以實際的歷史資料-古文書,實做標籤樹,並進行範例研究與分析,具體的說明從標籤樹中,使用者能夠進行整體性、多種主軸、且重點式的檢索行為。此外,配合現有的歷史研究資料來互相分析,如歷史年表、歷史出版物等,標籤樹提供使用者或歷史研究者一種驗證的研究方式。 This thesis presents an approach to classify and present query results. General purpose retrieval systems such as Web search engines cannot utilize domain knowledge to arrange query results into more readable form. Consequently it is often difficult for the user to take full advantage of returned documents should the quantity be very large. In this thesis we propose a notion of tag trees to post-classify and post-process query results according to features from different viewpoints. We incorporated our method into a digital library of historical documents. Using ""name"", ""location"", and ""time"" as coordinates and pre-defined sets of keywords for each coordinate, we classify retrieved documents according to the number of documents in which each keyword appears. The frequency that a keyword appears in retrieved documents also renders important insight into further query refinements and related queries.

9. 網路位置為基礎的網站數量評估機制-以色情網站為例 陳榮佐 /

隨著網路的蓬勃發展,色情網站的數量也與日劇增,人們也開始重視色情網站的問題,然而,想要獲得色情網站的數量並不是件簡單的事。本篇論文試著提出一個評估色情網站的方法,並擁有一定的信心值與誤差。 為了建立一套系統化且擁有較高信賴度的方法,我們採用網路位置(IP Address)取代以往以網域名稱(domain name)或網頁(webpage)來當作色情網站的單位,而我們使用關鍵字(keyword)、資料庫比對(database match)、鏈結分析(link analysis)來判斷是否為色情網站,再配合簡單隨機抽樣(Simple Random Sampling)來推得共有 69077個網路位置為色情網站,擁有95%信心值,誤差10%。 It is known that the number of pornographic websites increases as the Web expands. To estimate this number of pornographic websites online remains a big challenge. This paper proposes a method, based on statistical approaches, to estimate the actual number of pornographic websites within a certain confidence interval, and error range. In order to develop a more systematic and reliable method to estimate the number of pornographic websites, we have chosen to use IP address as our unit of measurement instead of the more commonly used domain name and webpage to describe pornographic website. We have used keywords, database matches, and link analysis to determine if a website contains pornographic content or not. Based on Simple Random Sampling statistics, we have concluded the number of pornographic websites up to date is 69077 with 95% confidence interval and within 10% error.

10. Free-DOM:萃取鬆散文件中的重要資訊並結構化之方法 王文廷 /

全球資訊網(WWW)(World Wide Web)上的資料,絕大多數皆以HTML(HyperText Markup Language)文件呈現;而全球資訊網上資料的加值應用,則須以此廣大的文件庫為基礎。又因為HTML文件是一種內容與排版呈現描述交雜在一起的文件,並沒有語意結構的描述,所以重要資訊的線索並不存在標籤(TAG)之中,因此HTML文件不論在語意上或者在結構上皆為鬆散的文件。所以在鬆散文件中的資料萃取及資料操控問題尤為重要。觀察深層網頁,可以假設同一個網站中的文章排版風格相近,同文章中的重要資訊也有相同的排版風格,Free-DOM主要應用在此類的文章之上。對鬆散文件的資料萃取而言,正規表達式提供一個豐富且精準的萃取機制。對資料操控來說,文章物件模型(Document Object Model)(DOM)提供了一個重要的機制來處理結構化的文章。Free-DOM係指使用正規表達式萃取鬆散文件(Free-Text)中的重要資料,然後使用文章物件模型的概念來結構化萃取後的資料。為了要做全球資訊網路資料的加值應用,本文設計Free-DOM來萃取結構化鬆散文件中的重要資訊以提供程式語言操控或是直接以XML(Extensible Markup Language)格式輸出結構化文件之後讓DOM操控以利於做全球資訊網路資料的加值應用。 Most documents available over the World Wide Web are written in or transformed into HTML. However, HTML is a loosely structured language that mixes presentational style with content. It is therefore important to design ways that can extract data from HTML documents. In this thesis we propose a method, Free-DOM (a Free-text Documents Object Model), for this purpose. Free-DOM is aimed at extracting data from HTML documents with a similar presentational format. It uses the regular expression to capture the structure of the format that it wants to extract, and the concept of DOM (Document Object Model) to manipulate the extracted data. Thus Free-DOM provides an extraction-and-manipulation language for free-text documents. Free-DOM supports programming languages (such as C++) as a library to pre-process and manipulate documents. It also works as a server-side script language to do value-added applications over the World Wide Web. We show the effectiveness of our method by several examples.

11. 官職表的模型與實作 張鈞韜 /

誰在什麼時候當了什麼官是研究當時歷史的重要資料。我們根據兩本描述清代台灣官職的歷史書籍:台灣地理及歷史卷九官師志第一冊文職表卷九[CR1]與台灣慣習記事[CR2]建立了本實驗系統,這兩本書包括了清代台灣各行政區域以及各官職的起始年代,歷任行政長官,我們希望這個系統能夠幫助使用者(尤指歷史學者)大大的縮短在尋找這方面資料的時間。 不僅僅只是建立參考文獻中所羅列的表格,我們設計專屬於這些資料的資料結構,讓這些資料的運用範圍能更廣範。書中的每一筆資料被我們分成數個tpules儲存,這些tulpes要儘量盡量的「侷限化」(localizaton) 使每個tuples之間不會互相影響,同時也要考量保存各tuples之間的關聯性,能夠回答使用者的各種問題。另一方面,因為參考文獻還是有所缺漏而且新的歷史研究不斷進行,所以能有效率的新增與修改資料的資料結構是絕對需要的。我們也將證明我們的資料結構是足夠精簡(minimal)的,當資料量變大時,不會讓系統不堪負荷。 本系統提供了三種查詢功能,依時間查詢、依官職查詢、依人物查詢。依官職查詢是參考文獻對於資料的排列方法,以官職為條目,列出了歷任任職者。數位資料應用上的方便,讓我們輕易提供了觀察資料的另外兩個面相:依人物查詢與依時間查詢。 依人物查詢提供每個人出仕的履歷,依時間查詢則繪出了某一年代台灣的行政區域與行政長官的樹狀階層圖。系統中在三個查詢結果中,(人、官)都附有連結,更大大縮短了歷史研究者尋找相關資料的時間。 Since it is important to learn the certain in-charge person at the certain time for history studying, we set up a program that assists users in getting information about local officials, by using an source book of that of Qing dynasty, the Listing of. 台灣地理及歷史卷九官師志第一冊文職表卷九[CR1] and 台灣慣習記事[CR2]. All the Taiwan administrative offices and the related occupying officials were recorded on the book, along with the specific dates of beginning and ending. Not a table but a data format was built to help users to fully utilize the information from the book. Since not every piece of information from the book is complete or correct, it is obvious that the smaller the tuple is, the easier it is to make local changes, and it is our main concept for the format. The tuples are required, however, to cover a wide range of fields to be connected to response new types of queries. It is proven that our data format is minimal while sufficient. There are currently three types of queries supported by our system: query by office, query by person, and query by year. Query by office, which is the same method the book adopted, chronologically lists the names of officials to a certain office. The remains, however, are the evidence that digital data offers more than the source book does. Query by name gives us every office held by a given name. And query by year brings an organization chart of the whole government for that given year, as well as the officials and offices. By just a click on an office on the chart, we can find out its internal structure, and its related officials with their titles in that given year. In the same way, a query by a person command would made by clicking on a person of the chart.

12. 淡新檔案21101案到22443案的訴訟關係與親屬關係資訊擷取及其應用 李承恩 /

淡新檔案主要是清乾隆四十一年(1776)至光緒二十一年(1895)淡水廳和新竹縣衙門內的公文,司法檔案充滿了人跟人之間的訴訟與親屬關係,在現存的清代臺灣省、府、州、縣廳署檔案中,以淡新檔案最具規模、完整而亙及長期間。本檔案為研究我國前清時代臺灣行政、司法、經濟、社會、農業等極有價值之第一手資料,也是瞭解傳統中國法律制度與司法審判的重要憑藉,為世界有名的傳統中國縣級檔案。 淡新檔案忠實的呈現清代臺灣市井小民因為食衣住行而產生的各種文件、糾紛、與訴訟,富含了人與人的關係,所以歷史學者研究淡新檔案的開始就是要將這先關係釐清。以往專家從事研究是邊讀文件邊繪出複雜的物流、人流與金流等關係圖,當案件數過多的檔案,以紙筆來記錄關係流程,顯的相當消耗人力、在保存、分享與查詢上易受到相當的侷限。 因此利用電腦技術輔助整理這些關係流程,將可減少歷史學者整理資料的時間,亦能增加擴大利用這些關係資料的彈性。本研究實做先以大量觀察為開始,找出關係圖的特徵,以正規表示法、資料探勘與SVG繪圖等技術,自動擷取淡新檔案民事文件中的訴訟原因、正反方訴訟者的親屬關係及官府作為等資訊,並自動產生訴訟關係圖及親屬關係圖。 訴訟關係圖及親屬關係圖係採用淡新檔案民事篇21101案到22443案,共92案、2286件文件,訴訟關係圖應用是以案為單位,用自動方法取得訴訟關係圖資料,共92張,經自行驗證,正確率為93%;親屬關係圖應用是以整個親屬家族為單位,用自動方法取得親屬關係圖資料,經自行驗證,共得109張親屬關係圖資料,以上資料為了妥善呈現並能修改,提供圖型查詢與修改介面工具。 本研究亦可以做為歷史專家從事淡新檔案研究的參考輔助工具。

13. 數位博物館的教材建立系統 陳思靜 /

近年來政府極力的推動典藏品數位化,建立了許多資料豐富珍貴的數位博物館,目的就是希望能將這些典藏品更廣泛的被推廣出去,而推廣人員的協助能讓推廣效果更好,如果我們的學習者是中小學學生,那對應的推廣人員就是中小學老師。 目前數位博物館大多是提供固定的資源給推廣人員進行推廣教育,像是針對部分教學主題所做的教案建議,或是一些經由專家所設計出的教材網頁,但是由於學習者種類眾多,要能針對不同學習族群及學習需求製作出符合的教材,則會有不同的教材內容與呈現方式。因此,有別於現有數位博物館提供內容已經固定的教材資訊,本論文的目的是希望在數位博物館的內容下,提供給推廣人員一套可以自由選擇內容的教材建立工具。 本論文的系統架構是從分析數位博物館內容物開始,從這些數位博物館具備的內容中,我們會解釋如何將博物館內容對應到教材所需要的資訊,接著藉由我們系統所提供的工具,讓推廣人員將這些已經存在的數位博物館資料,整理擷取成教材所要呈現的模式,最後完成的教材會是由多個有順序的頁面所組合而成,而這些教材可以在建立者同意公開的情況下,被視為教材的範本,提供給其他推廣人員作為製作教材時的參考資料,進一步豐富數位博物館的內容。 我們以「蝴蝶生態面面觀」為例,實作本論文所提出的教材建立系統。我們將管理教材、製作教材與呈現教材的過程整合在同一個介面上,方便推廣人員可以利用蝴蝶生態面面觀內容製作出蝴蝶的相關教材。 Digital museums have become popular in recent years. While some were built by real museums as a virtual alternative, many more do not have a physical correspondence. In addition to providing exhibitions that are open 24 hours a day, a digital museum also allows uses to gain access to the museum’s repository through a searchable digital archive that is often implemented as part of the system. This thesis explores the issue of using contents of a digital museum for education. Education is an important goal of digital museums. While some digital museums provide material for teachers to use in classrooms, they are usually static material such as web pages, which cannot be modified or re-arranged. We propose a tool for the users to extract content from a digital museum and the web and compose them into course material. We describe a mechanism to extract and organize information obtained from a digital museum. We also show how the course material created can be used in a classroom setting via serial display. If the author chooses to make the courses produced available, they can also be made public and accessed by other users through the web. To show the feasibility of our method, we built a prototype on an existing Digital Museum of Butterflies. We show how the system works by collect information, prepare lectures and present course materials about butterflies on the same platform.

14. 台灣古契書自動分類與依分類定義契書角色 盧家慶 /

台灣古契書是反映民間社會生活的第一手資料,同時也是研究臺灣歷史最重要的第一手資料。蒐集古契書並進行數位典藏除了可以保存契書資料外,也能讓我們透過蒐集的契書資料來瞭解清代臺灣地權轉移與開發史。 由臺灣大學資訊工程所數位典藏與自動推論實驗室和臺灣大學圖書館合作建置的臺灣歷史數位圖書館(Taiwan History Digital Library, THDL)是一個全文數位圖書館,在古契書方面目前已收集由國立台中圖書館及國立台灣大學圖書館所數位化的契書全文共21,399件,其中有21,121件契書具詮釋資料(metadata),其契書來源包括已刊印古契書、臺灣總督府檔案、岸裡大社、新竹北門鄭家、北市文獻會、台大南部古契書等資料群。面對如此龐大的契書資料需要一套好的分類方法讓使用者對整體契書資料能快速地瞭解,並能透過分類有效地使用契書資料。 本研究嘗試利用各數位化單位已經建置完成的詮釋資料來對各古契書資料群進行一致的自動分類。在各資料群詮釋資料中僅有描述契書性質的欄位而沒有精確的分類欄位,且描述性質的標準不一致。我們先參考各專家對古契書建議的分類方法決定了一個初始的分類架構,接著找出各詮釋資料中相當於”契書性質分類”的欄位、搭配每篇古契書的標題,將一篇篇古契書自動對應到上述分類架構中的某一分類。最後為特定分類重新賦予契書關係人物一致的角色。 將前述的自動分類方法與特定分類下角色賦予應用在THDL中21,121件具詮釋資料的契書上,可以將20,698件成功分類,而有423件契書需要經由人工處理分類。同時也發現到在原有14個分類外還可以新增租穀與契尾兩個類別。至於角色賦予由於成果不彰,需重新找尋適合的解決方法,比如說以詮釋資料搭配契書全文的方式。 Before the modernization of land administration by the Japanese during their occupation of Taiwan (between 1895 and 1945), hand-written land deeds are the only proof of the transaction or leasing of land. Land deeds are thus an important source of primary documents for studying Taiwanese society before 1895. Collaborating with the National Taiwan University Library, the Digital Archives Laboratory of the Department of Computer Science of NTU built a full-text digital library of primary historical documents, the Taiwan History Digital Library (THDL), which includes, among other things, 21,399 land deeds in searchable full-text. We believe that it is the largest data base of its kind in existence. In order to provide a better understanding of the contents and make them easier to use, we attempt, in this thesis, to categorize the collection. The difficulty arises from the fact that the land deeds in THDL came from different sources. Although most of them (21,121) also contain metadata, they were produced by different people using different standards. Thus, one cannot classify them easily using the descriptions provided in the metadata. We first studied existing classification scheme and chose one, which classified land deeds into 14 categories, that seems most suitable for our purpose. (To simplify the task, we only considered those with metadata.) We then designed an algorithm that, takes each collection, re-classified its content according to the 14 categories. Our method successfully classified 20,698 of the land deeds. The remaining 423 required examination by experts. We also discovered that two more categories, zugu (租榖) – rental charges in rice, and qiwei (契尾) – official certification for transaction of land, could be added to better capture the nature of the land deeds.

15. 日治法院檔案系統及其後分類呈現 蕭屹灵 /

『日治法院檔案』為台灣高等法院與王泰升教授於西元2000年所發現整理並數位化之台灣日治時期司法相關文件的總集,自台北、新竹、台中、嘉義四個地方法院以及司訓所藏台中法院檔案,共5640冊。其內容包含民事與刑事案件判決檔案、公正證書、各類行政文書、工商名冊登記簿等法律相關文件,尤其以記錄詳細司法案件審理內容之民事、刑事判決原本與公正證書原本三類檔案所佔數量最多,共3049冊,因此我們除了為每冊檔案建制詮釋資料外,亦針對民事、刑事判決原本與公正證書原本三類檔案,為其書冊內各案件建制詮釋資料。 法院檔案記錄深刻影響台灣人生活的法院各種作為,記載了一段台灣人民與日本人民的共同歷史經驗,是了解台灣在日治五十年間,經濟、社會、文化等等各方面朝向 近代化演進的第一手史料;而其描述台灣殖民法制之運作,在日本近代法史的研究中亦具有非常重要的地位。因此對於不同領域的研究人員都可從『日治法院檔案中』 找到所關心的材料。 為了使得典藏檢索系統能發揮檔案的價值,我們使用『屬性標籤』的資料結構整合詮釋資料,並引入了多維度的後分類導覽方式;更在此基礎下,進一步為系統加入檢 索詞組控制介面,發展出擁有檢索詞組控制與後分類架構的整合式檢索介面系統之模型,以解決一般檢索系統之缺點,達成連貫檢索流程的目的。 利用此模型,依據檢索前後兩階段之目的,1.檢索詞組建議與組合、2.檢索詞組調整與篩選,實做『日治法院檔案』之二階段檢索詞組控制檢索系統,希望藉由系統的協助下,使用者能夠迅速掌握檢索資源,並擁有可彈性調整檢索策略的檢索操作模式,以達成完善的檢索目標。 In 2000, the Taiwan High Court and Professor Tay-Sheng Wang of the Department of Law of the National Taiwan University re-discovered the archive of judicial court records of the judicial courts of Taipei, Taichung, Jiayi and Hsinchu during the Japanese colonial occupation of Taiwan from 1895 to 1945. Digitization of the findings was carried out under the direction of Professor Wang with help from the National Taiwan University Library. After several years of work, the digitization effort is now near its completion. This thesis describes the design and implementation of the TCCRA, the Taiwan Colonial Court Record Archives. The court cases, 5640 volumes in total, gave a vivid account of the economic, cultural, and societal evolution of Taiwan during the Japanese colonization. It is not only invaluable for anyone interested in the development of Taiwan, but also for researchers of Japanese colonial laws. The digitization effort includes producing all the images (through digital camera) and metadata of each court case. Because of the sheer volume of the data, it is important to design a system that allows the users not only find what they want but also helps them discover the meaning and conduct further exploration. We start by utilizing an ""Attributive tag"" data format to integrate the metadata and to provide the backbone of faceted browsing of query results. A sophisticated yet easy to use query interface is then designed to guide the user to refine queries and to classify query results in different ways. Our faceted retrieval system has two main additional features: query term suggestion and combination, and query adjustment and document selection. Our features enable the user to analyze query results as a collection and refine them easily.

16. 臺灣舊照片資料庫重複照片比對研究 朱國延 /

臺灣舊照片資料庫(URL係臺大圖書館所收藏之豐富的日治時期出版品,其中包含大量臺灣相關書籍及期刊資料,將其中的照片影像做數位化為數位照片成主要內容的照片資料庫。資料庫總計照片與詮釋資料( metadata )共三萬八千餘筆,並提供完善的詮釋資料檢索機制作線上瀏覽,更能就學術合理使用範圍內下載詮釋資料與數位圖像。 但是照片的內容與詮釋資料會因為不同書籍的編輯描述造成不一致性,使得出現了重複照片但不容易以文字檢索能順利找到相同內容的重複照片同時也造成用相同的文字檢索會出現重複照片的冗餘情況。 所以,本研究目的是著眼在除了利用文字描述與詮釋資料的檢索外,還有利用影像內容的檢索(content based image retrieval,CBIR)的方法來應用,利用影像內容的檢索的方式擬定半自動化系統的方法流程為照片內容做相似度比對,蒐集高相似度的相似照片對,再以人工檢視的方式將重複照片對找出來。 最後我們在臺灣舊照片資料庫系統的資料庫中的38,653張照片做為相似照片的比對的實作,我們以預估Recall在有達到90%以上的程度去檢視確認相似的目標照片對共308,286對,然後共找到了3,270對確定為重複照片對,構成2,621組的重複照片組,以便給予系統維護的單位資料庫中的重複照片組集合,對系統內重複照片冗餘問題做進一步的處理。 In 2003, the National Taiwan University Library produced a digital collection of old photographs of Taiwan. They cover the period from 1895 to 1945, when Taiwan was occupied by Japan. The photos, 38,653 in total, were selected from over 2,000 books published by the Japanese Colonial Government during that time, and cover a wide range of subjects. They were made into a digital library, with images and metadata records, and is the most extensive database of its kind in existence. We observed that there are duplications of photos in the database. They were either because certain photos were included in different books, or because some books were scanned twice. The purpose of the research reported in this thesis is to find duplication of images in the database. We adopted methods in content-based image retrieval and developed a system to identify pairs that might have come from the same photo. The pairs were then checked manually to see if they are indeed duplicates. Among the photographs in the database, our system identified 308,286 pairs, of which 3,270 were duplicated photo pairs. Since some photos appeared more than twice (9 being the most), there are 2,621 photo groups altogether. We estimate that the recall rate is over 90%.

17. 臺灣古地契關係自動重建之研究 黃于鳴 /

古地契是研究臺灣歷史上土地開發及社會經濟活動的第一手資料,而同筆土地在不同時間關於土地權利的移轉、典賣、鬮分等行為使地契之間產生如上下手契、鬮分契多份的關係,這些地契之間的關係是利用古地契研究土地發展的重要依據。目前「台灣歷史數位圖書館」(THDL)共蒐集了30285件清代及日治時期的古契書,這些契書由不同的單位所數位化,並分布在34個大小不一的文件集,若想單靠人力來重建地契之間的關係是相當困難的。 因此本研究提出一個自動化的方法來幫助重建古地契之間上下手契、原契與契尾、鬮分契多份、契書內容相同四種關係。首先從契書的詮釋資料與全文中擷取契書特徵,利用已有的契書分類與關係人角色對應方法並加以修正。接著整理出每種契書關係須滿足的特徵條件,再根據所整理出的特徵條件,配合契書特性使用特徵模糊比對,兩兩比對THDL裡所有契書,並經人工檢查,最後找出上下手契2409對、原契與契尾92對、鬮分契多份878組、契書內容相同531組,其中包含許多跨文件集人力不易發現的契書關係。另外,我們也利用「神岡 : 筱雲呂玉慶堂典藏古文書集」所包含已經過人工整理較完整的契書關係來檢視重建方法的回收率。 將這些重建的契書關係都連結起來,可以幫助我們觀察土地發展的脈絡,有助於研究臺灣歷史上從土地關係所衍伸出的經濟社會等相關議題,而這些契書關係也都已加入THDL的檢索系統可供歷史研究者使用。 During the dominant time of the Ching Dynasty and Japan (1683-1945), the development and operation of lands was the main social and economic activity in Taiwan. Consequently, there are a large number of land deeds leaved, which are contracted by local resident in private. These land deeds in that time was the only proof of land ownership and today they become vital material to study the development history of Taiwan. The acquisition, transfer and division of lands over time have brought about the relationships among the land deeds. Using these relationships, we can better make use of the land deeds which are in big quantity. However, these land deeds scattered are collected and digitalized by different organizations into many corpuses. It’s very hard to reconstruct the relationships only depending on the manpower. So in this thesis, we propose an automatic method to reconstruct the relationships. We first extract features such as related person, contracted time, price, …etc, from metadata and full-text of land deeds and unify the category and person role of deeds. Second, we define conditions each relationship should meet based on the features and define fuzzy comparison methods of features. Finally, using the feature conditions, we design an algorithm to efficiently compare each pair of land deeds to find the relationships. As a result, in totally 30285 land deeds, we find “original deed and deed from the previous owner” 2409 pairs, “original deed and its government receipt of tax payment” 92 pairs, “allotment agreements” 878 sets and “same deeds” 531 sets. These relationships reconstructed have been accessible in THDL (Taiwan History Digital Library) to assist historian in Taiwanese land deeds research.

18. 日治法院檔案數位典藏系統之研發與建置 項潔 / 蕭屹灵 / 董家兒 /

19. 辨識中文字相似特性產生的同地異名-以台灣歷史數位圖書館古契書為例 林韋翰 /

古契書對於研究台灣土地開發及社會經濟活動的歷史脈絡為相當珍貴的資料,契書中所記載的土地資訊卻常因在不同時間或是出自於不同人之手,而發現相同的土地卻有不同寫法的名稱,如此不僅會對系統檢索造成影響,在其他的相關研究中也會產生諸多不便。「台灣歷史數位圖書館」(THDL)蒐集了大量清代及日治時期的古契書,其中記錄的土地範圍遍佈全台,且經過時空變遷後,若想逐篇進行田野調查實為不易。 因此本研究利用中文字音與字形的相似特性,尋找指稱相同地點卻不同寫法的地名。一開始先從契書的全文資料中擷取地名在全文中出現的相對位置,來組織地名之間的階層關係。接著針對中文字的特性,分別依照字音和字形的相似特徵設計模糊比對,再以先前得到的地名階層資訊,對所有具相同上層的地名兩兩進行比對,最後經人工檢查,發現指稱相同地點卻有不同寫法的地名總共有844對。其中結果有包含常見的共通字、有因為音譯的不同產生的不同寫法、也有契書抄寫時的筆誤、還有數位化時的各種因素產生的差異。 本研究得到的結果,可以幫助研究古契書的研究者在THDL中更有效率的收集一範圍內的相關古契書,好讓研究者可以花更多的心力在其研究上。而本研究所使用的方法對於其他使用中文比對的資訊研究亦有助益,例如在古契書中尋找上下手契所需要的地名特徵,也可以使用本研究的相似地名比對方法。

20. 清代臺灣行政檔案文件自動分類至歷史事件 陳嘉翔 /

「台灣歷史數位圖書館」(Taiwan History Digital Library,以下簡稱THDL) 是個為了服務台灣史研究者所建的全文資料庫。 將資料庫中的文件以橫軸為年代、縱軸為文件數量繒製出「年代文件分布圖」,發現圖中的趨勢線高頻處與歷史大事件的發生時間有密切的對應,因而使人想探究每條趨勢線的高頻處各是發生那些歷史事件。為達到此目的,必須先將資料庫中的清代台灣行政檔案之文件自動分類到歷史事件。 本研究蒐集「台灣小事典」與「臺灣歷史辭典記載的事件」,在初步分類整理後,選出四十一筆歷史事件。設計的自動分類方法是先用人工搜尋出能代表每個事件的「初始關鍵字」,接著設定某個「association rule之confidence參數值」為門檻,對從數個「人名權威資料庫」蒐集出來的「候選特徵關鍵詞」做篩選。再將檢索年代限定為該事件發生的年代,並對該事件的「初始關鍵字」和「特徵關鍵詞」作聯集來對THDL做查詢,最後將回傳文件判定為與該事件相關。 系統共分類了11826篇文件,占清代台灣行政檔案的32%。另外68%的文件為六部相關奏摺、官員任免奏摺、地方政府回報米糧價格、關稅報告等庶務性奏摺文件。 本論文分別挑選與「戴潮春事件」、「牡丹社事件」以及「清日甲午戰爭」三個事件發生年代相同的文件,用人工方式逐篇閱讀並判斷該文件是否與該事件相關。目的是作為ground truth和「使用自動分類方法得到的文件」做比較,以計算出recall和precision來評估本研究使用的自動分類方法之成效。 當t→q 為0.2時,牡丹社事件、清日甲午戰爭和戴萬生事件的recall分別為0.7241、0.9941、0.8928;Precision分別為0.6117、0.6175、0.6735。由於歷史學家在檢索文件時,偏好先得到所有的文件再逐篇閱讀分析 (查全導向),因此recall平均值超過80% 以及precision平均值超過60%的分類結果還算可以接受。 Taiwan History Digital Library (THDL) is a full-text database built for Taiwan history researchers. By plotting the numbers of the documents of THDL annually (the horizontal axis is A.D. year; the vertical axis is numbers of the documents), it was discovered that critical historical events always happened in the peaks of the graph. To explore what historical events happened in each peak of the graph, a method should be developed to classify the documents into the historical events. After organizing, classifying and removing unnecessary Taiwan historical events from two dictionaries, forty-one Taiwan-related historical events in the Qing dynasty (form A.D.1684 to A.D. 1895) were chosen to be the experiment materials To classify the documents into the events, the “initial keywords” were manually selected first. Secondly, the parameter of the association rule, confidence (t→q), was employed to evaluate whether the “feature keyword” should be selected or not. If one document contains the “initial keywords” of the event or “selected feature keywords whose t→q is over the threshold” and that document was written in the years that the event happened, this document would be considered belonging to the historical event, and be classified into it. 11826 documents (32% of the archive) were classified into the historical events. The rest 68% documents of the archive are routine administrative documents, for example, the employments and discharges of the government officers, the price reports of the crops, the reports of tariff, etc. In order to evaluate the performance of the automatic classification method, the documents written in the year near the outbreak of the following three events: 1. Tai Chao-chuen incident, 2. Taiwan Expedition of 1874 (a.k.a. Mudan incident) and 3.First Sino-Japanese War were selected. Then each document was read manually, and was determined one by one if it belongs to the historical event as ground truth. In this way, the results of the automatic classification could be compared with these determined documents (ground truth) to calculate recall and precision. When the parameter, t→q, equals to 0.2, the recall of the “Tai Chao-chuen incident”, “Taiwan Expedition of 1874” and “First Sino-Japanese War” is 0.7241, 0.9941 and 0.8928 respectively, and the precision is 0.6117, 0.6175 and 0.6735 respectively. As historians prefer to retrieve all the related documents first, and then read these documents one by one (recall-oriented), the automatic classification method with the average of the recalls over 80% and the average of the precision over 60% is acceptable.

21. 中國古典白話小說中的社會網路關係:以《儒林外史》為例 廖儁凡 /

《儒林外史》是一部出現人物眾多、敘事結構相當獨特的白話寫實長篇章回小說,為中國諷刺文學的先聲。它雖然與《紅樓夢》同為清朝最著名的小說,但比起後者,《儒林外史》得到的注意似乎與其重要性不成正比,且缺乏在資訊領域方面的研究。而社會網絡分析雖已廣泛應用於許多研究領域,卻較少應用於文本分析之中。因此,我們想嘗試利用資訊技術建構並分析《儒林外史》的社會網絡,並回答與其有關的社會網絡議題。 在本研究中,我們發現《儒林外史》的全文有相當高比重的角色對話,而這是最重要的人物關係來源。因此,我們設計了一種半自動的方法,利用詞夾子演算法從全文中擷取資料,然後由角色對話建立並分析了《儒林外史》的社會網絡,且為其做了初步的文本分析。 在分析中,除了得知《儒林外史》的人物其所屬的團體以外,我們還發現了《儒林外史》一個較少人研究的角色──金東崖的重要性,並藉由各回中各角色說話內容的多寡得知了《儒林》中的確具有明顯的角色迭代結構。 The classic realistic vernacular novel “The Scholars” pioneered Chinese satire. It has a unique narrative style, in which there are hundreds of characters but none of them leads the story. It is the best-known novel during Qing Dynasty, just the same as “Dream of the Red Chamber”. However, the attention it acquires and the research been done on it are much less compared with the latter, and despite the extensive application of social network analysis, it is less applied to textual analysis. Therefore, we tried to use information technology to construct and analyze the social network of “The Scholars” and answered questions related to social network problems. In our research, we discovered that there is a high proportion of role dialogues in The Scholars, which is the most important source of role relations. Therefore, we designed a semi-automatic method to extract data from the full text of The Scholars using an algorithm called “word clipper”, and construct the social network based on the conversations of characters. After that, we did a preliminary textual analysis based on those quantitive data. During our analysis, besides finding out the community each role belongs, we discovered the importance of a little-researched character JIN Dong-ya, and learned the exact role-iterating status of The Scholars based on the talking of each characters.

22. Chinese Recorder Index檢索系統的設計與建置 孔容偉 /

The Chinese Recorder(中文名稱為《教務雜誌》,本論文中簡稱為”CR”)是一部在1867至1941年間出版的期刊,為來到中國的西方宣教士溝通、交換意見的重要平台,主要以英文撰寫。CR全部內容彙整後共有73冊(包含其前身”Missionary Recorder”),總共約50,600頁。由於橫跨年分長、包含主題廣,並且豐富呈現了西方宣教士眼中的中國,CR可說是研究近代基督教在中國的發展史的重要一手史料。 Chinese Recorder Index(本論文中簡稱”CRI”)全名為”The Chinese recorder index: a guide to Christian missions in Asia, 1867-1941”,是一部對一共73冊的CR所編的索引工具書,共分上下兩冊,由Kathleen L. Lodwick教授於1985年完成,。除了如同一般書籍索引標示某些關鍵詞彙出現頁碼以外,CRI更為這些詞彙條目的頁碼增加關於原文的資訊,方便使用者更容易的掌握原文內容。CRI的問世,提供了使用CR的研究者更為便利的使用方式。 然而CRI紙本有其既有的固定呈現方式,使用方法上也有所限制。本論文嘗試透過資訊技術分析整合CRI內容,提供比原本紙本CRI的呈現方式更多面、更彈性的資訊。 論文中介紹對CRI內容的分析和處理方法。依據CRI內容資料的結構,設計適合用來檢索的資料結構,使這些資料能夠以更有彈性的方式呈現。 再來根據這些資料結構的內容,設計各種不同的檢索方法來取得資料。除了能完整維持CRI原有的呈現架構外,也嘗試藉由不同的方法來呈現原本的CRI紙本中不易得到的資訊。整合這些方法,成為一個檢索模型,並以此實作出完整的檢索系統。 The Chinese Recorder (CR) is a journal published between 1867 and 1941. Published in English, it was created by western missionaries in China as an important platform to exchange information among the missionaries. CR is comprised of 73 volumes, including one volume of its precursor, the Missionary Record, and contains approximately 50,000 pages in total. In addition to sharing experience on missionary work, CR also contains descriptions and discussions of Chinese civilization and current events. Because the period that it covers overlaps with one of the most critical periods in Chinese history, CR is not only an important source for the studies of the spread of Christianity in China but also for the studies of modern Chinese history. The Chinese recorder index: a guide to Christian missions in Asia, 1867-1941 (CRI) is a two-volume book of indices of CR. Published in 1985, CRI was compiled by Professor Kathleen L. Lodwick and has three different indices and six tables of correspondences. It has 8,391 entries in the person index, 712 entries in the mission index, and 4,691 entries in the subject index. In addition to the usual page numbers that indicates where a keyword has appeared, CRI also uses tags to reflect certain properties. The comprehensiveness of CRI makes it an indispensible companion for scholars when using CR. The paper form of CRI, however, makes it difficult to fully utilize the immense wealth of information that it contains. In this thesis, we analyze the structures of the indices, and design data structure that is more suitable for information retrieval. The data structure introduced enables us to combine information in different entries to produce discoveries that are difficult to observe in the original CRI.

23. 《清實錄》之文本分析與時間標記初探 陳品諺 /

清實錄是一部編年體的歷史典籍,其按年月日逐序詳實紀錄清朝近三百年皇帝的活動和事蹟,不只紀錄皇帝的活動情形,更同時記載某些大臣的政績、法制政令、吏制科舉、人丁戶口、藩邦外交、文化經藉、兵役征戰等方面的歷史資料,因此成為重要清代歷史研究的珍貴文字資料。由於清實錄本身是一部完整的紀錄,數量龐大,且是官方紀錄,紀錄方式非常嚴謹,具有嚴格的規則,是一部適合數位人文研究的歷史典籍。 本次研究主要分兩個部份,第一部分主要說明資料處理。將數位化清實錄,依照清實錄的斷句結構,將清實錄依照條目切開。除此之外,針對每則條目以資訊技術採掘時間資訊,將其校正,並將日期轉換西元年月日,以方便時間統計等相關數據統計。本次研究共將清實錄切成317630件條目,切出89372不同的日期,並找到十二筆清實錄紙本錯誤。 第二部份則主要說明將處理完成的清實錄,匯入以臺灣歷史數位圖書館(THDL)模型,建置新的資料庫系統QSDL(QingShilu Digital Library,QSDL )。THDL本身是以人文研究者為導向的資料庫系統,於本章節中會介紹THDL原有的功能,並額外介紹調整的新功能。並於第五章,介紹利用QSDL可以觀察到的現象,以年代和出現頻率的折線圖統計資料呈現。 於第五章中,更說明如何從QSDL中撈取百里文書的資料。清朝雍正皇帝以後,上諭和奏摺以日行幾百里的速度的傳遞,類似今日的郵票限時系統,本研究將這類文件稱為「百里文書」。換句話說,較緊急的事情才會以限時傳送,而越緊急的文書所要求日行百里的速度也越快。於清實錄中,共找出15520篇條目為百里上諭文書,而於整部清實錄中,共有三篇以八百里上諭。 QingShiLu, the Veritable Records of the Qing Dynasty, chronologically documents daily activities and significant events during the reign of each of the emperors of Qing dynasty. Because of its richness and comprehensiveness, QingShiLu is an indispensible source of material for research on the Qing dynasty. There are two parts in this thesis. First we introduce a process for re-organizing the digital text of Qingshilu into a format that is more suitable for research use. This include itemize the events to make them more readable in chronological order, and to introduce the precise date of each event. During this process, we also discovered 12 errors in the original date (ganzhi -- 干支) records of Qingshilu. In the second part, we incorporated the re-engineered texts into the THDL (Taiwan History Digital Library) shell and built QSDL, the QingShilu Digital Library. To illustrate the efficacy of QSDL, in Chapter 5 we presented some observations about Qingshilu using frequency chart. As an example, we analyzed all entries about Baili dispatches (百里文書). Baili dispatches was a mechanism for sending important messages during the Qing reign. Depending on the level of urgencies, the messages may be required to be sent 100 through 800 li each day. (One li roughly corresponds to 1/2 kilometer.) We found that there were 15520 imperial edicts that were sent as baili dispatches during the entire Qing reign, and that according to Qingshilu, the most urgent, 800 li dispatches, were used only 3 times.

24. 資訊技術與歷史文獻分析 陳詩沛 /

本論文旨在探討:在後數位典藏時代、前所未有的大量歷史資料被數位化的背景底下,資訊科技該如何介入歷史研究過程,幫助歷史學者有效運用大規模的史料,進行歷史研究。 本論文首先介紹「台灣歷史數位圖書館」(Taiwan History Digital Library, THDL)收錄的兩批重要臺灣史料的內容,以及我們在這兩批資料上發展的檢索系統與觀察工具。本研究使用的兩批臺灣史料:『明清臺灣行政檔案』與『古契約文書』,已累計有73,287件,全文超過五千四百萬字,我們在論文中詳細介紹了其資料內容、來源出處、以及對臺灣史研究的重要性。接著我們介紹THDL系統因應這兩批史料所發展的檢索工具,以及「將檢索結果文件集視為有意義的整體」(regard query returns as a sub-collection)之觀念,並描述我們如何透過「檢索後分類」與「詞頻分析」等工具,為史家分析檢索結果,以引導史家發掘史料之間可能隱含的關連。 本論文緊接著提出兩種方法:「文件集特徵分析」與「史料關係建構」,來進一步拓展史家運用史料的手段。「文件集特徵分析」是將大量史料視為「觀察特徵的環境」,以史料作為「特徵」(史家想觀察的人物、地點、議題等)出現的證據(稱為support),針對史家目前關心的文件子集(sub-collection),藉由分析特徵在sub-collection中出現的數量(稱為特徵量),引導史家觀察跟此sub-collection有密切關聯的特徵,以及關聯的情況。我們將此方法寫成一個數學模型,並且也實際運用到『明清臺灣行政檔案』與『古契約文書』兩批史料上,從中得到了人力不易看出的有趣觀察。 而「史料關係建構」則是指在大量數位化史料集結的環境下,以資訊技術發掘隱含史料關係。本論文舉出三種史料關係作為實例:明清檔案引用關係、契書關係、與內容相似關係,說明其建構方法與成果,其中針對明清檔案引用關係的建構我們有詳細的方法論述。透過我們的方法,我們在37,836件『明清臺灣行政檔案』中發現了6,802對引用關係,在35,451件『古契約文書』中發現3,910組契書關係,在兩文獻集中各發現了3,973與3,570個內容相似群組。論文中我們也說明,史料關係建構不僅能構築起史料間的脈絡,也能為史家帶來新的發現,我們舉出根據上述史料關係而形成的1,101章引用關係圖、2,219張土地轉移圖、以及對範本群組的內容分析應用,來加強此一論述。 This thesis proposes two IT methods to help historians utilize digitized historical documents. The availability of large quantity of historical documents that can be searched and retrieved has become a challenge for historians since the traditional way of carefully going through a small number of documents is no longer sufficient. In this thesis we first give an overview of THDL, the Taiwan History Digital Library, a full-text digital library of primary historical documents about Taiwan. The documents in THDL, currently numbered 73,287 documents and over 54,000,000 words, are the major experiment materials in this thesis. We then introduce the feature analysis method, which puts a collection of historical documents in an observation environment to be studied collectively as opposed to treating them as individual documents. Feature analysis takes a sub-collection, meaning a set of documents related to a research topic that the user is currently interested in, as its input and analyzes the features shared by these documents. By calculating the amount of support for each feature (the amount of documents which are evidences of the occurrence of a feature), this method discovers features that are highly related to a sub-collection. We have developed a mathematical model for this method. We have also applied it to two of the corpuses in THDL and found unexpected and interesting observations. We then present several relation discovery methods that try to find relationships among historical documents in a large collection of documents. We gave three examples of relation discovery carried out on the Imperial Court documents and Taiwanese land deeds. They are citation relations, land transaction relations, and the template relation. Through our methods, we have discovered 6,802 citation relations among the 37,836 Imperial Court documents selected from 280 sources, 3,910 transaction relations among the 35,451 land deeds from 117 sources, and 105 templates that were created following a specific format. We argued that the relationship discovery not only can help historians to consider more angles while reading the documents, but also can lead to new findings. The citation relations found have been transformed into 1,101 successive citation graphs, each of which reveals how a historical event evolved through the correspondence between a Qing emperor and his officials. The transaction relations are also transformed into 2,219 land transitivity graphs, some of which indicates land development activities that have never been studied before.

25. 使用者取向之歷史地理資訊系統–古契書與統計資料呈現 歐仲翔 /

歷史資料在經過數位化的整理後,可以結合歷史地圖,並運用地理資訊系統,以空間的方式呈現出人文、社會、經濟等的史料分佈,提供歷史學家一個觀察的環境,更快掌握到歷史發展的脈絡,為人文研究帶來很大的便利。 然而,地理資訊系統雖是一項便利的研究工具,卻有一定程度的技術門檻,往往令歷史研究者望之怯步,使得人文研究和地理資訊系統之間,無法發揮相輔相成的效果。因此,如何能打造一個更親和、更直覺的地理資訊系統,令使用者能因應研究需要,自由地上載資料、進行觀察,便成為本研究所希望達成的目標。 本研究以西元1904的年臺灣堡圖為基礎,將地圖中大量的行政區域資訊進行向量化處理。透過Web GIS的方式,結合臺灣堡圖與日治時期的史料,提供輔助臺灣歷史研究的工具。本研究提出兩個Web GIS的設計,一為建置以地理為基礎的古契書資料庫,內容為1898年臺灣總督府展開地及整理時所抄錄保存的民間契約文書,其收錄時期約和臺灣堡圖建立時間相當,因此這些抄錄文書所使用的地名與臺灣堡圖中的地名可作對應,利用資訊技術擷取出其中的地理資訊。本論文由15899件總督府抄錄契約文書當中找出12502件的地理位置,以Web GIS工具呈現於堡圖的行政區域上,可以與當代通用的地圖(Google Map)套疊,讓研究者能以地方名稱或地理範圍進行資料搜尋與觀察。 另一設計則是以臺灣堡圖行政區域圖作為繪製主題圖的底圖,使用者可將自己的研究成果,整理成Excel格式的表格資料後上傳,透過我們所設計的工具,依臺灣堡圖的不同行政階層,如州廳、堡里、街庄等,繪製出可以呈現各區域的數量分佈等主題圖。以往的Web GIS工具,由於歷史地圖資料量龐大,在使用者端往往需要大量的時間來運算與呈現,更無法輕易地進行不同尺度轉換,此工具利用HTML5撰寫程式,使得地圖顯示與縮放更為效率化,節省大量的時間。 希望能以相對簡便的方式,讓更多歷史研究者能夠充分地運用地理資訊系統的強大功能,找出單純文本不容易看出的脈絡,開啟歷史研究一個新的面向。 Through integrating historical maps and historical data, the Geographic Information System (GIS) can help historian observe and track historical phenomena that are not easily found from historical data alone. However, for scholars without much knowledge in geography, mastering GIS can be a daunting task. There is, therefore, a barrier between humanities research and GIS. Hence, our target is how to build a user-friendly and intuitive GIS tool for scholars to use. In this thesis, we present two WebGIS systems. The first one built on a database of land deeds. Land deeds are important primary documents that are used by historians in the study of land and social development Qing era Taiwan. We start by identifying the exact latitude and longitude of the land deeds, upon which a retrieval system is built. A user can find land deeds either through typing in words or drawing an area. The deeds retrieved are then presented on the map. Several layers of maps, of different nature and time periods, are also incorporated. The second system presented uses an administrative area map to provide a convenient tool for researchers to generate theme maps. By uploading statistical data in Excel format, the system will automatically generate a map with the data in different regions presented in different color according to density. The system is a very simple visual aid for the user to observe geographical data.

26. 鳥瞰臺灣方志:以物產、職官為初探對象 李鈺淳 /

方志,記述一地方情況的史志。臺灣方志,顧名思義便是記載臺灣各個地方中所存在的人、事、時、地、物等重要記錄書。方志中所記載的內容可說是五花八門,門類繁富,每一本方志都隱藏了編輯者的思想邏輯與個人特性於其中。 本論文將探討在1685~1898年代間出版的23本方志。在方志眾多的門類下,針對職官與物產兩大類進行資料分析與整理。首先會介紹此兩大類在方志中的結構,職官結構:人名、官名、職稱、出身、籍貫、任前來歷、任期、去向。物產結構:物產分類、物產中所收錄的各式物種。 進一步將說明對結構的解讀與拆解,最後則是舉例說明對方志內容相關應用。除了保留方志的原內容外,還會將內容中常見的資訊做歸納,進一步儲存為詮釋資料(metadata),讓資料的檢索方式更彈性,呈現的方式更多樣化。 論文中嘗試藉由不同於一般傳統檢索的呈現方式,來觀察方志裡原先不容易得到或是需要耗費工時才能取得的結果,在將統整出來的結果,以嶄新的面貌呈現,讓觀察的觸角能朝更多的方向延伸。 A local gazetteer is a book that describes the government, economy, commodities, environment, people, etc of a place. It is perhaps the most important form of reference book for a region in Chinese history. In this thesis we analyzed the officials and local produces contained in 23 Taiwanese gazetteers, written between 1685 and 1898. For the officials, we studied the patterns and extracted, automatically, information such as the person’s name, the title, dates of assuming and leaving the office, the position that person had before taking this office and the one after (if any), place of birth, scholarly title held, etc. For the produce we analyzed the categories as given in each book, the items in each category, and the relationships among the different books in the categories and their contents. In addition to presenting the original contents in the gazetteers, our work provides a different way to compare them and to analyze them.

27. 開放式主題圖系統 張嘉文 /

將地理資訊系統(geographic information system)應用於歷史研究,受限於技術、設備與人力等條件,往往僅限於少數機構與學者有能力結合兩者。Web GIS的出現,雖讓使用者得以透過網路瀏覽軟體,獲得地圖等地理資訊,但對大部分研究者而言,仍難以將歷史資料與地理資訊系統結合,以空間的方式呈現出人文、社會、經濟等不同主題的分佈。 本研究所建立的開放式GIS讓使用者可以透過Excel檔案或是Web UI簡單地輸入自己的資料,而不需要透過其他技術人員來處理,即可讓自己手邊的資料呈現於GIS上。 系統以三個不同的底圖,總計七個不同層級作為系統的基底,分別為1904臺灣堡圖下的廳、堡里、街庄,1982-2010臺灣行政區劃下的縣市、鄉鎮,1820中國行政區劃下的省、州府。 有鑑於底圖圖資的取得不易,本系統透過了行政區劃合併的概念來讓研究者可以從系統的底圖為基準,在可以容許的誤差範圍下,自行合併出欲使用的行政區劃。 行政區劃下的統計資料無論是在研究上或是政府施政方針的制訂上,常常扮演著重要的部分,但統計資料往往很容易被忽略,其原因在於該數值無法直覺的被加以解釋,例如日常生活中的雨量分布圖,若只有密密麻麻的地名對應著數字,絕對是比不上一張以地圖為底的圖像顯示來的直接且容易理解。研究亦然如此,當學者透過地圖進行視覺化的觀察,並結合一些地理分析方法後,極可能看到原先無法看出的脈絡,將這些脈絡以地圖的方式呈現,則更能傳達研究者所想表達的內容。 本系統透過Web介面以及HTML5撰寫下,讓使用者可以在相對簡單的方式下,運用GIS的功能,找出原本隱藏在統計資料中的脈絡,也期許使用者可以更容易接受資訊工程方面的輔助。 The Geographic Information System (GIS) can help historian observe and track historical phenomena that are not easily found from historical data alone. However, the integration of history and GIS was, until recently, the preserve of highly skilled, well-equipped, and organized individuals and institutes. Hence, our target is how to build a user-friendly GIS tool for scholars to generate their own thematic maps. In this paper, we present a system provides an administrative area map for researchers to upload statistical data in Excel format or by user interface, then users can use those data in system and generate a map with the data in different regions presented in different color according to density. The system is a very simple visual aid for the user to observe social economic data and their spatial distributio.

28. 《清實錄》人名擷取自動化 劉士綱 /

在所有的史料內容中,「人」一直都是極具代表性且富有研究價值的象徵物。因此在史料的電子檔中,人的標記尤其重要。《清實錄》從太祖起,至德宗止,共十二部,共四千四百八十四卷,我們無法藉由人工翻書的方式將人名一一標示出來。此外,清朝的人名和現代的中國人名相對比較沒有規則,現代中國的人名可以用百家姓就擷取出幾乎全部的人名,而在《清實錄》中除了漢人的人名以外,還記載了滿洲人名,外國傳教士的人名,和以數字為組合的人名,這些人名都無法用單一的規則來解決,處理上會困難許多。本論文的研究主旨就是如何利用程式自動化的方式將《清實錄》中的人名在metadata中標記出來。 本論文中會介紹如何利用PMI(Pointwise mutual information)公式,將《清實錄》中的內文正確地斷出詞條,在搭配規則找出候選人名。在這階段將人名正確斷開以後,下一階段就要考量如何在這些大量的二字詞(bigrams)中找出可能為人名者,因此必須要做人名驗證(the validation of the names)。再者會介紹整個自動化演算法的流程,一開始先利用斷詞提升召回率(recall),再利用人名驗證提高候選人名的精確度(precision)。方法確定以後在針對《清實錄》中各個朝代作人名辨識,得到附錄中《清實錄》的候選人名結果。 Among all the historical material, ""Person"" is always the highly representative symbol that has rich research value. Therefore, it is important to tag the name of a person correctly in the electronic file of historical data. “Qing Dynasty” , starting from Taejo to DeZong, has a total of four thousand four hundred and eight four chapters in twelve volumes, therefore, it is not possible to manually mark each person’s name in Qing Dynasty. Besides, there’s no relatively mapping rule between the names of the Qing Dynasty and the modern Chinese names. The modern Chinese name can be found from the “hundred of surnames in China”, however, the names of the Qing Dynasty are formed not only from the Chinese, but also from the people of Manchukuo, the foreign missionaries, and sometimes from the combination of numbers only. It’s not possible to tag these name correctly with single rule. Therefore, the main purpose of this thesis is to tag the names of people from Qing Dynasty correctly by using the programming automatically in metadata. This thesis will introduce how to use the formula of the PMI (Pointwise mutual information) so as to correctly segment the phrase in the context of Qing Dynasty and to identify the names of people correctly with rules. After the stage of segmenting the phrase of names correctly, the next stage is to consider how to sort out the potential names of people from the big pool of bigrams. To do so, we need to validate the names of people. Furthermore, this thesis will also introduce the entire process of the algorithm in automation, using the segmentation of phrase to improve the recall rate at first, then using the validation of the names to enhance the degree of accuracy. With such method, we can easily identify the names of people in each dynasty. The derived result of the candidate names of people from Qing Dynasty is in Appendix.

29. 影片字幕檢索系統以臺大文學講座系列影片為例 Retrieval System for Video Subtitles with Videos of Literature Seminar at NTU 傅泓翊 /

一般在使用影音光碟時,只能按照章節來觀看,而不能對影片內容作檢索, 來找到想要看的片段。於是我們建立一個影片字幕檢索系統,希望能對影片內容 做一些搜尋,使用的影片是臺大文學講座系列影片。 臺大文學講座系列影片為臺大出版中心將近代文學的作家,如白先勇、葉維 廉、葉嘉瑩、高行健…等,於臺大演講的情況錄製成影音光碟,主要內容為大師 們文學創作的經歷,以及對文學、美學的想法。此系列光碟大部分含有演講手冊, 為了讓使用者在看到演講手冊中有興趣的部分時,能快速找到影片中的該片段, 而設計了此套字幕檢索系統。由於影片內容皆是演講,因此對字幕檢索也就是對 影片內容做檢索。 我們首先利用esrXP 取出包含字幕的圖片,並利用Microsoft Office Document Imaging 中的OCR 功能來辨識字幕圖片,將辨識結果送回esrXP 製作成字幕檔, 來取得字幕文字與字幕時間;並且利用最長共同子序列計算字幕與演講手冊句子 的相似度,來知道字幕與句子的對應關係,進而得到字幕的發言者以及字幕對應 到的演講手冊句子。 接著建立一個網站系統,利用HTML5 的video 標籤,讓使用者只要使用支援 HTML5 的瀏覽器即可觀看影片;在搜尋字幕以及觀看影片的時候,也可以看到當 下字幕所對應的演講手冊句子,而給予使用者更多資訊。另外,我們還引入多維 度的後分類導覽方式,幫助使用者能對搜索結果做更進一步的篩選。 When we watch videos with video discs like DVDs or VCDs, we can only watch by chapters. We cannot do some search on the content of video. So we provide a retrieval system for video subtitles, and hopefully do some progress on searching the video content. NTU Literary Lecture Series published by National Taiwan University Press are videos of speech giving by some modern literature writers in Taiwan. There are videos on DVDs and a speech manual for every video in NTU Literary Lecture Sreies. People may read speech manuals to scan the content of videos quickly. When people find a interesting paragraph and want to watch the part of video, they cannot easily do that. To solve this problem, we create the subtitle files of videos by esrXP which captures pictures of subtitles and Microsoft Office Document Imaging which does OCR on pictures to get the text of subtitles. Additionally, we match subtitles to the sentences in speech manual for giving more information to users. Then we access videos through web. By using video tag of HTML5 on webpage, users can easily watch the videos without any plug-in if they use HTML5-supported browsers like Google Chrome and Mozilla Firefox. When users watch videos, the sentences correspond to the subtitle will be displayed below the player. It will provide more information to users on selecting subtitles. We also provide the function of post-classification to users for filtering the retrieval results.

30. 將場域延伸至網路之圖書館 何浩洋 /

當前所未見大量的免費網路學術資料與服務充斥於網路環境中,圖書館該如何面對網路 的使用者,重新在網路時代定位自己的角色?並又該如何以圖書館有限的技術資源,面對不 斷改變的網路環境作出調整、適應網路,以提供更多網路資料及服務給使用者? 本論文首先點出,網路上大量免費資料及服務如何影響使用者查找學術資料的習????, 並因此造成圖書館使用者流失、information gateway 角色逐漸被網路取代的困境。網路上大 量資料與Web 2.0的服務使圖書館服務相形見拙;而圖書館館藏系統的寡佔性、封閉性與圖 書館內部技術支援的缺乏,更使得圖書館難以在系統層面快速做出調整,而問題也因此難以 解決。本文檢視圖書館界對此問題已提出的各種解決方法,這些方法均著重於圖書館該如何 與網路競爭,然而本論文將提出另一種觀點:網路並非圖書館的競爭者,圖書館應該融入網 路的大環境、延伸觸角,並重新定位網路時代圖書館的角色。 為此,本論文提出 Web Scalable Library 模型,將網路視為實體世界的延伸,將圖書館 的場域從實體延伸至網路環境。我們以傳統圖書館的四項基本元素——資料、使用者、圖書 館與服務——來說明 Web Scalable Library 將場域延伸到網路後的改變。對於傳統圖書館如 何轉型成 Web Scalable Library,本文提出 Pull 與 Push,及 Integration、Dis-Integration 以及 Re-Integration 來說明轉變方式。從系統實做層面,我們也提出 Adaptive Library Service Transformation Architecture 架構,以不修改圖書館核心系統的方式,改造現有的圖書館服務 使其適應網路,協助圖書館突破系統封閉與技術資源不足的困境。本論文並以臺大圖書館為 例,說明臺大圖書館如何在有限技術資源下改造封閉系統,將圖書館服務場域延伸到網路, 實現 Web Scalable Library。 This thesis proposes a library model to help university libraries reposition themselves and utilize limited technical resources to serve patrons in a rapidly changing digital world. I first discuss why the growth of the web has drawn patrons’ attention away from libraries in chapter 1. The large quantity of digital materials available on the Web and user-friendly Web services have changed all daily activities including research. Scholars now turn to the web rather than the library for research resources. Libraries are losing patrons. The solutions discussed in chapter 2 by library associations to attract patrons’ attention to libraries mainly focus on enhancing integrated library systems. This thesis argues that libraries should merge themselves into the web rather than compete with it. In chapter 3 and 4, I propose a Web Scalable Library and Adaptive Library Services Transformation Architecture to help libraries extend their reach from the physical world to the Web. The Web Scalable Library is a model that libraries including the Web in consideration the policy towards collections, mediums, patrons and services. I then explain how the Web Scalable Library can be achieved, based, in theory, on the concepts Pull and Push, and in practice, on Integration, Dis-Integration and Re-Integration. Adaptive Library Services Transformation Architecture is a system architecture built with the aim of avoiding library system modification under conditions of closed system design and an oligopoly market of library systems. It is designed to adapt library services to different Web platforms with a faster, light-weight development cycle and affordable development costs. Finally, in chapter 5, I present the case of National Taiwan University Library (NTULIB) as an example to show how NTULIB was extended into a Web Scalable Library, based on the model and architecture proposed in this thesis.

31. 不同脈絡中的歷史文本之自動分析 : 以《資治通鑑》、《冊府元龜》及《正史》為例 彭維謙 /

在中國浩瀚的歷史中,對於同一件事情,會有很多不同的史書作記錄,而歷史學者為了釐清事實的原貌,會針對自己研究的題目,窮一生之精力收集所有相關的資料,務求可以閱讀到所有相關的記錄。但是人的時間是有限的,本研究目的是以資訊技術,以快速的方法建立跨文本之間文字段落的相關性,節省學者尋找比對資料的時間。   本論文以《資治通鑑》、《通鑑紀事本末》、《冊府元龜》及正史為例,此四部書分別代表了在中國史學上四種不同的體裁,分別是以「人」為主體的紀傳體、以「時」為主體的編年體、以「事」為主體的紀事本末體及具有獨特「知識架構」的類書體。而且都是由上古記載至五代的史書,涵蓋時間一致,不過由於編寫的體裁不一樣,所以內容也有相異之處,這幾本都是涵蓋長時間的歷史記載,非常適合作為本次研究的文本。   本研究試圖通過自動的方法,先擷取出文本中的關鍵字,再利用這些關鍵字找出書跟書之間文字段落的關係,當這些關係經確立後,可以通過各書間彼此交錯的相關度,找出《資治通鑑》與《冊府元龜》引用正史,與《通鑑紀事本末》引用《資治通鑑》的情況,進而嘗試探討《資治通鑑》、《通鑑紀事本末》和《冊府元龜》的編輯方法。   最後建立資訊系統,希望通過文字和圖像化的呈現,提供給歷史學者對自己有興趣的題目作進一步的研究,既可以節省尋找及比對資料的時間,更可以得到較完整的資料。 In Chinese history, the same event may be recorded in different historical documents. In order to check the facts, a historian may need to spend a life time collecting all the relevant documents of interest. This thesis proposes an automated method to establish textual relationship between texts in different documents. The textual records used in this thesis are Zizhi Tongjian (資治通鑑, Comprehensive Mirror to Aid in Government), Tongjian Jishi Benmo (通鑑紀事本末, Narratives from Beginning to End in Comprehensive Mirror for Aid in Government), Cefu Yuangui (冊府元龜, Grand Tortoise in the Imperial Treasury of Books) and Zhengshi (正史, Standard Histories). These four classics are among the most important achievements of Chinese historiography. Although all four classics cover the same time from the beginning of recorded Chinese history to Wudai period (五代, 907-979), they are edited according to different formats and thus record the same event in different ways. They are suitable texts for this study because all they are general Chinese history books covering a long period. The method proposed in this thesis first extracts the keywords in the texts, then discovers the relationship between two paragraphs through common keywords. These relations are then used to find texts in the Mirror and the Grand Tortoise that are from the Standard History, which allows us to take a glimpse at the editing methodologies of the two books. We have built a retrieval system that presents both textual comparisons and graphical representation. Through our system, we hope to provide a way for scholars to find relations among different historical classics and thus shorten the process of finding and comparing text.

32. 自動化擷取地理資訊以結合電子文件與WebGIS : 以現代旅遊遊記為例 陳凱勛 /

無論是經數位化的歷史文本,或是現代隨處可見的電子文件,將其空間資料運用至地理資訊系統(Geographic Information System),以數位地圖的角度切入來呈現地名間的關係與分佈等情形,提供了一般用戶與人文研究者看待文本的另一個形式,視覺化呈現可更快速地掌握文本提供的資訊,為文字形式帶來最好的註解。 然而,地理資訊系統對於一般使用者而言,卻存在著技術門檻的問題,即使投入大量的心力去學習使用,所得結果也未必會滿足使用者需求,這些現象使文本與地理資訊系統產生一定程度的隔閡。因此,本研究希望能打造一個介於文本與地理資訊系統間的網路平台,透過友善的介面與系統操作,令使用者能一步步地完成其電子文本對應的地理關係圖形。 本研究以旅遊遊記為例,利用現今大量的POI(Point of interest)空間資料,配合通用的世界地形圖與道路圖(Google Map),開發圖文整合的地理資訊系統工具。本研究提出了兩個Web GIS系統的設計,其一為地名標記系統,利用事先建立的地名辭典對文本進行空間資料的抽取,透過介面的設計,列出被標記為空間詞彙其可能代表的實際地名列表,使用者可任意選擇這些地名以呈現於地圖上,最後並以路徑規劃或直線串連所有被標記的地名。 另一為地名概念圖繪製系統,利用資料庫中現有的地標名稱,可將清末至今的部分老舊地名或行政區域名稱作為關鍵詞彙,建立出含該詞彙之地標名稱所形成的地標集合,透過演算法繪製出其代表的區域範圍,以重現人們對於過去該地概念的關係圖。 透過Web GIS系統的設計,希望能拉近地理資訊系統與電子文件之間的距離,讓使用者能藉由系統,發現不同於以往單純逐字閱讀所能看到的脈絡,開啟另一種看待文本的可能性。 Through integrating maps and geographic information of articles, the Geographic Information System (GIS) can help user observe and track phenomena that are not easily found from articles alone. However, there is a technical threshold issue. For users without much knowledge in geography, mastering GIS can be a daunting task, and the results may not necessarily meet the user’s need. Therefore, the target is to build a user-friendly and intuitive GIS tool for users to use, so that users can step by step complete geographical maps corresponding to their data. In this paper, we present two WebGIS systems on a database of POI spatial data, and present the results on Google map. The first one is to extract the geographic vocabulary from article through gazetteer, and these vocabularies corresponding to their landmark list which represent the actual places. User can choose which one needs to present on the map or not. The other is to build a landmark set forms by a collection of landmark containing specific name from database, and using it to render the concept graph on map. These two systems are simple visual aid for the user to observe geographical data.

33. 類書知識分類變化之自動分析與討論 : 以《藝文類聚》與《太平御覽》為例 鍾嘉軒 /

《藝文類聚》與《太平御覽》是中國古代兩部重要的類書,成書年代分處於唐初與宋初,兩部類書皆是當代規模最龐大的類書。因其收錄許多書籍的內容,因此兩部類書皆廣被用於校閱或輯逸,歷年來對於《藝文類聚》與《太平御覽》的研究大多是與校閱或輯逸相關,但《藝文類聚》與《太平御覽》的價值並不僅此而已。兩部類書成書時間分處不同的時空背景,其收錄的內容與編排方式一定程度的反映出當時人們對於知識的概念與看法,對於想研究古代知識結構的研究者來說,《藝文類聚》與《太平御覽》都是非常好的研究材料。 從類書的角度來看,類書分類架構某種層面上代表著當代人們對於世上事物的概念,若從同樣一條知識在不同時代下的類書分類架構的變化來觀察,那或許就能看出知識在各個朝代的知識概念中的演進變化。藉由資訊的協助,進行類書內文條目的交叉比對,如此便能觀察同一條知識在不同類書中的分類位置,也可為類書研究提供更多討論的空間。本文的重點即在針對《藝文類聚》與《太平御覽》中的引書條目做自動化比對,藉由自動的比對找出《藝文類聚》與《太平御覽》共同擁有的知識引文,繼而由這些共有的條目,以另一個角度看待《藝文類聚》與《太平御覽》,找出以往以人力無法發現的問題。 本文第二章將介紹《藝文類聚》與《太平御覽》內文引文出處的整理,第三章將介紹兩類書中引書條目的自動化比對,而第四、五章則是介紹以系統呈現自動化比對兩類書後的結果,以及相關的問題討論與研究。 Leishu (類書) is a unique form of ancient Chinese reference books that seem to have no counterpart in the western civilization. The original purpose of leishu was to provide quick reference to ancient texts. Therefore a leishu usually contains two parts, a knowledge structure and a large number of quoted texts from older books. Among the most important leishu are Yiwenleiju (藝文類聚) and Taipingyulan (太平御覽), produced, respectively, in 624AD (early Tan Dynasty) and 977AD (early Sung Dynasty). Since many of the sources, more than 1,000 in each book, quoted in the two books no longer exist, Yiwenleiju and Taipingyulan have been used for collecting texts of lost books. The knowledge structure of a leishu reveals how the people of its time view the world. By observing the changes in citations, we can have a glimpse of the evolution of a concept between the two era. In this thesis we have developed a procedure to process the quoted texts in a leishu, and a way to compare similar texts in the two books. We have also built a search and retrieval system with which one can observe the differences both in the knowledge structure and the texts of Yiwenleiju and Taipingyulan.

34. 歷史文件自動地名標註 : 以《清實錄》為例 高欣愷 /

歷史資料在近年來不斷進步的資訊技術下,開始能夠被數位化整理,且能結合地圖資訊,運用地理資訊系統,以空間的面向呈現歷史的脈絡,提供歷史學者一個不同的觀察角度,為人文研究帶來新的氣象。 然而,對於歷史研究者而言,地理資訊系統雖然便利,卻有一定的門檻存在,使得多數歷史研究者依然無法使用GIS軟體作為研究工具,使得歷史研究與地理資訊系統之間難以發揮原本預期之輔助結果。因此本研究的目標在於建構一個GIS能夠以直覺式的操作,被不同歷史研究者使用,因應其研究需求,自由的上傳資料,提供對應地理資訊的觀察。 本文以《清實錄》作為歷史文本的例子,結合地圖,開發一具有圖文整合能力的地理資訊系統。本文提出了使用空間資料庫及Text Mining技術,標註歷史文本中的地名且加入地理資訊的方法。且使用者能夠透過介面的操作,將自動化標註之結果做人工的校正。最後能將文件的地理資訊與地圖結合,呈現出文件的地理位置,使研究者能藉由視覺化的角度的觀察文件的空間資訊。期待研究者能藉由使用此系統,降低對使用GIS作為研究工具的抗拒,進而理解GIS對歷史研究的幫助。 Through the progress of information technology, a lot of historical documents have been digitized in recent years. By integrating the digital files and geographic information, a Geographic Information System (GIS) can display the context of history on maps, help historians observe phenomena which are not easily found from article alone, and provide historians a different perspective on historical research. However, there is a technical barrier. For users without sufficient knowledge in geography, mastering GIS can be a daunting task, and the results may not necessarily meet the users’ need. The goal of this thesis, then, is to develop a GIS with an intuitive user interface, through which historians can construct geographical maps that correspond to their data. In this thesis, we use the “Veritable Records of the Qing Dynasty” as the historical text, combined with maps, and develop a WebGIS tool. We use spatial databases and text mining technologies to annotate the place name, extract the geographic vocabulary from the texts, and identify the geographic coordinates (landmarks) corresponding to the names. A user can modify the annotations or landmarks easily through the UI. The landmarks are then displayed on digital maps, thus providing the user a way to visually observe the geographical information of their data. We hope that our tool can reduce the technical difficulties that historians often encounter when using GIS, and encourage them to better utilize geographical information in their research.

35. 《古今圖書集成》自動化內容建構與出處擷取 林易徵 /

類書是中國歷史上重要的工具類型之書,其將古籍中各個不同的知識敘述片段擷取出來,並依照類書本身的分類方式及編排架構編纂而成,以類相從,以達到整理經籍以及方便查閱的作用。自三國時代開始,類書在中國的發展已近兩千年,收錄典籍愈多,分類方法愈詳細。目前現存以清代康熙、雍正時期所編纂的《古今圖書集成》最為重要,其資料也最為豐富,於現代也仍舊是值得參考的工具書。 《古今圖書集成》內含有約一億七千萬餘的文字量,並且收錄自上古至清初約一萬餘本的古籍資料,又其收錄的知識類型包羅萬象、應有盡有。如此鉅作要能方便地瀏覽查找其內含的豐富知識實屬不易,因此在本研究嘗試以資訊方法來解決這些問題。 本研究主要分為三個部份,第一部份主要說明《古今圖書集成》的成書架構,並依照其架構設計一套處理的流程將其所收錄的知識敘述文句段落切開為獨立條目,並套入台灣歷史數位圖書館(Taiwan History Digital Library, THDL)模型以供使用者方便查閱。第二部份主要針對各條目的古籍出處作整理,利用資訊方法將錯誤或是缺失的出處資訊補正,以達到整理經籍,甚至輯佚的目的。第三部份則是根據前兩部份的資料架構建置及出處整理結果,作交叉性的統計數據。 希望本研究也能夠對於未來類書或是《古今圖書集成》的研究者,達到前導及縮短研究時間之目的。 Leishu(類書, categorically data-assembling book) is a type of reference books developed in ancient China. A leishu first develops a classification structure for the intended knowledge domain, then extracts segments from existing books and fits them into the proper categories so that they can be retrieved and used conveniently later. Gujin Tushu Jicheng(古今圖書集成, Completed Collection of Graphs and Writings of Ancient and Modern Times), published in the 18th century during the Qing Dynasty, is the largest and most valuable leishu. Gujin Tushu Jicheng contains approximate 170 million words, which were taken from over 10 thousand ancient classics and books. In this thesis, we develop information technologies to effectively harness this great book. There are mainly three parts in this thesis. In the first part, we introduce the background and overall structure of Gujin Tushu Jicheng. We also design an automated procedure to identify and analyze the entries in the book. We then build a retrieval system by incorporating the restructured content into the THDL(Taiwan History Digital Library) shell. In the second part, we try to identify the sources of the entries automatically and systematically, fix the errors and patch the omissions. In the last part, we give some statistical data drawn from the analysis done in the first two parts of the thesis.

36. 資訊保存與自然語言處理的應用 Information preservation and its applications to natural language processing 陳瑞呈 /

在這篇論文中,我們從機率模型的範疇內推導一個稱作「資訊保存」的數學概念。我們的方法提供了連接數個最佳化原則,例如最大蹢及最小蹢方法(maximum and minimum entropy methods)的基礎。在這個框架中,我們明確地假設模型推衍是一個目標針對某個參考假說的有向過程。為了檢驗這個理論,我們對無監督式斷詞(unsupervised word segmentation)以及靜態索引刪減(static index pruning)進行了詳盡的實證研究。在無監督式斷詞中,我們的方法顯著地提昇了以壓縮為基礎的方法斷詞精確度,並且在效能與效率表現上達到與目前最佳方法接近的程度。在靜態索引刪減上,我們提出的以資訊為基礎的量度(information-based measure)以比其他方法效率更好的方式達到目前最好的結果。我們的模型推衍方法也取得了新發現,像是分群分析(cluster analysis)中的新校正方法。我們期望這個對推衍原則的深度理解能產生機率模型的新方法論,並且最終邁向自然語言處理上的突破。 In this dissertation, we motivate a mathematical concept, called information preservation, in the context of probabilistic modeling. Our approach provides a common ground for relating various optimization principles, such as maximum and minimum entropy methods. In this framework, we make explicit an assumption that the model induction is a directed process toward some reference hypothesis. To verify this theory, we conducted extensive empirical studies to unsupervised word segmentation and static index pruning. In unsupervised word segmentation, our approach has significantly boosted the segmentation accuracy of an ordinary compression-based method and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, the proposed information-based measure has achieved state-of-the-art performance, and it has done so more efficiently than the other methods. Our approach to model induction has also led to new discovery, such as a new regularization method for cluster analysis. We expect that this deepened understanding about the induction principles may produce new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing.

37. 具資料可擴充性的官職表之型與實作 孫若桓 /

在中國歷史上,官員的任職資訊是研究的重要參考資料。傳統的做法是將相關的資料編排出版成書,但書籍的編排方式一定按照某種規律(如時間),如果使用者要查考其他的資訊(如某人做過什麼官),則需要翻查全書才能找全。在前人的研究中,針對「台灣地理及歷史卷九官師志第一冊文職表」開發出一個具有三種查詢功能的「清代台灣文官官職表」系統 (,將該書所蘊含的官職資訊做結構化的呈現,使得使用的便利性顯著提升。但是當初設計時並沒有考慮到擴充的需求,後端資料的存放不具彈性,在沒有一個後端管理平台的狀況下,增添資料十分困難。   本論文在以「清代台灣文官官職表」系統現有功能為基礎的前提下,設計一個有彈性的資料模型來存放書中資訊,這個模型要解決後端的資料與前端的呈現的互動關係,使得後端資料的修改可以自動反應在前端的呈現上。此外資料結構的欄位亦須有足夠的彈性,使得彼此能夠連結,以滿足各種不同的查詢需求。   為了驗證我們方法的可行性,我們除了將「清代台灣文官官職表」系統中的資料重新做了一個查詢及展示系統外,可用不同的維度(時間、官名、人物)做不同結合的呈現外,也為中研院近史所所編「清季職官表」做了一個雛形系統,展示類似的功能。 Personnel information of officials has been an extremely useful source for research in Chinese history. The conventional way is to compile such information into a reference book. The linear organization of a book, however, makes it difficult to fully utilize the information. For instance, if the book is arranged chronologically, then it is hard to find all the positions that an official had assumed in his life time. In a previous study, a query system for the book “Taiwan Officials during Qing Dynasty”, was developed ( This system fully integrated the content of that book and provided search for year, office title, and person name. However, the system was built by “hardwiring” the content, leaving little room for extension. In this thesis we present a design with a flexible data model. On the functionality level, the front end search facility of the new system does not go beyond the previous one. However, the content of the book is engineered with a more flexible data model to allow easy extension and modification of the data. The separation of the data model and the front-end presentation also makes it easier to apply our approach to other reference books of a similar nature. To demonstrate the effectiveness of our method, we completely rebuilt the “Taiwan official search engine” mentioned above. We have also built a prototype for the book List of Officials of Qing Dynasty (清季職官表) edited by the Institute of Modern History of the Academia Sinica.

38. 呈現實體圖書館建置脈絡之數位瀏覽系統 張有為 /

傳統(實體)圖書館提供了兩個主要功能:瀏覽與搜尋。搜尋讓使用者能明確找到其所想要找到的書;瀏覽則讓使用者能得到預期之外的發現所帶來的意外之喜。在過去的三十年間,傳統的紙卡書目資料系統,已經完全的被更方便使用的圖書資訊查詢系統(WebPAC)及搜尋引擎所取代了。但另一方面,瀏覽的功能也因此而被使用者忽略掉了。而在因為藏書空間不足,越來越多書籍會被放置到遠端儲存空間的未來,這個問題只會越加嚴重。較少被使用的書籍,被放在實體圖書館中的機會也比較小,因此使用者難以在架上看到這些書籍,造成這些書籍被使用率的更加低落,同時使用者也無法透過圖書館的各種圖書分類法來找到這些書籍。 在本研究中,為試作一模擬實體圖書館的瀏覽環境,首先將把MARC21格式的書目資料轉換為XML格式,並且利用圖書館所提供的圖書分類號與圖書分類法,開發一數位圖書瀏覽系統。當使用者在系統中找到一本書時,他就可以像在實體圖書館中一般,同時瀏覽與該書擁有類似圖書分類號的其它書籍。透過書目資料中各種可供檢索的後分類資訊,本系統也能提供更多元的瀏覽方式。 A traditional (physical) library provides two functions: browse and search. While searching allows one to pinpoint to the book that she wishes to find, browsing provides the pleasure of unexpected discovery. In the last 30 years, the old card catalogues have been completely replaced by Webpac and search engines, which have provided unprecedented convenience. The browsing aspect, on the other hand, has largely been compromised. This problem becomes even more severe because more and more books are being been put into remote repositories due to the lack of shelf space. Thus a rarely used books becomes even less likely to be found if one cannot stumble upon it by accident. They also could not get advantage from book classification method that provide by library. In this thesis we propose a way to mimick the browsing environment in a library. We start by converting the MARC21 record to XML and use the “call number” and the subject classifications in a library to develop the digital library browsing system. When a patron finds a book in the library, she can also explore the books with the similar call numbers just like in a physical library. Additional browsing capability can also be obtained by utilizing the subject classifications and other features in the MARC records.

39. 進階閱讀與標註系統-以屏東縣志為例 張沛強 /

在現今網際網路普及的社會,研究者常常會將自己的研究放在網路上供大眾閱讀與了解,而其形式往往是將該篇文章或是書籍的數位化PDF(Portable Document Format)檔案放在網頁上提供簡單的閱讀或下載等,並無法真正發揮數位化文本可以產生的效果,如:顯示外部參考連結、地圖。而對於一般大眾而言,歷史文章中常常有一些和現代較為不同且含有特殊涵義之字句,或是上個時代的古地名等,可能在閱讀時候常常無法完全的了解相關解說與想像其中地理關係之對應。   本研究以屏東縣志稿為主題文本,希望透過文本與資料庫的建立,配合網頁前端技術之設計,提供研究者於平台上針對文本內容做人名、地名與特殊名詞之標註,更結合臺灣大百科、明清臺灣人物小傳以及臺灣百年歷史圖層等外部連結資源,可立即將研究者標註資料展示在瀏覽模式中,並提供一般大眾閱讀者使用,使得使用者更了解與親近文本中的知識。另外,系統也提供了研究者新增、編修與下載研究資料的介面,以輔助使用系統的文史工作者進行研究。   綜觀而論,本研究即是希望提供一個研究者與一般閱讀大眾的對話平台,也希望讓使用本系統之研究會員可以在這之中找出更多的研究能量,更重要的是希望藉由本系統的運作,能夠使民眾對於自己生活的土地能夠有更多的了解與地方情感認同。  It is quite common for researchers to put their research results on the Internet for public use. The most standard way is to cover a paper into a pdf (Portable Document Format) file for people to download and read. This method, although convenient, prohibits the author from utilizing other useful references such as external links or maps that can serve as reading aid. Furthermore, a research paper, such as an article on history, may contain geographical names whose locations are not familiar to the general reader.   In this study we present an annotation system through which an author can easily link his or her article with external sources. The system provides a platform on which one can annotate person names, place names and special terms and link them with external sources such as maps and encyclopedia. To demonstrate its effectiveness, we use the Pingtung County Gazetteer as the target text. The external sources include the Encyclopedia Taiwania, Short Biographies of Ming and Qing Taiwanese, historical maps from the Academia Sinica, and the Google map. In addition to the annotation system, we have also provided an interface so that users can use a browser to read the annotated text easily.

40. 中國古代法典及其事例之自動化整合——以乾隆朝《大清會典》為例 郭乃華 /

大清會典為清代官方編修的政書,記載了清代的政府體制與各項典章制度,是研究清代制度史的重要文獻。其中所收,無論是綱要性質的「典」還是記載制度沿革的「例」皆具有豐富的研究價值。然而,乾隆朝以後將會典內容依典例性質分別成書,卻使得研究者無法直接查閱同一制度的典與例,即使同時翻閱兩書,也往往受限於兩書編排上的差異,而難以有效率的找出具關聯性的內容。因此本研究嘗試透過資訊技術的輔助,自動化整合典與例這兩種不同性質的記錄。 本研究以乾隆朝編纂的《欽定大清會典》及《欽定大清會典則例》為例,首先分析其編寫格式,並依此整理兩書所收內容。除了將敘述各項制度的文字分割開、依照原書劃分的部門及職掌分門別類之外,則例的部分亦自動擷取其時間資訊。書中的編排另有一特定規則,能將所述內容相近的制度辨別出來。此一層關係雖不若明確書寫的分類般清晰可見,但對各制度的內容劃分仍是極具重要性。因此我們將各條目依此編排特性歸類,每種性質相似的內容稱為一個「主題」。 接下來,我們比對會典與則例之間的關係。本研究分別從分類、主題、條目等三個不同層面切入,比較會典與則例在編排及文字敘述上的同異處,再整合這三個層面的比較結果,將兩者間相關的條目比對出來。對照的結果,我們發現會典與則例之間並沒有完整的對應關係,不僅有許多制度規定僅在會典或則例的其中一書中出現,甚至在基本的分類層面上,兩書的劃分方式亦有所不同。 本研究共在3695筆會典條目與25150筆則例條目之間,找出13890組對應關係。將這些對應結果輔以網頁呈現,研究者便能快速地檢索並參照兩書內容。除此之外,全面性的比較兩書的分類、主題、內文等不同層面,亦能幫助研究者綜覽會典與則例的編排與分類方式,對兩書的編撰原則做進一步的分析與探討。 Daqing Huidian (大清會典, the Records of Laws and Systems of the Qing Dynasty) is an important record produced by the government of Qing. It contains two parts, the statutes, which are the set of laws and rules of governance, and the precedents, which indicate how the laws were carried out in actual governance. Both the statutes and the precedents recorded in these books are important materials for research on the laws and governance of the Qing dynasty. In the earlier compilations, the statutes and precedents were kept in the same book. As the reign of Qing progressed, however, they were divided into two different books due both to the increase in size and to the differences in their purposes and nature. In this thesis we try to use information technologies to rebuild the relationships between statutes and precedents. We take the third edition of the Daqing Huidian, compiled during the reign of Qianlong, as a case study. It includes two books: QinDingDaQingHuiDian (欽定大清會典) and QinDingDaQingHuiDianZeLi (欽定大清會典則例). While the former one is a collection of the statutes, the latter documents all the precedents. We first analyze the structures and the compiling principles of these two books and reorganize the digital text into a format that is more suitable for research use. This includes itemizing the contents, classifying them by organizations and duties, and retrieving the time information of each precedent. We also discover a compiling principle that can help to divide the records into several clusters. Each of them has its own topic. We then compare the two books from different aspects and discover that the differences between them are not only at the record level. Even the classes that were already clearly defined in the books may have several variations. To analyze the reasons for the differences, we design a method to explore the corresponding relations between statutes and precedents. Through this method, we have discovered 13,890 relations between 3,695 statute records and 25,150 precedent records.

41. 線上輕量型詮釋資料標註工具 黃綱政 /

近年來在科技的進步之下,數位典藏為歷史學者保存珍貴史料的重要方式,而數位典藏的方式係指將有保存價值之實體或非實體資料,透過數位化方式(攝影、掃描、影音拍攝、全文輸入等),並加上詮釋資料(Metadata)的描述,以數位檔案的形式儲存。 編寫Metadata是一項花費時間及人力的重要工作,對於Metadata的標註者來說,標註的方法有二,第一種方法為使用一獨立系統,獨立系統會特別針對某項計畫或某批性質相同的資料所設計,通常具有管理者易掌握工作進度之優點;而第二種方法為使用不依附系統的建檔方式,標註者邊看資料邊使用Excel來做Metadata標註。這兩種標註方式,前者會因為Metadata欄位設計缺乏彈性而只能針對某些類型的資料進行標註;後者則需要花費許多心思在建立資料與record的對應關係上。 有鑑於此,本研究特別思考如何在獨立系統與非獨立系統(Excel)之間取得平衡,取雙方之優點,並彌補雙方之缺點,進而開發出一套影像結合詮釋資料,並且在Metadata欄位設計上具有彈性的系統,透過方便的編輯模式和容易上手的操作流程,期許此系統能夠讓標註者在使用上更為流暢及便利。 Digital archives have become an important mean for preserving and using artifacts. In addition to converting an (analogue) object into digital form, an important aspect of digitization is to describe the object through metadata. Producing metadata is time-consuming and labor-intensive work. From a system point of view, there are usually two ways to create metadata for a collection. One is to use a stand-alone system that is designed for a specific collection of objects with the same metadata attributes. The other one is to use a general purpose system such as Excel that allows the user to create attributes on the fly. The drawback of the former is lack of flexibility if the attributes need to be modified, while it is usually difficult in the latter to link the metadata created with the digital objects. In this study we present a system that attempts to take advantage of the convenience from both sides while make up for their shortcomings. Our system allows a user to design her own metadata attributes, and provides a convenient editing mode and easy-to-use operating procedures. More importantly, it allows a natural way to link the metadata records with the digital objects that they are describing. We hope this system can provide the versatility and convenience that annotators of metadata need.

42. 古籍影像與文本之對應-以《古今圖書集成》為例 陳冠仲 /

《古今圖書集成》為現存最大類書,因此有不少數位人文學者將其與資料庫系統結合,做成《古今圖書集成》全文檢索系統,內容大多包含文字及影像的搜索功能,但在結果的呈現上皆重於文字,對影像的部分並無多加著墨,所以當使用者想從影像中獲取一些資訊,例如找某個關鍵字詞時,只能用肉眼觀察影像的內容,無法從系統提供幫助。 在本研究中,試圖避開OCR技術的輔助,直接對影像及文本處理,讓兩者間有高度的對應關係,再利用文本來尋找文字在影像中的位置。首先對所有影像做一些影像處理,包含了旋轉與切割,使每張影像有著相同的格式與排版,再分析影像特性,如:文字的排版方式、影像中圖像有固定大小與位置等等,利用這些特性以行為單位將影像的狀態完整對應到文本中,最後文本每一行對應到影像中文字、空行、圖像三種狀態其一。 最後再利用對應完成的文本及處理過的影像,先計算文字在文本中的位置,再透過對應座標的方式找出文字在影像中的位置。如此使得《古今圖書集成》影像將不再只是以插圖的形式點綴系統,而是能實際提供有用的資訊給使用者。 The Complete Collection of Graphs and Writings of Ancient and Modern Times (Gujintushujicheng, or Jicheng for short), completed in the early 18th century, is the largest book in the world in existence. Containing over one million Chinese characters, almost 100,000 pages, and cover over 6,000 subjects, Jicheng is also difficult to use. During the past decade, several digital systems have been developed so that people can use Jicheng through fulltext search. However, all of these system did not attempt to match images and texts, which would make using Jicheng even easier. This difficult arises partly because for old Chinese books, OCR is still not an effective technology. In this thesis we develop a method that tries to find direct correspondence between an image of Jicheng and its associated text without resorting to OCR. We first calibrate the images so that all 100,000 pages in the book have the same size and format. We then analyze the characteristics such as the format, number of lines, position of graphs, etc, so that each line in the typed text maps to either a line of text, a blank line, of part of a graph in a page image. Once this is done, we then do a character-by-character mapping between each character in the typed text and a character in a page image. Our method is quite effective. The accuracy in mapping the entire contain of Jicheng is 98,7%. The rest is mainly due to typographic errors occurred when typing the full text, which can be easily corrected by hand.

44. 文本與地圖結合的檢索與閱覽系統實作-以三國志為例 The Implementation of Textual and Geographical Reading System - A Case Study of Records of the Three Kingdoms 周信廷 /

近年來,在日新月異的資訊科技下,許多歷史資料被數位化,並運用地理資訊系統(Geographic Information System),以空間的面向呈現歷史的脈絡,提供歷史學者一個新的觀察角度,視覺化呈現亦可以更快速掌握文本提供的資訊,為人文研究帶來很大的便利。 然而,地理資訊系統雖然便利,但是對多數使用者而言有一定的門檻存在,需要花費額外的時間與心力去學習,使人文研究與地理資訊系統間無法發揮相輔相成的效果。因此,如何打造一個具親和力又直覺的系統,有效結合文本與地圖,使兩者之間環環相扣,方便使用者做研究,便成為本研究希望達成的目標。簡而言之就是建構一個結合文本與地圖的檢索與閱讀系統,幫助使用者同時閱讀文本與地圖。 本研究以中國史書《三國志》為例,結合三國時代的地理資料,開發一個圖文整合的系統,在系統中的檢索與閱讀等功能皆由文本與地圖互相配合作呈現,輔以一些簡單的GIS功能,使歷史研究者能從視覺化的角度觀察文本的空間資訊。此系統一方面能協助使用者閱讀文本,希望可以幫助找出單純閱讀文本不容易看出的脈絡,開啟歷史研究一個新的面向,另一方面亦可作為GIS的入門工具,協助跨越對使用GIS作為研究工具的門檻,進而瞭解GIS對歷史研究的幫助。 Historical documents often involve locations. With the advances in digital technology, it becomes possible to incorporate historical geographic information into historical texts to provide better visual aid of spatial information when studying these documents. GIS (Geographic Information System) is a convenient way of achieving this effect. However, for many scholars without much knowledge in geography or computer systems, mastering GIS to achieve the desired effect may be difficult. The goal of this thesis is to build a user-friendly and intuitive GIS tool combined with texts and maps for scholars to build their own georeferenced text. We have also developed a browsing environment for the general reader to view the resulting text. In this thesis, we use the “Records of the Three Kingdoms” as an example. Combined with a map of the Three Kingdoms era (produced by the GIS Center of the Academia Sinica), and developed with a WebGIS tool, our system allows the user to read the text and view the associated locations on the map simultaneously. Search and retrieval functions, together with some statistical analysis (such as the frequencies of person and locations in a given text), are also provided. Some simple GIS functions, such as identifying existing locations in a polygon or the order of appearances of locations, are also included.

45. 台灣歷史面量圖系統 Design and implementation of a Taiwan history choropleth map system 劉光哲 /

GIS系統是一門綜合性學科,結合地理學與地圖學,舉凡對空間資料進行各種處理應用分析的系統皆可稱為地理資訊系統,其所應用的領域也相當廣泛,而歷史資料在近年來不斷進步的資訊技術下,開始能夠被數位化整理,且能結合地圖資訊,運用地理資訊系統,以空間的面向呈現歷史的脈絡,提供歷史學者一個不同的觀察角度,為人文研究帶來新的氣象。然而,對於歷史研究者而言,地理資訊系統雖然便利,卻有一定的門檻存在,使得多數歷史研究者依然無法使用GIS軟體作為研究工具,使得歷史研究與地理資訊系統之間難以發揮原本預期之輔助結果。 本研究主要目的為利用個人所有的統計資料,結合相關地圖,畫出不同需求的面量圖。研究結合既有多方的台灣歷史地圖資料,地圖資料來源有台灣歷史文化地圖、中國歷史地理資訊系統、中央研究院、交通部運輸研究所等網站及研究單位,並將地圖資料處理整合,共整理出61張台灣歷史地圖並放入系統,提供研究者台灣過去各年代的行政圖,以滿足歷史研究的需求;本研究系統以WebGI為考量,採用技術Jacascript、AJAX、PHP,使用系統只需以瀏覽器開啟,不需再額外安裝其他軟體即可操作,且系統盡量簡化使用流程,讓使用者只需幾個簡單的操作就可使用,透過webUI或是excel上傳資料,即可根據需求畫出面量圖,力求讓歷史研究者可自由使用GIS,而不再需要透過其他技術人員幫忙。 本研究讓使用者能簡單操作系統,運用GIS的功能,歷史研究者可以輕易經由地理的角度,觀察歷史事件的起承轉合,從中觀察出複雜的歷史因果關係,並方便說明與解釋歷史。 GIS (geographic information systems) system is a multidiscipline that combines geography and cartography. In the broadest definition, any application systems that utilizes spatial data can be called a GIS. Employing GIS in historical research has become popular in recent years. However, using GIS has its technical barrier, which many humanists find difficult to overcome. In this study, we focus on the presentation of Taiwanese statistical data on historical maps. The maps we have used are Taiwan maps from the GIS Center of the Academia Sinica and the Institute of Transportation of the ROC government, totally 61. Since the purpose is to provide choropleth maps, we emphasize on providing the different administrative regions during the different periods over the 400 years history of Taiwan. The system we have designed is browser-based and does not require any installation of additional software. A user needs to only upload the data and select a few options to produce a choropleth map. We hope that the system can reduce the technical barrier that historians often encounter when using GIS, and encourage them to better utilize GIS in their research.

46. 旅遊網頁觀光目的地意象之內容分析工具研究 Destination Image Representation on the Web by Content Analysis 卓文福 /

隨著網路技術的發展,許多免費的網路資源可以讓人使用,旅遊者可以輕易的經由網路景點介紹的網站所傳達的資訊作為參考,相關的資訊來源可以來自官網、專門介紹的網站、部落格網站等資訊。近年來目的地意象的內容分析法研究亦關注到旅遊網頁內容所顯示的意義,相關的研究顯示目前研究以人工擷取,輔助軟體分析旅遊網頁的詞頻和詞彙群聚的現象可以觀察出旅遊地所顯示的目的地意象。 經探討在文本自動擷取、分群應用、大量資料分析與管理的研究較為缺乏,因此發展適合中文的方法與系統工具,進行方法的設計與實作,經過驗證其可行性後,整合為一個資訊系統,結合資料庫的設計,以進行大量資料的分析與管理。本文提出之目的地意象系統可觀察經過時間的變化所產生的差異,以全文或句子為計算基礎的分析差異,進行多景點資料來源的多面向的觀察。 在文本萃取研究部份,我們針對網站、部落格網頁進行擷取,區分成兩種資料的擷取模式,其一是對專屬介紹網站資料的擷取,另外一種是經由搜尋引擎對部落格網頁資料的擷取,以取得景點的不同來源資料來進行分析與比較,在取得網頁資料後,發展適用於各種不同網站來源的文本萃取機制,以取得其中的文本。在分群研究部份,利用所萃取之文本,經由旅遊詞彙分析,可以觀察出旅遊詞彙詞頻的統計資訊,利用相關性來進行詞彙共現分析,以取得其語意網絡,應用與調整多種分群演算法技術進行觀察,包含類神經網路演算法(Neural Network)、階層式分群演算法(Agglomerative Clustering)、生成樹分群演算法(Spanning Tree Clustering)進行分析與比較。在系統研究部份,提出目的地意象系統,經過設計可提供之功能包含新增景點、網站文本擷取、部落格文本擷取、匯入文本分析、線上文本萃取、瀏覽分析結果、詞彙管理、目的地意象分群等功能。 本文提出目的意象系統研究,可同時進行多個景點的目的地意象觀察,我們應用提出的系統於淡水、阿里山、日月潭、墾丁、清境、平溪等多個不同的景點,分析出部落格及官網網頁所傳達的目的地意象差異,藉此可以觀察台灣景點的目的地意象變化,利用本文提出目的意象系統研究,可同時進行多個景點的目的地意象觀察,分析出部落格及官網網頁所傳達的目的地意象差異,藉此可以觀察台灣景點的目的地意象變化,可以應用於網站好用性評估,觀察遊客在部落格上發布的想法與意見,作為旅遊業者或主管機關審視服務績效的工具。 With the rapid development of the Web, free Web resources have become the primary source of information for many people. When planning travel, for instance, one may consult official websites of places, commercial websites devoted to specific destinations, or weblogs. Research studying destination image representation on the web through content analysis has attracted much attention in recent years. Most of these works obtain Web content manually, use software to analysis phrase frequency and phrase clustering to analyze the destination images. The tools currently available, however, are mainly designed for western languages and are not suitable for Chinese content. We have also noticed the existing methods need improvements in automatic Web content extraction, clustering, and content analysis and management. In this thesis we have developed a system architecture that fully integrates into a management system these additional features so that destination images can be analyzed more effectively. Our method can also differentiate temporal aspects of data extracted, and use different segmentation methods to provide analysis from multiple dimensions. We have developed two kinds of automatic Web content extraction mechanisms. The purpose is to separate the meaningful content from the nonessential part, such as header and advertisement, in a webpage. The first one is designed for specific websites, and the second is for blogs that are obtained through keyword search. Parsing and cleaning techniques are also developed to extract meaningful content in the plain text from the webpages. Through segmenting the text, we identify travel related phrases together with their frequencies and co-location relations. We have also developed several clustering algorithms, based on neural network, spanning tree clustering, and agglomerative clustering, to cluster the phrases and find the destination images. The functions are integrated into a system with database design architecture. To demonstrate the effectiveness of our method, we have applied our system to a number of popular tourist destinations including Alishan, Sun Moon Lake, Kenting, Tamsui, Pingxi, and Qingjing. We use the system to analyze the differences of destination images transmitted through the official website and weblogs of each location. They also show the similarities and differences of perception among the different tourist locations. The system can be used to evaluate the effectiveness of official websites for tourism, identify subtle seasonal differences in tourism at the same location, and be used as a reference for promotional strategies for tourist industry.

47. 歷史文件中的時間資訊處理與系統呈現 Temporal Information Processing and a Chronological System for Chinese Historical Documents 蘇豐成 /

從有歷史以來,最早發展的體裁即是編年體。編年體以事件發生的時間為順序,提供後人能夠依其時序先後發展閱讀。在中國,以這樣為體裁的史書在先秦時期就已經出現,《春秋》、《左傳》、《竹書紀年》等書即是以年為單位,依序描述當時所發生的歷史事件。到後來北宋司馬光編撰的《資治通鑑》更是編年體史書中的經典,從三家分晉到五代的後周,橫跨了一千多年的歷史。除了編年體史書以外,有許多的也具有時間資訊的史料,例如正史中的本紀部分。 本研究嘗試建立起橫跨中國幾千年歷史的歷史時間軸,利用正規表示式(regular expression) 對中國歷史文件自動抓取時間標籤,並將其做處理後,利用時間資訊系統來對文件間做時間的對應。期待本研究以時間資訊處理和系統呈現兩大架構,可以提供一套具有處理時間資訊史書的系統化方法,達到提供使用者能在廣大的歷史洪流中,一種閱讀更加便捷且寬廣閱讀史書的方式。 Chronicle is one of the oldest forms of historical representation. It records events in chronological order and provides researchers a way to understand history. In China, this kind of historical representation can be dated back to Pre-Qin Period (先秦時期). Spring and Autumn Annals (春秋), Zuo Zhuan (左傳), and Bamboo Annals (竹書紀年) are all written in chronological order. Duringl Song Dynasty (宋朝), Sima Guang (司馬光) wrote Zizhi Tongjian (資治通鑑), a masterpiece of Chinese chronicles, that set the standard for all later chronicles. Zizhi Tongjian covered Chinese history from 403 BC to 959 AD. Addition to historical records written as chronicles, the biological sketches of emperors documented in the Official Histories (Zhengshi, 本紀) are usually also written in the chronological style. This research tries to create a chronicle covering the period of ancient Chinese history by combining chronologies such as Zizhi Tongjian with the chronological records in the Official Histories, and to create a method to process temporal information in Chinese historical documents. We first use regular expressions to automatically annotate temporal information from Chinese historical documents. We then build a chronological system to display the processed records. Through our system, we hope to provide users a way to read those historical documents in a more convenient way and be able to find inter-document relationships.

48. 自動化資料豐富程序 宋浩 /

無論是在數位典藏資料庫、數位圖書館、或數位博物館的領域,詮釋資料的建立都是一個重要的工作,同時也經常是耗費最多人力時間成本的項目。然而,建立詮釋資料並不是一件簡單的工作,建立者需要對某個特定領域的知識有深入的了解,才能產出豐富、正確、精準的詮釋資料,進而詳實傳達數位資源的重要性。   正因為詮釋資料必須透過大量人力進行建置,因此在實務上經常採用「聯合目錄」的形式,亦即由原始資料典藏單位負責建立典藏物的詮釋資料,再提交至中央主管單位統一提供可整合檢索、瀏覽的介面。由原始資料典藏單位各別建立詮釋資料與數位化的過程稱為「分散建置」,而由中央整合並提供使用介面則稱為「集中管理」,此模式是在綜合考量時間、人力、資源等因素後所產生的平衡點,其衍生的問題則是詮釋資料的填寫方式難以趨於一致,進而導致後續在瀏覽、檢索、與資料鏈結上的困難。   本研究試圖提出一套資料前置處理的框架:ADEPT (Automated Data Enrichment Processing Technology),目標是將符合都柏林核心集的輸入資料進行自動化的前置處理與豐富化。ADEPT框架中包含了三個主要模組,分別是:驗證模組、正規化模組、專有名詞擷取模組。透過這些模組處理過的資料將趨向一致性、符合統一的格式,同時具備人事時地物等重要資訊。除此之外,豐富化後的資料將更適合鏈結資料(linked data),不但可與網際網路上的相關資料相互連結,更可讓詮釋資料進一步被加值利用,達到全民共享的目標。 Metadata, known as ""data about data"", is an important way to describe and utilize digital objects in digital archives, digital libraries, and digital museums. To present accurate, precise, and high-quality metadata is a critical task for the digital databases, and it requires not only a high cost of human resources, but also domain know-how.   Due to the labor-intensive nature of metadata construction, a model often employed in developing a large digital collection is to build different archives separately, then construct a central portal (such as a union catalog) for users to browse, search, and explore the entire collection. Although this model is effective in terms of time, manpower, and resources, it has some drawbacks. The main problem is inconsistency in the metadata constructed. This may be caused by misinterpretation of metadata attributes, different details when inputting data, or inadequate metadata format for interpreting specific data sets.   In this thesis, we propose ADEPT (Automated Data Enrichment Processing Technology), a framework for pre-processing data. ADEPT contains three primary modules: data verification, data normalization, and named-entity recognition. ADEPT aims to ensure data consistency and correctness, and increases data usability at the same time. Furthermore, the enriched metadata is more suitable for linked open data. By connecting related data, we can explore and share information and knowledge through the Web.

49. 跨語言線上百科連結 王昱鈞 /

線上百科全書(如維基百科等)已成為目前網路上最重要的內容服務之一。 將線上百科全書中不同語言的條目建立連結在多語知識庫的建置與整合上是一相當重要的課題,許多先前之相關研究主要著重在建立維基百科不同語言版本間之跨語言連結,然而維基百科於各個語言的條目涵蓋數量有相當顯著的差異,為解決此問題,將數個重要的不同語言之單語線上百科之條目建立其連結以建置一個跨語言線上百科全書已成為一個重要的研究課題。於本論文之中,我們定義了跨語言線上百科連結之研究問題,並提出一個利用雙語主題模型與相關翻譯內容為特徵之基於支持向量機的跨語言線上百科連結方法,將英文維基百科與中文百度百科之對應條目建立連結。為驗證我們所提出之方法的有效性,我們自中文百度百科與英文維基百科收集了一定數量之對應條目並以此建置了數個實驗資料集。實驗之數據顯示我們所提出之跨語言線上百科連結方法於平均倒數排名(MRR) 評估指標可達到0.8252,較基準系統高了0.1745 (+26.82%),其數據說明我們的方法在建立英中跨語言線上百科連結是相當有效的。我們的方法並非高度依賴語言之特性,可易於擴展應用於建立其它語言間之線上百科條目之連結。 Online encyclopedias, like Wikipedia, are one of the most widely used internet services around the world. Though Wikipedia has many language editions, their coverage is imbalanced when compared to the number of language users both online and offline. Furthermore, large alternative online encyclopedias exist for some languages, such as Chinese Baidu Baike. We could improve access to the knowledge in these various sources by constructing and integrating multiple online encyclopedias into large multilingual knowledge bases. The main task in such a project is creating links between articles in different encyclopedias in different languages. Most research to date has focused on linking articles in the different language editions of Wikipedia, yet little work has been done in linking other platform encyclopedias. In this thesis, we develop a method for cross-language encyclopedia article linking (CLEAL) between encyclopedias on different platforms, English Wikipedia and Chinese Baidu Baike. We use a bilingual topic model and translation features based on an SVM model to link articles between these two encyclopedias. To evaluate our approach, we compile datasets from Baidu Baike articles and their corresponding En Wikipedia articles. The evaluation results show that our approach achieves 0.8252 in MRR, outperforming the baseline system by 0.1745 (+26.82%). Our method does not heavily depend on specific platform formats or linguistic characteristics, so it could be easily extended to generate cross-language article links among other online encyclopedias in other languages and on other platforms.

50. 田野影像管理與標註平台研究 陳偉儀 /

田野調查為眾多領域中重要的研究方法之一,研究者會親自下到田野實地紀錄所見所聞,以便加強研究深度。在田野調查的過程中,大量的影像和與之相關的紀錄無疑是研究者重要的心血結晶,而大量的資料無論是在紀錄當下抑或者是事後的整理均費時費力,但若不立即善加保管整理便有可能會遺失或忘卻照片所隱含的資訊,紀錄方式不當亦會造成紀錄的損失與事後解讀的錯誤。 有鑑於此,本研究特別以需要處理大量影像資料的田野調查為例 ,嘗試以資訊科技來解決大量照片資訊的編輯與管理問題。經由多次與田野調查工作者討論和訪談的收穫,以及親身參與田野調查的經歷,思考並設計出輔助田野調查工作者管理影像的系統,以期能減少田野調查工作者花費的時間心力,以及藉由不同的訊息保存與呈現模式,提昇研究者往後回顧的易讀與正確性。 本系統提供方便的編輯模式,讓研究者在短時間內為大量的照片編寫資訊與分類管理。而可在照片上標記的筆記系統,提供研究者更多的紀錄空間以及多樣化的應用方式,可用於紀錄空間資訊,重點標示,建立相片間連結等。各式搜尋方式確保往後使用者可快速蒐集到其所需的資料,不同的呈現方式也同時帶給研究者觀察影像間關聯的環境。完善的匯出功能讓使用者可便捷的將資訊轉移至其他平台,各式自訂選項則是讓系統更能貼近不同使用者的需求。 Field research is one of the most important research methods in social sciences. A researcher collects material through field work to enrich the diversity of his or her research material. Methods of Field work include informal interviews, direct observation, and personal participation. Regardless of what is involved, a field trip invariably results in the gathering of a collection ofphotographs and associated notes. If not organized and processed right away, one may loose or misintegret the information hidden in the photographs. We noticed not many systems are designed for documenting field research. Indded, a field worker may need to use different softwares to organize the photos and notes properly. In this thesis, we develop a system that allows a field researcher to organize and edit a large amount of photographs. Our system provides simple ways of organizing and editing the photos, so that the researcher can easily classify and annotate them. The implicit information such as time and location of the photos can be extracted and recorded automatically. The user can highlight a portion of a photo and annotate accordingly. Relations between two photographs can also be recorded so that relationships among the photos in a collection can be explored later. We also provide an export function so that the photos and their metadata can be ported to other platforms.

51. 文本標記格式的轉換與應用 曹又霖 /

許多數位人文的研究會需要使用到文本中的詞彙標記,而目前已經有許多現有的文本標記工具可以使用,由於各個工具擅長的詞彙標記不同,故本論文希望能夠整合多個工具去使用,但是因為各個工具所使用之格式不同,所以若要直接整合使用是無法辦到的事情,勢必要進行格式之間的轉換。為此本論文分析出文本標記格式中會有哪些資訊,並且將這些資訊進行分類,最後定義出了新的文本標記格式STAML去儲存這些資訊,並且將STAML作為各種不同文本標記格式之間轉換的中介語言,接著再利用網頁平台將這個轉換程式實際地開發出來。透過這個STAML格式與其轉換程式,本論文達到可以將這些文本標記工具整合使用的目的,藉此希望讓數位人文的研究能夠更加地順利。 Tagging named entities in a text is often an essential part of preparing the text to be used in digital humanities research. Although there are several text-tagging tools available to researchers, each tool is designed for a specific purpose and the tagging formats that they use are often different. Conse- quently text tagged using a specific tool cannot be reused by another person with a different tool. In this thesis we propose an approach to integrate different text-tagging formats produced from different tools. We introduce the Simple Text-Annotation Markup Language (STAML), which serves as an intermediary representa- tion between different tagging formats. Through STAML, texts tagged us- ing one format can be used in another tagging tool without disrupting the existing annotations. STAML and web-based programs are implemented for several common Chinese language based tagging formats such as those used by MARKUS, a popular tagging tool, THDL, and TEI.

52. 中文家譜數位化研究 郭秀萍 /

歷史與人的關係密不可分,在中文歷史文獻中,家譜的數量龐大、地域分布廣泛,是歷史研究的重要素材。家譜的編排包涵文獻篇章、世系及行狀,紀錄了個人之基本資訊、世系親屬關係、家譜的編修、源流、家族的倫理規範等,在家族、人口、經濟、地方社會等研究中皆佔有一席之地。   如今電腦科學雖為一項便利的研究工具,對於中文家譜的應用依然相當有限,家譜中的資訊需要被有效的紀錄及存儲,才能為後續研究提供最大的可能。家譜之數位化雖並非近日才有人提出,但資料紀錄格式與方法仍不斷在發展中,其中對於中文家譜文獻視野特性之考量亦尚不足,且少有考慮跨家譜、家族間的聯繫性。因此,如何打造一個家譜數位化系統,供使用者能將手邊之紙本家譜資料轉化為具有可攜性及可擴展性的家譜資訊格式,提高家譜資料後續被有效使用及呈現的可能性,從一棵樹到一片森林,提供人文研究者觀察多本家譜中的脈絡、從而提出問題的環境,便為本研究所欲達成之目標。   本研究以中文家譜為主要對象,提供使用者創建並可即時編修家譜文獻與人物紀錄、並可以簡便的方式紀錄人物間親屬關聯的介面,且在新增人物紀錄時同步畫出人物家族樹,以便瀏覽及編校。研究中也實際輸入紙本家譜,系統中之家譜資料可以Excel或本研究訂立之JPML(JiaPu Markup Language)格式轉出,JPML參考國內外之家譜或人物存儲格式、並引用中國歷代人物傳記資料庫之部分參考資料表,以期望後續之擴展性。 Genealogy plays an important role in historical studies. There is a vast amount of Chinese genealogical books, spanning over 500 years and over the entire China proper. The content of a genealogy book usually includes family documents, family tree, kinship, the profiles of the clansmen, ethical norm of family and editing information. These records are influential material for the study of family history, demography, economics, local society, and so on. Computer science can work as a powerful tool for studying genealogical records if we can digitalize and store genealogical data in an efficient way. The idea of digitizing genealogy is not new, but the concept and format are still evolving. The purpose of this study is to investigate an approach and a format to record the information of existing genealogical books, so as to provide an environment for users to transfer the data on Chinese genealogical books into a portable and extendable format; thus enhance the effective use of Chinese genealogical data and provide a way for genealogy researchers to observe and ask questions. The format we propose in this thesis is called JiaPu Markup Language (JPML). Notable genealogical structures such as GEDCOM and China Biographical Database (CBDB) have also been examined and adopted to improve the scalability of JPML. We have also designed interfaces for users to create and edit genealogical records to record the information, especially the lineage and kinship relations between people, from existing genealogical books. All data in our system can be exported into Excel or JPML format.

53. 利用使用者回饋尋找相關條目-以《清實錄》中臺灣相關資料為例 宋欣烜 /

《清實錄》是一部巨大的歷史典籍,為編年體的形式。按年月日紀載了清朝三百餘年的皇帝每日的活動與事蹟,其中包含了某些重要官員的任命與上奏紀錄、皇帝發布的政令、人口資料、貨物運送、四處征戰等的重要資料,加上是由官方記載、結構嚴謹,因此對研究清史的歷史學者是一部珍貴的史料。但在《清實錄》這樣大型的歷史典籍中,文史學者要探究的議題可能只牽涉到其中少部分條目,如何將這些相關條目抽取出來,是一個重要的問題。 傳統上,歷史典籍經過數位化之後,使用者會利用關鍵字搜尋的方式找尋相關條目,但這樣的方式,相關條目若未含有這些關鍵字,就無法利用這樣的方式找出。 本論文主要目的為提出一相關條目的搜尋方法,計算條目內容彼此之間的相關度,去取代關鍵字的搜尋方式。利用使用者從文本內選定少量條目,算出其餘每篇條目與選定條目的相關度,使用者由相關度大到小瀏覽,收集更多相關條目後,再重新計算相關度,在這樣反覆回饋的程序中,找出所有使用者所需的相關條目。 在民國八十年代,一群學者以人工的方式從《清實錄》中抓取出他們所認定與臺灣相關的條目,彙編成為《清實錄臺灣史資料專輯》。本論文利用此書與《清實錄》的資料來測試不同相關度演算法在歷史文獻上的效能,再設計一套基於使用者回饋的條目搜尋方法並根據該方法實作清實錄使用者回饋相關條目搜尋系統,最後,利用《清實錄臺灣史資料專輯》的條目,找出《清實錄》內更多與臺灣有所關聯的條目。 本論文主要分成兩個部分,第一部分說明如何去對應這兩本歷史典籍數位化資料中相同的條目,接著,介紹不同的條目相關度計算方法,再利用各種效能評估方式,測試這些相關度計算方法在這兩份歷史典籍上的效能。 第二部分是基於表現最好的相關度計算方法,設計一使用者回饋相關條目搜尋演算法並實作出一系統,經使用者操作該系統找尋出清實錄內更多臺灣相關的條目,最後,對這些新找出的條目做簡單觀察和統計分析。 “The Veritable Records of Qing” is a comprehensive historical records. It is a chronologically arranged collection of important issues with the day-to-day routine activities of the emperor and of memorials, including the submission or appointment of significant officials, imperial decrees, demographic information, cargo delivery and expeditions. It is compiled through emperors’ order, and it is also with strict structure. Therefore, it provides a valuable source for historians who conduct research on Qing dynasty. However, when scholars do research in “The Veritable Records of Qing”, to extract a small portion of relevance issue from this huge records can be a problem. Although after these historical records are digitalized, scholars can use keywords search to find relevant historical text. Nevertheless, if these relevant historical texts of interest do not contain the used keywords, it cannot be found by the tool. In this research, a method for finding relevant historical texts is proposed. It will compute the level of relevance between each text, instead of using keyword search. Based on some selected texts of interest by the researcher, the methods will compute the level of relevance between the selected texts and the potential texts of interest. After the computation, the potential texts of interest are listed by its rank. Researchers can choose texts they are interested in and send out their result. Having the feedback texts chosen from researchers, the method will continue on the next iteration, and find out the texts that are even more likely to be of interest of the researchers. In 1990s, scholars retrieved the supposed texts relevant to “Taiwan” from “Veritable Records of Qing” manually, and then edited them into “Veritable Records of Qing-Taiwan Selection”. In the research, this edition and “Veritable Records of Qing” are adopted to examine the performance of different relevance algorithm on general historical records. Next, a system based on relevance feedback algorithm is proposed to provide users or researchers with an interface to search for relevant texts in huge historical records. Finally, the research used “Veritable Records of Qing-Taiwan Selection” as an example to find out more relevance historical texts from “Veritable Records of Qing” that have not been chosen. The research can be divided into two part. The first part will be deliberating on the method proposed to match the two digitalized historical records mentioned above. Besides, different ways for computing relevance level in texts and branch mark of these methods on the performance on these two historical records will be introduced. While in the second part, the relevance feedback system based the most well-performed method in the experiment is introduced. Finally, with some testing by historians, the texts found out through this method are analyzed and observed.

54. 小說對話標註系統研究與實作 黃家富 /

在小說這種文學體裁中,說話行為乃是最重要的角色活動,說話以及說話的參與者往往是促成小說情節發展的要素,因此對說話內容以及參與者作角色對話分析,我們可以試著推斷出角色間關係的「質」與「量」。為此本研究改善既有的標註方法,並為此方法設計了一套新的標註系統,來協助使用者進行小說文本的標註,提升使用者的標註效率。由於小說對話分析標註的特殊性質,本研究注意到其他系統在標註對話資料的不利之處,因此在小說標註系統的開發中,特別針對功能和動線的實用性與便利性進行考量,並將此系統銜接於DocuSky平台,讓標註後的文本能夠利用DocuSky平台提供的各種工具來分析與呈現,以提高使用者對文本的掌握程度。最後,本研究以三國演義赤壁之戰為範例文本,在統計繪圖工具以及社會網路分析工具Gephi展示了已標註文本的應用示範。 Conversation plays a crucial role in the genre of novel. Conversations and the actors involved are often the major ingredients in the development of a novel. Analyzing conversations may reveal the evolution of relation among the characters involved. This thesis presents a novel tagging process and a system to help users tag conversations in the full text of a novel. We emphasize on the functionality and convenience of the tagging system for users to improve their tagging efficiency. We have also ported this tagging system into DucuSky, a DH platform for managing personal documents for scholars, so that the full texts that it tags can utilize other analysis and visualization tools provided by DocuSky. We demonstrate the effectiveness of our method by tagging the Battle of Red Cliff in the Romance of the Three Kingdoms. The outcome is displayed on a simple graphic tool and the social network analysis tool Gephi.

55. 透過社群關係與個人行為進行新聞推薦 謝于琳 /

近年來網路媒體越來越多,傳統報章雜誌媒體也逐漸網路化,新聞讀者也因為網路的便利大都轉為使用網路閱讀新聞,新聞的產量也爆炸式的增長,讀者透過自身人力的搜尋,很難找到符合個人需求的內容,因此幫助讀者有效的篩選、自動的提供符合使用者興趣的新聞是一個非常重要的課題。 在新聞推薦系統上,因為新聞的變化性高又要許多的隱性需求,因此在推薦上遭遇許許多多的困難,包括使用者需求多元、使用者需求不明確、新聞的時效性短暫…等問題。 本研究在推薦系統中,提出篩選候選文章的方法,有效減少新聞的數量,且透過自動的方法計算出新聞的時效性,讓不同的媒體有獨有的時效性遞減參數,解決時效性短暫之問題,且系統中納入讀者的社群關係,滿足讀者閱讀新聞多元的需求,和透過使用者在系統中活動的紀錄,推薦出讀者感興趣和有價值的文章給讀者。 The past decade has witnessed a tremendous serge of network news media. Instead of relying on traditional media as the main source of news, more and more people have turned to online news. This explosive growth made it difficult for users to choose the articles that might be of interest to an individual reader, and how to recommend news that might be relevant to a reader becomes an interesting issue. Because of the implicit demand of individual readers and the variety of news, designing an effective news recommendation system has to overcome several problems, which include the multiple needs of a user, vagueness in one’s expectation, the potentially short time span of news, the vast amount of news articles that need to be considered, etc. In this thesis, we propose a method to recommend news to a user. Our algorithm takes into consideration the timeliness of a news article, the behavior of members of a user’s social network, the multiple needs and the past activities of the reader.

56. 文本對讀系統—以《春秋》三傳為例 趙叡 /

《春秋》是中國歷史上最早的編年體史書,記載了上起魯隱公元年(公元前722年),下迄魯哀公十四年(公元481年),歷十二君,共二百四十二年的史事。《春秋》三傳就是註釋《春秋》經的史書,有左氏、公羊、榖梁三家,稱為《春秋》三傳。而三傳作者皆不同,各自所闡述的方式,對事情的看法不盡相同,也因此三傳彼此間對於某些史事的描述有所出入 本研究將“對讀”這樣的閱讀模式應用在《春秋》三傳上,讓使用者觀察三傳中因不同作者闡述相同事件的差異性,對同一件事情的描述可能相同或相左,相同的部分讓研究者得知正確性,不同的部分則讓研究者比較不同撰寫者想法之間的差異性。經過這樣的閱讀模式,研究者可以歸納出自己的想法。 除此之外,本系統利用了DocuSky個人文字資料庫的觀念,以DocuSky為系統平台,除了《春秋》三傳之外,使用者可使用自己所收藏的文本進行對讀研究,並提供利於閱讀與分析的使用者介面以及全文檢索等工具輔助使用者研究。 Chunqiu, the history of Lu compiled by Confucius that spans from 722 BCE to 481 BCE, is one of the most important historical record in chronological form. To explain Confucius’ very terse recording, three annotations, Zuo Zhuan, Gongyang Zhuan and Guliang Zhuan, were written. While each presents and interprets Chunqiu in its own way, together they are called the Three Commentaries of Chunqiu. This research presents an effort to develop a system that allows a reader to simultaneously read the three Commentaries. It utilizes the chronological nature of the records, treating the original writing in Chunqiu as a headline, and presents and compares the writings in the three Commentaries. Through this reading approach, researchers can summarize their own ideas. This research take advantage of DocuSky personal text database, not only for Three Commentaries of Chunqiu, user can upload their own documents to this system, our system will provide Conducive to reading and analysis of the user interface and some tools like full text query to support user.

57. 文獻引用建置系統的設計與實作 王景逸 /

文獻引用連結即將引用字串與文獻連結起來,例如論文的參考資料頁面中有條列許多的參考文獻,為字串的形式,將其一一看成引用字串,尋找文獻與其建立連結便是文獻引用連結的目標。 而引用字串與文獻之間的連結並不是那麼容易建立,引用字串中可能有因為書寫者的疏忽或是檔案編碼問題產生的錯誤等,導致與文獻有所差異而難以判斷是否該建立連結,且當資料量成長時需要考慮的資料亦會增加。 本研究在連結建置系統中,提出利用索引來篩選候選文獻或候選引用字串的方法,使得需要考慮的資料量降低至合理的範圍,並且在判斷上考慮作者、標題、時間、出處的相似性,作為是否應該建立連結的基準,也是種對於連結強度的評價。而對系統而言難以判斷的部分亦儲存下來,可交由人工檢查來輔助判斷。 Document Citation Linking is to link the citation string with one document. The references or citations in one paper for example, are string format, viewed as citation strings. The goal is for each citation string, to find one document and link it to the citation string. The links are not that easy to establish. There exists some error in the citation strings come from the writers’ mistake or encoding problem, etc. Leading to the difference between citation string and document, that is hard to decide whether to establish the links. And the amount of data need to consider also grows when there are more and more documents and citation strings. In this study of document citation linking system, propose a method to filter the candidate citation strings or documents by indexing, reduce the amount of data need to consider to a reasonable range. Considering the similarity of authors, title, time, source as the benchmark for establishing the links, also are the evaluation of the links. The system stores the cases that hard to decide to link or not by the computer, those can check by human beings.

58. 對文本進行詮釋資料附加的研究與應用 陳琤 /

在人文研究者利用數位文本作為研究資料時,如何方便管理文本的詮釋資料,又能讓詮釋資料有效的被利用於文本的查詢與檢索,是一個重要的課題。本研究制定了詮釋資料表單的資料綱要,作為詮釋資料與文本之間的媒介,並以此在個人文字資料庫平台DocuSky上面實作詮釋資料匯入工具,讓研究者能夠使用試算表軟體來管理詮釋資料的內容,並且在有需要利用詮釋資料來對文字資料庫進行檢索分析時,能夠隨時將更新後的詮釋資料匯入文字資料庫,以豐富文本的內容,增加文本的可利用性。 As digital texts are used in research, it is important to make sure that the metadata is both easy to manage and helpful in searching and classifying texts. In this study we propose an approach which uses schema as an intermediary of metadata sheets and texts. Based on the schema, a metadata importing tool is implemented in DocuSky, a platform for constructing personal text databases. Researchers can easily manage the metadata of their texts through spreadsheet, then import the updated metadata to text databases.

59. 正史五行志自然災害的系統呈現 謝弘庭 /

正史又稱二十五史,記載了中國數千年的歷史,其中的〈五行志〉與〈本紀〉中包含了古代大量的自然災害記錄,其中包含災害的時間、地點與資訊,是現代災害史研究的重要材料。   本研究根據正史〈五行志〉與〈本紀〉中的目錄與規則,以及災害史研究中對於自然災害的相關界定,定義本研究的自然災害。並從文本中依本研究定義擷取自然災害記錄。再根據災害史研究需求,對擷取的災害記錄標記災害事件。設計詮釋資料,建立系統。系統功能主要包含時間、空間與各欄位分類的視覺化分布呈現,以及後分類功能。期待讓使用者得以對災害事件的分布做多角度的觀察,進而發現有趣、值得研究的議題。 The “Standard Histories”, twenty five in total, are the official histories of the Chinese Dynasties. . Many of the Standard Histories contain volumes of Wuxingzhi (五行志), which record disasters occurred during the reign. The Chronicles of the Emperors (Benji, or 本紀) and Wuxingzhi (五行志) also often contains such records. These records document the time of occurrence, location, and severity, thus serve as important evidence for modern studies on the history of disasters. In this study we focus on the natural disasters documented in the Benji and Wuxingzhi of the Standard Histories. We first define the scope of our study and the nature of disasters based on the classifications and rules given in the Wuxingzhi as well as other studies on the history of natural disasters. We then extract the records of natural disasters from the Wuxingzhi and Benji. The extracted texts are further annotated with metadata, so as to meet the needs of the studies on the histories of disasters, A system is then developed to visualize the distribution of time, space and type of disasters, and to provide post-classification functions. It is expected that users will be enabled to make multi-aspect observations on distributions of disasters, thereby identifying issues that are interesting and worthy of study.

68. 介接維基文庫及DocuSky的文本加值工具 李旭恩 /

取得可靠且豐富的數位化研究資源,並對其做出適當的處理與加值,讓文本能發揮最大的研究效用,是數位人文的優勢之一,而維基文庫以自由、共享為座右銘號召全民共同上傳編輯,使其擁有了豐富且種類繁多的文獻,研究者可以透過搜尋功能,迅速找到需要的資料。 一般的文獻數位化如維基文庫,僅將紙本轉化為電子訊號供人閱讀。數位人文則進一步探求如何利用工具程式為文本深化,讓研究能產生新的視野。為此,DocuSky數位人文學術研究平台(後稱DocuSky研究平台)開發了個人文獻資料庫,透過工具程式,讓使用者可以建立自己的文獻管理模式,並對文本附加諸多資訊如:詮釋資料、標籤重點詞彙等功能。這些功能讓使用者可以對文獻做出如後分類檢索、詞頻統計、自動標記詞彙等操作,讓研究者能對文本有不同的視野。 本研究聚焦於如何結合上述二者的優勢開發Wiki2DocuXML,維基文庫擁有的豐富文獻,與DocuSky研究平台對文本的強大處理功能,過去亦有工具程式研究如何將其他資料庫的文獻轉換至DocuSky研究平台,此類介接工具的任務著重於如何對巨量資料庫做出快速的資料存取,並將其轉換成DocuSky所接受的文件格式,但在文本轉換的過程中使用者的可操作性則較為缺乏。有鑒於此,本論文引入簡易工作流程(Simple Workflow)的概念,利用維基文庫的API存取與資料過濾,並探討如何採用良好的使用者介面設計,讓研究者不僅可以流暢地在介接過程中取得需要的文獻,更能透過簡易工作流程對文獻做出初步的加值利用,讓DocuSky研究平台對其研究有進一步的幫助。

69. 基於爬蟲的跨資料庫二元關係呈現工具 與家譜數位化的應用 康譽騰 /

中國社會學家潘光旦之著作《明清兩代嘉興的望族》中濃縮家譜議題的精華,對家譜學理論的發展有不小的貢獻。然則因於年代因素,頁面殘破與字體難以辨識,因此將其數位化有急迫性與重要的意義存在。考量數位化的四種層次,第一是將原始資料轉換成圖像,第二是圖像轉成文字,第三是為文字資料加上適當的標記,第四是為標記的資料進行視覺呈現。要達成第三種與第四種層次所花的成本遠超過前兩者,且常常面臨可擴展性、多人共同作業、資料檢核與呈現上的困難。維基與爬蟲工具搭配的形式能滿足這部份需求,在此篇論文中將基於開源的BookStack維基平台進行資料建置。 要完整呈現數位化內容,僅靠原始文本是不足的,會用到第三方資料庫進行參照。然而跨越資料庫引用有技術上的限制,因此設計爬蟲來整合跨資料庫的內容是用來突破技術限制的方法。而要利用爬蟲跨資料搜尋則需要為其設計運行規則,其中基礎的規則是使用廣度優先搜尋,這對於使用者自建的小量資料是足夠的。其中渲染資料二元關係圖形的效能與搜尋結果複雜度相關,超出執行時間導致的渲染失敗將會是重大挑戰。

71. 《通志》傳記人物與正史記載比對系統建置研究 高正玥 /

通史的編纂由於覆蓋的時間範圍廣以及篇幅的限制,編纂者便會面臨添刪史 料的問題。《通志》作為一部重要的中國古代文獻,內容包含了從先秦至隋朝的通 史。在《通志》的編纂中,由於作者參考許多不同的史料,這使得了解作者究竟 參考了哪些史料、以及增添或刪減了哪些內容成為一項值為探究的議題。因此, 本研究以《通志》和正史為目標,使用了資訊科技技術進行對比,並建置比對分 析系統,旨在為研究者提供一個系統化且有效的方法來探索《通志》與正史的相 似性和差異性。 本論文旨在建立一個比對分析系統,在研究流程中,首先將《通志》和正史 分割為適當的比對單元,並對比對範圍進行界定,以增加比對的精確性與效率。 隨後,以字串比對演算法比較這些單元之間的相似性。最後,提供研究者以宏觀 及微觀角度,對比對結果進行觀察,並允許自行輸入自訂文本來擴展研究範圍。 期望藉由此比對分析系統,提供研究者更多研究方法和資源進行,進行原本無法 達到的深入研究,加深對《通志》的理解與探索。


