Alternative TitleA Prototype of Unified Search Engine for multisource MSDS
Thesis Advisor李晓霞
Degree Grantor中国科学院过程工程研究所
Place of Conferral过程工程研究所
Degree Discipline计算机化学★
KeywordMsds 搜索引擎 网络爬行 深层网检索 数据提取
Abstract化学品安全不仅关系到从业者的职业安全,也涉及到人类生活的方方面面,提高化学品安全使用意识是健康生活的有力保障。提供化学品安全信息的MSDS是一份很有商业价值的文件,无论对于生产或提供化学品的厂家,还是对于接触或使用化学品的研究人员和其他从业者,它都具有重要的意义。目前已有的针对MSDS的搜索工具,一般只能检索单一来源的MSDS数据库,所以建立一个通过一次查询能同时检索多个分散的MSDS数据源的统一检索引擎将可帮助人们更好的利用网络上免费的MSDS资源。 本文通过考察分析网络上已有的MSDS数据源的检索现状及检索工具的特点,设计了MSDS统一检索引擎原型系统的框架,在实现在线实时检索多来源MSDS的同时,提出了通过建立本地MSDS缓存数据库以提高MSDS统一检索引擎的响应速度的策略。 本论文建立的MSDS统一检索引擎原型系统,其实现过程包括发现与自动构造检索式模式、自动获取检索结果页面、利用数据提取的方法获取化合物标识信息以及建立MSDS的化合物索引等多个方面。综合运用了文本、网页及PDF格式的数据分析和提取技术,通过将浅层网链接分析技术和深层网检索与数据提取技术相结合的方法获取MSDS文件,然后缓存各数据源检索的MSDS结果、并建立了化合物索引,实现了多来源MSDS的统一检索。通过一系列实验不仅验证了方法的可行性,还建立了MSDS的本地缓存数据库,并大大扩展了化合物索引表baseIndex的覆盖面。在程序开发过程中,建立了一个小型的通用函数库,使得不同数据源爬行提取程序间的相同功能程序能够共享,这不仅提高了编程效率,也降低了程序间的耦合性。 本文建立的MSDS统一检索引擎原型系统,实现了通过一次查询检索多个来源数据库的功能,它可以提供多种检索方式,包括化合物名称、分子式、CAS号和化学结构检索等。用于验证方法的可行性的MSDS数据源包括Alfa Aesar、ILO、ChemExper和FisherSCI,可覆盖的MSDS数据规模达到五万,涉及到中英文两种语言、三种格式(Text、HTML和PDF)的MSDS。在向用户显示检索到的MSDS结果的同时,还可显示MSDS的来源、制作商、原网址、MSDS描述信息等,以便用户可根据需要对不同来源的数据进行比较。同时,本论文还通过建立MSDS统一检索引擎原型系统的接口程序,实现了与ChemDB Portal系统的集成,扩展了ChemDB Portal数据种类的覆盖面。 总之,本论文建立的MSDS统一检索引擎原型系统为使用MSDS的人员更好的利用网络上已有的免费的MSDS提供了极大便利,该搜索引擎的实现方法也能为其它领域类似系统提供一定的借鉴。
Other AbstractThe safety of chemical substances is not only the concern of professionals for their occupational safety, but also is closely related with many aspects of human life. The awareness of chemical safely is highly needed for a healthy lifestyle. MSDS is a document with commercial value that can provide the information of chemical safety, which is the demand of manufactures that produce or provide chemicals and researchers or other people that use chemicals. At present, the publically available MSDS sources are scattered on Internet and usually searching is possible for single-source MSDS database. So to establish a MSDS search engine which could search in multiple-source MSDS databases through only one query at a time will facilitate MSDS searching on Internet. Through the analysis of the status of MSDS data sources and the features of search tools this paper designs a framework for a unified MSDS search engine. While search multisource MSDS in real-time online, this paper proposes a strategy of establishing a local MSDS cache which can improve the response speed of the unified MSDS search engine. This paper establishes a prototype of MSDS unified search engine, it's implementation process includes finding and generating the retrieval mode automatically, retrieving searching result pages automatically, obtaining identification data of compounds by data extraction methods which is necessary for indexing MSDS documents against chemicals, and so on. By integrating the technology of data analysis and extraction from the documents in text, HTML or PDF format, MSDS documents are retrieved, cached and indexed compound by hyperlink analysis and searching in deep web. A series of experiments are carried out not only to verify the feasibility of the method, but also to establish the local cache for MSDS, which can also significantly expands the coverage of compound index. In the development process, a small library of general-purpose are built up in which functions can be shared by different extraction procedures. It can not only improve the efficiency of the programming, but also reduces the coupling between the procedures. The prototype of MSDS unified search engine established in this thesis can search different data sources just though one query in forms of chemical name, formula, CAS number and chemical structure. Some MSDS sources, including Alfa Aesar, ILO, ChemExper, FisherSCI are used in the experiments to verify the approach proposed. The size of the cached MSDS data in the prototype system is about 50000, both in English and Chinese languages and in three formats (Text, HTML, PDF). The prototype can show users not only the results of retrieved MSDS but also other information of MSDS, such as the source, producer, original site URL and description of MSDS, so that users can compare different MSDS from different sources. At the same time, an interface has been created to have the prototype of MSDS unified search engine integrated to ChemDB Portal system, which expands ChemDB Portal's coverage of both data types and data. In short, this paper establishes a prototype system of MSDS unified search engine which can help people to use MSDS to get MSDS on the internet much easier, and the establishing methods can be applied to other similar data.
Document Type学位论文
GB/T 7714
李海波. MSDS统一检索引擎原型系统的设计与实现[D]. 过程工程研究所. 中国科学院过程工程研究所,2009.
Files in This Item:
File Name/Size DocType Version Access License
10001_20062800413303(2074KB) 开放获取CC BY-NC-SAView Application Full Text
