|Other Abstract||The safety of chemical substances is not only the concern of professionals for their occupational safety, but also is closely related with many aspects of human life. The awareness of chemical safely is highly needed for a healthy lifestyle. MSDS is a document with commercial value that can provide the information of chemical safety, which is the demand of manufactures that produce or provide chemicals and researchers or other people that use chemicals. At present, the publically available MSDS sources are scattered on Internet and usually searching is possible for single-source MSDS database. So to establish a MSDS search engine which could search in multiple-source MSDS databases through only one query at a time will facilitate MSDS searching on Internet.
Through the analysis of the status of MSDS data sources and the features of search tools this paper designs a framework for a unified MSDS search engine. While search multisource MSDS in real-time online, this paper proposes a strategy of establishing a local MSDS cache which can improve the response speed of the unified MSDS search engine.
This paper establishes a prototype of MSDS unified search engine, it's implementation process includes finding and generating the retrieval mode automatically, retrieving searching result pages automatically, obtaining identification data of compounds by data extraction methods which is necessary for indexing MSDS documents against chemicals, and so on. By integrating the technology of data analysis and extraction from the documents in text, HTML or PDF format, MSDS documents are retrieved, cached and indexed compound by hyperlink analysis and searching in deep web. A series of experiments are carried out not only to verify the feasibility of the method, but also to establish the local cache for MSDS, which can also significantly expands the coverage of compound index. In the development process, a small library of general-purpose are built up in which functions can be shared by different extraction procedures. It can not only improve the efficiency of the programming, but also reduces the coupling between the procedures.
The prototype of MSDS unified search engine established in this thesis can search different data sources just though one query in forms of chemical name, formula, CAS number and chemical structure. Some MSDS sources, including Alfa Aesar, ILO, ChemExper, FisherSCI are used in the experiments to verify the approach proposed. The size of the cached MSDS data in the prototype system is about 50000, both in English and Chinese languages and in three formats (Text, HTML, PDF). The prototype can show users not only the results of retrieved MSDS but also other information of MSDS, such as the source, producer, original site URL and description of MSDS, so that users can compare different MSDS from different sources. At the same time, an interface has been created to have the prototype of MSDS unified search engine integrated to ChemDB Portal system, which expands ChemDB Portal's coverage of both data types and data.
In short, this paper establishes a prototype system of MSDS unified search engine which can help people to use MSDS to get MSDS on the internet much easier, and the establishing methods can be applied to other similar data.|