While there have been many efforts to extract such data from the literature, there are major flaws with the methods used for extraction. Chemists would be able to benefit from all prior art within the field, and the quality of insights would improve over time as the volume of data increases and algorithms are improved. Were this ideal state of affairs to be achieved, it would mean that every hard-won experimental result would have its chance to inform future experiments, rather than languishing in obscurity. The ability for a well-designed informatics platform to productively use as much data as can be made available means that in principle every publicly available scientific data point that is relevant to a machine learning algorithm’s domain should be injected into the training set. The inherently low scalability of scientists’ time is in stark contrast with the ever increasing ability of software algorithms to assimilate vast quantities of data and deliver meaningful insights that could not have been observed by more traditional means. A scientific publication is typically downloaded and perused by hundreds or perhaps thousands of humans, but the number of people who carefully study the data content, by carefully examining the constituent chemical structures, physical properties, reaction schemes, spectral assignments, etc., is usually just a handful. ĭespite the major shift that is trending right now, there is an important caveat: many of the hosts of online data do not necessarily give proper consideration to what may well be the most important consumer of their data, namely software algorithms, especially at a time when the ongoing development of the semantic web is hyperdependent on algorithms and mappings. The reuse of such data may require data licensing and we have suggested some rules that could be helpful. Traditionally this process involved scouring the peer reviewed literature, either online through paywalls or physically within the walls of a library, and in some cases perusing privately collected data on the subject. The data may be useful to test a hypothesis in the laboratory or to build computational models. The multitude of public databases, freely distributed vendor compound libraries a and directly shared lab notebooks make it possible for scientists to prospectively gather together a large knowledgebase. The increasing availability of freely accessible data for chemical compounds and their associated properties and web links is driving a significant shift in the way research is carried out. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. We propose that this trend be accompanied by a thorough examination of data sharing priorities. The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |