Training ML Models on Published Chemical Data: Is It Ethical?

Is it unethical if you gathered, created, and trained algorithms on published chemical data? How do you ensure that what you are doing complies with ethical standards? What should you do?

Last Friday, I was doing a practice talk for my thesis outline presentation. During the question-and-answer session, one student raised a very interesting (not to mention, timely) concern. She asked whether it is ethical to gather my dataset and train ML models from it. For context, my thesis study is about using machine learning to screen catalysts for ester hydrogenation into alcohols. My method involves manually collecting ester hydrogenation reaction points from peer-reviewed journals.  

I have read about it before. In essence, training ML models on chemical datasets published in peer-reviewed journals is not unethical, per se. The rationale was that once the results of a scientific study are published, it becomes part of the scientific body of knowledge. Also, scientific facts are not copyrightable, only the way they were interpreted. But I get where the student was coming from and her concerns about using AI/ML in scientific context.

I told the student that I believe that the process of taking data from journals and scientific articles and training ML models from them is not unethical. The issue only arises when I, myself, claim that the data I used for my study is from my own pursuits and research. Since the student was concerned about intellectual property rights, I also mentioned something about Fair Use. I posited that since I am building something from published results, I am not in danger of committing any ethical or IP rights violations.  

Days later and it still got me thinking so I decided to read more about it. Apparently, training an ML model on published results is not inherently unethical, as I suspected. However, its ethicality comes from how the data was obtained, whether intellectual property rights were respected, and how the resulting model is used.

How the data was obtained

The first question we must ask ourselves is how we got the data and how we processed it. If the data points we use to train ML models contain personal information (e.g. name, address, email, etc.), ethical issues may arise especially if no informed consent was given. This is not just an ethical issue as well, but also a legal breach on data privacy laws.

Furthermore, scientific facts are not copyrightable as I mentioned before. Experimental observations like percent yields, selectivities, enantiomeric excess, solvent systems, etc., are facts about the scientific world. Copyright laws protect creative expression, not scientific facts. That is the reason why we can extract numerical information from peer-reviewed journals, create a dataset, and train machine learning models on it.

In fact, I also believe that good-natured researchers want their data to be transformed, reanalyzed, criticized, aggregated, plotted, fed into models, built upon, and so on.

And speaking of creating a database out of reaction data, there is this Database RightsDatabase Rights is a sui generis right that protects the act of making a database out of scientific data. This means that if someone exerted time and effort into creating a database by manually scraping the internet for data, then they are protected by this so-called Database Rights.

As long as you did not just copy an entire database from Reaxys or SciFinder and training your model on it, Database Rights will apply. You can, however, read papers, extract facts manually, and build your own dataset from scratch. That’s fully allowed. And since I collected data from hundreds of papers about ester hydrogenation, I think Database Rights apply in my use case.

If anything, it’s like I bought Lego pieces and built something from them.

Respecting intellectual property rights

Another important factor to consider is whether we respected intellectual property rights. Published papers from peer-reviewed journals are subject to Licensing Agreements like Creative Commons. Using data in violation of these licensing agreements is an ethical, and often a legal issue.

In addition, using reaction data points without proper attribution to the authors who created the data is also a direct violation of Intellectual Property rights. It is considered as an act of plagiarism.

We can also consider whether the dataset contains sensitive information like name, email, address, etc. Ethical issues may arise especially if machine learning models are trained on this dataset without giving informed consent. Thankfully, chemical data is non-personal.

My take on this is as long as we give proper credit to the authors who generated the data, we’re not crossing any ethical or legal issues. What researchers get into trouble mostly include:

  • Copy pasting figures or tables without permission
  • Reproducing large chunks of texts or diagrams
  • Redistributing the papers themselves
  • Claiming that someone else’s dataset is your own

The bigger takeaway here is to always respect the researchers whom you got the data from. Ethics in ML/AI for chemistry will always boil down to transparency, attribution, data handling, and responsible use.

How the resulting model is used

Lastly, we must think about how we’re going to use the resulting model. In medical sciences, there’s this concept of beneficence and non-maleficence. Beneficence is the ethical principle of doing good and acting in the best interests of others, while non-maleficence is the ethical principle of “do no harm” that obligates professionals to avoid intentional harm towards others.

We must be careful that our model, trained on data from peer-reviewed journals, must not cause any harm towards any specific demographic. Our model must also be in the best interest of others. As long as our research can be used to develop new medicines, improve processes, or promote sustainability, then it is considered as ethical.

Now, let’s go into Fair Use territory. Fair Use is a legal doctrine that allows limited use of copyrighted material without permission from the owner, often for purposes like criticism, comment, news reporting, teaching, scholarship, or research. There, I highlighted the word “research” because my thesis is part of an undergraduate research requirement. More than that, my goal with my research is to promote more efficient ways to screen catalysts for a reaction that is beneficial for the environment, medicine, and chemical industry.

In Fair Use cases, five factors are often considered:

  • The purpose and character of your use: Have you taken material from another work and transformed it by adding new expression or meaning?
  • Nature of the copyrighted work: Again, scientific facts cannot be copyrighted. So, we have some leeway into this, since copyright laws are mostly protecting creative works.
  • Amount and substantiality of the portion taken: Did you take the heart and soul of the work? The less you take, the more likely you’ll be fine.
  • Effect of the use upon the potential market: Did you steal away from the original work by depriving them of potential income? If not, you’ll likely be okay.
  • Are you a good or a bad person: I just added this one but depending on your nature and intention for using someone else’s body of work, you may or may not be in trouble.

The bottom line here is that we can try to claim Fair Use on all of these factors. For one, we are transforming the original data into something that is completely different. Transformation is an important issue in deciding whether a use meets the first factor of the Fair Use test. Second, “ground truths” like reaction points are not copyrightable. Third, the heart of the copyrighted work may be the efficiency of the catalyst, not the ability to predict the performance of another. Fourth, the model we’re training is not a substitute for their work, so we’re not depriving them of income. Lastly, the model we produced must be beneficial in one way or another.

— The student was right in asking that question. To be honest, I didn’t know the answer on the spot, but I was challenged and somehow afraid of the possible lawsuits I may encounter. Reading upon it, I came to the realization that we must always be wary of the possible ethical and legal issues of our research. In a time where ethics and humanities are looked down upon, it’s important for my fellow researchers to step back and see whether we’re violating any ethical principles in our research. The pursuit of knowledge must always be for the benefit of humanity and must not trample on someone else’s rights and freedoms of expression.

Share:

Table Of Contents

Browse Related Posts