Machine Learning in the eTMF: Human-Aided Active Learning Processes

31 January 2020
By Jay Smith, Head of Product, Trial Interactive
Machine Learning in the eTMF

Every day we get closer to being able to automate the document processing and administration that consumes so many hours and can distract from treatment and research. As we can all attest, clinical trials are among the largest administrative endeavors in the wide world of business operations, so we should have every reason to be enthusiastic about any technology that can reduce this effort while improving quality.

The trial master file (TMF) is a focus of at least some of this excitement. The TMF is one of several places where a lot of manual effort is required for the collection and classification of the documentation used to ensure GCP compliance. This effort is a time-consuming back-office requirement that is necessary for regulatory compliance during clinical research.

Manual Administration

During a study, documentation comes in from many clinical sites and usually many countries and in many languages. This documentation is either classified to the site’s metadata specifications or not classified at all. The documents are often scanned with handwritten notes and signatures. What we end up with is a mountain of documentation that must be identified and tagged with metadata for presentation to auditors during an agency inspection. The TMF Reference Model requires a very specific set of metadata and filing that is recommended by regulatory agencies as the accepted standard.

Opportunities for Automation

  • Classification

    Imagine a world where all of this administration could be automatic! This potential is one of the more exciting frontiers for machine learning. In a perfect world, machine learning technology could be used to collect, classify, verify, and archive the bulk of the documentation collected during the trial. This would not just save time, but more importantly, provide a higher level of repeatable quality and enable a more immediate inspection readiness. Sponsors could receive the agency auditors and find documentation on demand. 

    If we can automatically collect and classify this information and successfully extract all of the metadata from these documents, then we can run some fruitful analysis. We can identify rules of thumb and algorithms to verify we have the correct information, confirm nothing is missing, and automatically find anomalies and quality issues. For example, we can ensure that all sub-investigators are listed in the delegation logs; that we have collected CVs, licenses, and training for all document signers; and that the documents have been signed and dated in the proper order. Many of these handwritten documents are crucial for verifying the clinical trial personnel followed the correct procedures. Unlocking these document’s metadata is the key to verifying quality.

  • Anonymization

    Since many studies are Phase III and double-blind, the documentation that comes into the TMF must be redacted to hide personalized health information, not to mention meet privacy regulations like GDPR. Automated processing and extraction of emails, bank account numbers, social security numbers, dates of birth, and other tell-tale identification markers can help ensure that personally identifiable information (PII), both health-related and otherwise, can be automatically redacted permanently before it is stored in the TMF.

  • Translation

    Another example would be to support the translation of documents into native languages in multicountry/lingual studies. While machine learning cannot fully automate in cases where a certified translation is required, it can certainly simplify this process, assisting the translator every step of the way and ensuring that keywords and phrases that have clinical significance are correctly translated or coded.


The challenges are immense, however. Have you ever been asked to verify you are a human in an online form using reCAPTCHA? What you might not know is that when you type in the letters and numbers pictured, you are teaching machine algorithms to identify those letters and numbers. The technology is being fed with crowd-sourced information to make it smarter, but it is still far from perfect. Similarly, have you ever had difficulty reading someone’s handwriting? In many cases, advanced AI is not able to decipher that content any better than you.

What this all means is that real challenges exist with scan quality and handwriting legibility as it applies to extracting the metadata necessary for ML to do its job. These issues compound when factoring in different languages and the optical character recognition (OCR) necessary to extract text from a scan. When you are categorizing information inside the TMF it is not enough to know that a document is classified a certain way. You also have to know the site location, contacts involved, and what country it was coming from. You have to classify all this documentation in all these ways to be successful.


Classification algorithms can generally verify that documentation is not duplicated. They can also use statistical analysis to compare document images together to verify by what percentage they are similar. However, what is most interesting is the learning part of ML. The technology can be “trained” against a model, which means it can learn through processing training data. If a document has been classified in a specific way many times, it can readily be sent through the ML algorithm and start to learn the differences. For established vendors like TransPerfect’s Trial Interactive, there is a lot of training data available—millions of classified documents can be sent through an ML algorithm to better train it. From there, one can start to apply predictive models: take the training data and statistical analysis for classification, compare it, and begin to predict what a document is and how to classify it within the eTMF.

Once we identify where in the document we will find essential metadata, a variety of tools, including natural language processing (NLP) and zonal OCR, can extract the metadata to better classify the document. We can then run set comparisons to look at all the data collected from a document and compare it against what we know about an investigative site, allowing the identification of anomalies and possible issues.

It's Not Magic... Yet

As innovators and clinical team partners, we want operational leaders to be able to assess realistic expectations as they relate to emerging technology. Artificial intelligence and machine learning, as they apply to clinical processes, and particularly the trial master file, have yet to overcome some real obstacles in their ability to reliably understand the information being fed into their algorithmic “brains.” Let’s face it, we all regularly see documentation that is borderline or sometimes completely unreadable, often laden with very idiosyncratic handwritten information. Bluntly speaking, ML is not ready to handle human nuance. Employing an approach called human-aided active learning is recommended where humans QC machine-determined results and the ML model learns from each human decision. This allows the TMF to stay compliant while making the process much more efficient. Presently, we still need our human clinical professionals to break down documentation into patterns that algorithms will understand.

The Auto-Filing TMF

The electric car company Tesla® has made a lot of press for claims of self-driving capability. Founder Elon Musk used SAE International’s “Levels of Driving Automation” classifications, published in 2014, to measure his company’s success, with the original goal of achieving Class 5, “steering wheel optional,” by 2018. While a laudable goal, it’s important to be skeptical about these kinds of claims. Just for fun, here is a depiction of the classifications for “Self-Filing TMF” based upon the SAE model:


0 - All manual processes. Teams of document classifiers. Lists of essential documents. Regular, internal quality review processes. Manual agency inspection.

1 - ("hands on"): Also exists now in Trial Interactive and other products. OCR and ML translations available. Document classification suggestions against essential documents. Some metadata extraction for critical documents such as the 1572 form.

2 - ("hands off"): The automated TMF can classify documentation by itself and can perform limited metadata extraction and verification steps. TMF actively anticipates documents through CTMS processes. Regulatory and QA must still monitor the TMF with regular quality reviews.

3 - ("eyes off"): The TMF self-processes documents and can handle situations that call for a response, like opening queries. However, the TMF does not really understand what essential documents are needed and cannot handle amendments and special situations very well. The TMF cannot audit itself yet.

4 - ("mind off"): Once the TMF is configured, the sponsor can safely turn their attention away from TMF tasks, e.g., they can focus on the clinical trial. Fully self-auditing, no attention is ever required for quality, e.g., the sponsor may safely ignore the TMF for normal trials. However, configuration is still required every time to properly set up the trial, train the models, etc.

5 - (”UI optional"): No human intervention is required at all. The TMF self-configures based on the protocol, self-files, self-processes, and self-audits.

I think we can agree that while Class 5 would be impressive, achieving Class 2 or Class 3 would provide most of the efficiencies while ensuring compliance and a high level of quality. Machine learning is not ready to take over document processing and, at this stage, should be seen as a helping hand for document specialists and TMF managers.

For more information on artificial intelligence and machine learning in the eTMF, visit us at or contact us at

If you would like to continue reading on related topics, please check out these articles:

To discuss AI and ML capabilities within the Trial Interactive e-clinical platform, contact us at