Also 22:08 commenting on Lukas' question. The data in biological world are different from NLP or CV data in various ways, just to name a few: 1. In biology, the experiment data is only an estimation of the physical ground truth and often inconsistent, whereas in many other domain basically the test corpus used for model training is the same in training and real world. So the intrinsic noise within would impact the ceiling of how a model could be evaluated. Since the data is not ground truth, there is a greater gap between model output and reality, given even if the model is perfect on the testing data. 2. The lack of data is real. Partially because bio data is expensive. For CV an annotator could label a dozen or even a hundred pictures per hour and it costs less than $100. But in bio world, on average a single row of data could cost $100-$1000, even over $10k or more for things like protein structure, and takes days or weeks generate. It also requires high level expertise to conduct these experiments, and often repeats need to be done to analyze the intrinsic variances of these data. 3. The format of bio data is so diverse. For LLM, text is all you need, add voice and moving pictures we can train SORA. But in biology, there are hundreds of tasks, structure, affinity, stability, toxicity... each task has many different experiment types. Well. If you are interested in more about this my twitter is also NachuanShan. I work at BioMap as a data product manager, building protein language models.
As mentioned in a previous comment, a significant challenge in applying machine learning to drug discovery projects lies in the scarcity of robust and well-structured data. For instance, a major factor contributing to the failure of drug discovery endeavours is the suboptimal ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties. The landscape could be transformed if we could develop models capable of predicting the outcomes of in vitro assays, allowing us to streamline the selection of well-optimized candidates for pre-clinical trials. However, the publicly available ADMET data is notably deficient in both quality and quantity, leading to the development of models that lack robustness.
One issue in LLM for drug discovery is Heisenberg Uncertainty Principle. For example scRNA-seq generates big data, great for ML/LLM. But the data is unreal because you need to "disassociate" the cells to do the sequencing.
Also 22:08 commenting on Lukas' question. The data in biological world are different from NLP or CV data in various ways, just to name a few:
1. In biology, the experiment data is only an estimation of the physical ground truth and often inconsistent, whereas in many other domain basically the test corpus used for model training is the same in training and real world. So the intrinsic noise within would impact the ceiling of how a model could be evaluated. Since the data is not ground truth, there is a greater gap between model output and reality, given even if the model is perfect on the testing data.
2. The lack of data is real. Partially because bio data is expensive. For CV an annotator could label a dozen or even a hundred pictures per hour and it costs less than $100. But in bio world, on average a single row of data could cost $100-$1000, even over $10k or more for things like protein structure, and takes days or weeks generate. It also requires high level expertise to conduct these experiments, and often repeats need to be done to analyze the intrinsic variances of these data.
3. The format of bio data is so diverse. For LLM, text is all you need, add voice and moving pictures we can train SORA. But in biology, there are hundreds of tasks, structure, affinity, stability, toxicity... each task has many different experiment types.
Well. If you are interested in more about this my twitter is also NachuanShan. I work at BioMap as a data product manager, building protein language models.
As mentioned in a previous comment, a significant challenge in applying machine learning to drug discovery projects lies in the scarcity of robust and well-structured data. For instance, a major factor contributing to the failure of drug discovery endeavours is the suboptimal ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties. The landscape could be transformed if we could develop models capable of predicting the outcomes of in vitro assays, allowing us to streamline the selection of well-optimized candidates for pre-clinical trials. However, the publicly available ADMET data is notably deficient in both quality and quantity, leading to the development of models that lack robustness.
One issue in LLM for drug discovery is Heisenberg Uncertainty Principle. For example scRNA-seq generates big data, great for ML/LLM. But the data is unreal because you need to "disassociate" the cells to do the sequencing.
Excellent questions by Lucas. Insighful discussion.
.. becoming comfortable with being uncomfortable ❤️
Great conversation. Love this topic.
Great questions 👍
Very insightful and informative
The brain is not the mind
The brain is not the mind
The brain is not the mind
Demis gonna win #DeepMind #EZ
Where is the data going to come from ? There are strict hippa regulations especially in USA
Interviewer really irritating