Marker: This Open-Source Tool will make your PDFs LLM Ready

Prompt Engineering

มุมมอง 66 040

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 22 ม.ค. 2025

ความคิดเห็น • 139

@engineerprompt 7 หลายเดือนก่อน
If you are interested in learning more about how to build robust RAG applications, check out this course: prompt-s-site.thinkific.com/courses/rag
@ilianos 7 หลายเดือนก่อน ⁺⁸
🎯 Key points for quick navigation:
00:00 *📄 Introduction and Challenges with PDFs*
- Introduction to the video topic,
- Challenges of extracting data from PDFs for LLM applications,
- Different elements and structures in PDFs complicating extraction.
01:09 *🔧 Existing Approaches to PDF Conversion*
- Overview of methods to convert PDFs to plain text,
- Use of machine learning models and OCR for extraction,
- Comparison of PDFs to Markdown for ease of processing.
02:17 *🛠️ Introduction to Marker Tool*
- Introduction to the Marker tool for converting PDFs to Markdown,
- Comparison with other tools like Nugat,
- Performance and accuracy benefits of using Marker.
03:36 *📚 Features of Marker*
- Supported document types and languages,
- Removal of headers, footers, and artifacts,
- Formatting of tables and code blocks, image extraction,
- Limitations and operational capabilities on different systems.
05:00 *📝 Licensing and Limitations*
- Licensing terms based on organizational revenue,
- Limitations in converting equations and formatting tables,
- Discussion on practical limitations noticed in usage.
05:54 *💻 Setting Up and Installing Marker*
- Steps to create a virtual environment for Marker,
- Instructions for installing PyTorch based on OS,
- Detailed steps to install Marker and optional OCR package.
07:31 *🧪 Example Conversion Process*
- Steps to convert a single PDF file to Markdown,
- Explanation of command parameters and process flow,
- Initial example with a scientific paper.
10:10 *📊 Reviewing Conversion Output*
- Review of the output structure and accuracy,
- Metadata extraction and image handling,
- Preview of converted Markdown and comparison with the original PDF.
12:13 *📜 Additional Examples and Output Review*
- Example with Andrew Ng’s CV and another paper,
- Review of the extracted content and any noticed issues,
- Importance of secondary post-processing for accuracy.
13:34 *🎥 Conclusion and Future Content*
- Summary of Marker tool’s utility and performance,
- Announcement of future videos on related topics,
- Invitation to subscribe for more content.
Made with HARPA AI
@ernestuz 7 หลายเดือนก่อน ⁺²⁷
Man, a couple of weeks ago I was fighting this PDF chaos. Thanks for your video.
@engineerprompt 7 หลายเดือนก่อน
glad it was helpful.
@rizkiananda352 7 หลายเดือนก่อน
How yo do it. I failed as non coder. How to extract table as well?
@whoami5955 2 หลายเดือนก่อน
@@rizkiananda352why you are expecting to pass?
@gregsLyrics 7 หลายเดือนก่อน ⁺²
Brilliant vid - it is a godsend. OCRing a PDF is just not workable, period. I gave up on attempting parsing PDF. This new information is amazing and I am once again excited.
@engineerprompt 7 หลายเดือนก่อน
Glad it was helpful!
@Alvaro-cs7zs 7 หลายเดือนก่อน ⁺⁴
Thanks for the video. I try it but it did a little bit of a mess with the tables on my PDF. Not working really well. The rest of the text gets resolved properly. But tables, not really, just some of them are nicely structured and done
@greymooses 7 หลายเดือนก่อน ⁺³
If you do make a video about scraping data, please go over content that requires javascript to load. It’s been difficult to find a clear guide specifically for capturing this data for LLM usage. I loved this video, thank you!
@engineerprompt 7 หลายเดือนก่อน
I haven't look into it before so let me see what I can come up with.
@kineticraft6977 4 หลายเดือนก่อน ⁺²
If anyone is having trouble where its running but not actually placing the new files in the output directory, if you followed the github example command, the "--min_length 10000" is whats doing it. It simply goes through that whole process and then decides its too short. Either reduce that number to a much lower number of chars or remove the option entirely. 30 min of hunting through TMP folders for the files and finally figured it out
@ai-whisperer 7 หลายเดือนก่อน ⁺⁷
Thanks for covering Marker, this is brilliant!!
Would love to see batch processing of pds using Marker.
Also, for the web scraping projects, can we include one, where we scrape data apartment rental data (that keeps changing/evolving) from websites like craigslist, etc. store it persistently in a vetorstore or db and then run a query on that info?
@engineerprompt 7 หลายเดือนก่อน ⁺³
let me see what I can do.
@anandu06 5 หลายเดือนก่อน ⁺¹
How you tried the scanned documents instead of digital pdf? And handwritten text as well?
@PritiSurange 28 วันที่ผ่านมา
in video git repository not mentioned whatever you given in description box
@Nick_With_A_Stick 7 หลายเดือนก่อน ⁺³
Marker only used 4gb of vram out of a a6000, can you increase the batch size and get some more speed gain? Or is it stuck at that speed regardless of the batch size? 100 seconds per page is a huge improvement over nougat, but still very slow 😢
I love the video tho, I struggled with this one time for hours making a custom script to scrape this one pdf. Definitely gonna use marker sometime soon.
@engineerprompt 7 หลายเดือนก่อน ⁺²
I think you want to run multiple files in a batch. That will give you the best performance. I also came across megaParse (github.com/QuivrHQ/MegaParse) which is based on top of LlamaParse. That is not 100% local though,
@Nick_With_A_Stick 7 หลายเดือนก่อน
@@engineerprompt awesome!!!!! Thank you❤️❤️❤️❤️❤️!!!!!!!!!!!
@samarthmath2952 7 หลายเดือนก่อน ⁺¹
I am getting float error. I have installed the CUDA version. Any suggestions?
@mariongully4087 7 หลายเดือนก่อน ⁺²
Very interesting. Once you have converted the pdf file, how can we give all this info to the vectordatabase for RAG ?
@engineerprompt 7 หลายเดือนก่อน ⁺²
You will do similar chunking like text files. I will put together a video on it.
@ritikeshchoube1748 7 หลายเดือนก่อน
i have pdf of scanned documents it is like images os using tesseract i am converting this into images and then reading it but the thing is my pdf has also some table and when i am generating embedding of this and then passing into to an llm it is unable to answer the question which i am asking from the table...
@engineerprompt 7 หลายเดือนก่อน
My recommendation will be to run them through a multimodal model like Claude Haiku, if cost is not a big concern. You can use that directly to answer questions from scanned docs. Here is a video on how to do that. th-cam.com/video/a5OW5UAyC3E/w-d-xo.html
@drmetroyt 7 หลายเดือนก่อน ⁺²
This is really helpful to prepare the pdfs before adding them to RAG . But is there any way to install this marker application as docker container?
@engineerprompt 7 หลายเดือนก่อน
Yes, I am not sure about that.
@drmetroyt 7 หลายเดือนก่อน
@@engineerprompt after some research i found a docker image for marker by dibz15 on Dockerhub , but i dont have any idea as how to setup the container a video on it would be helpful
@drmetroyt 7 หลายเดือนก่อน
@@engineerprompt there is an image by dibz15 for marker on Dockerhub , could you make a video on installing it
@drmetroyt 7 หลายเดือนก่อน
Docker version please?
@justthisguyyouknow666 2 หลายเดือนก่อน
Excellent explanation. Many thanks!
@volodymyrdonets4166 3 หลายเดือนก่อน
An amazing video. Many thanks, dude!
@tedp9146 7 หลายเดือนก่อน ⁺⁴
I actually tried it out today before seeing this video and sadly it produced quite messed up results for a not so complicated document. Some sections and tables were parsed perfectly but even if there are some scrambled up parts the results are useless :/
@MdAffan-ux2kf 7 หลายเดือนก่อน
def MODEL_DTYPE(self) -> torch.dtype:
AttributeError: module 'torch' has no attribute 'dtype'
I am getting this error while running the marker_single..... command
plzz help me resolve this
@gauravkumargupta9622 4 หลายเดือนก่อน
I have to mark the different regions in a question paper scanned PDF(Subjective or MCQ with subquestions). Can it do this accurately
@Reality_Check_1984 3 หลายเดือนก่อน
I see how to run this out of the terminal but how do we import and run this in a python file? I have had some issues.
@synthclub 7 หลายเดือนก่อน
Amazing, can't wait till test.. converting maths from pdf to LaTeX cost thousands of dollars..now it's free.
@chauyuhin5013 5 หลายเดือนก่อน
Are there ways I can also convert comments/annotations into a markdown format?
@MalikZurkiyeh 7 หลายเดือนก่อน
when I try to convert an entire folder of pdfs using this command "marker /data/inputs /data/formatted_inputs --workers 3", i get this error " ImportError: libGL.so.1: cannot open shared object file: No such file or directory ", any ideas on how to fix it?
@engineerprompt 7 หลายเดือนก่อน
not sure, is the path correct?
@baitfishing6374 5 หลายเดือนก่อน
Does marker required gpu installed on system.??
@семенантонов-ч7ф 7 หลายเดือนก่อน ⁺¹
Is it support pdf files with AMS-TeX / AMS-LaTeX math notation (amsmath)?
@engineerprompt 7 หลายเดือนก่อน ⁺¹
I am not sure, if you have a reference pdf, I can try it for you.
@семенантонов-ч7ф 7 หลายเดือนก่อน
@@engineerprompt This pdf, for example, contain huge amount of amsmath notation - www.kurims.kyoto-u.ac.jp/~motizuki/Inter-universal%20Teichmuller%20Theory%20III.pdf
@tetraocean 4 หลายเดือนก่อน
can chat bot send images with this data?
normally embedding only text, but how about with images ?
@tanmeshnm 7 หลายเดือนก่อน
The main challenge I faced when using nlm-ingestor was parsing SVG images from PDFs. I'm curious whether it will handle this case well.
@engineerprompt 7 หลายเดือนก่อน
Not sure, you might want to look into unstructuredio as well.
@chjpiu 7 หลายเดือนก่อน
Hi, do you know how much RAM is required for this application? I tried, but it said that it was out of memory. My laptop has 16 GB RAM w/o Nvidia GPU. Thanks a lot
@drmartinbartos 7 หลายเดือนก่อน
Around 7minutes, having installed a conda environment you select pip not conda when installing PyTorch - any reason why? If there’s a working conda option doesn’t it make sense to keep using conda and only use pip when you absolutely have to? Just wondering.. (thanks for the video btw - had just been wondering about effective ways of making off content reliably available to RAG and the video is super-useful).
@engineerprompt 7 หลายเดือนก่อน
I usually use pip because that has most of the python packages available. conda is somehow limited with available python packaged. conda will also work in this case but its more of my own habit at this point :)
@leomeza9396 6 หลายเดือนก่อน
Awesome! Thanks for sharing this!
@intellect5124 7 หลายเดือนก่อน
would be interested to learn parse the data from URLs of websites and also query the parsed data using opesource methods. we can call it as web article/new research tool
@engineerprompt 7 หลายเดือนก่อน
Great, idea. Will do that.
@jimlynch9390 7 หลายเดือนก่อน
It showed some promise except it flaked out with a overflow error. On the pages it seemed to convert it scrambled the data and lost some of it. These pages are primarily transactions in a table with columns separated by whitespace. The pages with plain text worked a bit better.
@engineerprompt 7 หลายเดือนก่อน
interesting, I think it has some limitations. Hope the creator continue working on it. In the tables case, it might be good to use a multimodal model.
@stanTrX 4 หลายเดือนก่อน
Tabula-py or this? Which is better when it gets to extracting tables?
@danielpicassomunoz2752 7 หลายเดือนก่อน
Anything to convert to epub? Getting rid of headers and footers
@neurojitsu 4 หลายเดือนก่อน
Would this work for annotated pdfs? Would it be advantageous to use Marker for NoteboolLM and Anthropic, or is it not necessary?
@engineerprompt 4 หลายเดือนก่อน ⁺¹
If you have PDFs, I would suggest send it to Gemini Flash to convert it into markdown and then feed that to Anthropic.
@neurojitsu 4 หลายเดือนก่อน
@@engineerprompt thank you
@themorethemerrier281 7 หลายเดือนก่อน ⁺¹
This sounds very interesting but I will need to learn some python environment basic before I can put this to the test. A solution like this could help me a lot!
@fabriciot4166 7 หลายเดือนก่อน ⁺¹
Great contribution, thank you very much!
@engineerprompt 7 หลายเดือนก่อน
glad its helpful.
@cristian_palau 7 หลายเดือนก่อน ⁺¹
thank you for sharing this excelent tools!
@anuraglahon 7 หลายเดือนก่อน
If we want to do it for many pdf at once and then build chatbot?
@engineerprompt 7 หลายเดือนก่อน ⁺³
Yes, there is a batch version. I am going to create an end to end tutorial on it
@VenkatesanVenkat-fd4hg 7 หลายเดือนก่อน ⁺²
Great video, waiting for scraping video content....
@Naejbert 3 หลายเดือนก่อน
Is this suitable for over 250 1mb PDFs?
@samcavalera9489 7 หลายเดือนก่อน
Thanks so much bro 🙏🙏
My question is when your pdf has some images inside, and you want to do embedding on the pdf for the purpose of RAG, how can you pass the info of images to the vector db? Is there any way to do multi-modal RAG? In the case of scientific papers, those images contain significant amount of useful information.
Many thanks in advance 🙏
@engineerprompt 7 หลายเดือนก่อน ⁺¹
If you have images, you want to run them through a vision model (such as Llava) to generate their text description and then embed that description in the vectorstore along with the metadata. You can use it directly with RAG then.
@samcavalera9489 7 หลายเดือนก่อน
@@engineerprompt thanks bro for your guidance! For us in academia, using RAG on scientific papers is fruitless without incorporating the figures, as most of the time, figures contain way more information than their mere descriptions in the paper. I will give your suggestion a try and see how I can resolve this problem. Thanks again bro!
@ignaciopincheira23 7 หลายเดือนก่อน
Could you add the description of each image to the text with the aim of having a single Markdown file, similar to the original PDF? This way, it would be possible to pass a file to a language model that is readable and maintains its content.
@engineerprompt 7 หลายเดือนก่อน
Yes, that is possible. I am going to create a video on multi-modal RAG which will cover this topic.
@maxlgemeinderat9202 7 หลายเดือนก่อน ⁺¹
Do you think this could be better than unstructuredIO?
@engineerprompt 7 หลายเดือนก่อน
I think this gives you some of the features which are in the premium version of unstructuredio
@intellect5124 7 หลายเดือนก่อน
Very informative video. Could you try to build a system that can run on a large number of PDFs and further convert these to .md files for an LLM to query or generate specific prompts with a UI?
@engineerprompt 7 หลายเดือนก่อน
Yeah, I am thinking about it. Will post something.
@iqbalhonnur4451 3 หลายเดือนก่อน
Nice video!!! Can we use this in commercial applications?
@engineerprompt 3 หลายเดือนก่อน ⁺¹
Yes, if your revenue is less than $5M ARR, if more, you need to get in touch with the author for license.
@paulmiller591 7 หลายเดือนก่อน
Perfect timing thanks!
@engineerprompt 7 หลายเดือนก่อน
glad its helpful :)
@jamalnuh8565 7 หลายเดือนก่อน
Alway I like your content. Thank you bro
@engineerprompt 7 หลายเดือนก่อน
I appreciate that, thank you!
@anandgs 7 หลายเดือนก่อน
Thank you very much!!! I was looking for something like this for a long time. I work for a large bank but with very small budget for my project. Due to budget crunch we cannot afford buying third party tools, this sounds to be a perfect fit but since there is a limit of $5MN we may not qualify to use this for free. Would you suggest going with Nougat or you have a better alternative for my use case, really appreciate your content!
@engineerprompt 7 หลายเดือนก่อน
Nougat can be an option or look into unstructuredio. Also I would recommend to look into Claude or GPT4o with vision if data privacy is not a big issue. Some of these proprietary tools have good data privacy based on their TOS.
@anandgs 7 หลายเดือนก่อน
@@engineerprompt Thanks for the prompt response!!
@Lowlightu 7 หลายเดือนก่อน ⁺¹⁰
Is it better than Unstructured ?
@engineerprompt 7 หลายเดือนก่อน ⁺³
really depends on the use case and the ability to run this completely local.
@AdarshMadrecha 7 หลายเดือนก่อน
Can you please share GitHub URL of solution you are talking about
@nuluai 7 หลายเดือนก่อน
Thank you so much! great job !!
@mohsenghafari7652 7 หลายเดือนก่อน
Hello. Thank you for your efforts and very good training. It is work in other language ?
@engineerprompt 7 หลายเดือนก่อน
According to the repo creator, it should.
@Larsbor 7 หลายเดือนก่อน
I am uncertain about marker, it is for scientific use, but says it removes footers, that is where you normally put in your sources, and apendix links.. so?!
@navinlikenoother 7 หลายเดือนก่อน ⁺³
Hi , great video. can you also explore ways to extract information from powerpoints and ms word docs. I'm asking because most corporate information are stored in these formats.
@engineerprompt 7 หลายเดือนก่อน ⁺³
check out this library, it uses llamaparse but I think will do what you are looking for. Will create a video on it if there is interest:
github.com/QuivrHQ/MegaParse
@manjula_1 7 หลายเดือนก่อน
This is Very useful!, Now, In next video, Tell how to finetune any model (with some long context length like "Phi-3-mini-128k-instruct") With this Markdown Data 😍😍
@engineerprompt 7 หลายเดือนก่อน
let me see what i can do :)
@thunderwh 7 หลายเดือนก่อน
Fantastic, thanks!
@gorripotinikhileswar7087 7 หลายเดือนก่อน
Hey , Can we use this offline?
@engineerprompt 7 หลายเดือนก่อน
Yes
@mzimmerman1988 7 หลายเดือนก่อน ⁺¹
thanks for sharing.
@hnb13686 7 หลายเดือนก่อน ⁺²⁶
THis is not completely open-source so dont report it as such with clarification midway in the vid.
@sobeck6900 7 หลายเดือนก่อน
what do you mean it's not completely Open Source?
@thowes 7 หลายเดือนก่อน
@@sobeck6900If there are restrictions on who can use the software (e.g., no commercial use), then it is not open source. Check the OSI definition of open source or the FSF definition of free software.
@anandgs 7 หลายเดือนก่อน
I had another question, are you also on Udemy?
@engineerprompt 7 หลายเดือนก่อน ⁺¹
I am not on Udemy but just launching my RAG course here: prompt-s-site.thinkific.com/courses/rag
@DanielHomeImprovement 7 หลายเดือนก่อน
amazing video thx so much
@fortran57 7 หลายเดือนก่อน
Great content
@MrSuntask 7 หลายเดือนก่อน
Looks like a great tool
@someoneelse4195 7 หลายเดือนก่อน ⁺¹
Comparison with unstructured?
@drmetroyt 7 หลายเดือนก่อน
Docker version please
@Sri_Harsha_Electronics_Guthik 6 หลายเดือนก่อน
I have been dealing with this PDF garbage since 10 years. This is a good thing, but my only question is, is this better than Adobe Acrobat?
@MeinDeutschkurs 7 หลายเดือนก่อน
Yeah! 👏👏👏👏👏👏
@denijane89 7 หลายเดือนก่อน
The dumb part of python tools is that ok, you'll install marker, but it will want python=3.10, while the langchain and crewai will work with python=3.11 and as a result you cannot authomize the process because each tool resides in its own conda env. So yeah, I like what I saw, it really looks good, but I'll have to create the markdown separately from all the other stuff I have and that's annoying.
@dezigns333 7 หลายเดือนก่อน
If you're going to use OCR than just use images of each page. Any LLM with vision can deal with it.
@ritikeshchoube1748 7 หลายเดือนก่อน
you find any solution?
@AaronALAI 7 หลายเดือนก่อน ⁺¹
Really amazing project, testing today!!
@engineerprompt 7 หลายเดือนก่อน
Would love to see how your experience with it is.
@puneetbajaj786 7 หลายเดือนก่อน ⁺¹
@@engineerprompt Bro its not givin good output when there are 3 columns in a page, can we do something in this
@JanBadertscher 7 หลายเดือนก่อน
The real question is how it compares to the current SOTA "unstructured"
@christopherchilton-smith6482 7 หลายเดือนก่อน
I wonder how far away we are from arbitrarily high accuracy on tasks like this.
@engineerprompt 7 หลายเดือนก่อน ⁺¹
To be honest, when it comes to voice models, open source models are lagging behind!
@DavidJNowak 7 หลายเดือนก่อน
What I want is for LLMs to cook my next meal.
@Beetgrape 7 หลายเดือนก่อน
dude, I wanna deploy this on huggingface as an API. make a tutorial on this.
@engineerprompt 7 หลายเดือนก่อน
deployment series is coming soon, will give you an idea on how to do this.
@mohsenghafari7652 7 หลายเดือนก่อน
thanks
@publicsectordirect982 7 หลายเดือนก่อน ⁺²
A very tidy tool
@Jayden-qq1ei 7 หลายเดือนก่อน
Markdowns for PDF for LLM😁
@engineerprompt 7 หลายเดือนก่อน
:)
@prodigroup 7 หลายเดือนก่อน
👑
@supercker 6 หลายเดือนก่อน
"all languages" perhaps means the various languages we speak.
@poisonza 6 หลายเดือนก่อน
Cool
@Larsbor 7 หลายเดือนก่อน
Ok as usual the lack of Gui destroys it for me..😢
@trusterzero6399 7 หลายเดือนก่อน
Grow out of that and a world will open up
@only_learn6095 7 หลายเดือนก่อน
GPL 3.0 No thanks.
@Sneakylamah 7 หลายเดือนก่อน
On my m1 Mac i have tried this out, installing
dependencies = [
"torch>=2.3.0",
"torchvision>=0.18.0",
"torchaudio>=2.3.0",
"marker-pdf>=0.2.13",
]
Then when i try out just a single pdf it fails on a simple python import.
marker_single 26572517.pdf OUTPUT --max_pages 2 --langs English
Traceback (most recent call last):
File "marker/.venv/bin/marker_single", line 5, in
from convert_single import main
File "marker/.venv/lib/python3.12/site-packages/convert_single.py", line 5, in
from marker.convert import convert_single_pdf
ModuleNotFoundError: No module named ‘marker.convert'
Anyone getting the same?
Tried with python 3.10 and 3.12
@engineerprompt 7 หลายเดือนก่อน
are you using a virtual environment? use this command:
python -m pip install marker-pdf
This will ensure its installing the package in the current virtual env.
@Sneakylamah 7 หลายเดือนก่อน
Using rye, and yes it is there in my virtual env.
@Sneakylamah 7 หลายเดือนก่อน
The marker scripts are there to be called.
@Sneakylamah 7 หลายเดือนก่อน
@@engineerprompt Ok the problem seems to be with the way Rye handled the imports, sorry bout that. Creating the virtual env normally i can run the commands. Thanks for the video, i have been looking for how to do this a long time.

ต่อไป

เล่นอัตโนมัติ

PyMuPDF4LLM for RAG: The Unstructured & LlamaParse Killer?