Using the Chat Endpoint in the Ollama API

Matt Williams

มุมมอง 15 006

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 ก.ย. 2024
Be sure to sign up to my monthly newsletter at technovangelis...
And if interested in supporting me, sign up for my patreon at / technovangelist

ความคิดเห็น • 46

@nofoobar 7 หลายเดือนก่อน
Thanks for this awesome tutorial. I took it as a reference and built a user_id based map to keep history in an in memory database.
This helped me to keep history for each user.
{
"user_1" : [{}],
"user_2": [{}]
}
@RanaMuhammadWaqas 6 หลายเดือนก่อน ⁺²
That awakward silence at the end though :D
@Psychopatz 9 หลายเดือนก่อน ⁺¹
Thank you sir for adding this QoL. Super helpful indeed!
@dr.mikeybee 9 หลายเดือนก่อน
Thanks for writing this terrific server, models, and tools!
@technovangelist 9 หลายเดือนก่อน
Glad you like it!
@aidan.halvey 4 หลายเดือนก่อน
What an absolute legend
@gears2525 4 หลายเดือนก่อน ⁺²
For some reason, I’m unable to hit the endpoint from another computer on the same network
@TridentHut-dr8dg หลายเดือนก่อน
@@gears2525 Hey why not use a flask API and you load your local Ollama in that and try to connect that with flask and share it to the other PC
Maybe it'll work
@HistoryIsAbsurd 7 หลายเดือนก่อน
I actually had no idea you were part of the ollama team too thats super cool
@technovangelist 7 หลายเดือนก่อน
I was but not anymore. Focusing on videos
@HistoryIsAbsurd 7 หลายเดือนก่อน
Still super cool! Love the vids! Keep em comin!@@technovangelist
@me-cm8or 22 วันที่ผ่านมา
How can I do those requests from another pc connected to the same network?
@PrashantSaikia 5 หลายเดือนก่อน
Is there an example of a chat UI application that uses the ollama inference endpoint and is then deployed in the cloud (AWS, GCP, etc)? I have managed to create the app and it is running on local, but I'm struggling to deploy it in cloud - specifically, I'm stuck at creating an appropriate Dockerfile, as it seems there needs to be two deployments, one for the ollama inference endpoint and one for the UI. Therefore, an example showing how it's done would be awesome!
@technovangelist 5 หลายเดือนก่อน
I havent built out any GUI's because its hard to improve upon the usability of the cli in this case. Everywhere else I prefer a gui, but here none of them are good enough to beat out the cli. If you want to deploy with docker, there are plenty of tutorials online to get you up to speed. A friend, Bret Fisher, has a popular Docker course. But no need to start there. Just deploy it to the cloud without docker. Spin up an instance somewhere and run it. The bigger challenge there is securing your system. If you don't need to share it, then look into tailscale.
@Ramirola83 2 หลายเดือนก่อน
When I use Ollama from the terminal with the llama3 model, it works very fast, almost instantly. However, when I try to make a request to localhost from the same machine using curl, it is incredibly slow. Why could this be?
@technovangelist 2 หลายเดือนก่อน
What is the command you are running? The cli uses the same api as curl. There is no difference. Best thing to try is to go to the cli then type ‘/set verbose’ and then ask the question. Then in the curl look at the stats in the final json blob and compare the numbers.
@niteshapte 3 หลายเดือนก่อน
If ollama itself is not multithread and async, how much wrapper like this can help?
I have 2 laptops, A and B. 'A' has a database with more than 10k records in a table and code written Scala to fetch data from the database. 'B' has ollama running on it and llama3:latest model is loaded on it. If I am fetching data from the database from 'A' and sending it to chat and generate endpoints API of ollama on 'B'. I observed that ollama responds to the chat or generate request within few milliseconds like 100, 125, 200, 300. But that's just in the beginning. Later on the response time increase to 10 minutes and about for a request by the time 2k requests is send and still 8k requests still need to send.
From this behavious, it doesn't look that ollama is supporting concurrency. I created a python script with Flask to achieve async but the behaviour from ollama remained same. Do you know to solve this problem? Or is it like there no problem and hence no solution? OLLAMA_NUM_PARRAL value was 4.
@technovangelist 3 หลายเดือนก่อน
If you are seeing long times like that there is something wrong with your setup. You should ask on the discord and try to resolve the issue with your setup.
@chrisBruner 9 หลายเดือนก่อน
Very interesting, I had no idea. What possible roles are there besides "user", do we just make them up or is there a predefined set?
@technovangelist 9 หลายเดือนก่อน
The three possible roles for now are user, system, and assistant
@chrisBruner 9 หลายเดือนก่อน
@@technovangelistLove these videos keep them coming! I know you can produce embeddings from ollama, and can store them in a database, but I don't know how that is useful. Can you explain?
@harinaren1989 หลายเดือนก่อน
is there a way i can fetch the whole response in a single response object
@technovangelist หลายเดือนก่อน
Sure. Set streaming to false
@GrecoFPV 5 หลายเดือนก่อน
Hello 👋 can we connect our local ollama with a self hosted N8n server ?
@dextersantamaria7222 7 หลายเดือนก่อน
Hi Matt! Thanks from the awesome work! Is there a way to include vLlm into Ollama?
@technovangelist 7 หลายเดือนก่อน
I don’t think so. They are alternative ways of doing the same thing
@claudioguendelman 2 หลายเดือนก่อน
Is some way that you can help me with my proyect ? thanks from Chile Claudio
@ilteris 9 หลายเดือนก่อน
separate question. What's that browser? It has a very nice interface. TIA
@technovangelist 9 หลายเดือนก่อน ⁺¹
That’s Arc. It’s great.
@ilteris 9 หลายเดือนก่อน
@@technovangelist thank you kindly sir! Please do more videos 🙏
@MavVRX 8 หลายเดือนก่อน
Why does the chat api respond all at once, even with streaming turned on?
@technovangelist 8 หลายเดือนก่อน
Not sure what you mean by that
@MavVRX 8 หลายเดือนก่อน
@@technovangelist I get a delayed response (while the AI is producing a response), once the response comes back, all the chunks are received in a single response, rather than getting the chunks one at the time (i.e. the time taken to generate a complete response is the same, but the chat api lags, and only send a response after the complete response is generated, even with stream=True). This is vastly different to the generate api, where one character is sent back at a time.
@kacemtoubal3580 หลายเดือนก่อน
are those models fine_tunable ??
@technovangelist หลายเดือนก่อน
Most of the models used in Ollama are fine tunable.
@briann1233 5 หลายเดือนก่อน
Do you have github repo for this video?
@technovangelist 5 หลายเดือนก่อน
github.com/ollama/ollama/examples
@PRAKASHWAGLE 8 หลายเดือนก่อน
Is there a link to doscord channel ?
@technovangelist 8 หลายเดือนก่อน
There is a link on the main website for ollama
@sampriti6026 8 หลายเดือนก่อน
hey can you post a github link to the code?
@technovangelist 8 หลายเดือนก่อน
That’s a good point. I find it and post
@Star-rd9eg 8 หลายเดือนก่อน
@@technovangelist did ya manage to get it?:D
@technovangelist 8 หลายเดือนก่อน
Oh, its already in the ollama repo
@technovangelist 8 หลายเดือนก่อน
Yes, its in the main ollama repo
@maritholtermann4009 7 หลายเดือนก่อน
Hi! Thank you for a nice and descriptive video. Can you please link to repo? Can't find it :)@@technovangelist

ต่อไป

เล่นอัตโนมัติ

Ollama's Newest Release and Model Breakdown