Thank you for this video. One question though, does this mean the data is being sent to openai for it to process, I mean from privacy perspective this is really concerning
Yes. You could train it using dummy data, if you just want to get the queries to use elsewhere. If you wanted the LLM to do what I did in the video then a local LLM would be best. I didn't/don't have access to a decent computer to run one though. try : github.com/imartinez/privateGPT Using Langchain with OpenAI and SQL would be best kept to publicly available data and study/learning SQL. see also: community.openai.com/t/creating-a-chatbot-using-the-data-stored-in-my-huge-database/245942/27 pgvector is another option if you use Postgres and are familiar with embeddings.
Hi there- thanks for this video! Is it possible to use retrievers and condensers with the sql connector? I know it the pdf document loaders you’re able to find only the relevant text using a retriever, condense it to reduce token count and send that as context to the LLM. Perhaps that would save on cost? Also, I’m guessing that using memory to store indexes of your data in something like pinecone is useless with a database because data is always changing- is that correct?
I don't know the answer to Q1, maybe someone else can comment if they know. Re: Pinecone, vector databases are so fast, you most likely would not ever have a requirement to store indexes. I used Qdrant with Python and it's blazingly fast.
@@python360 In my case, mainly the last two. I would also include the size of the results of the query. How many records can I stick in your teachers table before the answer to the question "Show me the details of all teachers" exceeds the LLM's ability to summarize it? Input size: Our production database has several hundred tables and views, some with dozens of columns, some with abbreviated column and table names (not my idea, so don't blame me), many foreign key constraints. I cannot imagine LangChain SQL or any other NLIDB software being able to query it, at least not yet. Maybe someday.
The database was hosted on your device and you ran the code on Colab or something like that? It’d be great if it connects with no issue to a local database.
In the context of SQL (Structured Query Language), a schema refers to a logical container that holds database objects such as tables, views, indexes, and other related structures. When using SQL commands, you often work within a specific schema to organize and structure your database objects. However, the ability to directly pass a schema as a parameter in an SQL command can depend on the database system you are using and the programming language you are working with. Here's a general explanation of how you might work with schemas in SQL commands: Database System: Different database management systems (DBMS) have varying support for schemas. For example, PostgreSQL, SQL Server, and Oracle all have their ways of handling schemas. SQL Commands: SQL commands like SELECT, INSERT, UPDATE, and DELETE typically operate on tables and other objects within a specific schema. You reference these objects using a qualified name, often in the format .. This helps avoid ambiguity when multiple objects with the same name exist in different schemas. Programming Languages: When you're interacting with a database from a programming language (such as Python, Java, C#, etc.), you might use prepared statements or parameterized queries. In these cases, you can dynamically insert schema names into your SQL queries. However, the exact syntax can differ between programming languages and database systems. For example, in Python using the sqlite3 library, you can pass schema names as a parameter using placeholders: import sqlite3 conn = sqlite3.connect('my_database.db') cursor = conn.cursor() schema_name = 'my_schema' table_name = 'my_table' sql_query = f"SELECT * FROM {schema_name}.{table_name}" cursor.execute(sql_query)
I think I've solved the problem, the code probably is: db = SQLDatabase.from_uri( f "postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_name}?client_encoding=utf8", engine_args={ "connect_args": {"options":"-c search_path=meta_collect,m ySechma"}, } )@@python360
The size of the database doesn't make any difference here, rather it's the size of the response which will make a difference to the number of tokens, where each 4 letter word is approx 1 token.
Since there are many things that are done and work very well in colab but when you take them to VSC they don't. Please can you implement in VSC. Thanks. By the way, excellent the way you develop the themes inside the video!!!
Good suggestion, I used a sentence transformer model from huggingface on my findthatbit project. The model plus qdrant vector database were both open source and the data could be kept private depending on the use case. 👍
Great video! I have a couple of questions, if that's OK. 1. What are the versions of langchain and openai that you are using? I am trying to run the SQL Agent on a SQLite database and I keep running into this error, AuthenticationError: 2. At the beginning of the video you mentioned that the SQL Agent uses up a lot OpenAI credits. Do you have an approximate value for how many credits it uses up? Is it definitely more than one request per run?
Thankyou, here's the requirements.txt - github.com/RGGH/OpenAI_SQL/blob/master/requirements.txt The SQL code was using about 170 tokens from memory. Ah, if you're using SQLite have you set a password on the database? By default it likely won't have one.
Thank you for this video. One question though, does this mean the data is being sent to openai for it to process, I mean from privacy perspective this is really concerning
Yes.
You could train it using dummy data, if you just want to get the queries to use elsewhere.
If you wanted the LLM to do what I did in the video then a local LLM would be best.
I didn't/don't have access to a decent computer to run one though.
try : github.com/imartinez/privateGPT
Using Langchain with OpenAI and SQL would be best kept to publicly available data and study/learning SQL.
see also:
community.openai.com/t/creating-a-chatbot-using-the-data-stored-in-my-huge-database/245942/27
pgvector is another option if you use Postgres and are familiar with embeddings.
Hi there- thanks for this video!
Is it possible to use retrievers and condensers with the sql connector? I know it the pdf document loaders you’re able to find only the relevant text using a retriever, condense it to reduce token count and send that as context to the LLM. Perhaps that would save on cost?
Also, I’m guessing that using memory to store indexes of your data in something like pinecone is useless with a database because data is always changing- is that correct?
I don't know the answer to Q1, maybe someone else can comment if they know. Re: Pinecone, vector databases are so fast, you most likely would not ever have a requirement to store indexes. I used Qdrant with Python and it's blazingly fast.
I get the sense that we are a long, long way from this being usable with schemas you'd find in the real world.
Yes, I just wanted to explore the basic idea. What do you see as the biggest issue with it? eg Functions? Relationships? Subqueries?
@@python360 In my case, mainly the last two. I would also include the size of the results of the query. How many records can I stick in your teachers table before the answer to the question "Show me the details of all teachers" exceeds the LLM's ability to summarize it? Input size: Our production database has several hundred tables and views, some with dozens of columns, some with abbreviated column and table names (not my idea, so don't blame me), many foreign key constraints. I cannot imagine LangChain SQL or any other NLIDB software being able to query it, at least not yet. Maybe someday.
Can you please tell me how to resolve the prompt length issue
The database was hosted on your device and you ran the code on Colab or something like that? It’d be great if it connects with no issue to a local database.
It was a local database, I already had Postgres on the laptop, and I was running python locally as .ipynb in VSCODE 👍
the default schema is the public, if i want to connect to my custom schema like 'my_schema' ,what should i do ? thanks a lot!
In the context of SQL (Structured Query Language), a schema refers to a logical container that holds database objects such as tables, views, indexes, and other related structures. When using SQL commands, you often work within a specific schema to organize and structure your database objects.
However, the ability to directly pass a schema as a parameter in an SQL command can depend on the database system you are using and the programming language you are working with.
Here's a general explanation of how you might work with schemas in SQL commands:
Database System:
Different database management systems (DBMS) have varying support for schemas. For example, PostgreSQL, SQL Server, and Oracle all have their ways of handling schemas.
SQL Commands:
SQL commands like SELECT, INSERT, UPDATE, and DELETE typically operate on tables and other objects within a specific schema. You reference these objects using a qualified name, often in the format .. This helps avoid ambiguity when multiple objects with the same name exist in different schemas.
Programming Languages:
When you're interacting with a database from a programming language (such as Python, Java, C#, etc.), you might use prepared statements or parameterized queries. In these cases, you can dynamically insert schema names into your SQL queries. However, the exact syntax can differ between programming languages and database systems.
For example, in Python using the sqlite3 library, you can pass schema names as a parameter using placeholders:
import sqlite3
conn = sqlite3.connect('my_database.db')
cursor = conn.cursor()
schema_name = 'my_schema'
table_name = 'my_table'
sql_query = f"SELECT * FROM {schema_name}.{table_name}"
cursor.execute(sql_query)
I asked chatGPT as I didn't know :o)
I think I've solved the problem, the code probably is:
db = SQLDatabase.from_uri(
f "postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_name}?client_encoding=utf8",
engine_args={
"connect_args": {"options":"-c search_path=meta_collect,m ySechma"},
}
)@@python360
Like to ask why some logic is in python file and some in Jupyter notebook?
Yes, good question! So I develop in Jupyer to start with, and then export as .py python when it's ready to use on s server. 👍
will this work for large amount of data as well? How will it take in account such a big context?
The size of the database doesn't make any difference here, rather it's the size of the response which will make a difference to the number of tokens, where each 4 letter word is approx 1 token.
When dealing with a large response size what might be the best approach? Although we can limit the number of rows but that doesn't seem full proof.
Are you using LangChain "tools" - you could make a custom tool that would use only the tokens of the query?
@@python360 Will try that out, thanks.
Since there are many things that are done and work very well in colab but when you take them to VSC they don't. Please can you implement in VSC. Thanks. By the way, excellent the way you develop the themes inside the video!!!
Noted. Thanks for the comment, I'll look to use VScode in future videos 👍
How use open-source LLM s like private gpt gpt4all vicuna
Good suggestion, I used a sentence transformer model from huggingface on my findthatbit project.
The model plus qdrant vector database were both open source and the data could be kept private depending on the use case. 👍
@@python360did the hugging face model work for querying your sql db using natural language
Good queston.Maybe try regex ? eg matches = re.findall(Action) from the response, - maybe if anyone else reads this they can suggest another way?
How can I get only the SQL query and not the summarized response that it provides?
I am also looking for this, did you find any answer?
I got this error ValueError: Invalid header value b'Bearer API KEY.......
'
The header value should be a string, not a bytes object...can you check it's a string with print(type(~my-header value~) ?
Great video!
I have a couple of questions, if that's OK.
1. What are the versions of langchain and openai that you are using? I am trying to run the SQL Agent on a SQLite database and I keep running into this error,
AuthenticationError:
2. At the beginning of the video you mentioned that the SQL Agent uses up a lot OpenAI credits. Do you have an approximate value for how many credits it uses up? Is it definitely more than one request per run?
Thankyou, here's the requirements.txt - github.com/RGGH/OpenAI_SQL/blob/master/requirements.txt
The SQL code was using about 170 tokens from memory.
Ah, if you're using SQLite have you set a password on the database? By default it likely won't have one.
Can it generate the visualizations as well?
Not that I know of, might be a plug-in now that can, it's evolving so quickly..