Hadoop Certification - 05 Sqoop Import Incremental

itversity

มุมมอง 17 635

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 31 ม.ค. 2025

ความคิดเห็น • 61

@johndeer3627 7 ปีที่แล้ว
I have tested this query without '- -append' option, and it worked fine.
@asrithaudamala9187 8 ปีที่แล้ว ⁺²
Durga sir, below are my observations when practicing this video
we have to use --append with --where clause but with the --incremental we need not specify append separately.
we can use sqoop job, so that sqoop job can read the --last-value and use it in next iteration
sqoop job --create ganeshaJob -- import --connect 'jdbc:mysql://quickstart.cloudera:3306/retail_db' --username retail_dba --password cloudera ----table departments --target-dir /user/cloudera/sqoop_import/departments -m=1 --last-value 7 --incremental append --check-column department_id --outdir java_files
=> we can inspect the configuration of a job with show action
sqoop job --show ganeshaJob
=>we can use sqoop job --exec ganeshaJob to execute the job and will always take the latest incremented value
@itversity 8 ปีที่แล้ว ⁺¹
Good to know. Thank you for sharing the information.
@asrithaudamala9187 8 ปีที่แล้ว
Thank you sir
@smohank 8 ปีที่แล้ว
Mr Gadiraju , your videos are great inspiration for way of teaching and also the scale of learning for people.... i would like to know . how to do incremental load when you have condition based on multiple columns , like , department_id > 10 and salary > 5000
@itversity 8 ปีที่แล้ว
+Mohan K You can use --query option or even use --where option to apply conditions.
@simhadriisin 9 ปีที่แล้ว
Hi Sir, Thanks for the videos. They are wonderfully organized. I have been learning from your videos for a week now and I have a question. This might sound a little of context of Sqoop import but.
When we are not explicitly creating a hive-table and want to create it as part of import using create-hive-table . I believe a Hive table is created with schema and semantics that can best accomodate data in source table right . Like INT field for Interger col and String for VARCHAR col etc. Why doesnt hive preserve some contraints such as primary key ? Because I saw that when we redo import without overwrite it just re appends the same data. which means we end up with two rows with department_id 7 . Is there a way to specify a col as primary key in Hive
@prabhumuthiayan4665 8 ปีที่แล้ว
Hello Durga,
I ran the below command and it is still throwing the error.
sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com:3306/retail_db" --username retail_dba --password hadoop --table orders --target-dir /user/root/sqoop_import/orders
Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/06/08 14:08:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1 16/06/08 14:08:52 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@41d1d49d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@41d1d49d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries
Not sure if some old result set thread is still active and i need to kill it before running sqoop import again? Please help. This is happening only for order tables, i was able to import all other tables without any issues.
Thanks,
Prabhu Muthaiyan
@prabhumuthiayan4665 8 ปีที่แล้ว ⁺¹
I added --driver com.mysql.jdbc.Drive in my sqoop command that resolved the issue. Thanks
@itversity 8 ปีที่แล้ว
It is due to the bug in some mysql connector jars.
@abhisen2007 8 ปีที่แล้ว ⁺¹
Hi Durga - is there any example where you have used Incremental Mode as Last Modified?
@itversity 8 ปีที่แล้ว
For any technical discussions or doubts, please use our forum - discuss.itversity.com
For practicing on state of the art big data cluster, please sign up on - labs.itversity.com
Lab is under free preview until 12/31/2016 and after that subscription
charges are 14.99$ per 31 days, 34.99$ per 93 days and 54.99$ per 185 days
@Bedtime_storytime123 8 ปีที่แล้ว
Hello Durga, First of all thank you for your awesome videos on hadoop hands on.
I have one query, I ran a query with lastmodified with below command, and got some strange logs (will paste in next comment). It is not as per the documentation which says "You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported". Which means latest data will be imported, but it is importing the whole table.
I have few entries in table initially and then i added some more rows. lets say, initially number of entries were 10, then i add 5 more in mysql total rows 15 (with additional column of timestamp). But as per logs it is appending whole table. And result is 25 rows in hdfs.
@Bedtime_storytime123 8 ปีที่แล้ว
QUERY: [cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera --table departments_new --append --target-dir hdfs --check-column created_date --incremental lastmodified --last-value "2016-07-01 09:39:46" --split-by department_id
Strange Logs: 16/07/01 10:28:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`department_id`), MAX(`department_id`) FROM `departments_new` WHERE ( `created_date` >= '2016-07-01 09:39:46' AND `created_date` < '2016-07-01 10:28:15.0' )
Hence result:
2,Fitness,2016-07-01 09:39:46.0
3,Footwear,2016-07-01 09:39:46.0
4,Apparel,2016-07-01 09:39:46.0
5,Golf,2016-07-01 09:39:46.0
6,Outdoors,2016-07-01 09:39:46.0
7,Fan Shop,2016-07-01 09:39:46.0
2,Fitness,2016-07-01 09:39:46.0
3,Footwear,2016-07-01 09:39:46.0
4,Apparel,2016-07-01 09:39:46.0
5,Golf,2016-07-01 09:39:46.0
6,Outdoors,2016-07-01 09:39:46.0
7,Fan Shop,2016-07-01 09:39:46.0
110,civil,2016-07-01 10:02:51.0
111,Mechanical,2016-07-01 10:03:40.0
112,Automobile,2016-07-01 10:04:03.0
113,Pharma,2016-07-01 10:04:18.0
114,Social Engineering,2016-07-01 10:04:52.0
@asrithaudamala9187 8 ปีที่แล้ว
Hi Sumit, I believe this is becuase of the ">=" condition for created_date in the Where condition created, could you pleae check giving the --last-value as "2016-07-01 09:39:47", also delete the "--append" in the query.
@muralirachakonda382 8 ปีที่แล้ว
Hello Durga Sir, For the increment load i got your point that we need keep logging table. Even if i get last value by any means to how to pass that variable in the run time in order to get incremental data.
@itversity 8 ปีที่แล้ว
You have to develop integration platform by developing logic by using either python or shell scripting or some other programming language.
@bharathak 8 ปีที่แล้ว
Sir, to automate the continuous passing of values in "-last-value" argument, can't we use "sqoop job --create" where, the sqoop meta-store will store the last imported value and automatically submit it as lower
bound value for next import when we run the sqoop job?
or, is the above asked question by Murali Rachakonda a complete different use-case/scenario, which I interpreted wrongly?
@asrithaudamala9187 8 ปีที่แล้ว
I have tried this and we can do that
sqoop job --create ganeshaJob -- import --connect 'jdbc:mysql://quickstart.cloudera:3306/retail_db' --username retail_dba --password cloudera ----table departments --target-dir /user/cloudera/sqoop_import/departments -m=1 --last-value 7 --incremental append --check-column department_id --outdir java_files
=> we can inspect the configuration of a job with show action
sqoop job --show ganeshaJob
=>we can use sqoop job --exec ganeshaJob to execute the job and will always take the latest incremented value
@pratibhareshmi2120 8 ปีที่แล้ว
Hi Durga sir, it would be great help if you show for last modified also, because this is one of the interview question.
@tejalanki9807 9 ปีที่แล้ว
Hi, I'm following your videos from last month, videos are really helpful.
For eg: for incremental load, if "id" column values are not in series,then how should we specify the "last_value"?
@itversity 9 ปีที่แล้ว ⁺¹
+teja lanki You need to develop data integration framework. You need to keep track of what is being migrated by using log tables in a database and query it using eval command.
@itversity 9 ปีที่แล้ว
+teja lanki You need to develop data integration framework. You need to keep track of what is being migrated by using log tables in a database and query it using eval command.
@tejalanki9807 9 ปีที่แล้ว
+itversity Thank u so much for your response!
@britishguts7311 9 ปีที่แล้ว
I ran this query
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--table departments \
--target-dir /user/cloudera/sqoop_import/departments \
--append \
--check-column "department_id" \
--incremental append \
--last-value 7 \
--outdir java_files
In my case hadoop is using one split even though it should use 4 !!
15/12/19 21:46:14 INFO mapreduce.JobSubmitter: number of splits:1
what could be the reason ??I haven't modified any configuration files!
@swadeepmishra634 9 ปีที่แล้ว
+british Guts Probably it might have identified that it only has one row to export thus one split. Do you see 4 split when you do regular import? I am not big data expert but good to be in a discussion :)
@rodionvasiljev3020 8 ปีที่แล้ว
Hi Durga,
I have imported all tables into /apps/hive/warehouse/retail_stage.db/, but when use hive command line and run "show databases" I can't see retail_stage. Do you know what is the matter? I thought that if I import DB into /apps/hive/warehouse/ folder then I should see it from inside hive command line.
@jasonli1420 8 ปีที่แล้ว
Hi Gadiraju，thanks for the nice videos. When I following this videos. I was able to successfully load the data.
But the data can not show with hive query "select * from departments". All newly loaded data show as NULL. However,
if I use dfs -cat, all data show as expected. What could be wrong? I noted that the source table id is autoincremeted, I am not sure if this causes the issue. Thanks again.
@itversity 8 ปีที่แล้ว
+Jason Li What is the create table command you have ran? Most likely it is delimiter issue. You need to specify data delimiter using - row format delimited fields terminated by '' - default delimiter is null character.
@pradeepp2009 6 ปีที่แล้ว
Hi All i had a query for incremental import initially i have 10 records which where dumped into HDFS/Hive using incremental import once i have imported i had another 5 records have added to the table and 1,2 records where updated. Now how can i append 10-15 records into hive/HDFS and update 1,2 records which initially in HDFS/Hive. i am using hive 0.12(i have seen this can overcome / ACID transitions where introduced in hive 0.14 version) please provide me some solution
@priyankashekhawat6174 6 ปีที่แล้ว
Hi Sir, I want to know that, could we do incremental import for all the tables in database, if yes then how?
@abhiganta 9 ปีที่แล้ว
Hi Durga Garu,
I have one doubt, along with sqoop import we use --hive-import to import directly into HIVE, similarly can we import directly into HIVE using --incremental append ?
@itversity 9 ปีที่แล้ว
+abhi ganta Yes, sqoop incremental import works with hive table as well.
@abhiganta 9 ปีที่แล้ว
+itversity Ok Sir, Thank you.. Any suggestions or ideas about incremental updates(row levle). As of now I am following this hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
just wanted to know if any better procedure.
@itversity 9 ปีที่แล้ว ⁺¹
It seems to be same.
@maharajnarayanan1060 9 ปีที่แล้ว
Hi durga, the incremental load here is more or less like hard coding like in --last-value 7 it loads id greater than 7 and also with --where condition the same, for next load we need to change the value right??, is there any way we could use --query to get max(id) from the table and substitute in in the --last-value or --where parameter?
@itversity 9 ปีที่แล้ว
+maharaj narayanan You have to develop that as part of data integration framework. You can use programming language of your choice or oozie to do so. You need to maintain log tables to check what was migrated as part of last run.
@maharajnarayanan1060 9 ปีที่แล้ว
u mean like a control table.?
@itversity 9 ปีที่แล้ว
Yes.
@bharathak 8 ปีที่แล้ว
Sir, to automate the continuous passing of values in "-last-value" argument, can't we use "sqoop job --create" where, the sqoop meta-store will store the last imported value and automatically submit it as lower bound value for next import when we run the sqoop job?
or, is the above asked question by Maharaj Narayan a complete different case/scenario which I interpreted wrongly?
@kapilsharma20 8 ปีที่แล้ว
Hi Durga Sir,
How to deal with duplicate value check while importing or to deal with redundancy option if the values are already available in Hive Tables?
@itversity 8 ปีที่แล้ว
You need to stage the data and then perform join to address duplicates. There is no out of the box solution for it.
@prabhumuthiayan4665 8 ปีที่แล้ว
Hi Durga, in order to get old commands, you are using /-cat and pressing some keys, what keys are they ?
@itversity 8 ปีที่แล้ว ⁺¹
th-cam.com/video/Op6DS0VlaIM/w-d-xo.html
@Vikky-qf4hn 8 ปีที่แล้ว
Hello Sir, should we use the same 'eval' command to fetch the last value in the exam, if asked?
@itversity 8 ปีที่แล้ว
You can use it. But most likely they will not ask those questions.
For further clarifications use discuss.itversity.com
@vikramnaidu736 6 ปีที่แล้ว
what is the URL and where i will get the scripts ??can you please provide that URL ?.
@9980923847 8 ปีที่แล้ว
Hi Sir,
Wanted to know what is the use of fields-terminated-by and input-fields-terminated-by.... means whats the difference between them ?
@itversity 8 ปีที่แล้ว
fields-terminated-by is for import, input-fields-terminated-by is typically for export.
@karunanithishanmugam4366 8 ปีที่แล้ว
Hello Durga Sir, are all these needed for HDPCD no Java?
@itversity 8 ปีที่แล้ว ⁺¹
Yes, people have acknowledged that they have cleared certification.
@karunanithishanmugam4366 8 ปีที่แล้ว
+itversity Thanks for your reply, Durga sir. I was asking because scoop has only import to hive, hdfs and export as per HDPCD syllabus. But in all your sqoop videos contain various other concepts and these videos state that these are for Cloudera certification. So, my question is that do we have to know all these concepts for HDPCD too?
@ashhb2000 7 ปีที่แล้ว
Video doesnt cover how to capture incremental data and load into HIVE table. Is there a sqoop option that can be used while loading into HIVE table?
@itversity 7 ปีที่แล้ว
Please post technical questions on discuss.itversity.com
However, I will answer this question. You can use sqoop merge command to load incremental data which require updating the data.
@lakshmithiagarajan4821 8 ปีที่แล้ว
Very nice videos , brief and crisp.
Sqoop job create sqoop_job doesnt work , its stars erroring where I say --import . Pls comment. Here is my 'job create' command.
sqoop job --create sqoop_job1 --import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username=retail_dba --password=cloudera --table departments --target-dir /user/cloudera/sqoop_import/departments --append --check-column "department_id" --incremental append --last-value 8 --outdir java_files
And my erroring starts like this,
17/01/25 13:10:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0
17/01/25 13:10:21 ERROR tool.BaseSqoopTool: Error parsing arguments for job:
17/01/25 13:10:21 ERROR tool.BaseSqoopTool: Unrecognized argument: --import
17/01/25 13:10:21 ERROR tool.BaseSqoopTool: Unrecognized argument: --connect
@yrkleo 7 ปีที่แล้ว
Hi Durga ... I am also facing the same issue can you let me know what needs to be changed to run this command
@lakshmithiagarajan4821 7 ปีที่แล้ว
here is where the little space makes the error go away ,
sqoop job --create sqoop_job1 -- import
please notice the space between the -- and the import clause , now ur command runs without an error
@lakshmithiagarajan4821 7 ปีที่แล้ว
In this command
sqoop job -create sqoop_job1 - import
You may want to notice the space between the -- and the import clause
@vishmehta3113 8 ปีที่แล้ว
I just started with Sqoop Hands-on. I have a question, lets say I have 300 tables in a database and I want to perform an incremental load on those tables. I understand I can do incremental imports with either append mode or last modified.
But do I have to create 300 jobs, if the only thing in job which varies is Table name , CDC column and the last value/updated value?
Has anyone tried using the same job and passing this above things as parameter which can be read from a text file in a loop and execute the same job for all the tables in parallel.
What is the industry standard and recommendations ?
Also, is there a way to truncate and re-load the hadoop tables which is very small instead of performing CDC and merging the tables later?
@itversity 8 ปีที่แล้ว
Yes, you have to create 300 jobs. There is no easy work around for it.
@bashful88 8 ปีที่แล้ว
Hi Sir,
Can I give some conditions if I am using sqoop import using query?
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --query "select * from departments where department_id < 6" --target-dir /user/cloudera/sqoop_import/departments --split-by department_id.
Because it is accepting only the script like this
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --query "select * from departments where \$CONDITIONS" --target-dir /user/cloudera/sqoop_import/departments --split-by department_id
Instead of $CONDITIONS shall we give something like thie department_id < 6 ?

ต่อไป

เล่นอัตโนมัติ