Tuesday Tech Tip - Accelerating ZFS Workloads with NVMe Storage

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024

ความคิดเห็น • 46

  • @midnightwatchman1
    @midnightwatchman1 ปีที่แล้ว +7

    actually document management servers actually frequently have over 100 K files in one directory. massive workload do exist. a human may not do directly by but applications frequently do

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว +2

      I'm sure it would benefit email servers, as well.

    • @n8c
      @n8c ปีที่แล้ว

      A temperature tracking software (food transport industry) does this as well.
      Devs do crap like this all the time, where stuff clearly belongs in a DB or sth 😅

  • @HoshPak
    @HoshPak หลายเดือนก่อน

    Would having a SLOG and a cache device still make sense when having a special vdev in the pool?
    I'm building a compact storage server that fits 64 GB of RAM, 2 NVMes and 4 HDDs. I could imagine partitioning the NVMe drives equally so I have everything mirrored plus a striped cache. Would that be useful? What is a good way to measure this?

  • @chrisparkin4989
    @chrisparkin4989 ปีที่แล้ว +2

    Great vid but won't 'hot' metadata live in your ARC (RAM) anyway and therefore that is surely the fastest place to have it?

    • @TheExard3k
      @TheExard3k ปีที่แล้ว

      It would. But ARC evicts stuff all the time, so your desired metadata may not stay there. Tuning parameters can help with this. But having metadata on SSD/NVMe guarantees fast access. And the vdev increases pool capacity, so it's not "wasted" space. Worth considering if you have spare capacity for 2xSSD/NVMe. And you really need it on very large pools or when handing out a lot of zvols (block storage).

  • @Solkre82
    @Solkre82 5 หลายเดือนก่อน

    If you add a metadata vdev to a pool, is it safe to remove later? Is this a cache or is no metadata going to disk anymore?

    • @45Drives
      @45Drives  5 หลายเดือนก่อน +1

      the metadata vdev houses the data about the data. So things like properties, indexes, etc. essentially pointers to where the data is in the pool/dataset.
      If you remove that it's like the data has nothing to tie it to a specific block(s) in the pool rendering all data inaccessible
      So no, not safe to remove.

  • @TheChadXperience909
    @TheChadXperience909 ปีที่แล้ว +2

    In my experience, it really accelerates file transfers. Especially, when doing large backups of entire drives and file systems.

    • @steveo6023
      @steveo6023 ปีที่แล้ว

      How can this improve transfer speed when only metadata are on the nvme?

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว +2

      @@steveo6023 It speeds up, because flash storage is faster at small random IOPS than HDDs. Even though they are small reads/writes, they add up over time. Also, it prevents the read/write head inside the HDD from thrashing around as much, which reduces seek latency, and can also benefit drive longevity.

    • @steveo6023
      @steveo6023 ปีที่แล้ว

      @@TheChadXperience909 but metadata is cached in the arc anyway

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว +1

      @@steveo6023 That applies only to reads, and always depends.

    • @shittubes
      @shittubes ปีที่แล้ว

      @@steveo6023 If spinning drives can spend 99% of their time in sequential writes, they will be very fast. If e.g. 50% of the time is spent in random writes for metadata, the transfer speed will be halved. if the nvme metadata handling doesn't add other unexpected delays (which I do not know, am only wondering if it's the case), this could be completely predictable in this linear way.

  • @n8c
    @n8c ปีที่แล้ว

    Do you usually run some performance metrics on your customers' machines once they have been built out?
    Feel like you could easily let the same tools run in the background to generate some exemplary "load at 10 am might be this", which should easily show the differences.
    For StarWind vSAN I used DiskSPD, which seems to have a linux-port Git repo (YT doesn't like links, it's the first result in Google).

  • @StephenCunningham1
    @StephenCunningham1 ปีที่แล้ว

    Stinks that you lose the whole pool if the mirror dies. I'd want to also z2 the special pool

  • @---tr9qg
    @---tr9qg ปีที่แล้ว +1

    🔥🔥🔥

  • @89tsupra
    @89tsupra ปีที่แล้ว +1

    Thank you for the explanation, you mentioned that the metadata is stored on the disks and having an NVME will help speed that up. Would you recommend adding one for an all-flash storage pool?

    • @steveo6023
      @steveo6023 ปีที่แล้ว

      As he said it will keep the load from the storage disks (or flash). Depending on the workload it also could improve performance for an all flash storage

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว

      M.2 NVME drives have lower latencies than drives connected via SATA, and often have faster read/write throughput. It would accelerate such an array, but to a lesser extent. When comparing, you should look at their IOPS.

    • @shittubes
      @shittubes ปีที่แล้ว

      it can create higher fairness between multiple applications with different access patterns. so that a high throughput sequential write load won't affect another workload that does mostly very small I/O, either just on metadata or using small blocksizes (handled by the special device).

    • @ati4280
      @ati4280 ปีที่แล้ว

      It depends on SSD types. If you add a NVMe drive for all-flash SATA pool, the benifits will not as noticeable as accelarate a HDD only pool. The iops difference between NVMe drives and SATA drives is not that significant. The 4k performance of a SSD is not only related to its interface, model of NVMe controller, NAND type, and cache speed and size also play a big role to influence the final performance of a SSD drive.

  • @pivot3india
    @pivot3india ปีที่แล้ว

    is it good to have meta disk even if we use ZFS primarily as virtualisation target ?

  • @cyberpunk9487
    @cyberpunk9487 ปีที่แล้ว +1

    Im curious does this benefit iscsi luns and vm disks. Say i want to use truenas as a storage target iscsi for windows vms and i would also like to use a SR (storage repo) for vm disks to live on.

    • @shittubes
      @shittubes ปีที่แล้ว

      it's only useful for datasets, not usable for zvol

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว

      Metadata doesn't (only) mean file metadata, in this case. Zvols also consist of metadata nodes and data nodes, and the metadata nodes do get stored on the special vdev, as well. However, you'll likely see acceleration to a lesser degree than with regular datasets. Though, I read somewhere that you may be able to use file based extents for iSCSI, which means dataset rules would apply.

    • @cyberpunk9487
      @cyberpunk9487 ปีที่แล้ว

      @@TheChadXperience909 from what I remember you can use file extents for iscsi on truenas but I vagally recall hearing that some of the iscsi benefits are lost when not using zvols.

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว

      @@cyberpunk9487 Makes sense.

    • @mitcHELLOworld
      @mitcHELLOworld ปีที่แล้ว

      @@shittubes We actually don't use ZVOL's for iSCSI LUNs for a few reasons. We have found much better success deploying fileIO based LUNs that we create within the ZFS dataset. I believe one of my videos here goes over this, but perhaps its time for a good refresher on ZFS iSCSI.

  • @zparihar
    @zparihar 8 หลายเดือนก่อน

    Great demo. What are the risks? Do we need to mirror the special device? What happens if it dies?

    • @meaga
      @meaga 7 หลายเดือนก่อน

      You would loose the data in your pool. So yes, I'd recommend to mirror your metadata device. Also make sure that it is sized correctly corresponding to your data vdevs size.

  • @teagancollyer
    @teagancollyer ปีที่แล้ว

    Hi, what capacity HDD's and NVME were used for the video, I'm terrible at reading Linux's storage capacity counters. I'm trying to work out a good capacity of NVME to get for my 32TB (raw) pool, is 500GB a good amount?

    • @45Drives
      @45Drives  ปีที่แล้ว +3

      We used 16TB HDDs and a 1.8TB NVMe.
      How much metadata stored will vary depending on how many files are in the pool not only how big it is. 32 TB of tiny files will use more metadata space than 32TB of larger files. So, its not always straightforward to pick the size of the special vdev needed.
      Okay, so where to go from here?
      Rule of thumb seems to be about 0.3% of the pool size for a typical workload. This is from Wendell at Level1Tech - a very trusted ZFS guru. See this as reference: forum.level1techs.com/t/zfs-metadata-special-device-z/159954
      So, in your case 0.3% of 32TB would be 96GB. Therefore, 512GB NVMe will work. Remember, you will want to at least 2x mirror this drive and buy enterprise NVMe, as you will want power loss protection.
      If you already have data on the pool, you can get the total amount of metadata currently being used, using a tool called 'zdb'. Check out this thread as a reference: old.reddit.com/r/zfs/comments/sbxnkw/how_does_one_query_the_metadata_size_of_an/
      You can do this by following the steps in the above thread or you can use a script we put together inspired by the thread: scripts.45drives.com/get_zpool_metadata.sh
      Usage "bash get_zpool_metadata.sh poolname"
      Thanks for the question, hope this helps!

    • @teagancollyer
      @teagancollyer ปีที่แล้ว

      @@45Drives Thanks for the reply. All of that info will be very useful and I'll be reading those threads you linked in a minute.

  • @steveo6023
    @steveo6023 ปีที่แล้ว

    Unfortunately it will add a single point of failure when using only one name device as all data will be gone when the metadata ssd dies

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว +1

      That's why you should always add it in mirrors, which also has the effect of nearly doubling read speeds, since it can read from both mirrors. Presenter is using mirrors in his example.

    • @n8c
      @n8c ปีที่แล้ว +2

      This is a lab env, as stated.
      You wouldn't run this exact setup prod for various reasons 😅

  • @shittubes
    @shittubes ปีที่แล้ว +1

    i'm honestly quite disappointed that the speedup for nvme special device is quite a lot smaller in the larger folders:
    500k 18/11 1.63636363636364
    1k 119/21 5.66666666666667
    the first examples were nice, 6x speedup - why not.
    but a 2x speedup, not so impressive any more, considering that nvme should normally be 10x faster even at the largest blocksizes.
    in the iostat output i also see the nvme being read at often just 5MB/s, why is it so low?!

    • @TheChadXperience909
      @TheChadXperience909 ปีที่แล้ว

      The law of diminishing returns.

    • @mitcHELLOworld
      @mitcHELLOworld ปีที่แล้ว

      5MB/s isn't what matters here. The rated IOPS of a drive is what is going to tell you how fast storage media will run. For example, if your storage IO pipeline is using a 1 KB block size (it isn't but just as an example) then your storage media needs to be able to do 5000 IOPS to even hit 5MB/s, whereas if your block size was 1MB, 5000 IOPS would be 5GB/s. a HDD is capable of in the neighborhood of 400 IOPS total (thats being generous) , making a HDD not even able to hit 450KB/s if you were to use 1KB block size.
      As for The special VDEV, is what we consider a "support vdev" and is best when used in conjunction with the ARC. It isn't best used for ALL metadata requests to come from it. However, to easily show the difference between no NVMe and NVMe for special vdev he had to considerably handicap the ZFS ARC because during this test there is no real world workloads happening, and if he kept the ARC fully sized, you woudln't have seen a difference between the two anyways because the ARC would have held everything.
      In a production setting, there will be a large subset of the pools metadata stored within the RAM on the ARC, and the special VDEV will be there for any cache misses on metadata lookups that aren't in the ARC. When ZFS has a cache miss and there isn't a dedicated special vdev this can cause quite a bit of latency and slowdown. By adding the special VDEV in, this accelerates metadata lookups by a huge factor.
      The special VDEV can also be put to use for small block allocation as well which is really cool and can really improve performance of the overall pool. Perhaps we will cover this in more detail in a follow up video. But in the mean-time, Brett and I did discuss this in our "ZFS Architecture and build" video from a few months ago!

    • @shittubes
      @shittubes ปีที่แล้ว

      @@mitcHELLOworld do i understand correctly that you didn't start with an empty ARC? i can confirm that I often see something around 60-95% ARC hits here for metadata in production, even with a small ARC. this would indeed seem a good enough explanation why the ratios between HDD and nvme times aren't higher.

    • @shittubes
      @shittubes ปีที่แล้ว

      ​@@mitcHELLOworld what was your recordsize?
      not sure what block size is best for just metadata for such an edge-case. would be funny to dig deeper here and check the size of the actual read()s returning.
      i agree it's better to look at IOPS, old habits :D
      so revisiting, I look at the 1000k scenario, and concentrating just on the IOPS:
      the special device IOPS seems to peak somewhere at the beginning with around 19K, but averages ~7K IOPS.
      meanwhile, the HDDs (all together) don't do so much worse, averaging 5-7K IOPS.
      i feel like something else must be bottlenecking, not the actual IOPS capacity of the nvme drives. or do you consider 7K good? :P