EKB PhD
EKB PhD
  • 144
  • 44 168
Matrix of Characters vs. Vector of Strings in Julia lang
Is it faster to iterate over a matrix of characters or a vector of string in Julia language? That is the research question for this video.
Here's Andre's code:
github.com/abieler/AdventOfCodes.jl/blob/main/2024/src/4/4.jl
Here's the Advent of Code:
adventofcode.com/2024/day/4
#julialang #adventofcode
มุมมอง: 248

วีดีโอ

If-else vs. try-except in Python 3.13 and Julia 1.11
มุมมอง 42314 วันที่ผ่านมา
Which is faster: if-else or try-except? I test this question when solving a word search from the Advent of Code 2024, Day 04. And is the different between these two approaches equally different in Python and in Julia? Here's the Advent of Code: adventofcode.com/2024 #pythonprogramming #julialang #mojolang #adventofcode
Mojo v. Python v. Julia: Advent of Code 2024, Day 04
มุมมอง 37714 วันที่ผ่านมา
I test the speed of Mojo versus Python versus Julia when completing the word search coding challenge of the Advent of Code 2024, Day 04. Here's the Advent of Code: adventofcode.com/ Here's my Mojo script: github.com/ekbrown/scripting_for_linguists/blob/main/2024-12-04.mojo Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/2024-12-04.py Here's my Julia script: github....
Make Your Own Corpus in SketchEngine
มุมมอง 47หลายเดือนก่อน
I show how to make your own corpus in SketchEngine, both by webscraping and by uploading your own files. 1:25 webscrape a corpus 6:05 upload your own files Here's SketchEngine: www.sketchengine.eu/ #corpuslinguistics #sketchengine
Python Counter is fast vs. Pandas and Polars
มุมมอง 535หลายเดือนก่อน
Python collections.Counter object is fast, sometimes faster than base Python dict.get method and Pandas and Polars. Here's my script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_counter_v_other_freqs.py #corpuslinguistics #pythonprogramming
#LancsBox X Metadata
มุมมอง 47หลายเดือนก่อน
I demonstrate how to include metadata in XML files for the corpus linguistics toolkit LancsBox X. Here's LancsBox: lancsbox.lancs.ac.uk/ #corpuslinguistics #LancsBoxX
How much faster is Julia than Python with Keyness Analysis?
มุมมอง 218หลายเดือนก่อน
When performing keyness analysis, which is faster: Python 3.13 or Julia 1.11? That is the question I answer in this video. Here's my Julia code: github.com/ekbrown/scripting_for_linguists/blob/main/get_keywords.jl Here's my Python code: github.com/ekbrown/scripting_for_linguists/blob/main/get_keywords.py #pythonprogramming #julialang #corpuslinguistics
Python 3.13 vs. Julia 1.11 with Word Frequencies
มุมมอง 2.3K2 หลายเดือนก่อน
Are Python 3.13 and Julia 1.11 faster than previous versions of themselves? Among those two, which is faster? I answer these questions in the context of calculating word frequencies of millions of words. #pythonprogramming #julialang #corpuslinguistics
Polars vs. Pandas vs. Tidyverse vs. data.table for Left Join of Data Frames
มุมมอง 6123 หลายเดือนก่อน
When performing a left join, which is faster: Polars in Python, Pandas in Python, Tidyverse in R, or data.table in R? I pit four data science packages against each other when performing a left join of data frames. Here's the Polars module: pola.rs/ Here's the Pandas module: pandas.pydata.org/ Here's Tidyverse: dplyr.tidyverse.org/reference/mutate-joins.html Here's data.table: github.com/Rdatata...
R Data Structures
มุมมอง 633 หลายเดือนก่อน
I present some of the data structures of the R programming language. Here's the lesson plan I use in the video: ekbrown.github.io/ling_data_analysis/lessons/data_structures.html #rlanguage
Basics of R Programming Language
มุมมอง 863 หลายเดือนก่อน
Here's the lesson plan I use in the video: ekbrown.github.io/ling_data_analysis/lessons/programming_basics.html
Basic searching in english-corpora.org
มุมมอง 943 หลายเดือนก่อน
Here's the slide in the video: docs.google.com/presentation/d/1pqeslieZGt0-X88Vp-SFIdXa0_AftAKme4hQFb8Ccng/edit#slide=id.p3
Quickest way to access items in Python dictionary
มุมมอง 1824 หลายเดือนก่อน
In Python, which is faster when accessing items in a dictionary: .items() method .keys() method iterate over dict That is the question I answer in this video. Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_quickest_access_dict.py #pythonprogramming #corpuslinguistics
Strings vs. bytes in Julia when calculating lexical diversity (MATTR)
มุมมอง 1354 หลายเดือนก่อน
In Julia, is it worth it to convert from string to byte before calculating lexical diversity (MATTR)? How 'bout using @view? Here's my Julia script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_str_v_bytes.jl #julialang #corpuslinguistics
Wait, what?! Python is quicker than Rust when calculating MATTR lexical diversity
มุมมอง 2274 หลายเดือนก่อน
I benchmark Rust by itself, Rust when called from Python with PyO3, and Python by itself (as well as Mojo and Julia) when calculating the MATTR lexical diversity measure (MATTR = Moving Average Type to Token Ratio). Here's my Rust function: github.com/ekbrown/scripting_for_linguists/blob/main/main_mattr.rs Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_matt...
Python tops the podium against Rust, Julia, and Mojo when calculating lexical diversity (MATTR)
มุมมอง 5495 หลายเดือนก่อน
Python tops the podium against Rust, Julia, and Mojo when calculating lexical diversity (MATTR)
Using PyO3, Rust helps Python to calculate lexical diversity
มุมมอง 2655 หลายเดือนก่อน
Using PyO3, Rust helps Python to calculate lexical diversity
Does Rust work quicker than Python on native Python data structures?
มุมมอง 3645 หลายเดือนก่อน
Does Rust work quicker than Python on native Python data structures?
Is it worth it to call Rust from Python with PyO3?
มุมมอง 2.9K5 หลายเดือนก่อน
Is it worth it to call Rust from Python with PyO3?
How much faster is Dictionaries.jl than Julia's Base Dict?
มุมมอง 3596 หลายเดือนก่อน
How much faster is Dictionaries.jl than Julia's Base Dict?
Python vs. Julia with deeply nested dictionaries
มุมมอง 3956 หลายเดือนก่อน
Python vs. Julia with deeply nested dictionaries
How much faster has Mojo's dictionary gotten?
มุมมอง 5K6 หลายเดือนก่อน
How much faster has Mojo's dictionary gotten?
Is retrieval from a Python dictionary quicker than insertion?
มุมมอง 3577 หลายเดือนก่อน
Is retrieval from a Python dictionary quicker than insertion?
Does Python's dictionary get slower as it gets bigger?
มุมมอง 1.3K7 หลายเดือนก่อน
Does Python's dictionary get slower as it gets bigger?
#LancsBoxX keyword analysis (aka. keyness analysis)
มุมมอง 1817 หลายเดือนก่อน
#LancsBoxX keyword analysis (aka. keyness analysis)
How much faster is Rust than Python when finding neighboring words?
มุมมอง 1.1K7 หลายเดือนก่อน
How much faster is Rust than Python when finding neighboring words?
How big is a "small" dictionary in Mojo lang?
มุมมอง 1.7K8 หลายเดือนก่อน
How big is a "small" dictionary in Mojo lang?
Julia lang is (mostly) getting quicker
มุมมอง 1.9K8 หลายเดือนก่อน
Julia lang is (mostly) getting quicker
Prep18 Advanced speech apps
มุมมอง 439 หลายเดือนก่อน
Prep18 Advanced speech apps
Text-to-Speech, lecture 1 (Prep15)
มุมมอง 13310 หลายเดือนก่อน
Text-to-Speech, lecture 1 (Prep15)

ความคิดเห็น

  • @andrebieler7906
    @andrebieler7906 วันที่ผ่านมา

    Good stuff, thanks for the feature :) So interesting to see that results seem to differ from hardware to hardware. on my AMD Ryzen 9 5950X w/ Julia 11.2 I cant reproduce the improvements for the base vec implementation, but glad you can squeeze the lemon even more xD.

    • @ekbphd3200
      @ekbphd3200 วันที่ผ่านมา

      Yeah, I was surprised by the improvement with a base vector. Thanks again for your comment!

  • @dustinhess6798
    @dustinhess6798 3 วันที่ผ่านมา

    That is awesome! I am really glad we could collaborate and learn things. Sorry I couldn't post my solution. Got cuaght up with Holiday travel. Maybe I can post soon. It's really quite fun to optimize these little problems. Again, really enjoy your videos!

    • @ekbphd3200
      @ekbphd3200 2 วันที่ผ่านมา

      I appreciate your comment and the collaboration!

  • @andrebieler7906
    @andrebieler7906 6 วันที่ผ่านมา

    Thanks again for yet another very nice and informative video. It is always a good excuse for me to play around with Julia when i see it comes in dead last in a benchmark :) I've created a script for part one on github which achieves around 280x speed improvement on my machine. (26 ms -> 0.095 ms) Sadly I cannot post a link here that leads to my repository. I'll try to post the code in a reply to this comment. Core changes from original version: - Use a Matrix of characters instead of Vector of Strings - Use CartesianIndices to navigate the matrix. This makes sure it iterates "the right way to be fast" (col-major vs row-major memory layout) - Limit memory allocations in hot loop (contrary to popular believe it is not "basically free"). The new version allocates 16 bytes where the original one was > 6 MB - Use StaticArrays for faster array processing

    • @ekbphd3200
      @ekbphd3200 6 วันที่ผ่านมา

      Do you mind if I make a video using your script? Woah! Yeah, that's faster. I found your script on your github: github.com/abieler/AdventOfCodes.jl/blob/main/2024/src/4/4.jl Obviously, I'd cite my source (you).

    • @andrebieler7906
      @andrebieler7906 6 วันที่ผ่านมา

      @ekbphd3200 i would be honored to be featured in a video

  • @bart2019
    @bart2019 11 วันที่ผ่านมา

    This approach just makes me cringe. It's completely against the rules of the game to abuse try/catch for plain flow control.

    • @Robdawger
      @Robdawger 11 วันที่ผ่านมา

      Against the rules, eh? I hope the code-writing police don't come after him!

    • @ekbphd3200
      @ekbphd3200 11 วันที่ผ่านมา

      What's most surprising to me is the big difference between Julia and Python. In Julia, it seems that it's always best to use if-else rather than try-catch, but in Python it depends on whether you expect to hit the except branch much. Thanks for the comment!

  • @arturdankovsky8293
    @arturdankovsky8293 12 วันที่ผ่านมา

    Hi! I'm trying to follow your commands in my lancbox tool. I am wondering if it may be the case that I have a version 5.0.3 and operating on Windows and that's why it doesn't follow with, for example, [pos="N.*"]

    • @ekbphd3200
      @ekbphd3200 12 วันที่ผ่านมา

      Hard to know without more info. Did you use an English part of speech tagger? Which spaCy model did you have Lancsbox X use? Does it use the Penn Treebank tag set, which would allow the POS search that you specified.

  • @SchmidiAUT
    @SchmidiAUT 14 วันที่ผ่านมา

    Cool, thank you for picking up my comment. 🙂

    • @ekbphd3200
      @ekbphd3200 13 วันที่ผ่านมา

      Thanks for the comment! I love learning!

  • @User_3584
    @User_3584 18 วันที่ผ่านมา

    Again ..tq Dr for the information.

    • @ekbphd3200
      @ekbphd3200 17 วันที่ผ่านมา

      Always welcome

  • @SchmidiAUT
    @SchmidiAUT 18 วันที่ผ่านมา

    You aren't supposed to use try catch, as part of your logic. While on happy path it comes without overhead, on error path it is horrible slow. Julia should perform much better if you don't use the try catch.

    • @ekbphd3200
      @ekbphd3200 18 วันที่ผ่านมา

      Wow! Thanks for that feedback. I didn't know that. I'll try that, or rather I'll if-else that. 😁

    • @dustinhess6798
      @dustinhess6798 17 วันที่ผ่านมา

      @@ekbphd3200 I also try on my laptop with python and julia performed very close on part aroun 15 micro seconds. I did not the gap you go in your results I can publish mine on git. I also too stab at part on and wrote it my own way and in part one I using benchmark tool I get mean 1.38 microsconds compared to the 11 or 12 from published script. I can publish what did if you are interested. Also all this to help you write better julia code. I really like the content to keep it up.

    • @ekbphd3200
      @ekbphd3200 16 วันที่ผ่านมา

      @@dustinhess6798 Yes, I'd appreciate seeing your code. I'm always interested in learning how to improve my code, as I'm not a trained computer scientist.

    • @dustinhess6798
      @dustinhess6798 15 วันที่ผ่านมา

      @@ekbphd3200 I tried to reply with github link like 3 time yesterday and you tube keep rejecting my comment for some reason. I will try to repost today. If this comment takes. Also one thing I want to note if you aren't aware is that is Julia is Column Major language.

    • @ekbphd3200
      @ekbphd3200 14 วันที่ผ่านมา

      @SchmidiAUT Thanks again for the feedback. I've tested if-else and try-catch in Julia. Take a watch on my video: th-cam.com/video/-kjFKnEsZRU/w-d-xo.html

  • @grandlagging0zero175
    @grandlagging0zero175 18 วันที่ผ่านมา

    maybe add haskell? elixir? f#? rust? golang? maybe it's worth testing them in task 13?

    • @ekbphd3200
      @ekbphd3200 18 วันที่ผ่านมา

      Maybe I'll try rust. I haven't yet looked at those other languages.

    • @grandlagging0zero175
      @grandlagging0zero175 18 วันที่ผ่านมา

      @ekbphd3200 why not? Use AI for generate code...

  • @leandrodeleite
    @leandrodeleite 19 วันที่ผ่านมา

    Interesting comparison. Learn a lot from it, thanks!

    • @ekbphd3200
      @ekbphd3200 19 วันที่ผ่านมา

      Glad you liked it!

  • @achimwasp
    @achimwasp 22 วันที่ผ่านมา

    No need for Python here, just cd to the folder with your pdf files in Terminal (you can just type "cd " and then drag the folder from the Finder into the Terminal window). Then execute ".../pdftotext *.pdf". This will convert all your pdf files. There's lots of things that can be done in the Terminai with the programs that are already installed (e.g. "wc" for word- or linecount, "sed" for replacing text, "awk" for all kinds of text manipulation and frequency count).

    • @ekbphd3200
      @ekbphd3200 22 วันที่ผ่านมา

      Thanks!

  • @GabriellaGabriella-n7e
    @GabriellaGabriella-n7e 25 วันที่ผ่านมา

    I can't find 'PRAAT' menu in the praat objects even though I have downloaded the most recent one. Could you share how we can go on with it?

    • @ekbphd3200
      @ekbphd3200 19 วันที่ผ่านมา

      Look in the Program Files on Windows or Applications on Mac.

  • @dildarahmad869
    @dildarahmad869 27 วันที่ผ่านมา

    Dr. EKB you are amazing, it is very informative. may I please request for any guide or manual for Regex Analysis.

    • @ekbphd3200
      @ekbphd3200 25 วันที่ผ่านมา

      Thank you! There are quite a few tutorials about regular expressions (aka. regexes) online. For example, this one looks good: www.geeksforgeeks.org/write-regular-expressions/

  • @davidmurphy563
    @davidmurphy563 28 วันที่ผ่านมา

    That was very cool. Perf_counter() is more accurate for measuring performance btw.

    • @ekbphd3200
      @ekbphd3200 27 วันที่ผ่านมา

      Thanks! I’ve started using it rather than time(). Thanks for the feedback.

    • @davidmurphy563
      @davidmurphy563 27 วันที่ผ่านมา

      @@ekbphd3200 Ah, it's just one of those "good practice" things devs tell other devs that probably doesn't change very much! :)

  • @User_3584
    @User_3584 หลายเดือนก่อน

    Thanks for the amazing tutorial for us.

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      You're very welcome!

  • @123456crapface
    @123456crapface หลายเดือนก่อน

    Line by line wins if you also consider memory usage. You could easily grab(N) number of lines or parallelize the process

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      That's a great point!

  • @2broke2code
    @2broke2code หลายเดือนก่อน

    underated content. Subscribed!

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      Thank you for subscribing!

  • @saitaro
    @saitaro หลายเดือนก่อน

    For this task you could either make use of pandas: >>> string = "Tilda BAR baz THE geez THE THREE geez Marta Medelin".split() * 8400 >>> import pandas as pd >>> s = pd.Series(data=string) >>> s.value_counts().to_dict() {'foo': 2, 'bar': 1, 'baz': 1} ...or the Counter from the stdlib, which is typically chosen for this kind of tasks and is a subclass of dict: >>> from collections import Counter >>> Counter(string) Counter({'foo': 2, 'bar': 1, 'baz': 1}) The Counter is faster as it employs the counting function implemented in C - _collections__count_elements_impl from the CPython source code. The pandas approach for 84000 words took 3.72 ms, Counter - 2.87 ms on Macbook Pro M2 Max (~3.7 GHz).

  • @chrismen83240
    @chrismen83240 หลายเดือนก่อน

    1.11 make Array a native julia type instead of a c wrapper, I think Dictonnary now also rely on the new Memory type so that made it even better. 1.12 should allow the compiler to actually use it to say if a vector should be only stack allocated ect ect so maybe another 1.5 x there ? not sure at all

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      Sounds cool!

  • @exxzxxe
    @exxzxxe หลายเดือนก่อน

    So, a question for you: First; I have some "test feasibility" code in Python that, when implemented" will be "large linked lists of dictionaries of dictionaries, so to speak. What language should I program "commercial grade" code in? My initial strategy was to test code in Python, then "commercial code" in Mojo.

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      I’m not sure I know. Mojo is still in active development, so things might change in the future that might require you to rewrite your code.

  • @DataPastor
    @DataPastor หลายเดือนก่อน

    That is not even an order of magnitude difference… and the Python code is not even optimized for speed. Well done Python! 🎉

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      Very true!

    • @gillesreyna1272
      @gillesreyna1272 22 วันที่ผ่านมา

      nor is the Julia code

    • @ekbphd3200
      @ekbphd3200 19 วันที่ผ่านมา

      @@gillesreyna1272 How can I make my Julia script faster?

    • @gillesreyna1272
      @gillesreyna1272 15 วันที่ผ่านมา

      @@ekbphd3200 can you drop the current script somewhere ?

    • @georgerogers1166
      @georgerogers1166 5 วันที่ผ่านมา

      Startup time

  • @AyeshaKhan-616
    @AyeshaKhan-616 หลายเดือนก่อน

    Can you please explain the log likelihood?

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      Here's a new video in which I show the mathematical formula for log likelihood: th-cam.com/video/e20SeAc4ygc/w-d-xo.html

  • @micaiahm1
    @micaiahm1 2 หลายเดือนก่อน

    This is awesome, thanks for taking the time to share this

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      You're welcome!

  • @DataPastor
    @DataPastor 2 หลายเดือนก่อน

    I will touch Mojo as soon as they eliminate the competitive clause from its license.

    • @ekbphd3200
      @ekbphd3200 2 หลายเดือนก่อน

      Understandable

    • @andrewshorts1198
      @andrewshorts1198 หลายเดือนก่อน

      Out of the loop on this

  • @sounkoumahamanetoure4607
    @sounkoumahamanetoure4607 2 หลายเดือนก่อน

    What would the same task in R look like given the native aggregation functions ?

    • @ekbphd3200
      @ekbphd3200 2 หลายเดือนก่อน

      I think you’re referring to the table function in base R. Yeah, you could load up all words in a vector and then pass that vector into the table function and then use the names function to get the words out of the table result (as the table result itself holds the numbers).

    • @juvencus_
      @juvencus_ 2 หลายเดือนก่อน

      @@ekbphd3200And speed-wise?

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      I haven’t tested it, but I assume it would be slower than data.table and tidyverse.

  • @paulmairo
    @paulmairo 2 หลายเดือนก่อน

    Nice video, this comes in handy as I was indeed asking myself what use cases warrant reaching to PyO3. I am wondering though, if we convert the call `out_dict.get(w, 0)` to a "dummy" `if w in out_dict` won't it be faster? Something I also find missing here in the video is the memory and CPU (cores) usage. Not that I think Python would do better there, but it would be interesting to check.

    • @ekbphd3200
      @ekbphd3200 2 หลายเดือนก่อน

      Great ideas! I’ll try these ideas in a future video.

  • @sampathnkn1418
    @sampathnkn1418 2 หลายเดือนก่อน

    Great job, keep it up!

    • @ekbphd3200
      @ekbphd3200 2 หลายเดือนก่อน

      Thank you much!

  • @dustinhess6798
    @dustinhess6798 2 หลายเดือนก่อน

    Well I did played around a bit and what found was if you just use mean() from the Statistics package instead of the the home grown straight forward for loop implementation you get an improvement even over the bytes method. Below is how I modified the function makes it a bit more simple and readable as well a performance boost with out the fancy byte stuff. ( Nothing wrong with fancy byte stuff, that was a good catch) function get_mattr(word_list::Vector{String}, window_span::Int = 50) n_words = length(word_list) effective_window_span = min(window_span, n_words) n_windows = n_words - effective_window_span + 1 if n_windows <= 0 return get_ttr(word_list) end mean_ttr = mean(get_ttr(word_list[i:(i + effective_window_span - 1)]) for i in 1:n_windows) return mean_ttr end Here is a link to the data and the output graph I generated. drive.google.com/drive/folders/1-AelwjZZtAPGKf_bLkhkLOC0ZZUWBuTf?usp=sharing

  • @dustinhess6798
    @dustinhess6798 2 หลายเดือนก่อน

    Hey nice vid. I am a Physicist. I work for a photonics quantum computing company and use Julia for modeling in my work. One thing you may consider is using the BenchmarkTools for Julia. I am not sure but the tail at the beginning of your graph might be due to the JIT compiler optimizing. If this something you do often you could precompile the julia code, once optimized and that would negate the JIT start up time. I will play around a little bit and get back with you. I like the attitude of always willing to learn something from some one else. There is so much out there to learn if just listen and not jump to conclusion.

  • @StupidInternetPeople1
    @StupidInternetPeople1 3 หลายเดือนก่อน

    Amazing doucheFace thumbnail! Congrats you look like every unimaginative, lazy creator on YT. Clearly intelligent people choose stupid face thumbnails because looking like an idiot is a huge indicator that your content must be amazing! 😂

  • @iraqi2015
    @iraqi2015 3 หลายเดือนก่อน

    When I run it on linux. and click on start. it crashes and closes. I don't know why!

    • @ekbphd3200
      @ekbphd3200 3 หลายเดือนก่อน

      Darn. Double check that you have the latest version and perhaps ask for help on their discussion board: www.laurenceanthony.net/software/antconc/

  • @ahmedal-attar3478
    @ahmedal-attar3478 3 หลายเดือนก่อน

    Probably worth noting, Polars is quicker because it's multi-threaded and uses all the cores on the machine, were as Pandas is single threaded

    • @ekbphd3200
      @ekbphd3200 3 หลายเดือนก่อน

      Thank you for pointing that out! I appreciate it.

    • @paulselormey7862
      @paulselormey7862 3 หลายเดือนก่อน

      Nice take, benchmark must go beyond speed. How much resources are used (CPU, memory) to achieve the apparent faster speed?

    • @ekbphd3200
      @ekbphd3200 หลายเดือนก่อน

      I’m not sure. I’ll have to analyze that next.

  • @gardnmi
    @gardnmi 3 หลายเดือนก่อน

    pandas has a join method. It's supposedly faster. You just have to set the join columns as the index before calling.

    • @ekbphd3200
      @ekbphd3200 3 หลายเดือนก่อน

      Thanks for the comment. However, I can't get join() to be faster than merge(), in fact join() is 4x slower than merge() in my code. In the pandas section of my code here: github.com/ekbrown/scripting_for_linguists/blob/main/Script_polars_pandas_left_join.py when I comment out my merge() line and uncomment the two set_index() lines and the join() line, it is 4x slower. If you get set_index() + join() to be quicker than merge(), please leave a reply with how. Thanks!

  • @xoruporu310
    @xoruporu310 3 หลายเดือนก่อน

    What is the difference between the original and the X version?

    • @ekbphd3200
      @ekbphd3200 3 หลายเดือนก่อน

      As I understand it, the X version works better than the original version with bigger corpora and with XML. It has other features. Take a watch on Vaclav's (the lead on LancsBox and LancsBox X) webinar here, if you'd like: th-cam.com/video/ji5S_xm8_N0/w-d-xo.htmlsi=luso4bFa4gl7UP79

    • @xoruporu310
      @xoruporu310 3 หลายเดือนก่อน

      @@ekbphd3200 thank you!!

  • @TheBIMCoordinator
    @TheBIMCoordinator 3 หลายเดือนก่อน

    I have been learning rust so looking forward to watching this video!

    • @ekbphd3200
      @ekbphd3200 3 หลายเดือนก่อน

      Awesome!

  • @TheBIMCoordinator
    @TheBIMCoordinator 4 หลายเดือนก่อน

    Great vid!

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Thanks!

  • @TheBIMCoordinator
    @TheBIMCoordinator 4 หลายเดือนก่อน

    I really enjoy this channel! I have been picking up rust from python trying to solve bottlenecks from speed

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      I’m so glad!

  • @hellolk77
    @hellolk77 4 หลายเดือนก่อน

    Blue & pink lines are roughly linear. Roughly at 70k, there could be some memory allocation or something else that has dropped the performance. More like a one time thing within each test above 70k. With this I tend to think that it's linear for normal hash reading ( blue & pink ).

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Good points!

  • @gardnmi
    @gardnmi 4 หลายเดือนก่อน

    I would assume the fastest way to access values is using dict.values() :)

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Right!

  • @hellolk77
    @hellolk77 4 หลายเดือนก่อน

    you didn’t access them the way they meant to. You need to iterate keys and access each value using the key. Otherwise hashing is not required to get the map elements. Simply array can do that.

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Thanks for the comment. I've created a video testing your idea (if I understand correctly your idea). Take a watch: th-cam.com/video/okPofYRLkRk/w-d-xo.html

  • @bratwurst_addict
    @bratwurst_addict 4 หลายเดือนก่อน

    This is just what I was looking for. I am working on business logic and there are a lot of SQL statements. Python takes about 9 seconds to have all 22k entries in a dict, a C++ program I had ChatGPT just write up based on the Python code (it took like 20 iterations of me telling ChatGPT well now I'm getting THIS error) was 10x faster. I won't learn c/++, I'll do some Rust!

  • @josecantu8195
    @josecantu8195 4 หลายเดือนก่อน

    Thanks Professor! I'm learning on the job on my own how to implement python & rust together given my interests in software development, data science & biomedical science so this is an interesting series you made!

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Great to hear it!

  • @AndyQuinteroM
    @AndyQuinteroM 4 หลายเดือนก่อน

    Great video, but the result is interesting. Mind if I can get eh full main.rs file and dataset. Would love to run the tests my self and perhaps improve upon it

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Sure thing! Any feedback that you have is welcome! I'm trying to improve my ability in Rust, so anything you see that could be done better, please let me know. Here's the main file: github.com/ekbrown/scripting_for_linguists/blob/main/main_mattr_native_rust.rs And here's the text file from the Spotify Podcast dataset that I used: github.com/ekbrown/scripting_for_linguists/blob/main/0a0HuaT4Vm7FoYvccyRRQj.txt

  • @pyajudeme9245
    @pyajudeme9245 4 หลายเดือนก่อน

    Awesome, I was waiting for that video! I thought that Python's GIL blocking in your last video had a much stronger effect. I guess strings are in all programming languages pretty horrible, because utf-8 doesn't have a fixed byte size, so all programming languages have to use the slow techniques that python uses for all data types. Python is pretty good compared to other languages when talking about dicts and strings. The rest is very slow, but thanks God, it is very simple to speed it up if you need it.

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Yeah, I guess so. Python continues to impress.

  • @andrebieler7906
    @andrebieler7906 4 หลายเดือนก่อน

    FWIW i got about a 30% speed increase for julia when working on bytes directly (vs strings) and passing @view of byte-vectors. wds = split(txt) bwds = [Vector{UInt8}(word) for word in wds] and then passing the @view of bwds instead of wds into the individual functions note: i also dropped all the println() and I/O operations in my code as i was mostly curios about the speed of julia and not I/O or print. (but fair play if it is included)

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Thanks for this! I'll give it a try.

    • @andrebieler7906
      @andrebieler7906 4 หลายเดือนก่อน

      @@ekbphd3200 very interesting benchmarks and results. i personally had never anything to do involving heavy string manipulation and hence am by no means an expert in that area. for all my use cases julia is always orders of magnitude faster than python. had fun digging around in your examples. <3

    • @ekbphd3200
      @ekbphd3200 4 หลายเดือนก่อน

      Thanks for your comments! I tried the advice in your previous comment and it works for me too! Thanks for pointing this out. Sounds like another video! I'll be sure to acknowledge my source (you).

    • @andrebieler7906
      @andrebieler7906 4 หลายเดือนก่อน

      @@ekbphd3200 Oh very cool, definitely did not expect this to trigger a new video :) I'm sorry I could have been a bit more helpful with my comment about the @view macro... For it to show an effect one also needs to add @view in the `get_matter` function like so: numerator += get_ttr(@view in_list[i : (i+window_span-1)]) Apologies for not pointing that out in my comment. Anyway the vast majority of the speed gain is from the bytes vector, but maybe something to consider if you want to give @view another shot in the future. Thanks a lot for the mention in the video <3

  • @pyajudeme9245
    @pyajudeme9245 5 หลายเดือนก่อน

    Great video, I like your benchmark videos, but from Rust's perspective it's a little unfair. It's not representative when Rust is a slave of Python's GIL. However, I think the speed difference between the language is not that big, because UTF-8 chars don't have a fixed byte size. It would be nice to see a comparison between utf-8 and byte strings (fixed size). You could also add Zig to the benchmarks, I used it a week ago for the first time (RGB search in a picture). It was between 2 and 10(!) times faster than C, but I haven't tested it with strings yet.

    • @ekbphd3200
      @ekbphd3200 5 หลายเดือนก่อน

      Yeah. Good point. Perhaps I'll time just the inside of my Rust function, after the Python list is converted into a Rust vector. Yeah, Zig looks interesting.

  • @Noodlezoup
    @Noodlezoup 5 หลายเดือนก่อน

    Thank you for sharing this!

    • @ekbphd3200
      @ekbphd3200 5 หลายเดือนก่อน

      My pleasure!

  • @RealLexable
    @RealLexable 5 หลายเดือนก่อน

    But only as long Mojo isn't out there to perform python to it's coming new standard limits even faster than c++. The future is going to be fast as hell bro 🎉

    • @ekbphd3200
      @ekbphd3200 5 หลายเดือนก่อน

      Awesome!

  • @nandoflorestan
    @nandoflorestan 5 หลายเดือนก่อน

    Bah, Mojo is not even open source. That's repugnant.

    • @ekbphd3200
      @ekbphd3200 5 หลายเดือนก่อน

      Maybe I'm misunderstanding, but I thought this means it's open-source: github.com/modularml/mojo/tree/main/stdlib for example, I can see the source code of the List object here: github.com/modularml/mojo/blob/main/stdlib/src/collections/list.mojo Perhaps, I'm not sure what you're saying.

    • @Navhkrin
      @Navhkrin 5 หลายเดือนก่อน

      It will be made open source though. Chris clearly mentioned that making a language that has not reached v1.0 open source significantly slows down the progress because open-source projects that are led my committee's move slowly. They want to finalize the Mojo spec and features before making it open source. That being said, they already started making it open source. std lib and documents are currently open source. Mojo is designed around pushing as many features to lib's as possible so making std lib open source is already huge.

    • @ekbphd3200
      @ekbphd3200 2 หลายเดือนก่อน

      @@Navhkrin Thanks for that info!

  • @marvinakuffo4096
    @marvinakuffo4096 5 หลายเดือนก่อน

    How about using glob as a generator? Would that reduce the gap beteen the os.walk and glob? Because in your code, glob() loads everything into memory and that may adversely impact the runtime.

    • @ekbphd3200
      @ekbphd3200 5 หลายเดือนก่อน

      I'll have to try that at some point in the future.