Видео 138
Просмотров 36 566

Python 3.13 vs. Julia 1.11 with Word Frequencies

5:56

Polars vs. Pandas vs. Tidyverse vs. data.table for Left Join of Data Frames

12:00

R Data Structures

23:56

Basics of R Programming Language

44:39

Basic searching in english-corpora.org

21:12

Quickest way to access items in Python dictionary

5:03

How much faster is Julia than Python with Keyness Analysis?

When performing keyness analysis, which is faster: Python 3.13 or Julia 1.11? That is the question I answer in this video.
Here's my Julia code:
github.com/ekbrown/scripting_for_linguists/blob/main/get_keywords.jl
Here's my Python code:
github.com/ekbrown/scripting_for_linguists/blob/main/get_keywords.py
#pythonprogramming #julialang #corpuslinguistics

Видео

Python 3.13 vs. Julia 1.11 with Word Frequencies

5:56

Python 3.13 vs. Julia 1.11 with Word Frequencies

Просмотров 85928 дней назад

Are Python 3.13 and Julia 1.11 faster than previous versions of themselves? Among those two, which is faster? I answer these questions in the context of calculating word frequencies of millions of words. #pythonprogramming #julialang #corpuslinguistics

Polars vs. Pandas vs. Tidyverse vs. data.table for Left Join of Data Frames

12:00

Polars vs. Pandas vs. Tidyverse vs. data.table for Left Join of Data Frames

Просмотров 431Месяц назад

When performing a left join, which is faster: Polars in Python, Pandas in Python, Tidyverse in R, or data.table in R? I pit four data science packages against each other when performing a left join of data frames. Here's the Polars module: pola.rs/ Here's the Pandas module: pandas.pydata.org/ Here's Tidyverse: dplyr.tidyverse.org/reference/mutate-joins.html Here's data.table: github.com/Rdatata...

23:56

R Data Structures

Просмотров 52Месяц назад

I present some of the data structures of the R programming language. Here's the lesson plan I use in the video: ekbrown.github.io/ling_data_analysis/lessons/data_structures.html #rlanguage

44:39

Basics of R Programming Language

Просмотров 752 месяца назад

Here's the lesson plan I use in the video: ekbrown.github.io/ling_data_analysis/lessons/programming_basics.html

21:12

Basic searching in english-corpora.org

Просмотров 672 месяца назад

Here's the slide in the video: docs.google.com/presentation/d/1pqeslieZGt0-X88Vp-SFIdXa0_AftAKme4hQFb8Ccng/edit#slide=id.p3

Quickest way to access items in Python dictionary

5:03

Quickest way to access items in Python dictionary

Просмотров 1282 месяца назад

In Python, which is faster when accessing items in a dictionary: .items() method .keys() method iterate over dict That is the question I answer in this video. Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_quickest_access_dict.py #pythonprogramming #corpuslinguistics

Strings vs. bytes in Julia when calculating lexical diversity (MATTR)

6:55

Strings vs. bytes in Julia when calculating lexical diversity (MATTR)

Просмотров 1022 месяца назад

In Julia, is it worth it to convert from string to byte before calculating lexical diversity (MATTR)? How 'bout using @view? Here's my Julia script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_str_v_bytes.jl #julialang #corpuslinguistics

Wait, what?! Python is quicker than Rust when calculating MATTR lexical diversity

10:28

Wait, what?! Python is quicker than Rust when calculating MATTR lexical diversity

Просмотров 1843 месяца назад

I benchmark Rust by itself, Rust when called from Python with PyO3, and Python by itself (as well as Mojo and Julia) when calculating the MATTR lexical diversity measure (MATTR = Moving Average Type to Token Ratio). Here's my Rust function: github.com/ekbrown/scripting_for_linguists/blob/main/main_mattr.rs Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_matt...

Python tops the podium against Rust, Julia, and Mojo when calculating lexical diversity (MATTR)

9:27

Python tops the podium against Rust, Julia, and Mojo when calculating lexical diversity (MATTR)

Просмотров 4833 месяца назад

It's a battle royale with Python, Rust, Julia and Mojo battling to calculate lexical diversity, specifically the Moving Average Type to Token Ratio (MATTR) measure. Here's my Python code: github.com/ekbrown/scripting_for_linguists/blob/main/Script_mattr_py_rs.py Here's my Rust code: github.com/ekbrown/scripting_for_linguists/blob/main/lib_mattr.rs Here's my Julia code: github.com/ekbrown/script...

Using PyO3, Rust helps Python to calculate lexical diversity

7:56

Using PyO3, Rust helps Python to calculate lexical diversity

Просмотров 2443 месяца назад

I use PyO3 to write Rust code that I call from within Python to calculate lexical diversity, specifically the MTLD_wrap algorithm (MTLD = Measure of Textual Lexical Diversity). Thanks to Scott Jarvis for his Python code that highly informed my Python code here: github.com/ekbrown/scripting_for_linguists/blob/main/Script_MTLD_wrap.py Here's my Rust code: github.com/ekbrown/scripting_for_linguist...

Does Rust work quicker than Python on native Python data structures?

7:54

Does Rust work quicker than Python on native Python data structures?

Просмотров 3123 месяца назад

I use the PyO3 Rust crate to have Rust loop over a native-Python PyList data structure and insert into a native-Python PyDict. Is it faster than Python? That's the question I answer in this video. Here's the PyO3 docs on native-Python data types in Rust: pyo3.rs/v0.22.2/conversions/tables Here's my Rust code (thanks to my son Seth for help with it): github.com/ekbrown/scripting_for_linguists/bl...

Is it worth it to call Rust from Python with PyO3?

8:50

Is it worth it to call Rust from Python with PyO3?

Просмотров 1,3 тыс.4 месяца назад

Is it worth it to call Rust from Python when inserting into a dictionary? That is the question I explore in this video. Here's my Rust code (shout-out to my son Seth for help with the Rust code): github.com/ekbrown/scripting_for_linguists/blob/main/lib_hashmap.rs Here is my Python code: github.com/ekbrown/scripting_for_linguists/blob/main/Script_rs_helps_py_hashmap.py Here is the PyO3 crate (ak...

How much faster is Dictionaries.jl than Julia's Base Dict?

5:45

How much faster is Dictionaries.jl than Julia's Base Dict?

Просмотров 2814 месяца назад

I demonstrate how much faster the Dictionaries.jl dictionary object is than the Base Julia dictionary object when populating a calculating the frequencies of words in a large file. Here's my Julia script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_freqs_Dictionaries.jl Here's the Dictionaries.jl package: github.com/andyferris/Dictionaries.jl #julialang

Python vs. Julia with deeply nested dictionaries

18:51

Python vs. Julia with deeply nested dictionaries

Просмотров 3504 месяца назад

When creating a deeply nested dictionary, which is faster: Python or Julia? That's the question I seek to answer in this video, within the context of implementing the MERGE-multidimensional algorithm proposed here: journals.openedition.org/lexis/6231 Here's my Python script: github.com/ekbrown/scripting_for_linguists/blob/main/Script_nested_dictionaries.py Here's my Julia script: github.com/ekb...

How much faster has Mojo's dictionary gotten?

7:40

How much faster has Mojo's dictionary gotten?

Просмотров 5 тыс.5 месяцев назад

How much faster has Mojo's dictionary gotten?

Is retrieval from a Python dictionary quicker than insertion?

7:19

Is retrieval from a Python dictionary quicker than insertion?

Просмотров 3325 месяцев назад

Is retrieval from a Python dictionary quicker than insertion?

Does Python's dictionary get slower as it gets bigger?

6:13

Does Python's dictionary get slower as it gets bigger?

Просмотров 1,3 тыс.5 месяцев назад

Does Python's dictionary get slower as it gets bigger?

#LancsBoxX keyword analysis (aka. keyness analysis)

7:13

#LancsBoxX keyword analysis (aka. keyness analysis)

Просмотров 1366 месяцев назад

#LancsBoxX keyword analysis (aka. keyness analysis)

How much faster is Rust than Python when finding neighboring words?

11:30

How much faster is Rust than Python when finding neighboring words?

Просмотров 1 тыс.6 месяцев назад

How much faster is Rust than Python when finding neighboring words?

How big is a "small" dictionary in Mojo lang?

7:20

How big is a "small" dictionary in Mojo lang?

Просмотров 1,7 тыс.6 месяцев назад

How big is a "small" dictionary in Mojo lang?

6:17

Julia lang is (mostly) getting quicker

Просмотров 1,8 тыс.7 месяцев назад

Julia lang is (mostly) getting quicker

25:07

Prep18 Advanced speech apps

Просмотров 358 месяцев назад

Prep18 Advanced speech apps

31:42

Text-to-Speech, lecture 1 (Prep15)

Просмотров 1168 месяцев назад

Text-to-Speech, lecture 1 (Prep15)

15:36

Sounds waves and sampling rate

Просмотров 228 месяцев назад

Sounds waves and sampling rate

Mojo hand-written hash map dictionary versus Python and Julia

7:53

Mojo hand-written hash map dictionary versus Python and Julia

Просмотров 1,5 тыс.8 месяцев назад

Mojo hand-written hash map dictionary versus Python and Julia

Julia vs. Python when calculating lexical diversity

10:17

Julia vs. Python when calculating lexical diversity

Просмотров 4779 месяцев назад

Julia vs. Python when calculating lexical diversity

39:09

Prep 10 ASR lecture2

Просмотров 1299 месяцев назад

Prep 10 ASR lecture2

7:44

MFCCs in Praat

Просмотров 1909 месяцев назад

MFCCs in Praat

Why is Mojo's dictionary slower (!) than Python's?

12:16

Why is Mojo's dictionary slower (!) than Python's?

Просмотров 4,4 тыс.9 месяцев назад

Why is Mojo's dictionary slower (!) than Python's?

@DataPastor 8 дней назад
That is not even an order of magnitude difference… and the Python code is not even optimized for speed. Well done Python! 🎉
@ekbphd3200 7 дней назад
Very true!
@AyeshaKhan-616 11 дней назад
Can you please explain the log likelihood?
@ekbphd3200 6 дней назад
Here's a new video in which I show the mathematical formula for log likelihood: ruclips.net/video/e20SeAc4ygc/видео.html
@micaiahm1 11 дней назад
This is awesome, thanks for taking the time to share this
@ekbphd3200 4 дня назад
You're welcome!
@DataPastor 16 дней назад
I will touch Mojo as soon as they eliminate the competitive clause from its license.
@ekbphd3200 13 дней назад
Understandable
@andrewshorts1198 6 дней назад
Out of the loop on this
@sounkoumahamanetoure4607 28 дней назад
What would the same task in R look like given the native aggregation functions ?
@ekbphd3200 28 дней назад
I think you’re referring to the table function in base R. Yeah, you could load up all words in a vector and then pass that vector into the table function and then use the names function to get the words out of the table result (as the table result itself holds the numbers).
@juvencus_ 19 дней назад
@@ekbphd3200And speed-wise?
@ekbphd3200 4 часа назад
I haven’t tested it, but I assume it would be slower than data.table and tidyverse.
@paulmairo Месяц назад
Nice video, this comes in handy as I was indeed asking myself what use cases warrant reaching to PyO3. I am wondering though, if we convert the call `out_dict.get(w, 0)` to a "dummy" `if w in out_dict` won't it be faster? Something I also find missing here in the video is the memory and CPU (cores) usage. Not that I think Python would do better there, but it would be interesting to check.
@ekbphd3200 28 дней назад
Great ideas! I’ll try these ideas in a future video.
@sampathnkn1418 Месяц назад
Great job, keep it up!
@ekbphd3200 Месяц назад
Thank you much!
@dustinhess6798 Месяц назад
Well I did played around a bit and what found was if you just use mean() from the Statistics package instead of the the home grown straight forward for loop implementation you get an improvement even over the bytes method. Below is how I modified the function makes it a bit more simple and readable as well a performance boost with out the fancy byte stuff. ( Nothing wrong with fancy byte stuff, that was a good catch) function get_mattr(word_list::Vector{String}, window_span::Int = 50) n_words = length(word_list) effective_window_span = min(window_span, n_words) n_windows = n_words - effective_window_span + 1 if n_windows <= 0 return get_ttr(word_list) end mean_ttr = mean(get_ttr(word_list[i:(i + effective_window_span - 1)]) for i in 1:n_windows) return mean_ttr end Here is a link to the data and the output graph I generated. drive.google.com/drive/folders/1-AelwjZZtAPGKf_bLkhkLOC0ZZUWBuTf?usp=sharing
@dustinhess6798 Месяц назад
Hey nice vid. I am a Physicist. I work for a photonics quantum computing company and use Julia for modeling in my work. One thing you may consider is using the BenchmarkTools for Julia. I am not sure but the tail at the beginning of your graph might be due to the JIT compiler optimizing. If this something you do often you could precompile the julia code, once optimized and that would negate the JIT start up time. I will play around a little bit and get back with you. I like the attitude of always willing to learn something from some one else. There is so much out there to learn if just listen and not jump to conclusion.
@StupidInternetPeople1 Месяц назад
Amazing doucheFace thumbnail! Congrats you look like every unimaginative, lazy creator on YT. Clearly intelligent people choose stupid face thumbnails because looking like an idiot is a huge indicator that your content must be amazing! 😂
@iraqi2015 Месяц назад
When I run it on linux. and click on start. it crashes and closes. I don't know why!
@ekbphd3200 Месяц назад
Darn. Double check that you have the latest version and perhaps ask for help on their discussion board: www.laurenceanthony.net/software/antconc/
@ahmedal-attar3478 Месяц назад
Probably worth noting, Polars is quicker because it's multi-threaded and uses all the cores on the machine, were as Pandas is single threaded
@ekbphd3200 Месяц назад
Thank you for pointing that out! I appreciate it.
@paulselormey7862 Месяц назад
Nice take, benchmark must go beyond speed. How much resources are used (CPU, memory) to achieve the apparent faster speed?
@ekbphd3200 4 часа назад
I’m not sure. I’ll have to analyze that next.
@gardnmi Месяц назад
pandas has a join method. It's supposedly faster. You just have to set the join columns as the index before calling.
@ekbphd3200 Месяц назад
Thanks for the comment. However, I can't get join() to be faster than merge(), in fact join() is 4x slower than merge() in my code. In the pandas section of my code here: github.com/ekbrown/scripting_for_linguists/blob/main/Script_polars_pandas_left_join.py when I comment out my merge() line and uncomment the two set_index() lines and the join() line, it is 4x slower. If you get set_index() + join() to be quicker than merge(), please leave a reply with how. Thanks!
@xoruporu310 Месяц назад
What is the difference between the original and the X version?
@ekbphd3200 Месяц назад
As I understand it, the X version works better than the original version with bigger corpora and with XML. It has other features. Take a watch on Vaclav's (the lead on LancsBox and LancsBox X) webinar here, if you'd like: ruclips.net/video/ji5S_xm8_N0/видео.htmlsi=luso4bFa4gl7UP79
@xoruporu310 Месяц назад
@@ekbphd3200 thank you!!
@TheBIMCoordinator 2 месяца назад
I have been learning rust so looking forward to watching this video!
@ekbphd3200 2 месяца назад
Awesome!
@TheBIMCoordinator 2 месяца назад
Great vid!
@ekbphd3200 2 месяца назад
Thanks!
@TheBIMCoordinator 2 месяца назад
I really enjoy this channel! I have been picking up rust from python trying to solve bottlenecks from speed
@ekbphd3200 2 месяца назад
I’m so glad!
@iamwhoiam798 2 месяца назад
Blue & pink lines are roughly linear. Roughly at 70k, there could be some memory allocation or something else that has dropped the performance. More like a one time thing within each test above 70k. With this I tend to think that it's linear for normal hash reading ( blue & pink ).
@ekbphd3200 2 месяца назад
Good points!
@gardnmi 2 месяца назад
I would assume the fastest way to access values is using dict.values() :)
@ekbphd3200 2 месяца назад
Right!
@iamwhoiam798 2 месяца назад
you didn’t access them the way they meant to. You need to iterate keys and access each value using the key. Otherwise hashing is not required to get the map elements. Simply array can do that.
@ekbphd3200 2 месяца назад
Thanks for the comment. I've created a video testing your idea (if I understand correctly your idea). Take a watch: ruclips.net/video/okPofYRLkRk/видео.html
@bratwurst_addict 3 месяца назад
This is just what I was looking for. I am working on business logic and there are a lot of SQL statements. Python takes about 9 seconds to have all 22k entries in a dict, a C++ program I had ChatGPT just write up based on the Python code (it took like 20 iterations of me telling ChatGPT well now I'm getting THIS error) was 10x faster. I won't learn c/++, I'll do some Rust!
@josecantu8195 3 месяца назад
Thanks Professor! I'm learning on the job on my own how to implement python & rust together given my interests in software development, data science & biomedical science so this is an interesting series you made!
@ekbphd3200 3 месяца назад
Great to hear it!
@AndyQuinteroM 3 месяца назад
Great video, but the result is interesting. Mind if I can get eh full main.rs file and dataset. Would love to run the tests my self and perhaps improve upon it
@ekbphd3200 3 месяца назад
Sure thing! Any feedback that you have is welcome! I'm trying to improve my ability in Rust, so anything you see that could be done better, please let me know. Here's the main file: github.com/ekbrown/scripting_for_linguists/blob/main/main_mattr_native_rust.rs And here's the text file from the Spotify Podcast dataset that I used: github.com/ekbrown/scripting_for_linguists/blob/main/0a0HuaT4Vm7FoYvccyRRQj.txt
@pyajudeme9245 3 месяца назад
Awesome, I was waiting for that video! I thought that Python's GIL blocking in your last video had a much stronger effect. I guess strings are in all programming languages pretty horrible, because utf-8 doesn't have a fixed byte size, so all programming languages have to use the slow techniques that python uses for all data types. Python is pretty good compared to other languages when talking about dicts and strings. The rest is very slow, but thanks God, it is very simple to speed it up if you need it.
@ekbphd3200 3 месяца назад
Yeah, I guess so. Python continues to impress.
@andrebieler7906 3 месяца назад
FWIW i got about a 30% speed increase for julia when working on bytes directly (vs strings) and passing @view of byte-vectors. wds = split(txt) bwds = [Vector{UInt8}(word) for word in wds] and then passing the @view of bwds instead of wds into the individual functions note: i also dropped all the println() and I/O operations in my code as i was mostly curios about the speed of julia and not I/O or print. (but fair play if it is included)
@ekbphd3200 3 месяца назад
Thanks for this! I'll give it a try.
@andrebieler7906 3 месяца назад
@@ekbphd3200 very interesting benchmarks and results. i personally had never anything to do involving heavy string manipulation and hence am by no means an expert in that area. for all my use cases julia is always orders of magnitude faster than python. had fun digging around in your examples. <3
@ekbphd3200 3 месяца назад
Thanks for your comments! I tried the advice in your previous comment and it works for me too! Thanks for pointing this out. Sounds like another video! I'll be sure to acknowledge my source (you).
@andrebieler7906 2 месяца назад
@@ekbphd3200 Oh very cool, definitely did not expect this to trigger a new video :) I'm sorry I could have been a bit more helpful with my comment about the @view macro... For it to show an effect one also needs to add @view in the `get_matter` function like so: numerator += get_ttr(@view in_list[i : (i+window_span-1)]) Apologies for not pointing that out in my comment. Anyway the vast majority of the speed gain is from the bytes vector, but maybe something to consider if you want to give @view another shot in the future. Thanks a lot for the mention in the video <3
@pyajudeme9245 3 месяца назад
Great video, I like your benchmark videos, but from Rust's perspective it's a little unfair. It's not representative when Rust is a slave of Python's GIL. However, I think the speed difference between the language is not that big, because UTF-8 chars don't have a fixed byte size. It would be nice to see a comparison between utf-8 and byte strings (fixed size). You could also add Zig to the benchmarks, I used it a week ago for the first time (RGB search in a picture). It was between 2 and 10(!) times faster than C, but I haven't tested it with strings yet.
@ekbphd3200 3 месяца назад
Yeah. Good point. Perhaps I'll time just the inside of my Rust function, after the Python list is converted into a Rust vector. Yeah, Zig looks interesting.
@Noodlezoup 3 месяца назад
Thank you for sharing this!
@ekbphd3200 3 месяца назад
My pleasure!
@RealLexable 3 месяца назад
But only as long Mojo isn't out there to perform python to it's coming new standard limits even faster than c++. The future is going to be fast as hell bro 🎉
@ekbphd3200 3 месяца назад
Awesome!
@nandoflorestan 3 месяца назад
Bah, Mojo is not even open source. That's repugnant.
@ekbphd3200 3 месяца назад
Maybe I'm misunderstanding, but I thought this means it's open-source: github.com/modularml/mojo/tree/main/stdlib for example, I can see the source code of the List object here: github.com/modularml/mojo/blob/main/stdlib/src/collections/list.mojo Perhaps, I'm not sure what you're saying.
@Navhkrin 3 месяца назад
It will be made open source though. Chris clearly mentioned that making a language that has not reached v1.0 open source significantly slows down the progress because open-source projects that are led my committee's move slowly. They want to finalize the Mojo spec and features before making it open source. That being said, they already started making it open source. std lib and documents are currently open source. Mojo is designed around pushing as many features to lib's as possible so making std lib open source is already huge.
@ekbphd3200 13 дней назад
@@Navhkrin Thanks for that info!
@marvinakuffo4096 3 месяца назад
How about using glob as a generator? Would that reduce the gap beteen the os.walk and glob? Because in your code, glob() loads everything into memory and that may adversely impact the runtime.
@ekbphd3200 3 месяца назад
I'll have to try that at some point in the future.
@SBrown-ou1xl 3 месяца назад
I thought about this a bit more, and I think the MTLD_wrap algorithm has a time complexity of O(n^2). It might be interesting to try to fit a quadratic to the scatter plot instead of a line!
@ekbphd3200 3 месяца назад
Good idea. Is that different from the LOESS line?
@j-p-d-e-v 3 месяца назад
I tried PyO3 and its actually a really good library. BTW great content.
@ekbphd3200 3 месяца назад
Yeah, it seems to be well written and well documented. Thanks! I'm glad you enjoy my videos!
@playea123 3 месяца назад
This is fantastic! Thank you for sharing!!
@ekbphd3200 3 месяца назад
You're very welcome!
@NoX-512 3 месяца назад
If you convert the text into an array of integers, where each integer is an index into an array (or tree) of unique words from the text, you could possibly speed up things by a lot, depending on how long it takes to set up the arrays/tree.
@ekbphd3200 3 месяца назад
Very good idea! I'll have to try this.
@pyajudeme9245 4 месяца назад
Nice! I would love to see the same thing, but with Numpy arrays instead of dicts in Rust.
@ekbphd3200 4 месяца назад
Good idea! I'll have to try that at some point in the future.
@pyajudeme9245 4 месяца назад
What I learned the last couple of years is that Python is the best language for working with strings. It might not be the fastest, but the difference to the compiled languages is not that big (sometimes even faster - regex Python > regex C++), like with numeric data types. Practically all languages have a bad performance when leading with strings. However, working in Python's interactive mode, doing string operations is priceless. Guido van Rossum said in a recent interview with Lex Fridman that Perl is still the fastest when talking about strings (or regex - I don't remember exactly). However, it would be nice to see a comparison between Perl and other languages.
@ekbphd3200 4 месяца назад
Yeah, good points!
@SBrown-ou1xl 4 месяца назад
That's a really cool project! Thanks for bringing it to our attention!
@ekbphd3200 4 месяца назад
No, no, thank you!
@zhaoziyang-c5h 4 месяца назад
This was great, thanks! Had no idea this was available. Going to implement it into my python ebook reader
@ekbphd3200 4 месяца назад
You're welcome! Best of luck!
@j-p-d-e-v 4 месяца назад
Nice video, the speed boost from Rust is almost twice. Can you do polars and pandas performance comparison?
@ekbphd3200 4 месяца назад
Thanks! Yeah, Rust wins again. Ah, interesting idea. I'll have to try that comparison at some point in the future.
@techinsider3611 4 месяца назад
Also try mojo.
@ekbphd3200 4 месяца назад
Yeah, I need to try Mojo too. I'm finding that Mojo isn't yet good at text processing. I hope and assume that it will be get better as it is developed more and more.
@AsgerJon 4 месяца назад
Instead of split(" "), I suggest split(). Omitting the argument splits on each consecutive whitespace.
@ekbphd3200 4 месяца назад
I'll try that!
@AsgerJon 4 месяца назад
@@ekbphd3200 I learned that earlier this year from a certain Mr GPT, after years of stuff like: while ' ' in someString: # two spaces someString=someString.replace(' ', ' ') I think it was actually laughing at me.
@wld-ph 4 месяца назад
Have to tried different sizes of datasets, to see whether there is some underlying system cause... 230 million words is a lot, way more than I know... and it´s all in one file... and not very parallel...
@ekbphd3200 4 месяца назад
Yeah, I'm enjoying experimenting with Mojo after each release.
@abanoubha 4 месяца назад
what about Go ?
@ekbphd3200 4 месяца назад
I haven't yet ventured into Go for text processing.
@kilianklaiber6367 4 месяца назад
So rust essentially takes half the time than python....nice, but I thought rust would be a lot faster.
@ekbphd3200 4 месяца назад
Yeah. Nearly twice as fast.
@0xedb 4 месяца назад
could be a lot faster. it all depends on what's being done and how efficient the code is. not always tho
@JavierHarford 3 месяца назад
I can imagine 2x is just a function of the complexity x the sample size, which makes me wonder about the curve at scale. There would also be some unrelated but important measures, such as speed of development and the effect of higher level abstractions vs lower level optimisation too
@oterotube13 4 месяца назад
So at the end is Julia vs. C
@ekbphd3200 4 месяца назад
I guess. I don't know how Python's for loops compare to C's, but I guess the dictionary itself is implemented in C.
@exxzxxe 4 месяца назад
This is the second time I have viewed this video. Thank you for performing the benchmark-testing work I would have had to do- saved me quite a bit of time. Now a question: Do you believe Mojo will progress to point where its dictionary performance will equal or exceed Python's?
@ekbphd3200 4 месяца назад
You're very welcome! I'm glad that enjoyed it. I hope and assume Mojo's native dictionary will get faster with future releases. In the changelog for Mojo v.24.4 the creators say: "Significant performance improvements when inserting into a Dict. Performance on this metric is still not where we'd like it to be, but it is much improved." docs.modular.com/mojo/changelog#v244-2024-06-07 With the "still not where we'd like it to be" I assume that they will continue to work on the native dictionary.
@murithiedwin2182 4 месяца назад
That's a significant speed improvement, 3x folds faster in the newer version. However, it still doesn't explain why mojo code is still slower than actual identical python code, given that mojo was going for machine code compilation but with python syntax and ease, and not bytecode... For the little documentation overview i have read, mojo team explained that mojo is not python, but python will be mojo, in the sense that python will instead be an interpreted subset of the compiled mojo, and that features in python not yet implemented in mojo will dynamically switch to run in an included actual python runtime, but in the future, mojo will be self-contained to run all python code on mojo runtime.
@ekbphd3200 4 месяца назад
Cool! Thanks for looking that up.
@melodyogonna 4 месяца назад
How come you know that dunder methods provide high-level sugar for Python, but you call the methods directly in Mojo? You don't to call object.__len()__. object.__setitem()__ etc directly in Mojo, they work pretty much the same way they do in Python.
@ekbphd3200 4 месяца назад
I'll have to try the sugar way in the future. Thanks for pointing this out.
@alextantos658 4 месяца назад
And ome could also try out the Dictionaries.jl package in Julia that is much more performant and efficient than he base Julia Dict type.
@ekbphd3200 4 месяца назад
Thanks for the idea. I just tried Dictionary.jl to get the frequencies of words across 40k files with 230m words, and it was only slightly faster than Base.Dict (47s vs. 51s). I'll have to implement Dictionary.jl with a deeply nested dictionary and see how it does.
@alextantos658 4 месяца назад
@@ekbphd3200 Thanks for the nice videos and the work! Besides Dictionary.jl, Julia offers several other options from DataStructures.jl, such as SwissDict and other data structures that are claimed to be faster. What I appreciate about Julia is its diverse range of options, often re-implemented within the language itself without needing to track/tune C implementations of basic operations.. Therefore, while comparing base types between languages provides valuable insights, it doesn't fully capture the extent of Julia's capabilities. PS: I am a Python user and fan too..
@ekbphd3200 4 месяца назад
Here's a quick comparison with a simple frequency dictionary: ruclips.net/video/ROgQASMN_lI/видео.html
@francoisgrassard 4 месяца назад
Thank you so much for this video (and the Others). Really interesting.
@ekbphd3200 4 месяца назад
Glad you enjoyed it!

EKB PhD

Видео

Комментарии