How do you locate the position in the code where optimization is possible? Do you learn about gc.freeze() somewhere else first and then realize it could be used in the project? Or you notice there is high memory usage for the services and then actively looking for potential solutions and encounter gc.freeze()?
it depends on the framework and how things are set up. usually you want it as late in the parent process before forking as possible. I've known about this particular function for a while (even made a video on it a year or so ago). I'm currently trying to upgrade python and was hunting for a memory leak and decided to try this out for fun (and profit). had some success with this and similar approaches at previous employers
For me to locate a problem usually it is a mix of debugging, experience(checking known bottlenecks for your application, example: access to disk, API interactions, parsing of big data sources, DB queries), and bench-marking; running operations containing different data to evaluate response times. You follow the data step by step until usually you hit a performance drop on a specific function(rarely your hole chain of calls is equally slow in all parts ). The whole optimization process usually goes like this: optimization is needed for a certain piece of code, because is too slow/resource consuming; we analyze the code to try to understand the cause of the issue (eg. inefficient algorithm, too much memory used, slow operation because of too many api/database requests.. ). We first try to just make the code better, . If not sufficient then we try to apply known but maybe more complex optimization methods (if appropriate) like caching, optimizing external interactions. if we are not satisfied we try to find new solutions, by studying existing libraries, or checking if we need to use some new tools or libraries, or even restructure part of the code/ infrastructure. It is a set of skills that you acquire with study (knowing the industry way to do something) and knowing the tools at your disposal by reading documentation of your libraries; then with time you build a set of solutions, at least for many common problems.
Isn't that why you generally avoid fork and use threads instead? All threads live in the same process sharing the heap while having their unique stack.
Python was designed in the time where single core CPUs were the norm. Yeah it might be a problem now. Yes they could release python 4 and break everything for that to work, but that's painful for everyone
You're right i forgot about pep 703. I think it was more for library devs. The pep by itself wouldn't speedup code. If I remember correctly it would slow down regular code
In my work I also noticed 9:25 this.. block algorithm specifics aligned for small objects optimisations. However I have a need to... optimize if for storing bigger objects. It's.. bytes and str objects with sizes up to 5-10 MB (to be precise - thousands of incoming and outcoming html responses) which as we know - immutable and require.. continuous block of big size to store. As result of this I have.. strange situation when process have for example total 50 MB of free RAM allocated to process but as It doesn't have free single continuous block with size of 5MB - process asks OS to allocate more RAM so I quickly run out of RAM with a lot of free memory I can't efficiently use.(All things inside single process) Where or How I can get more detailed info about this? And in what direction I need to route?
4:40 Oh this is cool, I really need to learn more about the C implementation underlying Python. edit: now I wonder how a circular garbage collector works...
I don't know for sure ow it's implemented in python, but in general a GC works not by deleting stuff that needs to be deleted, and instead attempts to find everything that is referenced, and keeps that (by just traversing the object graph and keeping everything that is reachable)
This is a great video. Could you mention whether you saw a visible change in CPU usage and task latency? We implemted this at work and we did see a decrease in memory consumption but the CPU increased quite a bit. Which is also seen by some tasks taking twice as much time.
Hey Anthony - the just found your last few videos and they have been great - I've been using memray cprofile pystack a lot the last year and its good to see how other folks are using it. One question on gc.freeze() --- I've tried to recreate the standard Python behavior with CoW and fork with a basic example. (load a handful of modules, fork, do some minor calculations, force gc.collect). Examining the shared memory unique memory set in Debian, I don't seem to be able to recreate the issue in trivial cases.
If you disable the GC at this point before the fork, doesn't that make your program never free memory at any point after the fork? Do you ever re-enable the GC?
That would be because the cpython stuff is C code, not python. And must of that code are Macros ( the lines begin with a #) and, to simplify, that is code that is run before it's complied. Mostly it's checking what compiler and system it's going to be used on. ___GNUC___ being the GCC and ___CLANG___ being the Clang C Compilers, respectively . ___STDC_VERSION___ is the version of the C language standard being used. _MSC_VER is the version of Microsoft's Visual C complier.
Had to think a bit to understand, to put it in other words, he does not have a Windows VM in Linux, but a Linux VM in Windows, OBS is running on windows and is cropped the area of the Linux VM. When he moves a windows window on top of the Linux VM window it is not in the VM but on top of it.
Neat trick. Instead of using Celery prefork why not use the solo worker which is single process and let k8s scale the workers? This works well for our application and uses much less resources. The health probes and pod termination are tricky with long running tasks but possible by touching a file periodically. This way k8s handles hung tasks and more pods not worker processes is how you scale up.
in theory that's better. practically though there are memory leaks and significant (unused) overhead of just getting the django app initialized. so single worker would be pretty wasteful (that prefork had such an impact is kind of a testament to that) if each worker were a separate service that had very specific dependencies it would probably make sense? though that would involve tons of work since we have hundreds of different tasks
without going into too much detail memory is segmented into chunks which are called pages. when paged in they become resident (copied from the parent process)
hmm, tbh i would never runb something as big in python. maybe rather nodejs? but maybe that has another can of worms...still, the severe performance problems I keep running into with python would strongly disincentivise investing that deeply on it on a high performance server...
Imagine someone tries to learn python and they start on their merry way, learning the basics, building their first hello world. And then you run in Dumbledore style and ask them calmly: "HARRY! Did you waste a Terrabyte of RAM using garbage collection?!?!"
Do you think that running gc.freeze after gc.collect would improve more memory usage? def _create_worker_process(self, i): worker_before_create_process.send(sender=self) gc.collect() # Issue #2927 return super()._create_worker_process(i) I put that signal just before collect and thats why this come into my thought.
The thing i would really enjoy is troubleshooting process which lead to this solution.
you'll want to check out next week's video then :)
i work at datadog and its so cool seeing you use it and visualize everything nicely!!
It will always feel good to see our product being used in the wild even when working for major companies, great job guys, amazing product
How do you locate the position in the code where optimization is possible? Do you learn about gc.freeze() somewhere else first and then realize it could be used in the project? Or you notice there is high memory usage for the services and then actively looking for potential solutions and encounter gc.freeze()?
it depends on the framework and how things are set up. usually you want it as late in the parent process before forking as possible.
I've known about this particular function for a while (even made a video on it a year or so ago). I'm currently trying to upgrade python and was hunting for a memory leak and decided to try this out for fun (and profit). had some success with this and similar approaches at previous employers
For me to locate a problem usually it is a mix of debugging, experience(checking known bottlenecks for your application, example: access to disk, API interactions, parsing of big data sources, DB queries), and bench-marking; running operations containing different data to evaluate response times. You follow the data step by step until usually you hit a performance drop on a specific function(rarely your hole chain of calls is equally slow in all parts ).
The whole optimization process usually goes like this: optimization is needed for a certain piece of code, because is too slow/resource consuming; we analyze the code to try to understand the cause of the issue (eg. inefficient algorithm, too much memory used, slow operation because of too many api/database requests.. ). We first try to just make the code better, . If not sufficient then we try to apply known but maybe more complex optimization methods (if appropriate) like caching, optimizing external interactions. if we are not satisfied we try to find new solutions, by studying existing libraries, or checking if we need to use some new tools or libraries, or even restructure part of the code/ infrastructure.
It is a set of skills that you acquire with study (knowing the industry way to do something) and knowing the tools at your disposal by reading documentation of your libraries; then with time you build a set of solutions, at least for many common problems.
Isn't that why you generally avoid fork and use threads instead? All threads live in the same process sharing the heap while having their unique stack.
But python can't run true parallelism when you use threads. Maybe the new subinterpreter might deliver the solution
@@JohnZakaria I would say that's a design flaw in the language. Just another reason to hate on python 😂
Python was designed in the time where single core CPUs were the norm.
Yeah it might be a problem now.
Yes they could release python 4 and break everything for that to work, but that's painful for everyone
@@JohnZakaria Wasn't there some news that they are going to remove the GIL?
You're right i forgot about pep 703.
I think it was more for library devs.
The pep by itself wouldn't speedup code.
If I remember correctly it would slow down regular code
Talk about some great numbers to add to the resume!
In my work I also noticed 9:25 this.. block algorithm specifics aligned for small objects optimisations.
However I have a need to... optimize if for storing bigger objects. It's.. bytes and str objects with sizes up to 5-10 MB (to be precise - thousands of incoming and outcoming html responses) which as we know - immutable and require.. continuous block of big size to store.
As result of this I have.. strange situation when process have for example total 50 MB of free RAM allocated to process but as It doesn't have free single continuous block with size of 5MB - process asks OS to allocate more RAM so I quickly run out of RAM with a lot of free memory I can't efficiently use.(All things inside single process)
Where or How I can get more detailed info about this? And in what direction I need to route?
try jemalloc perhaps?
@@anthonywritescode Thank You for advice. I will try that
Could we use in any Django project that uses celery or is it only specific to Sentry?
should be pretty universally useful, yeah
4:40 Oh this is cool, I really need to learn more about the C implementation underlying Python.
edit: now I wonder how a circular garbage collector works...
Generation algorithm
I don't know for sure ow it's implemented in python, but in general a GC works not by deleting stuff that needs to be deleted, and instead attempts to find everything that is referenced, and keeps that (by just traversing the object graph and keeping everything that is reachable)
It's not the way Python does it, but Floyd’s Cycle Finding Algorithm is a pretty interesting way of finding circular references.
I know nothing about any code or programming but I keep getting this video and still have no idea what's being said or how the solution worked
you've got to be so proud of yourself jesus
This is a great video. Could you mention whether you saw a visible change in CPU usage and task latency?
We implemted this at work and we did see a decrease in memory consumption but the CPU increased quite a bit. Which is also seen by some tasks taking twice as much time.
our CPU didn't change noticeably, if anything it improved a tiny bit (which is what I expect)
Hey Anthony - the just found your last few videos and they have been great - I've been using memray cprofile pystack a lot the last year and its good to see how other folks are using it.
One question on gc.freeze() --- I've tried to recreate the standard Python behavior with CoW and fork with a basic example. (load a handful of modules, fork, do some minor calculations, force gc.collect). Examining the shared memory unique memory set in Debian, I don't seem to be able to recreate the issue in trivial cases.
it's impossible to tell without seeing your setup
If you disable the GC at this point before the fork, doesn't that make your program never free memory at any point after the fork? Do you ever re-enable the GC?
gc freeze does not disable the gc
I know some python but not so in-depth, can barely understand what you are showing in cpython.
How would one learn this stuff?
That would be because the cpython stuff is C code, not python. And must of that code are Macros ( the lines begin with a #) and, to simplify, that is code that is run before it's complied. Mostly it's checking what compiler and system it's going to be used on.
___GNUC___ being the GCC and ___CLANG___ being the Clang C Compilers, respectively . ___STDC_VERSION___ is the version of the C language standard being used. _MSC_VER is the version of Microsoft's Visual C complier.
how did he open paint when he's on ubuntu?
Can you make a guide on how to use Celery with Flask and Django? Especially when you create celery workers and wait them in flask.
personally I would not recommend using celery. the architectural decision to use it at work predates me and is almost too big to change at this point
@@anthonywritescode what are the alternatives?
any work queue really
I think you can also do this trick with gunicorn
yep! or really any prefork framework
Exactly why I came to the comments. Wondering if anyone has tried this on Gunicorn and saw the results.
Hey how do you use these windows apps directly on your linux desktop?
VM
@@drz1 I know that , how does he make the individual apps appear directly on the linux desktop ? I have seen multiple times, e.g. paints in this video
This is not linux desktop. It's Windows with Linux VM in fullscreen mode, so he can simply tab out to other window apps@@rkdeshdeepak4131
not even full screen either but yes -- I crop the obs scene to just the Linux vm
Had to think a bit to understand, to put it in other words, he does not have a Windows VM in Linux, but a Linux VM in Windows, OBS is running on windows and is cropped the area of the Linux VM. When he moves a windows window on top of the Linux VM window it is not in the VM but on top of it.
great work
Neat trick. Instead of using Celery prefork why not use the solo worker which is single process and let k8s scale the workers? This works well for our application and uses much less resources. The health probes and pod termination are tricky with long running tasks but possible by touching a file periodically. This way k8s handles hung tasks and more pods not worker processes is how you scale up.
in theory that's better. practically though there are memory leaks and significant (unused) overhead of just getting the django app initialized. so single worker would be pretty wasteful (that prefork had such an impact is kind of a testament to that)
if each worker were a separate service that had very specific dependencies it would probably make sense? though that would involve tons of work since we have hundreds of different tasks
i am sorry if i missed but what does "paging into those objects" mean?
without going into too much detail memory is segmented into chunks which are called pages. when paged in they become resident (copied from the parent process)
Great video
What a good engineer! This is why some guys rake in more dough than others.
You know how to make programs more efficient
I know how to use Paint more efficiently
We are not the same
Jeeze what type of server has 6+ terabytes of ram 😮
not a single server, a kubernetes cluster
@@anthonywritescode Thanks, that makes sense.
Hope you get a raise or a bonus for this! ;)
Is that an ubuntu vm on windows?
dang python sucks at copy on write!!
hmm, tbh i would never runb something as big in python. maybe rather nodejs? but maybe that has another can of worms...still, the severe performance problems I keep running into with python would strongly disincentivise investing that deeply on it on a high performance server...
NodeJS has big performance problems too. Something native like Rust would be better
this just reinforces my belief that garbage collection based memory management is evil
a bit naive don't you think
@@squishy-tomato Projection much?
Yeah, just throw hardware at the problem. Cloud vendors must love you.
Нихуя не понял, но видос интересный, спасибо Антоха.
Ну чего ты не понял-то? Сказали сборщику мусора не отслеживать ссылки. Его структуры перестали копироваться в дочерние процессы.
Imagine someone tries to learn python and they start on their merry way, learning the basics, building their first hello world. And then you run in Dumbledore style and ask them calmly: "HARRY! Did you waste a Terrabyte of RAM using garbage collection?!?!"
Huh?
Do you think that running gc.freeze after gc.collect would improve more memory usage?
def _create_worker_process(self, i):
worker_before_create_process.send(sender=self)
gc.collect() # Issue #2927
return super()._create_worker_process(i)
I put that signal just before collect and thats why this come into my thought.
collect will likely make it worse because it will make more holes in arenas