Get help from the marimo community

Updated 7 months ago

Advanced memory management

At a glance

The post discusses how marimo, a tool, manages memory. Marimo uses a "globals" dictionary to store everything that is defined, and the DAG (Directed Acyclic Graph) determines the order in which to run cells. This could lead to memory build-up, but marimo removes and collects variables on cell invalidation.

The comments discuss an experimental "strict mode" feature in marimo that actively manages the exposed globals to the cell, creating cell-specific "global" environments and performing additional active cleanup. This prevents cross-cell memory mutation, but may have copy overhead. The community members also discuss various approaches to managing memory, such as using mo.drop, restructuring the code, and using a persistent cache feature.

There is no explicitly marked answer in the provided information.

Continuing the conversation with @evandertoorn

Memory Management


How does marimo keep things in memory? marimo internally manages a "globals" dict shared between all cells, everything that is defined is put into this dictionary. The dag primarily works with a static code analysis without respect to what has already been defined etc, to determine the order in which to run cells. Since the global dict is persistent during the session, it could potentially lead to memory build up. However, instead, variables are removed and collected by marimo on cell invalidation.
d
e
9 comments

Can we do even better


Maybe. One of the current experimental features of marimo is "strict mode" enabled with:

Plain Text
[experimental]
execution_type = "strict"


This mode actively manages the exposed globals to the cell, creating cell specific "global" environments, and has additional active cleanup. To prevent cross cell memory mutation (which is possible but discouraged in marimo normal mode)- strict mode implicitly copies variables between cells (you can wrap variables with zero_copy in this mode to disable this behavior). One advantage to strict mode, is that this build up of any hidden state doesn't occur, but at the cost of copy overhead. One of the edge cases normal mode marimo does not catch is the following (maybe this is actually a bug @Akshay?)

_my_var = 1

Then remove the reference to _my_var, and it will still remain secretly in memory. marimo doesn't clean this up since it has no context wrt the rest of the graph. Since strict mode accounts for all references, private or not, it removes _my_var if it determines it is not needed.

Is strict mode worth it?

I think it depends on your use case. You can try it out, and worst case disable it. It's experimental for a reason, but the more feedback it gets the better. If you frequently are prototyping with various private variables, strict mode will prevent this variable build up, but potentially at the cost of the "copy" in other cases. You can fight against this with "zero_copy" but lose some of the mutation protections.

Best case you barely notice strict mode and have a possible memory boost due to the active gc, worst case there's a performance issue.
Example use case:
Plain Text
# import cell
import polars as pl
import seaborn as sns
# raw parsing cell
huge_df = pl.read_parquet("huge.parquet")
# plot 1.
sns.histplot(huge_df, ...)
# plot 2.
sns.boxplot(huge_df, ...)
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
# Potentially, I'd be fine with indicating that this is where the need for it to exist stops
mo.drop(huge_df)
# ... further analysis
The DAG could infer for further cells that the variable is no longer usable.
w.r.t. strict mode, copying large dataframes (i.e. 60% of RAM) would not be feasible between cells.
Can you add the cell divisions or is this all in one cell?
If in a single cell

Plain Text
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
del huge_df


Works in both modes

Plain Text
# cell
huge_df = pl.read_parquet("huge.parquet")

# cell
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
required_del_ref = None # Trick marimo to always run this cell first

# cell
required_del_ref # included to ensure correct run order
del globals()["huge_df"]


Not recommended but possible. Won't work in strick mode

Plain Text
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)


Still not recommended, particular to dataframes. Will not work in strict mode

Plain Text
# cell
huge_df = zero_copy(pl.read_parquet("huge.parquet"))
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)


Will work in strict mode (not recommended)

---

mo.drop is not easily possible since static analysis primarily works on variable name.
You could just restructure your code though:

Plain Text
# cell
huge_df = pl.read_parquet("huge.parquet")
plot_fig1, _plot_ax1 = plt.figure()
plot_fig2, _plot_ax2 = plt.figure()
# Build diagrams without displaying
sns.histplot(huge_df, ..., ax=_plot_ax1)
sns.boxplot(huge_df, ..., ax=_plot_ax2)

# Export partial views
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)

del huge_df

# cell
plot_fig1

# cell
plot_fig2


That way huge_df is confined to a single cell. But this is annoying- because any change to partial_df or partial_df2 requires a rerun.
Here's another suggestion with the proposed persistent_cache feature

Plain Text
# cell
@functools.cache
def load_huge():
    return pl.read_parquet("huge.parquet")

# cell
with mo.persistent_cache(name="figures") as figures:
    plot_fig1, _plot_ax1 = plt.figure()
    plot_fig2, _plot_ax2 = plt.figure()
    # Build diagrams without displaying
    sns.histplot(load_huge(), ..., ax=_plot_ax1)
    sns.boxplot(load_huge() ..., ax=_plot_ax2)

# cell
with mo.persistent_cache(name="partial_df"):
  partial_df = load_huge().filter(...)

# cell
with mo.persistent_cache(name="partial_df2"):
  partial_df2 = load_huge().group_by(...).agg(...)

# cell
figures, partial_df, partial_df2 # Ensure the above cells run first
load_huge.clear_cache()


In theory, on a secondary run, load_huge will never have to be called, and the cells will auto rerun/ reload huge_df if the code changes
I guess you can do this now, without the persistent cache blocks actually. Persistence would just potentially make things faster on secondary runs / restarted kernels
now this looks absolutely delightful
Add a reply
Sign up and join the conversation on Discord