Get help from the marimo community

Updated 5 months ago

`DataLoader` with `num_workers=1` crashes?

At a glance

The community member is learning about LLMs and encountered an issue when trying to run a Jupyter notebook in Marimo. The problem is related to a DataLoader trying to run in workers when there is a notebook local implementation of a Dataset. The suggested solution is to move the ToyDataset class into a separate .py file, but the community member is unsure if this is the expected behavior.

The community member tried moving the ToyDataset class into its own file and importing it, which allowed them to move forward. However, they found it to be a "pain" to set up the pyproject.toml and the editable install just for that. The community member also mentioned that this approach does not seem to be compatible with Marimo's sandbox feature.

The community member suggested that it would be helpful to have the ability to scaffold out a Python directory with a template that puts users on a "happy path" for what they will need in the future, similar to what is available in the JavaScript community with npm init react-app my-app.

The community member provided a GitHub repository and branch with their attempt to resolve the issue, and another

Useful resources
I'm learning about LLMs and was working through implementing this Jupyter notebook into Marimo and ran into a problem with a DataLoader trying to run in workers when there is a notebook local implementation of a Dataset. I suppose a solution is to move the ToyDataset class into a separate .py file, but is this the expected behavior? Depending on external files also means that my project needs to have modules fully setup and functional.
A
N
26 comments
Sorry, i’m don’t follow, what problem did you run into?
This is the exception thrown when running from the command line. Same exception happens in the web notebook.
Plain Text
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'ToyDataset' on <module '__mp_main__' from 'listing_A_part_2.py'>
Traceback (most recent call last):
  File "listing_A_part_2.py", line 212, in <module>
    app.run()
  File ".venv\Lib\site-packages\marimo\_ast\app.py", line 298, in run
    outputs, glbls = AppScriptRunner(InternalApp(self)).run()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv\Lib\site-packages\marimo\_runtime\app\script_runner.py", line 111, in run
    raise e.__cause__ from None  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv\Lib\site-packages\marimo\_runtime\executor.py", line 170, in execute_cell
    exec(cell.body, glbls)
  File "listing_A_part_2.py", line 189, in <module>
    for _batch_idx, (_features, _labels) in enumerate(train_loader):
                                            ^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
Plain Text
  File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1327, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1293, in _get_data
    success, data = self._try_get_data()
                    ^^^^^^^^^^^^^^^^^^^^
  File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1144, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 22260) exited unexpectedly
I got my project directory to be self-installed as an editable module and moved the ToyDataset class into its own file and imported from there, so I can move on. It is kinda a pain to setup the pyproject.toml and the editable install just for that.
This also doesn't seem to be compatible with Marimo's sandbox feature?
What would help is the ability to scaffold out a Python directory with a template that puts users on some sort of happy path for what they will need in the future. This would be helpful especially because there so many options around what can be done in Python with hacking the path, etc.. I've seen this sort of thing in the JavaScript community with npm init react-app my-app, but can't say I've seen it in Python.
Are you running as a script, with marimo edit, or something else? Exact instructions on how to reproduce would help, starting by linking to a marimo notebook. I can’t reproduce from a jupyter notebook
Now I'm trying to figure out uv and what is the right thing to keep it compatible with marimo edit --sandbox.
I was running with marimo edit but the error wasn't copyable. There was something in the middle that interrupted the selection. So that's why I pasted the error from the command line.
Also, I think I've hit a dead-end on getting it to run with marimo edit --sandbox. I'm getting the following error:

Plain Text
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py
Running in a sandbox: uv run --isolated --no-project --with-requirements C:\Users\USER\AppData\Local\Temp\tmpl00x1659.txt marimo edit .\llm_from_scratch\appx_a\listing_A_part_2.py
  × No solution found when resolving `--with` dependencies:
  ╰─▶ Because llm-from-scratch was not found in the package registry and you require llm-from-scratch, we can conclude
      that your requirements are unsatisfiable.


To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder. You can see this attempt here: https://github.com/ngbrown/build-llm-from-scratch/commit/8fccd4a9d422bfe1085d2f2f7bcb57c69cfee989
I had ran uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py ./ and uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121 to populate the /// script header of the .py file, and then had to manually add the extra-index-url option.
@Akshay With the github repo and mentioned branches, this should be reproducable. Any thoughts?
To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder.

That's interesting, thanks for letting me know. This is a bug we should fix. llm_from_scratch shouldn't be added as a dependency
I will make a GitHub issue to track
Oh wait, you shouldn't be adding a local file as a dependency
I see now that you manually added that
llm_from_scratch shouldn't be added as a dependency

Because DataLoader spawns separate processes and can't access the Dataset that it needs from within a Marimo notebook cell function, I need to move that class into its own file and somehow a notebook needs to import modules from the local directory. As far as I know Python doesn't have a way to import a bare file (it needs the __init__.py marker file making the directory contents modules). Is there another preferred way?
Mm I see. Sorry for the short messages, a little busy today.
We recently implemented support for uv-sources. I just cloned your repo and tried it, and it works (using marimo 0.9.10).
Disregard my previous message on not adding a local file, since you are populating uv-sources your use case makes sense.
Let me know if it works for you now?
I got it to work.

The complication is that the paths in the Marimo .py file header are relative to the command line, not the file itself. So for this:
Plain Text
# [tool.uv.sources]
# llm-from-scratch = { path = "../../" }

This does not work:
Plain Text
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py

But this does:
Plain Text
> cd .\llm_from_scratch\appx_a\
llm_from_scratch\appx_a> marimo edit --sandbox listing_A_part_2.py

I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?
I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?

We match the behavior of running Python scripts, in python my_directory/my_script.py, the current working directory will be the directory of the command.

We can clarify in our docs, perhaps in the FAQ.

We do have a utility for constructing paths relative to the notebook directory (mo.notebook_dir()), but I guess that won't help for the script metadata.
For the particular case of script metadata/sandbox, likely uv also matches the Python CLI's behavior when determining the current working directory.
Add a reply
Sign up and join the conversation on Discord