James Mullenbach

Home
Blog
GitHub
LinkedIn
Twitter

What I Wish I'd Known Before Starting My First ML Research Project

26 Dec 2017

I recently submitted my first long paper to an academic conference, as first author. Here are some of the many things I would tell myself before I started, to make my time spent working on it easier and more productive. In practice, this may mostly function as a list of mistakes I found myself making, and may be specific to the project I worked on, but hopefully I’ve distilled what I learned to a form general enough that others can take something away from this as well.

1. Know your prior work.

Knowing the scope of what research is already out there is vital! I fell into the trap of wanting to get started immediately. It’s way more fun to code up a model than it is to painstakingly go through endless pages of google scholar, refine search terms, read abstracts, and perform the lit review slog, but that must be the first step.

Also, be thorough. Don’t just find a few examples of similar work, try to find all recent examples that have some relevance to your approach. This is especially important in machine learning, as you’d better compare to the strongest previous results.

It’s also important to keep up-to-date on any new papers on the subject that may come up. Although I believe many venues will consider papers that appear within a certain time frame of the deadline to be contemporaneous, and thus comparison won’t be required, it makes a stronger case to at least know of the most recent attempts at the problem.

2. Know your metrics.

One of the issues that I faced which was mostly unnecessary was with evaluation metrics. Do not write your own metrics if you don’t absolutely have to! Before you do, make sure existing implementations from reputable libraries don’t exist already. No, really, really make sure. If you really must implement some metric yourself, treat it as the most important part of the code, which it is. Check it 5000x. Better yet, write a unit test!

This also ties in to number 1 – you should know which metrics you’re evaluating against early on.

3. Keep your changes in order.

You should keep a well-managed pipeline of experimental changes, large or small, that you make to your approach. It might look like this:

  1. (Optional, depending on the size of the change) Create a branch on Git.
  2. Document the change in writing. Make a blurb about why you think it will help, what the effects actually were, and whether you will keep or discard the change. I initially used a physical notebook to keep track of this, but later preferred keeping a set of markdown notes on the Git repo, where others can also see them.
  3. Find out why the change helped or didn't help. Don't be afraid to dig into the model specifics (e.g. neural network weights, in my case) to understand what's going on. Diving deep like this could have also helped me discover a few bugs during development much faster.
  4. Keep track of the results in a google sheet. Even if they are much worse, you will want to reference them in the future.

Also, change only one thing at a time! This is not just better scientific practice, but also helps avoid bad situations where you might attribute a performance boost to the smart and perhaps novel change you made, which was in actuality due to the simple data processing step which you also changed. (Not based on a real example…)

Finally, if you find a bug in your code that could invalidate some prior results, make a note of that in your google sheet that holds your results and hide or move the old results.

4. Exercise good code sense.

Use Jupyter notebooks for all data processing. I initially wrote a myriad of short scripts to perform each task, like text processing, combining files, and building vocabulary, but this quickly becomes unwieldy. You have to keep track of inputs and outputs very carefully, and there’s no written record of the steps performed by default. Switching to notebooks made changes to data processing 1000% easier, though I kept some more involved steps in scripts to keep the notebooks neat.

For model development, I eventually learned to make heavy use of iPython and pdb, which I sorely needed from the beginning. Even with only very shallow knowledge of both of these tools, I now know that throwing around pdb.set_trace() in problematic code makes debugging much easier, and iPython magic commands like %save can be extremely handy.

Other tips, in list form:

  1. Implement early stopping when you write your training loop
  2. Saving predictions is more important than saving the model
  3. Save the exact command used to generate the results
  4. Don't be afraid to delete old code - that's what Git is for!
  5. Data batching is important, and it should be fast. Batch size must be constant.
  6. As mentioned before, sometimes it's necessary to get into the nitty-gritty of your model weights to get a sense of what's happening. For neural nets specifically, as there are a wealth of resources on getting the details like initialization and activations right, I won't repeat what I've learned from those but will note that keeping a checklist like one of these will be useful.

5. Document everything in an organized, shared space.

Your work will more than likely have at least one co-author, so, of course, it’s important that everyone is familiar with the technical details.

Make figures visually describing your model(s) early on, for easy reference.

Start a paper draft early on, if only as a place to TeX up equations that were scratched out on paper or whiteboard during meetings.


That’s it for now! Let me know if this has helped you or if you have other tips to add. It would be cool to grow this into a real resource.