Organizing Data Through the Lens of Deduplication

Our home file server has been running since 2008, and over the last 12 years, it has accumulated more than 4 TB of data. The storage is shared between four people, and it tends to get disorganized over time. We also had a problem with duplicated data (over 500 GB of wasted space), an issue that is intertwined with disorganization. I wanted to solve both of these problems at once, and without losing any of our data. Existing tools didn’t work the way I wanted, so I wrote Periscope to help me clean up our file server.

Periscope works differently from most other duplicate file finders. It’s designed to be used interactively to explore the filesystem, understand which files are duplicated and where duplicates live, and safely delete duplicates, all without losing any data. Periscope enables exploring the filesystem with standard tools — the shell, and commands like cd, ls, tree, and so on — while providing additional duplicate-aware commands that mirror core filesystem utilities. For example, psc ls gives a directory listing that highlights duplicates, and psc rm deletes files only if a duplicate copy exists. Here is Periscope in action on a demo dataset:

Experiments in Constraint-based Graphic Design

Standard GUI-based graphic design tools only support a limited “snap to guides” style of positioning, have a basic object grouping system, and implement primitive functionality for aligning or distributing objects. They don’t have a way of remembering constraints and relationships between objects, and they don’t have ways of defining and reusing abstractions. I’ve been dissatisfied with existing tools for design, in particular for creating figures and diagrams, so I’ve been working on a new system called Basalt that matches the way I think: in terms of relationships and abstractions.

Basalt is implemented as a domain-specific language (DSL), and it’s quite different from GUI-based design tools like Illustrator and Keynote. It’s also pretty different from libraries/languages like D3.js, TikZ, and diagrams. At its core, Basalt is based on constraints: the designer specifies figures in terms of relationships, which compile down to constraints that are solved automatically using an SMT solver to produce the final output. This allows the designer to specify drawings in terms of relationships like “these objects are distributed horizontally, with a 1:2:3 ratio of space between them.” Constraints are also a key aspect of how Basalt supports abstraction, because constraints compose nicely.

I’ve been experimenting with this concept, off and on, for the last couple years. Basalt is far from complete, but the exploration has yielded some interesting results already. The prototype is usable enough that I made all the figures in my latest research paper and presentation with it.

Gemini: A Modern LaTeX Poster Theme

Programs like PowerPoint, Keynote, and Adobe Illustrator are common tools for designing posters, but these programs have a number of disadvantages, including lack of separation of content and presentation and lack of programmatic control over the output. Designing posters using these programs can require countless hours calculating positions of elements by hand, manually laying out content, manually propagating style changes, and repeating these kinds of tasks over and over again during the iterative process of poster design.

The idea of using a document preparation system like LaTeX to implement a poster using code sounds fantastic, and indeed, there are a number of LaTeX templates and packages for making posters, such as a0poster, sciposter, and beamerposter. However, I didn’t like the look of the existing themes and templates — they all looked 20 years old — and this is what kept me from using LaTeX for making posters, even though I had been using the software for years for authoring documents.

I finally bit the bullet and spent some time designing a clean, stylish, and minimal poster theme for LaTeX, building on top of the beamerposter package. The result has been open-sourced as Gemini, and it makes it really easy to design posters that look like this:

Poster example