Hacker News new | past | comments | ask | show | jobs | submit login
Paperless-ngx – Open source document management system (nerdyarticles.com)
627 points by thunderbong on Oct 7, 2023 | hide | past | favorite | 182 comments



This is nifty, but seems to lack to one thing that keeps me coming back to DEVONthink: a learning classifier.

With DT, say you’ve scanned or saved 20 docs to your inbox and you want to sort them to their long-term homes. DT will suggest folders based on how closely the new file matches the contents of those folders. It has the UI equivalent of “this looks like 2023 state taxes. Is it? This looks like kid #2’s school stuff. Is it? This looks like the older dog’s veterinarian records. Is it?”

That’s so, so nice.

Lately, as an experiment, I’ve been playing with organizing my docs with Johnny Decimal, then using the Hazel app to sort known docs with fixed structures (think bank statements and the like) into the right folders. My ScanSnap scanner’s software does OCR, so by the time docs land in the inbox folder, they’re ready for automated processing. It’s working pretty well so far, and I may stick with it.

But if I were to go back to an app, it would be DEVONthink or something with most of its features. That classifier is too darn nice, plus its smart rules, plus its scriptability, plus multi-device sync, plus Markdown notes with wiki links to stored docs, plus a thousand other niceties.


Paperless uses tags and will auto tag based on previous scans. IME it works very well (as long as you have a decently sized library of tagged documents) and seldom do I have to add my own tags. It’s not perfect, though, and sometimes I have to go in and fix some of the tags.

https://docs.paperless-ngx.com/advanced_usage/


Oh! Looks like I was wrong. Nice!

I’d still miss DT’s zillion other things I’ve used over the years, but that one would have been a dealbreaker.


From what I can tell, DT is only on Mac, and not open source. If the company goes under, good luck.


You can always export the files and you could also access them directly in the applications document database if needed.


Previous conversation also here: https://news.ycombinator.com/item?id=37521492


Paperless has this—when I upload a new file it will attempt to categorize it automatically using my existing tags. The more items I put in each tag the better it gets at categorizing them, so it definitely seems to be learning somehow, though I'm not sure on the details of how it works.

I've never used DT, so it's possible that their system is substantially better in some way.


I have used DEVONthink in the past, and paperless more recently, and only more recently still discovered a real need for this in my life for personal reasons. Since I am at the point of picking a route, I am curious of how you find DEVONthink and its sync needs relative to something like a web app. I don't find myself with an always-on Mac these days, switching my desk between my personal and work laptop, and I would prefer have always access to the files, as well as sharing that access with my wife. I see DEVONthink has a companion portable app, but curious if you use it and what you think of it, though honestly I'm unsure the "scan anything to a SMB share" pattern is unbeatable by features elsewhere.


DEVONthink supports multiple sync methods for the same databases. You could use CloudKit or Dropbox to keep all your devices in sync, and then Bonjour and/or Dropbox to sync between you and your wife's devices.


I don’t share the docs with my wife. I mean, I would if she wanted to, but she’d rather I play house librarian. I currently use DT’s iCloud sync and it works well across devices. Really, I have nothing but good to say about the experience. It just works.

But I’m still experimenting with just plain files in iCloud. 99% of my mobile file access is to pull up scans of vaccination cards and things like that. It’s basically read-only from my iPhone (and iPad while I was using it). My phone’s Files app is full capable of handling that use case for me.


I thought I wanted this originally when I first started going paperless but I quickly realized that as long as I OCR everything and throw it in a pile I can easily grep for "state taxes" and 2023.


One thing that I’ve done that makes my paper handling process much easier is have my printer/scanner point to a write only samba share. Most HP printers support this. I wrote a short script that looks for new files in there (with inotify), runs OCR on them with OCRMyPDF and moves them to a different file share. It means that my non-technical family members can just stick the paper in the document feeder, and 20 seconds later, an OCRed copy ends up on the family file share. You don’t get the fancy tagging and search that this provides, but file shares integrate natively into all OSs, which is a huge perk.


I've done something similar although I had to jump through a few hoops to get it to work.

I have a Fujitsu ScanSnap which is one of those feed-through scanners. I have it hooked up to a Raspberry Pi which listens for the button press on the scanner. You press the button, the paper feeds through the scanner and once it has finished the scan a script runs to collate everything into a PDF and drops the result onto a Samba share that's running on the box where paperless-ngx is.

It's pretty neat and feels seamless. The worst part was dealing with SANE and finding linux drivers for my scanner.


I don’t understand the Pi and button part. I also have a Fujitsu ScanSnap and just configure it to save to a Samba share.

What does listen for button press mean? and how?


I'm not sure how I would do that on my model (ScanSnap S1300i), it connects over USB and has no touchscreen/control interface or network port, or wifi capability, you have to connect it to a computer via USB.

This works fine on say, a Mac, with the official Fujitsu ScanSnap software, and I'm guessing _that_ supports saving to a samba share, but I wanted a solution that's

1. completely headless, i.e. no desktop machine required and experience needs to be friction free as the headless part means the only way to interact with the scanning function is to press 1 button

2. linux compatible, as I wanted to connect it to a Pi. I had to dig for the drivers, Fujitsu didn't have the right ones for my model on their website!

I couldn't find any official software from Fujitsu, but I found the drivers eventually, so ended up coming up with connecting the scanner to the Pi over USB and glueing the bits together to drop the PDFs onto the samba share

The button is located on the scanner, and I run "scanbd" [1] to listen for the button press, this is what coordinates the scan function (feeding the paper through) and then post-scan -> running a script to collate + create PDFs

[1] https://wiki.archlinux.org/title/Scanner_Button_Daemon


I have a SnapScan N1800 which makes a big difference and eliminates these complications.


I've got this setup with a Brother ADS-1700W scanner, which can write directly to a network share over wifi. Paperless-ngx is running on my NAS which hosts the share as well.


Are all your scans straight or do you also get skewed scans regularly? It's one of my disappointments with the ADS-1700W, a Xerox MFD scanner outputs all straight scans.


If you have any notes on this, I’ve been wanting to set this up for ages and I’d be incredibly grateful!


My solution was pretty much the same as what this guy did, although he had a slightly different model of scanner to me, but it's a very similar setup

https://chrisschuld.com/2020/01/network-scanner-with-scansna...


I started with a script similar to the one you're using (though hand-crafted) with my ScanSnap S1500 (though I have mine run the PDF conversion in the background so I can immediately scan another document without having to wait - this is easy to do now with scanpdf). I've been doing this for about 12 years now, originally manually sorting into directories and using "pdfgrep" to find stuff but more recently I've put everything into a paperless-ngx instance (gradually tagging all the old documents).

I've switched my hand-crafted scripts recently to use scanpdf[1] which seems to give better results (once I tweaked it to be a little less eager to downconvert to B+W). I experimented with using OpenCV models for cropping and straightening (based on examples in a stackoverflow thread at [2]) but I found results were worse than scanpdf so far.

1. http://badge.fury.io/py/scanpdf 2. https://stackoverflow.com/questions/28935983/preprocessing-i...


Paperless-ngx supports a folder on disk that you can drop files into and have them ingested. Throw in a samba container pointed at the same directory in your docker-compose and you’ve replicated the same setup.


Do you have any other info on how to do this? I've looked for this but cannot find how to do


I wanted to read the article, but it was incredible twitchy on my iPhone.

I scan into a Samba share that paperless-ngx picks up automatically, OCRs, tags, and deletes.

A web application is pretty cross platform too, at this point.

Plus I can get to them on my phones with less trouble than a share.


Yeah, I was looking at the docs for this and it looks like a somewhat more featureful version of what I’ve stuck together.

How does it handle when you have digital documents you want to store (a la google drive or similar)?


If you are looking to quickly setup Paperless-NGX check out my little side project https://github.com/jdoss/ppngx. It will setup everything you need to run Paperless-NGX (PostgreSQL, Redis, Tika, Gotenberg, PaperlessNGX, and SFTPGo) inside a Podman Pod on a Linux based system. You can optionally set it up to start on boot via systemd.

I run this locally on my workstation and send PDFs many times a week from Brother ADS2800w scanner via SFTP. Paperless NGX has reduced my home office paper piles to almost zero. It is a fantastic open source project and I am very thankful it exists.


Why would you want to use this over one of the official docker compose setups? https://github.com/paperless-ngx/paperless-ngx/blob/main/doc...

They will also automatically launch if you have docker running at boot. Is it just because you prefer redhat/IBM's docker equivalent stack to the much more common and cross platform docker install?


I would want this over docker and docker-compose any day.

I've been using docker compose in production for a couple of years now and it adds another layer on top of systemd that is a continuous source of headache, especially during updates.

Podman gets it right: no central daemon, can automatically generate systemd services for a whole pod. Updates are seamless.

This by itself is enough of a reason to me.


Seconded on the things Podman gets right. Also the isolation of all of the containers in their own network name space makes port management on my workstation super easy. I run many things like Paperless NGX using the same pattern in the start.sh file of my little project. I then use Traefik to route traffic to the right pod. It works great.


I don't use Docker at all on any of my infra or workstations. That's why I made this.


Alright, but you've sort of re-invented docker compose there, but as a shell script. These days docker compose even work with podman if you really prefer IBM's docker implementation to the original.


I am well aware that docker compose works mostly with Podman. I prefer to use Podamn with systemd over it. I have you even tried giving this a shot? Maybe give using Podman + systemd following my example in Bash instead of Docker + Docker Compose and you will see why?


Well... Maybe re-inventing was part of the fun or a learning experience. If you want, there is even this: https://github.com/Mitigram/docker-compose-build


> everything you need to run Paperless-NGX (PostgreSQL, Redis, Tika, Gotenberg, PaperlessNGX, and SFTPGo)

That is a lot of dependency. How stable is Paperless with all those applications making uncoordinated changes on their own schedules?


The only hard dependencies are Redis and Postgres. The official stance is to run them from the provided docket compose and the container for paperless-ngx itself is kept updated and working for the stable containers of redis and postgres.

Tika and Gotenburg are additional features for scanning and converting MS Office documents to PDF. Not necessary and I don't use them in my setup at all. Same with sftpgo. I'm not sure for its usecase. But paperless doesn't directly depend on it in anyway.


Postgres didn't actually a requirement either. Sqlite is supported and default unless you provide configuration for connecting to a database host.


I've used the one click TrueNAS app for a while and it's been solid. I had less luck with the TrueCharts o variant.

Entirely possible the instability was my fault but the error messages didn't make it obvious what was wrong.


It's in this script so I can SFTP PDFs from my scanner over the network. Push button, scan paper and then it is SFTP'ed to a shared volume between SFTPGo and Paperless so it is detected and ingested into Paperless NGX automatically.


Even if you use :latest it doesn't update by itself.


Why does this need so many services? Is it meant to scale to enough users where you'd want postgres and Redis? Seems like everything but the machine learning on this could be pure Python if they wanted.

I'd probably still use it since there's awesome tools like this to set it all up, but it seems like a confusing architecture.


https://apps.apple.com/app/id6464425056

Just recently started working on an iOS/macOS app for it. Hope you like it!


Hey, this looks pretty cool.

One question though, my paperless-ngx is behind an SSO login (I use Authelia) with 2FA. Would it be possible to make your app work with that?


I'll have to look into how the flow with Authelia exactly looks like. I guess the forward auth flows of Caddy and Traefik shouldn't be too hard to cover. And that's something I'd also want personally: I'm using a similar forward auth mechanism using Traefik, that I currently have turned off because of the app. The docs of the other integrations of Authelia I have not looked into yet.


Goot to hear! I'll keep an eye on your app, hopefully you figure this out.

If you'd like some help testing this out let me know. Email is on my profile.


Nice, looks like you're headed in a good direction with this!


Thank you!


This is great, nice work.


Thank you!


How would you compare this to something like DevonThink, out of curiosity?


I'm not in the position to compare them, as I have never used DevonThink before. You'd also need to compare DevonThink to Paperless-ngx, instead of comparing my iOS app, that is not yet feature complete and which is just one of many clients to access Paperless-ngx.


I'm looking for a suitable document management system for a while. There is one feature I would like to have, I didn't find anywhere except maybe in $$$ enterprise systems:

I want to add custom metadata to documents by categories/tags/folders, for example like this:

  Invoice {issued: date, invoiceNumber: string, amount: number, due: date}
  Contract { validFrom: date, renewsAt: date, autoRenew: boolean}
When adding a tag like this, it should either automatically fetch this information from the content document (probably very hard) or give you a manual workflow to type it into a form, while showing the document next to it. Maybe just by selecting the text from the PDF.

In the folder list and in the search you would be able to add those meta data information as columns, sort them by value or do queries (tag:invoice AND invoice.amount > 1000)

Edit: this feature seems to be one of most upvoted feature requests for paperless https://github.com/paperless-ngx/paperless-ngx/discussions/1...


I'm working on my own SaaS document management system that is easy-to-use, affordable and fully automated. Basically a black hole, throw a scan in or wait for emails to come it, it will name, tag and categorize it. It will also attempt to retrieve most important data such as invoice amount, customer numbers, so that you can easily distinguish and find the documents youre looking for. It comes with a chat feature so that you can ask things such as "what was my liability insurance number?" and it'll answer from the knowledge of your documents. I find this pretty useful, recently I was at an airport and forgot my flight number. I just asked what was my flight number and it retrieved that information from my recent documents easily. Integration with third party APIs and agnostic backend configuration for LLM and OCR is in progress. It works with Google Cloud Vision OCR and OpenAI at the moment.


Where can I sign up to track progress? This sounds like exactly the future I envisioned. I take great care manicuring my paperless instance such that when the day arrives, the LLM integration can work its magic best.

That said, open source is absolutely table stakes in this, to me. From the documents I have in the system one could trivially impersonate me. Perhaps even as good as clone me. So sending all that off to random internet corporations, no can’t do.


That's unfortunately why I think Microsoft and Google are going to be the first ones to actually achieve this future. They're the only organizations well known enough that enterprise might trust them with this kind of thing.



I keep this site updated when something changes.

https://turtledev.net/projects/refind-ai


We may want to get in touch with each other. We have an Open Core document management platform that runs in AWS; I'm not sure about your roadmap, but there may be something there that's of use: https://github.com/formkiq/formkiq-core


Cool, I mean - that's a LOT of AWS services right there.

But yeah, let's connect. Take a look at my project as well! https://turtledev.net/projects/refind-ai


Looking through the setup, this seems like an insane way to package an application for users to install: https://docs.paperless-ngx.com/setup

The documentation itself is so full of implementation details that, as someone who is interested in the concept of this, I'm scared off even trying to setup and use this

The project would be much more approachable if there was a simple native installer. My parents could also benefit from this but there's no way they would ever even understand how to install this, much less troubleshoot docker things.


It doesn’t look like the project goals include being installable by your parents

It looks to sit in the self hosted space that has an admin manage all the sysadmin tasks. They’ve provided docker which is a pretty good step.

There are desktop apps designed at the single user/less experienced user, which might be more suitable


You might want Recoll[1]. Similar if less powerful capabilities, cross-platform, open source, has Windows and macOS installers.

Still an overly complex FOSS user interface for a tech-unsavvy target with lots of digging around to configure it (OCR setup, for instance[2]), but at least you don't need to know what Docker is to install it.

1: https://www.lesbonscomptes.com/recoll/

2: https://www.lesbonscomptes.com/recoll/usermanual/webhelp/doc...


It's a bit rich calling it insane just because it's not immediately approachable for you. Not every project is aiming for mass-adoption and if you want to lower the barriers, that's on you to make happen.

Like a sibling commenter is already doing, for example: https://news.ycombinator.com/item?id=37802337

Consider pitching in time/money there (if welcome) instead of complaining when not everything is served to you on a silver platter.

A sqlite backend would be another thing that could reduce the complexity for a minimal OoB setup, I guess.

Truth the told, I wouldn't find it unfair to call the architecture/setup "insane". This is not one. If you've done any meaningful self-hosting in the past decade, it's as straightforward as it can be.


Sqlite is already supported and the default.


Self-hosting services usually entails more technical knowledge than just installing an app and I don't think a document management system would necessarily work well as a native application. For starters, there's the backup issue and you wouldn't want non-technical people to store important documents that only live on a local drive. Remote web access is also a very useful feature for when travelling and that wouldn't be easy to setup for a local install.

I've been using it for over a year and am very happy with it, though I intend on moving it from my home Pi docker swarm onto a free Oracle cloud instance to improve the performance and uptime (I've got my Pis auto updating and rebooting, so services get shunted around fairly often).


The project would be much more approachable if there was a simple native installer

Actually the very first example on https://docs.paperless-ngx.com/setup lists an interactive installer which asks the user some question and eventually arrives at a working docker-compose setup.

    $ bash -c "$(curl -L https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/install-paperless-ngx.sh)"
If you ask me, this is already pretty user friendly. Although I agree that if your needs are more involved, there is some reading you'll have to do.

I am currently in the process of migrating from mayan-edms to paperless-ngx and it feels pretty approachable to me if you know your way around docker (compose).


It is designed to be a server application, so it'd be very difficult to offer a desktop-like app experience, that's easier to install.


If you own a Synology NAS I recommend to have a look at synOCR:

https://github.com/geimist/synOCR/wiki

English translation: https://github-com.translate.goog/geimist/synOCR?_x_tr_sl=au...

I've been using this for several years and it works great.


This looks neat. Wish I'd known about it when I started with the original Paperless a couple of years ago (then NG and now NGX). Might give it a try if I ever need to change.


Paperless-NGX doesn't have document version history, unfortunately.

Right now I am looking at OpenProDoc [1] and bitfarm-archiv [2] as document management possibilities.

[1] http://jhierrot.github.io/openprodoc/Spec_EN.html

[2] https://www.bitfarm-archiv.com/document-management/features....


I am just rcloning my paperless-ngx document volume to s3 deep glacier every night for this.

It's a bit "scary" since even documents I delete in paperless-ngx are thus preserved forever, but it may come in handy someday.


Genuine question: for simple needs, why use this or DevonThink over macOS' built-in features? macOS now does OCR (Live Text), has tagging, and spotlight search is fast (but sometimes presents too many results to be useful). I even stopped splitting PDFs into separate documents and organizing them into folders. I just search.


Does auto OCR work on iCloud files ? For example: I scansnap a huge collection of documents to a folder that is on iCloud (synced w desktop). It works great because it is so simple. However if I have, say, PDF document, will the Mac ocr functionality perform the OCR if the doc is on iCloud and will I then be able to search for the text in that doc via spotlight / finder ? I tested this a few years ago and the search on content inside scanned PDFs did not work. I had looked at Paperless but decided to stay on Mac os file system.


Are you talking about iCloud Drive? As far as I can tell, files in there are just normal files, so Live Text works. You can easily put a screenshot / pdf in there and see.


This is more designed for a self hosted server, so if you want multi-device web access then it's a great solution. I can download a PDF on my android phone and upload it to my paperless-ngx instance in a couple of clicks and easily edit the tags as necessary. It's great for travelling as you're not reliant on having a locally installed application on your chosen device with you, and of course it would still be available if you lost your main device and only had your phone on you.


Makes sense, but how about Dropbox / iCloud Drive as alternatives? PDFs / images are somewhat small (at least relative to videos). I just stuff all my PDFs in Dropbox. I'm almost completely paperless and I don't seem to accumulate that many scanned docs to fill up even the free tier storage space.


Yeah, it depends on what you want from a document management system. If you've got a bunch of searchable PDFs, then storing them in a cloud service might well be sufficient. Paperless-NGX adds OCR to the mix (probably more useful for scanned paper images) and also tags. When I add a document, it fills in the best guess for correspondent, document type and appropriate tags, which tends to be accurate for common documents (e.g. payslips, statements) and usually only needs me to change them if it's from someone new.

What I find most useful is grouping together holiday documents such as travel insurance, holiday booking details, passports etc. and assigning a suitable tag, so I can easily find the relevant info. You could easily replicate that by copying those documents into a separate folder for easy access, but with Paperless-NGX it does most of the organisation for you and the search is more flexible as you can specify what kind of document you're looking for and who it came from.


I used to be the target audience and really enjoyed having my system just right, sorting and tagging everything, etc. But over the years I realized that I wasn’t really benefiting much, and gave SwiftScan on my iPhone + dumping into and iCloud folder a try. For my needs, this has worked fine. It is rare I even need to refer to the scans, and the macOS OCR + automatic dates usually let me find the doc quickly. In the worst case I browse thumbnails.


Yeah. I had a Devonthink-based setup but after one too many database corruptions I threw in the towel. Now I just OCR scan everything into a few MacOS folders and search using Houdahspot (Spotlight, I found, was not suitable for fine-grained search). I’m very happy with the setup.


HoudahSpot looks cool! What kind of query do you use it for that Spotlight can't find for you?


Obvious answer: because, contrary to popular belief, not everyone uses macOS.


I self host a couple things, but if I had to choose only one, it’d be this. So far the project strikes a great balance of stability (zero issues over two years now) and new features (ownership concept already available, allowing for multiple accounts in a pretty intuitive way).

I’ve killed my instance twice now and had to restore from backup, which is also surprisingly pleasant to do. Their document exporter makes that possible. Having everything in a single JSON and otherwise just the raw PDFs makes a ton of sense and has me confident my documents are “just there” and moving to a different system would be feasible.


>the project strikes a great balance of stability (zero issues over two years now)....

>I’ve killed my instance twice now and had to restore from backup, which is also surprisingly pleasant to do

Stable, but murderable?


Yep, it's not undying, but the murder happened at no fault of theirs. I'm taking credit for that one.


Is this in reality a German cry for help, disguised as tech talk?

As one of the least digitized countries in Europe, and the digitalization budget recently cut 99%, it seems like they still need to use paper in their lives, and it's not gonna improve soon.

This feels so incredibly archaic to me as a Norwegian, I would have to print out documents to have anything to fill paperless-ngx with.


You can just use your digital documents directly, and augment it with the few paper receipts that you might (or might not) still have to deal with. The main selling point is really document management (to me, anyway), the 'branding focus' on physical documents is probably a little misleading.


I track, using tags, whether a document is a scan or properly digital. The pendulum is strongly in favor of the latter: I use this tool a ton for natively digital documents as well. Invoices, contracts, tickets etc. all come in as PDFs anyway, luckily. I have all that knowledge at the tip of my fingers. Yes, some of those documents are scans and used to be physical paper, but that’s besides the point.


You can easily use this for digital documents as well. The only difference in my setup is a tag showing whether the document id maps to a physical document in a binder or not.


I use MayanEDMS personally, and have for the past five or so years. It's complex but does what it says on the tin.

https://www.mayan-edms.com/


Mayan EDMS recently moved a lot of basic documentation behind a subscription paywall.


What I don't really understand is, do people really have than many physical documents that they need to keep track of, that such a system is worth it? E.g. to file my taxes (in Belgium), I think I only ever need a few (maybe even only 1 or 2) digital documents. Or is this more a mentality thing? I know my parents have folders and folders, e.g. my father kept all expense notes from his work even after retirement... I throw everything away once it's handled.


Definitely, Germany strongly believes that a document that hasn't been a physical piece of paper at least once can't be real. That makes for folders upon folders of documents and it's actually worse than back in the 20th century because generating and mailing documents has become way easier and cheaper, so things that would have been a one-page typewritten letter back then now are five ten-page ones full of automatically generated crap. One lengthy illness in the family alone filled hundreds of pages and it can be very hard to know what can be thrown away at which point.


Yeah, I'm also in Germany (although not German) and installed Paperless because of this!

I think more than a few of these projects are started and/or maintained by Germans due to the astonishing number of documents received - e.g., paperless-ng appears to have been done by a German, although neither the original Paperless nor Paperless NGX immediately appear to be.


> Definitely, Germany strongly believes that a document that hasn't been a physical piece of paper at least once can't be real.

I'm sorry to tell you that is a an oversimplification and especially for documenting expenses as a company/freelancer it's kind of worse.

Last time I checked if you want to follow the tax law to the word you're not allowed to change the medium:

If an invoice came as a paper copy (e.g. by snail mail), this paper copy is the original. If you scan it the digital version isn't.

If an invoice came as a digital document (e.g. a PDF by email), this digital document is the original - a printed version of that digital document isn't.

So if a tax inspector asks for "originals" it's technically almost impossible to provide them in the sense of the law. If even a tax inspector would care is another question.


It's perfectly legal (and common) for a decade now to scan documents and destroy the paper original as long as you follow some guidelines. Keyword is "ersetzendes Scannen".

And yes, they care about those rules and that you provide "originals" according to that definition - in particular that you didn't modify digital documents in any way. You can (and should) comply with that and there are service providers to help if you are to small to set that up yourself.


Thanks, today I learned about "ersetzendes Scannen". I just checked and it's exactly a decade (2013) since it's allowed which coincidetally is the year when I started working as a freelancer (and I have to care about such rules).

I admit that my last paragraph was kind of hyperbole, but I never heard (at least from other freelancers) of a tax inspector which wasn't happy with either everything printed or everything digital. I guess they really start to care if they suspect something fishy.


Another search/keyword is "Revisionssicher". If you storage/software has that, you a good to go.


Just a side note to this and the other replies: You can also keep the original documents and add scans to paperless for indexing, etc. Since I switched to paperless I keep my originals in binders just ordered by the paperless id, so I can retrieve the original when required.


Can't speak for physical documents in general, but personally I really appreciate paperless-ngx for it's general document indexing/storage. Being able to scan and ocr physical documents (usually using the camera on my mobile phone) is very nice, but I mainly use it with pdfs that paperless automatically fetches, ocrs (if necessary), and tags from my email inbox, or which I copy into a specific local folder which gets synced with paperless.

Getting all my invoices from last year to prepare taxes is now just a simple query in the paperless UI, the result would be about 95% digital and 5% physical documents, probably. Of course I could do all that old-school using filesystem folders, but having all my documents indexed and searchable in a single place was definitely worth the (small) effort of setting it all up and keep it running.


I don't understand what you mean with prepare taxes.

I just add all purchases/sales right when they happen in my accounting app and attach the invoice PDF. Then when I have to file taxes, I export the correct numbers.

Are you doing your bookkeeping in Excel or something?


This is just for my personal taxes, no accounting involved. I just get all the relevant stuff together once a year. Of course it's not 10s of 100s of documents, but still enough so it would take me some time to get everything together manually.

Also it was just meant as an example, paperless is generally useful (to me) in situations where I need to access somehow related documents, like traveling and such, or searching my documents for some information. As I said, there are other systems and ways to do this, but for me this is the one that stuck.


> I throw everything away once it's handled

That's the Marie Kondo approach. I feel the same way. I'm not sure that digitizing everything really removes clutter. It removes the physical paper, yes. But not the mental overhead of knowing that you have all those documents.

Some people obviously have a need to retain documents for things like business expenses that will be deducted from income. I don't have any of that. I get my W2s and 1099s and do my taxes. I throw all that in one folder and put it in a box in the closet. That's good enough for my purposes; I see no need to expend the time and mental energy necessary to scan and tag (even automatically) every receipt, utility bill, and other statements I receive. Why? I'll never look at them again.


So here's an example where it came in useful to have back documents:

I recently purchased a house. As part of the process, I needed to apply for a mortgage. The bank wanted a statement from my employer about my income from them, along with my last 2 complete years tax documents.

The bank had an inquiry. My employer had said my salary + bonus was X, but in the first of these two years, my tax documents said my income from my employer that year was 2.5X. The extra 1.5X was due to the employer being bought out and some change of control terms in the RSUs causing immediate payout of what would normally have been paid out over 4 years. Since I kept the documents of the RSU terms and the payslips, I could provide these to the bank to clear the matter up.

Notably, had I not kept my own copy of these documents, I could not have gone back to my employer for new copies. Due to the change of control, they had changed payroll vendors, and had eventually terminated the contract with the old vendor, so I could not have gotten a payslip from 1.5 years ago. Similarly, in the move to the new owner's HR system, the company had lost many of their records of agreements with employee's, including contracts etc., so it's not clear they would still have the terms of the RSUs, especially since the change of control payout rendered this a "completed" transaction. And later events made it clear that they did not have, e.g. a copy of my employment contract.

Similarly, if I ever had had a dispute over the terms of those contracts - if I hadn't kept a copy of the contract, and the company definitely hadn't kept theirs, any dispute would have been my word against theirs.


Companies are legally required to keep payroll records for multiple years (depends on where you live, though I doubt most places are less than 3-4). This is ok advice, but these systems don't just work like this. If you didn't have the documentation the bank would likely take your approved tax filings as evidence and move on with their day.

In a real contract dispute your copy of a contact from your documents isn't notably different in the eyes of the court than one from your employer. They're both notarized and if there's a dispute between them there is established processes. Aside from some titles or etc., historical filing ownership is typically relegated to the document originators.


In my experience, no, you don't need this. The few things I keep just go in folders named for the year under my Documents folder, and they are given descriptive filenames like paystub-2022-10-15.pdf, or companyA-w-2.pdf. In the rare cases where I need to go back to those (like for a loan application or doing taxes) it's easy enough to find them.


Depends on your tax situation. For my private taxes it's maybe 3 or 4 documents and those from the bank etc have all gone PDF anyway.

But when I was a freelancer I used a document scanning system provided by my bookkeeper. It worked similar to this open source thing, scan to PDF, automatic OCR and classification. Needed it because many invoices still arrived on paper, and receipts for restaurants etc I usually took a picture to upload.


You are right, you do not want to lookup documents that old, it is a waste of time... ... unless you are a German and the state asks for your time sheets three years in the past because you've gotten child support and are requested to prove your working hours. ... unless you happen to have an accident and your insurance is fighting with another insurance who's gonna pay and they ask you about the incident two years later ... unless you end up in a contract fight with the postal operator, that can take a year of mailing before being settled.

Some correspondences take years and only add a mailing every few months. You would like to have a thread-like view -- as in an electronic mail. That is the strength of document management systems.


In some countries like Germany, the government still communicates with its citizens by snail mail. Important documents are usually physical there. They are one of the least developed countries in Europe with digitalization, they are far behind.


In the Netherlands, government bodies are regularly pushing everything they can to a digital inbox - which I vastly prefer. My simple, single-employer yearly income tax is all pre-calculated. Further, deductions for mortgage interest, healthcare, studies, etc are all pre-filled as much as possible. I think you only need to upload documents for complicated sitations or audits?

Of course, I still quickly download my year-end bank/salary/mortgage statements and cross-verify the tax departments numbers. The whole process takes at most a few hours.

IME Germany has significantly more hard-copy requirements.


You never need to upload the documents in the Netherlands, their software doesn't have such an option.

But technically you're expected to keep the documents at least until you receive the "definitieve aanslag" and if you're nitpicking I think there is a 7 year term for the tax services to come back on your filed taxes and change things or demand proof.

Practically that doesn't happen if you accepted their pre-filled numbers and they match your employers. But if you're a freelancer or other non-standard case I would keep digital copies for a few years just to be sure.


> You never need to upload the documents in the Netherlands, their software doesn't have such an option.

Ah, interesting. I just assumed my situation never triggered it.


I guess it’s partly a mentality thing for me. I’ve had numerous cases of sadness that I couldn’t produce a necessary document, and gladness that I was able to pull up something presumed long lost. For me, it’s easier to save everything “just in case”. It all adds up to less than 50GB so it’s not an enormous amount of data to store by current standards.

Seriously, a couple cases of “sorry, I don’t have proof to back up that tax deduction” or “hey, here’s the receipt proving that our TV is still covered by warranty!” make it all worthwhile.


Kinda - at the moment I'm receiving _a lot_ of documents, mostly as PDFs via E-Mail (some the original digital version, some scans of physical copies), but some via post as well.

I've only added documents I've received this year (plus a couple of dozen documents going further back), and I've got ~250 in there, with a total of ~2.5m words (although I think word-count is a fuzzy concept in German).

I've posted a top level comment in more detail, but yeah, it's helpful to me.


I have a small business in USA. For federal business taxes I need 6-7 documents. Then that process creates other documents I need for personal taxes, which also requires 6-8 more documents. So, I'm roughly 20 important documents per year for federal taxes. Nexus in 3 states, adds more. And save them all for 7 years.

The other end of the spectrum in USA is filling with the 1040-EZ which is like a 3-4 document process.


In the US, the IRS can audit your taxes going back at least seven years. So, you should keep at least seven years worth of the documents that you might need to produce during an audit.

And there are some important documents you want to keep longer than that -- birth certificates, wedding certificates, death certificates, documentation around the house you buy, documentation around your car, any personal health documents you keep, etc....

So, you'll want to have electronic copies of all that, and ideally backed up to an offsite location, in case you have a fire in your house. If you at least have electronic copies of those documents, even if you lose the original document in a fire, there's a good chance you can use that information to get new certified copies sent to you by the appropriate city/state government in question.

So, yeah, this can become a thing that needs to be carefully monitored and managed over the decades.


I'm quite paranoid about throwing stuff away so for me it's at least partly a mentality thing. I probably save a lot more than I need but it gives me piece of mind to know that it's all there. There are some things that it is very helpful to have easy access to, like utility bills and bank statements (which I occasionally need for KYC stuff) or ID documents.


It's not just for physical documents. I have payslips which may be useful in the future, but are would be really hard to recover when I leave the company. Any invoices which come to my email. Any bank documents which exist in a vaguely named "account updates" email. And many other things which could be possible to find in the future, but are much better in paperless with appropriate tags and OCR.

But yeah, then there are for example the bank account contract updates which come by physical mail only.

> expense notes (...) I throw everything away once it's handled.

Don't know about your location, but I need to keep the tax related documents for 5 years in case of an audit.


It's nice to throw papers away without worrying about it. Or to archive instruction manuals for stuff I own - paperless is the first place I look (its search is nice).


I would be in favor of not scanning them, forgetting about them, then throwing them away when I eventually see them again and deciding I did not miss them.


Seriously. People in this thread are describing some setups that momentarily seem cool in theory, but are almost certainly overkill for personal use.


It depends.

We had this ridiculous money-making scheme from the Federal Government a few years back called "Robodebt", which basically amounted to sending demands to pay back welfare money that they claim was incorrectly claimed ... long after the retention period for the paperwork. Conveniently they declined to show their working for the so-called incorrect claims.

Most people didn't keep their timesheets and records, and then couldn't prove they weren't overpaid, and were supposedly liable for the "debt".

Some people comitted suicide over it.

If you had your records, you probably wouldn't have lost any sleep over it, much less your life.


Luckily, running paperless-ngx on my NixOS desktop is trivial. And it was also trivial to make it accessible over an avahi name on my local network. So it was kind of a "why not" sort of thing.


Indeed, the NixOS module works well. I just tried it:

    services.paperless.enable = true;
In contrast to Docker, it composes well with other things on the same system (e.g. the Samba server already set up to receive documents from my scanner), and I get automatic security updates for it.


I started using paperless a few weeks ago and thus far i really like it. However i have a cuse case, of which i am not sure if there is a way to solve it using paperless. That is having something akin to an ongoing chat/thread that you have with a corespondent.

If I want something done by a third party i.e. my bank ill send them a mail or setup a meeting with a representative. This creates documents, that i want to add to paperless. Over the course of the next weeks additional documents will be created that outline the "progress" of the topic. I want all of the documents for this single issue im trying to solve to be kinda linked together, so that i always know the current progress and where it all started.

Does anyone here found a good way to solve this issue or is there even a built in function from paperless that i did not see ?


How do people usually backup their self-hosted docker services using postgres? I have been using docker-volume-backup [0] and just saving the postgres data directory, but I've found it requires a minute of downtime to backup properly.

[0] https://github.com/offen/docker-volume-backup


pg_dump [0] (or pg_dumpall, linked there) sounds like what you want to use. You could docker exec into the postgres container, then copy the dump from the volume to your backup location on the host.

A bit more contrived than copying the volume but you don't need to shut down the server. There's probably some scripts out there for doing this in a structured way but I usually do it more or less manually/use a bash script.

[0]: https://www.postgresql.org/docs/current/app-pgdump.html


docker-compose --env-file .env exec postgres /usr/bin/pg_dump -U postgres "$db_name" | gzip -9 > "$BACKUP_ROOT/postgres/${NOW}.${db_name}.sql.gz"


Specifically in the case of paperless-ngx, I use their export facility from a cron job. The export is plaintext and contains all the information needed to recreate the postgres db and the learned identifiers. In case of a disk failure (and I've had one with my paperless store), I just reimported the previous days backup from my offline backup of paperless' export.


I used vackup [1] that’s been obsoleted but still works for me. However, you still need to turn of the container temporarily.

[1] https://github.com/BretFisher/docker-vackup



For now I only backuped some databases with a pg_dump one liner triggered from a cron job on the docker host (via docker exec or docker run --rm). No idea how this scales for big databases. But for your regular home server <10 GB databases this should just work.


restic container with all volumes mounted to /backup/<volumename> (and . to /backup/self - use named volumes, not binds) in my composefile with scale 0 and a backup.sh that's essentially

docker compose down && docker compose run backup && docker compose up -d

The restore procedure is the same, you restore the composefile through restic on the host and then `docker compose run backup restic restore latest --exclude "/data/self/*" --target /`

I find it's fast enough because restic is incremental, but if you can set this up on a filesystem with snapshots that would be a great option too.

Restic takes a bit of fiddling around too. I mount a prepared ssh config, a known hosts file and a private key.


ZFS snapshots


Snapshots is not a backup.


I would pay a monthly subscription for an app on my phone where I can take a picture of the doc and it will just immediately upload to the cloud without any bullshit. And I can go back and re-sort these document later with an intuitive UI.

What I have to do currently is use one of those document scanner apps on my phone to take a picture. It will give me a reasonably good pdf doc from that picture, but then I got to press multiple buttons to eventually email the doc to myself. Then I have to go into my mailbox and sort the doc into my google drive. Ultimately, it works, but also very annoying.


I’m using QuickScan on iOS for this, which has customizable favorite export locations so that you with one press after a scan can upload to specific folders in google drive etc.


On iOS, built-in Files app allows to scan and store on iCloud directly (“Scan Documents” feature).


At least one of "those document scanner apps on my phone" immediately uploads scan-generated PDF to a user-designated folder in Google Drive: Genius Scan (on Android), one-time payment "Enterprise" version.

Source: happy user for 3+ years. No affiliation.


I tried a lot of them and Scanner Pro is the most user friendly and focused one in my opinion. Supports Workflows to automatically upload to Dropbox etc.


This is very interesting to me.

I'd love it if I could also use my mobile devices to bring up paper docs instantly (mobile phone, tablet, kindle).


It’s not free/oss, and it’s on the Apple ecosystem, but DEVONTHINK does a fantastic job of this, and supports storing all of your documents in a webdav store which you can host yourself. It uses Aabbyy fine reader for ocr which I have found to provide better results than tensorflow based ocr.


I've been using DEVONThink for just this for a few years, and it's very good at it. However, it's macOS only and has far more features than I need (simple searching, tagging, and organization). I tried paperless a year ago and the search and rendering was far too slow, and many docs just gave obscure errors. Perhaps it's time to give it another shot. I'd love to have something on Linux that could handle my large repository of documents.


Easily possible. Paperless-ngx works great on mobile as well. I have WireGuard on my phone and connect that way, then simply use a mobile browser, no app needed.


Nice!


There is even a nice Oss Swift app now in the app store. v1 but looks nice is fast and simple.

https://apps.apple.com/app/id6448698521


I made that! Glad you like it!


I'm delighted to discover that your application supports custom HTTP headers, enabling the bypassing of authentication on my Cloudflare proxied/tunneled instance using service tokens


I use Paperless with Tailscale.

On my mobile I use Firefox, and for me it works great.

I traveled for a month in Europe this summer and never had any issues.


Shameless plug: I recently released a native app for iOS that connects to Paperless-ngx:

https://apps.apple.com/de/app/swift-paperless/id6448698521


It finally happened to me: the very thing I just started researching and testing out showed up simultaneously at the top of the front page.

I’ve found more good information about Paperless right here in the comments than anywhere else so far.


Is there any modern solution that doesn't tie you to the clunky interface of a single web browser client?

While the folder organization criticism in the article is on point (although you could also use tags that many file systems support, but that's not a reliable system to invest time in, or maybe if it's backed up by some app that can restore all the tagging it could be), the range of native tools for viewing/editing various document formats as well as your ability to customize your workflows in unparalleled


You can setup an simple structure for the files to be organized (I do directory structure with owner/year/month/...) and then back that up separately as well so all your PDFs are still with you and somewhat organized even without the tags.


paperwork is local and a standalone app.


I see multi-user support mentioned, how flexible is the user and access control system? E.g., is there an integration to other auth-providers (LDAP or OIDC would work for me), are there group support where I can manage centrally whom can access what. I.e., could I use this in a company with different departments as a central document management system?


If anyone is looking for a fully-commercial version, we use something like this — it is called Hubdoc and it is free with any Xero subscription.

I really really appreciate the work that went into paperless, but for us the business risk of self-hosting this is far too high because if we lose our docs we lose our tax proof.


Personally, I consider self-hosted services to potentially have a longer lifespan than relying on a commercial product. Unless a company has a robust escrow contingency, then there's always a possibility that they can go bust and you lose your data and service. Probably more of a concern is vendor lock-in, especially when they decide to change their service levels.

Paperless-ngx is a great example of how open source can work around the issue of lack of maintenance (quite a big problem with commercial services if they don't get much profit from them) as IIRC Paperless-NG was a fork from Paperless and now Paperless-NGX is a fork from Paperless-NG. Even if it becomes completely obsolete, I can choose to keep an instance running for as many decades as I wish.


This is true. I have yet to see an accounting web service go under this hard, though. I am thinking that Xero (and Hubdoc) will be around for decades at least, and probably outlast my own company.


Is Paperless suitable for business use, say, for a smallish sized company with 25 employees and 1000 customers. I think in my EU country such systems need to fulfill certain requirements like versioning/tracking of changes.


I love seeing more Angular projects in the wild like this.

Angular is an under-appreciated, solid, no-gimmicks framework. Been using it for years rather than React and it seems the the pendulum is swinging back toward "this side" now.


My Paperless-ngx listening on a network share + brother ADS-2800W are key to stay sane. My only complain is that it is resource hungry. If I allocate less than 2G RAM to the paperless VM it does not work as it should.


I have this exact setup but with the ADS-4300N. I'm new to it and it's still a novelty.

My only complaint is I've had the odd letter get scanned upside down and there's no way to rotate pages in Paperless-ngx.


My main gripe is that you can't use an existing folder structure.


using paperless for some months now and i really like it. Nice to see the project got some new contributors and frequent releases.


I wonder if people know about Google's Stacks app? I don't know if it's as powerful as Paperless-Ngx, but it lets you organize docs pretty easily and some of it is automatic. I have "stacks" for insurance, id cards, receipts, medical records, etc. Whenever I get paper mail, I snap a photo and immediately toss it. I can then organize it in the Stacks app and easily be able to pull it up later. It's a pretty useful, easy solution IMO.


If it starts with 'google' then at best its something you try out then, if you like it, try and find that functionality in an app made by someone else. Google will kill this app just when you get fully invested. All google apps are traps and foot-guns, especially the ones that work great.


Definitely scary as it's under their incubator area120


Probably right


I can't get it on my, ahem, Google Pixel device running Android 13:

>This app isn't available for your device because it was made for an older version of Android.


Also:

“Stack is only available on Android in the U.S. You can install it through the Google Play store.”


That's weird. I'm using it right now on the Pixel 7 Pro running Android 13.


I usually don't jump on the "Google cancels everything" train, but do keep in mind that Stacks is a project from their Area 120 incubator, which saw heavy layoffs [1]. It's not on the remaining list, so it may have already been cancelled internally and currently in the process of being shut down.

[1] https://techcrunch.com/2023/01/25/google-spares-three-area-1...


Until they cancel it.


True :(((


Does it have annotation capabilities? Quickly adding a checkmark or signature would make managing documents much easier.


It looks like it does, though I've never wanted to use them. I just had a quick look at my instance and you can add text notes alongside the document and also there's some basic editing draw/text tools to add to the document itself.


One thing I wish it had was built-in file encryption. Lock/unlock with user auth.


I tinkered with this a few weeks ago. Pleasantly surprised with it's capabilities.


Is it using local storage or cloud?


Yes.

It's a self-hosted application, so it depends on your setup. I suppose it's arguably using local storage on the server you run it on which is often going to be a cloud hosted machine.


I've spun up a copy of this recently (within the last month) and it's already proving helpful.

I've purchased a new-build home in Germany, and I'm currently in the stage between "purchased" and "ready for move-in," and if you've ever purchased a Neubau in Germany you know how much paperwork is involved - I get so many documents over email, many of which are scanned (to preserve the wet signature and stamps), and some of which I need to copy into a translator, that this is incredibly helpful. It checks my email, grabs PDFs, straightens them, OCRs them, adds a correspondent, tags them, and makes them available through a web UI.

I also appreciate the full-text search (for all that it might struggle if I had tens of thousands of documents) as I've had to go and try to find particular documents where the name of the document I've received might be a synonym for what the other person is asking for, but the word they're asking for is at least used in the text.

I'll also set it up to pull documents from my NAS as well, where the scanner writes to, as I also receive a number of documents via mail (that I also occasionally need to translate or copy/paste from).

There are also some limitations that annoy me:

* I really wish the email filters were more flexible - right now, I have to have three filters, one of PDFs, one for JPEGs, and one for PNGs, so I wish I could just set a regex for the attachment name. This one annoys me enough that if I ever have time I'd look at doing a PR for it (assuming the filtering is done locally and not on the IMAP server). * I'd also like to be able to setup rules to tag documents based on the email domain (e.g., house-builders get tagged as "house-builder, house") without having to manage a gigantic explosion of rules. In theory the ML should handle that, but... I'm mistrustful of ML. We'll see in a few months if I was too hasty in my judgement or not. * I'd like to retain slightly more information about the correspondent, like both name and email address (there's no consistency about who has their From line as "Name <email>" and who's just "email", even within the same company), both for de-duplication of correspondents and domain-based searching. * I wish I could share documents more easily than downloading it and re-uploading it to my email client (or mounting the folders and trying to find the right document, but that has its own set of problems). This one of those problems that's really easy to state, but potentially quite difficult to actually implement - could a web application add a PDF to the clipboard in such a way that GMail, say, would understand what was happening and add it as an attachment when pasted?

Overall though, I'm pretty happy with it, and finding it useful so quickly was somewhat surprising.


I've created "ODI" (Overengineered Documents Indexer) and presented it recently.

https://clis-everywhere.k8s.best/16

My approach is scanning the documents with airscan1, indexing them with a custom OCR Server (using the MLKit by Google on an Android phone which does completely offline OCR scanning) and indexing everything in OpenSearch. I've then created a backend + frontend to see the documents and di full text search with that.

Everything is (going to be) open source with a permissive license.


You project is very nice, next step is browser automation to pay the bills as they arrive when approved on a push notification ? :)

Funny situation because it's two extremes, we in Switzerland have way too much paper and need to find a solution, especially for tax filling. Opposite people are complaining they'd have nothing to feed it with.


I actually care less about bills and more about properly indexing my documents.

I'm still on my way to have a proper document management solution for my needs, so that I can retrieve any correspondence of the last X years.

Finding an apartment contract or some other important mail is worth more than just being able to pay the bills "automatically" - since you sort of can already do that with eBill or the QR Bills.

OCRing everything is the first step towards this goal :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: