since we are on the topic and since your article clearly mentions "Normalize your data unless you have a good reason not to" I had to ask. I am trying to build a news aggregator and I have many websites. Each of them has slightly different format. Even though I use feedparser in python, it still doesn't change how some of them put html text inside content and some of them break it down into a separate media xml attribute while retaining only basic textual summary inside a summary attribute. Do you think it makes more sense to store a single rss item as a column inside postgres or should it be stored after parsing it? I can see upsides and downsides to both approaches. Store it as XML and you have the ability to change your processing logic down the line for each stored item but you lose the flexibility of querying metadata and you also have to parse it on the fly every single time. Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source. What do you guys recommend?
With the caveat that you probably shouldn't listen to me (or anyone else on here) since you are the only one who knows how much pain each choice will be ...
I think that given that you are not really dealing with structured data - you've said that different sites have different structures, and I assume even with processing, you may not be able to generate identical metadata structures from each entry.
I think I would go for one column of XML, plus maybe another column that holds a parsed data structure that represents the result of your processing (basically a cache holding the post-processed version of each site). Hopefully that could be re-evaluated by whatever language (Python?) you are using for your application. That way you don't have to do the full parsing each time you want to examine the entry, but you have access to something that can quickly give you whatever metadata is associated with it, but which doesn't tie you to the rigid structure of a table based database.
Once you know what you are really doing with the data, then you could add additional metadata columns that are more rigid, and which can be queried directly in SQL as you identify patterns that are useful for performance.
i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw <item></item> from each url till I get a good hang of how each site differs?
Just shooting without knowing your real needs - take this with a grain of salt.
Store some parsed representation that makes it easier for you to work with (probably normalized). Keep an archive of raw data somewhere. That may be another column, table or even S3 bucket. Don't worry about schema changes but you need to not lose the original data. There are some pitfalls to schema migrations. But the schema should be the representation that works for you _at the moment_, otherwise it'll slow you down.
If you’re going to be parsing it anyways, and there’s the possibility of actually doing something with that parsed info beyond just reprinting it, then storing the post-parse results is probably better. Especially if you’re only going to need a reduced subset of the information and storage matters.
If you’re just reprinting — you’re parsing only for the sake of rendering logic — then storing the parse-result is probably just extra unnecessary work.
Also if storage isn’t a concern, then I like using the database as intermediate storage for the pipeline. Grab the RSS, stuff it in the DB as-is. Take it out of the DB, parse, store the parse results. Etc. You’ll have to do this anyways if you’re going to end up with a processing queue (and you can use PG as a simple queue.. SELECT…FOR UPDATE), but it’s nice-to-have if your pipeline is going to eventually change as well — and you’re able to reprocess old items
> Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source.
Don’t you need to deal with this problem regardless? Ideally you’ll find a common subset you actually care about and your app-logic will look like
website —> website handler -> extract data subset —> add to common structure —> render common structure
Even if you don’t use a database at all your app-code will need to figure out the data normalization