Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With the caveat that you probably shouldn't listen to me (or anyone else on here) since you are the only one who knows how much pain each choice will be ...

I think that given that you are not really dealing with structured data - you've said that different sites have different structures, and I assume even with processing, you may not be able to generate identical metadata structures from each entry.

I think I would go for one column of XML, plus maybe another column that holds a parsed data structure that represents the result of your processing (basically a cache holding the post-processed version of each site). Hopefully that could be re-evaluated by whatever language (Python?) you are using for your application. That way you don't have to do the full parsing each time you want to examine the entry, but you have access to something that can quickly give you whatever metadata is associated with it, but which doesn't tie you to the rigid structure of a table based database.

Once you know what you are really doing with the data, then you could add additional metadata columns that are more rigid, and which can be queried directly in SQL as you identify patterns that are useful for performance.



i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw <item></item> from each url till I get a good hang of how each site differs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: