Hacker Newsnew | past | comments | ask | show | jobs | submit | n_u's commentslogin

Regarding cross-shard consistency, there's a doc here[1] on the options. In short you can just send the writes to all replicas, you can disallow cross-replica transactions (read and write), use two-phase commit to perform a distributed transaction or use their own hybrid approach that they set as the default.

[1] https://vitess.io/docs/20.0/user-guides/configuration-advanc...


Is there info anywhere on the structure of the semi-lattice they are using for their CRDT?

Is the map based on a multi-value register or a last-writer-wins register?



Thank you.

From the doc

> Automerge uses a combination of LWW (last writer wins) and multi-value register. By default, if you read from doc.foo you will get the LWW semantics, but you can also see the conflicts by calling Automerge.getConflicts(doc, 'foo') which has multi-value semantics.

> Note that "last writer wins" here is based on the internal ID of the opeartion [sic], not a wall clock time. The internal ID is a unique operation ID that is the combination of a counter and the actorId that generated it. Conflicts are ordered based on the counter first (using the actorId only to break ties when operations have the same counter value).

Seems like they use LWW with Lamport clocks to order operations and a unique ID for each client as a tie-breaker.


I agree with both of you. I think it’s sort of a hybrid and a spectrum of how much you do of each first.

When you test part of the circuit with the scope, you are using prior knowledge to determine which tool to use and where to test. You don’t just take measurements blindly. You could test a totally different part of the system because there might be some crazy coupling but you don’t. In this system it seems like taking the measurement is really cheap and a quick analysis about what to measure is likely to give relevant results.

In a different system it could be that measurements are expensive and it’s easy to measure something irrelevant. So there it’s worth doing more analysis before measurements.

I think both cases fight what I’ve heard called intellectual laziness. It’s sometimes hard to make yourself be intellectually honest and do the proper unbiased analysis and measuring for RCA. It’s also really easy to sit around and conjecture compared to taking the time to measure. It’s really easy for your brain to say “oh it’s always caused by this thing cuz it’s junk” and move on because you want to be done with it. Is this really the cause? Could there be nothing else causing it? Would you investigate this more if other people’s lives depended on this?

I learned about this model of viewing RCA from people who work on safety critical systems. It takes a lot of energy and time to be thorough and your brain will use shortcuts and confirmation bias. I ask myself if I’m being lazy because I want a certain answer. Can I be more thorough? Is there a measurement I know will be annoying so I’m avoiding it?


Which part of the index are you putting in the buffer pool here? The postings list, the doc store or the terms dict?

Is it being cached for future queries or are you just talking about putting it in memory to perform the computation for a query?


I'm primarily looking at document lists and possibly the keyword-documents mapping.

Caching will likely be fairly tuned toward the operation itself, since it's not a general-purpose DBMS and I can fairly accurately predict which pages will likely be useful to cache or when read-ahead is likely to be fruitful based on the operation being performed.

For keyword-document mappings some LRU cache scheme is likely a good fit, when reading a list of documents readahead is good (and I can inform the pool of how far to read ahead), when intersecting document lists I can also generally predict when pages are likely to be re-read or needed in the future based on the position in the tree.

Will definitely need a fair bit of tuning but overall the problem is greatly simplified by revolving around very specific types of access patterns.


Ah interesting. Is your keyword-document map (aka term dict) too big to keep in memory permanently? My understanding is that at Google they just keep it in memory on every replica.

Edit: I should specify they shard the corpus by document so there isn't a replica with the entire term dict on it.


Could plausibly fit in RAM, is only like ~100 GB in total. We'll see, will probably keep it mmap:ed at first to see what happens. It isn't the target of very many queries (relatively speaking) at any rate so either way is probably fine.


>It isn't the target of very many queries (relatively speaking)

Wow why is that? Do you use a vector index primarily?


No I mean for every query there is mapping up keywords to trees of documents, there is dozens if not hundreds of queries in the latter in order to intersect document lists.


Ah I see. I thought by query you meant "user search query". I'm guessing by query here you mean "read".


It’s more about whether or not the company has taxable profits for that year (importantly these are not the same as real profits). I would read this article to understand more about how being forced to amortize tax deductions for expenses affects a business’s taxes.

https://news.ycombinator.com/item?id=44180533

more info here too

https://news.ycombinator.com/item?id=44226145


It also classifies software development as R&D which together with immediate expensing for R&D undoes the Section 174 changes as far as I understand.

“For purposes of this section, any amount paid or incurred in connection with the development of any software shall be treated as a research or experimental expenditure“

Page 303 of bill here https://www.congress.gov/119/bills/hr1/BILLS-119hr1eas.pdf

Original article about Section 174 tax code causing layoffs

https://news.ycombinator.com/item?id=44180533

Post from @dang with more info about Section 174

https://news.ycombinator.com/item?id=44226145


>It also classifies software development as R&D

The TCJA (passed in 2017) already did that (effective 2022). So it sounds like this new bill is keeping that, but changing the deduction rules back to what they were before 2022.

See this previous discussion of the TCJA:

> all "software development" is now an R&E expense.

https://news.ycombinator.com/item?id=34627712

(AIUI, "R&D" (research and development) and "R&E" (research and experimentation) are synonyms.)


Page 301

> there shall allowed as a deduction any domestic research and experimental expenditures which are paid or incurred by the taxpayer in the current taxable year

AFAIK, there was no domestic vs. foreign R&D distinction in section 174 before.


There was a domestic vs foreign distinction in the TCJA, passed in 2017, which took effect in 2022:

> 174 to require taxpayers to amortize specified R&E expenditures ratably over a five-year period for domestic expenditures and a 15-year period for specified R&E expenditures attributed to foreign research

https://www.journalofaccountancy.com/issues/2022/nov/amortiz...


That's nuts


Relevant quote from article “Three business tax deductions would be made permanent. That includes the ability to use depreciation and amortization as the basis for interest expensing, the research and development write-off”

from the bill itself on page 795

“SOFTWARE DEVELOPMENT- For purposes of this section, any amount paid or incurred in connection with the development of any software shall be treated as a research or experimental expenditure.”

https://www.congress.gov/119/bills/hr1/BILLS-119hr1eh.pdf

Relevant articles

https://news.ycombinator.com/item?id=44180533

@dang if there’s a more relevant article feel free to change the link


Signed and called my representative and senators.

I ask simply "If I have $1m of revenue and $1m of expenses that is entirely software dev salaries, what do you think my profit is for that year? How much should I be taxed on that?"

https://www.house.gov/representatives/find-your-representati...

https://www.senate.gov/senators/senators-contact.htm


Some OTs do, some don't. OTs with the TP2 property do not require a central authority to order edits I believe.

In my experience if you are persisting your edits or document state, you have something that creates an ordering anyways. That thing is commonly an OLTP database. OLTPs are optimized for this kind of write-heavy workload and there's a lot of existing work on how to optimize them further.

But now even S3 has PUT-IF, so you could use that to create an ordering. https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...


What I’ve always been curious about is if you can help the S3 query optimizer* in any way to use specialized optimizations. For example if you indicate the data is immutable[1] does the lack of a write path allow further optimization under the hood? Replicas could in theory serve requests without coordination.

*I’m using “query optimizer” rather broadly here. I know S3 isn’t a DBMS.

[1] https://aws.amazon.com/blogs/storage/protecting-data-with-am...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: