Skip to main content

3 posts tagged with "Firebase"

View All Tags

· 6 min read
ShinaBR2

Firestore features

Important factors (not all):

  • Query performance depends on the result set, NOT on the size of the collection. Querying a collection that has millions of records should have the same as a collection that has hundred of records if the result is the same.
  • There is a limitation to the size of a document.
  • Firestore charges for document read, write and delete operations.

Subcollection vs top-level collection, consider these things:

  • Security Rules
  • How to query: do I usually query among ALL items => top-level collection, or do I usually query all items WITHIN a document => subcollection

Note: this is opinionated, NOT strictly.

Let's review an example of a music site.

Feature

As an end user, I want to:

  • Listen to some all my configured audios by default
  • Filter audio by feeling

Understand the data

Audio contains basic information like name, src, created date, etc. Feeling is simple, just contains name and value.

An audio can have multiple feelings, multiple audios can have the same feeling, its many-to-many relationship. For many-to-many relationship, we will have 4 collections at the concept level: audios, feelings, audiosGroupedByFeeling, feelingsGroupedByAudio.

One important factor, it's likely to have thousands of audio have the same feeling, but one audio usually has just a few feelings.

Design and decide

Obviously, we will have 2 top collections are audios and feelings.

Based on my design:

  • It's rarely need filtering ALL feelings by an audio. The only thing I need is when viewing the audio detail, I may want to see if that audio contains what feelings.
  • It's usually need filtering ALL audios by a feeling.

Important: array is evil, ignored by default!

Filtering feelings by audio

Remember, we rarely use this query!

We have some ways:

  • Use current top-level feelings collection, add a map inside each feeling document (key is audio id, value is boolean) for filtering
  • Create a new top-level collection (for example, feelingsGroupedByAudio)
  • Use a map field inside each audio document (key is feeling id, value is boolean)
  • Use a map field inside each audio document (key is feeling id, value is the feeling document or partial of it)
  • Use subcollection inside each audio document

First approach will make feeling document size really big. Adding a new data just for a rare used query but increase the document size is not a good trade off.

The second approach is subjective, it's not really great to me whenever I see in the database another collection just for group by mechanism. Another point is, we need to have an intermediate collection for this approach. The path should be something like /feelingsGroupedByAudio/${audioId}/feelingIntermediateCollection/${feelingId}.

Third approach, we need some additional queries to get all the feeling data when we have the audio id, which is not good in terms of performance.

Forth approach, it's fine since the document size is bigger but not much. We can store only essential information that need to display on the client side, without additional query.

Last approach is fine too. It's pros is the the size of audio document is minimum, but we will need additional query to get feelings when we need (in the "audio detail view").

So we can consider between the fourth and fifth approach. I chose the fourth because:

Filtering audios by feeling

Remember, we often use this query!

We have some ways:

  • Use the current audios top-level collection, add a map inside each audio document (key is feeling id, value is boolean) for filtering
  • Create a new top level collection (for example, audiosGroupedByFeeling)
  • Use a map field inside each feeling document (key is audio id, value is boolean)
  • Use a map field inside each feeling document (key is audio id, value is the audio document or partial of it)
  • Use subcollection inside each feeling document

The first approach is fine, the size of the audio document is increased a bit, but we can query easily. A small problem is we need a field name like feelingMap inside each audio document somehow ugly to me. And we need to use where function to get the data.

The second approach has the same problem as the second approach of the previous section. We will have an additional collection just for groups by mechanism, and an intermediate collection.

Using a map field in each feeling document will make the size bigger since the amount of data of an audio document is much bigger than the feeling document. The fourth approach is the worst way.

Fifth approach is basically a group by mechanism. The good point is both feeling and audio document size is minimum, no additional ugly name feelingMap field, and query is still straightforward.

I chose the fifth approach in this case.

Client side vs server side query

From the previous section, I chose:

  • Each audio document has a map field (key is feeling id, value is the feeling document or partial of it)
  • Create a new subcollection inside each feeling document, for example path /feelings/${feelingId}/audios

I got the consideration between client-side query and server-side query here. Some popup questions:

  • Why do I need to call to the server side again to query since I already have feeling information in each audio document? I can just do filtering on the client side instead, it will save cost of Firestore read.
  • For the "default state" when no feeling is selected, I fetch data from the audios collection, but when I choose a feeling, it looks for another collection (/feelings/${feelingId}/audios). Is that stupid?

Here are some criteria to consider:

  • Security rules. In this case, no problem. But in many cases, we may have different policies for top-level collection and subcollection.
  • Pagination. It's a common pattern when you already load the first 20 audios from the top audios collection, and then you want to filter audios by feeling, which potentially leads to unexpected behavior when doing it on the client side.
  • Filtering on the client-side require we have enough information from the beginning. In this case, we can not filter audios by feeling if the audio document itself does not contain the feeling information.

Conclusions

Nothing is perfect, and no solution is ideal for all cases, but at least we have some rules to follow:

  • Let the view and the frequency of queries determine the data model
  • Keep the document size minimum
  • Query performance in Firestore depends on the result set. So no need to have a top-level just for the group by mechanism.

· 5 min read
ShinaBR2

Problem

I am a fan of serverless solutions including Firebase Cloud Functions, but until now it still does not natively support monorepo and pnpm. This was a very frustrating development experience. After a few hours of research, trying, failing, and repeating the cycle, at least I can figure out a hack to solve this problem. See the problem here: https://github.com/firebase/firebase-tools/issues/653

Some references that I have read:

Motivation

Thanks to the community, I hope this part will make more sense for the future readers and they can choose the right approach for the right situation.

The problem that I want to solve is deploying the Firebase Cloud Functions in the CI environment. Since we only set up the CI once and CI server will handle things automatically for us.

Some important parts to make things clearer to understand how things work.

The folder structure should be like

root
|- apps
|- api
|- packages
|- core
firebase.json
pnpm-workspace.yaml

The apps/api/package.json should look like this:

{
"name": "api",
"main": "dist/index.js",
"dependencies": {
"firebase-functions": "^4.1.1",
"core": "workspace:*"
}
}

Explanation:

The apps/api/package.json explanation:

  • Field name is MUST since it defines how module resolution works. You may familiar with pnpm command for example pnpm install -D --filter api". The apiis the value of thename` field.
  • Field main describe how NodeJS resolve your code. Let's imagine when reading the code base, NodeJS won't know where to get started if you don't tell it. Set this main value dist/index.js means "Hey NodeJS, look for the file dist/index.js at the same level of the package.json file and run it".

Now let's go to the tricky part!

Hacky solution

Solution: https://github.com/Madvinking/pnpm-isolate-workspace

The idea is, to build all the dependencies into one single workspace with some tweaks in the package.json file since firebase deploy command does not support the pnpm workspace:* protocol. I tested many times in both my local environment and CI server, and as long as the package.json file contains the workspace:* protocol, it will fail even if the code is already built.

Steps:

  • Build Cloud Functions locally, the output will be in apps/api/dist
  • Change the firebase.json source field to "source": "apps/api/_isolated_", and remove the predeploy hook. The predeploy define what command will run BEFORE deploying the Cloud Functions (using firebase deploy command). The reason why I remove this is I already build the code base in the previous step.
  • Run pnpx pnpm-isolate-workspace api at the root folder, it will create the folder name _isolated_.
  • Copy build folder into new created folder cp -r apps/api/dist apps/api/_isolated_
  • Go to the apps/api/_isolated_ run mv package.json package-dev.json
  • Go to the apps/api/_isolated_ run mv package-prod.json package.json
  • Go to the apps/api/_isolated_ run sed -i 's/"core\"\: \"workspace:\*\"/"core\"\: \"file\:workspaces\/packages\/core\"/g' package.json, thanks to this comment
  • Finally, run firebase deploy --only functions at the root folder

Questions?

  • Why do I need to rename two package.json files in the apps/api/_isolated_ folder? The main reason is is removing the devDependencies to reduce manual work for the next step
    • Because the package-prod.json does NOT contains the devDependencies and we don't need devDependencies for the deployment. Other than that, the devDependencies may contain some other packages from my other workspaces.
    • I don't know yet how to let the firebase deploy command using the package-prod.json file instead of package.json
  • What exactly sed command does? Why do I need that?
    • This is the most tricky part. The sed command will read the file, and replace some strings with others, which is a very low level, risky, and not easy to do for everyone. That means it only makes sense when doing this in the CI server since it is isolated to your code base. You never want to see these changes in your git repository.
  • Why not install firebase-tools as a dependency and then run something like pnpm exec firebase deploy in the CI server?
    • It makes sense if you run the firebase deploy command from your local machine. In the CI server, please note that I use this.
  • What actually w9jds/firebase-action does and WHY do I need to use that?
    • The most important part is the "authentication process". To deploy Firebase Cloud Functions, "you" need to have the right permissions. For example in your local machine, you need to run the command firebase login before doing anything, then you need to grant access. The same thing will happen on the CI server, we need to grant the right permissions to the Google Service Account through the GCP_SA_KEY key. In the CI environment, there are no browsers to let you sign in, that's the point. So instead of manually running the command pnpm exec firebase deploy in the CI server, the above w9jds/firebase-action will handle things for you.

Other notes

There are some problems with this approach, please don't think it's a perfect solution, and make sure you fully understand it because it's likely you may touch it again in the future, unfortunately.

· 7 min read
ShinaBR2

There are many kinds of NoSQL databases, this article mainly focuses on Firebase's products are the "Firebase real-time database" and "Firestore". However, the mindset and theory will be similar to all other NoSQL databases.

A little reminder, this article is not a comprehensive guide about the NoSQL world. From now on this article, whenever I use "NoSQL", I am talking about the above databases, for other kinds of NoSQL databases, it may vary.

Inspired

Must checkout:

Mindset

First and foremost, mindset is the key to everything.

The rule of thumb when working with NoSQL is denormalization. It's the process of duplicating your data into multiple places in your database. If you feel this is wrong when you come from the MySQL world, that's okay, but this is the first step you need to change your mind. Otherwise, you can not go further. Not because you're bad, just if you can not use the right thing the way it is, don't use it.

After we have denormalized our data, the next thing is keeping the data consistent. In order to do that, whenever we update the data, we need to do it in multiple places.

Arrays are evil, old but still valuable.

NoSQL is based on the theory that reading is more often than writing.

The way we should structure data is the way our application needs to use.

Never assume what you get from the NoSQL is the thing you expect, especially in the world of mobile apps since the end users may not want to update to the latest version.

A reminder, no matter what kind of your database you are using, the relation among your data still be the same. Don't use your brain to remember how you should structure the database, let's understand the relationship of your data instead.

Structure data

This is my personal thinking, it may not suitable in some cases, any feedback will be appreciated.

A real-world example, we usually have many data that live in terms of "1 - 1", "1 - n", and "n - n" relationships, no matter how you store them in the database. The principle of relational databases still is valuable here, regarding the primary key, foreign keys, and conjunction tables.

For example, we have some entities A, B, and C with the following relationship:

  • A and B: "1 - 1" relationship
  • One "A entity" may have n entities of C, which means a "1 - n" relationship.
  • B and C: "n - n" relationship.

Before considering the relationship, we will create some collections at the top level A_Collection, B_Collection, and C_Collection which store all entities of each collection, it's straightforward.

Question: why do we need to have these collections regardless of the relationship?

The answer: Because we can get the entity from its primary key. We can use security rules for these collections for example only the admin can read/write all entities, but the other users can read/write their owned data only.

"1 - 1" relationship

We can choose either store inside each A entity "b_primary_key", or the entire B entity.

Question: what should we store in each A entity?

The answer: depends on how we read the data. If we will want to get the B entity besides the A entity most of the time, store the entire B entity, otherwise, just store the primary key.

"1 - n" relationship

We will have a "list of primary keys of C entities" inside each A entity to get the reference whenever we need it, but DO NOT store it as arrays. We can choose either to store only the primary key of C entities (whose value is boolean like true) or store entire C entities. The reason is similar to the above "1 - 1" relationship.

"n - n" relationship

For the "n - n" relationship between B and C, this Stackoverflow question is a great answer for it, here is the summary:

  • First approach: create a new table like B_anc_C_Collection which acts as a conjunction table in MySQL world
  • Second approach: we have 4 collections B_Collection, C_Collection, B_to_C_Collection, and C_to_B_Collection.

Question: for the second approach, when should we look for B_to_C_Collection, and when C_to_B_Collection?

The answer: depends on what "input" you have, think of them as a "groupBy" collection.

Write the data

At this moment, your data should be live in multiple places in the database. In order to keep data consistent regardless of how we read the data, we need to write data to all places at the same time. The "transaction" concept should be the key here. It means batch writing all data at the same time and ending up with success or failure, making sure NO PARTIAL data were written.

The question here (maybe) is how can we remember where to batch-write the data. I should remind you again of my above words.

"No matter what kind of your database you are using, the relation among your data still be the same. Don't use your brain to remember how you should structure the database, let's understand the relationship of your data instead".

From my point of view, there are two kinds of batch-write operations. First, we don't care about the current data. The second one, we depend on the latest, up-to-date data.

Let's call the first approach just simple "batched-write", second one is "transaction".

"Batched-write" is just simply answers these questions:

  • When the process is started and ended?
  • What should we do during the process?

"Transaction" is a bit more complex, here are the steps.

  • Read the latest data to make sure we are working with the up-to-date data
  • Do logic
  • Tell the database what are we going to change

The database behind the scence will double-check the places we want to read + write data, if nothing changes from the moment we start the transaction, go ahead and commit all the changes. Otherwise, back to step one. The process will repeat until either successful or fail due to too many tries.

This strategy is known as "optimistic concurrency control", it means to optimize for the happy case (which happens most of the time), and if the worst case happens, just retry the whole process again.

Cloud Functions

Forget about the fact that Cloud Functions has not related to the database world, there is one pattern that I usually do to keep all the data consistent. That is using the listener concept of Cloud Functions, you may familiar with that during working with Firebase's NoSQL databases. The idea is "listening" to changes in some specific data, then updating all other denormalized data in other places.

There are no perfect solutions, you can consider some trade-offs, mainly come from user experience (maybe more that I can not remember now):

  • Does that make sense to let the client-side update multiple places in the database? If not, let the client-side update one place, then let the Cloud Functions sync up the rest.
  • Does the client need data to be reflected immediately and offline support? If not, let the Cloud Functions do the job.

Performance

The key in terms of query performance is: don't ask/give more data than you need. It has slightly different between the "Firebase real-time database" and the "Firestore" but there is something you will need to keep in mind.

For Firebase real-time database only: the number of children DOES matter to query performance. It means looking for 10 items in a collection that has 10M items is slower than in a collection that has only 100 items. See this Stackoverflow answer.

One good thing about Firestore is query speed depends on how many entities we actually get, NOT the total entities. In other words, your collection has 60M items, it is still fast as same as only 60 items.

Also, check out these Stackoverflow questions: