GFS FAQ

Q: Why is atomic record append at-least-once, rather than exactly
once?

为什么记录的追加是至少一次,而不是仅仅只追加一次?

Section 3.1, Step 7, says that if a write fails at one of the
secondaries, the client re-tries the write. That will cause the data
to be appended more than once at the non-failed replicas. A different
design could probably detect duplicate client requests despite
arbitrary failures (e.g. a primary failure between the original
request and the client’s retry). You’ll implement such a design in Lab
3, at considerable expense in complexity and performance.

如果往 secondaries 写入失败 client 会重试写入,这将导致没有失败的 secondaries 会被重复写入.一个不同的设计可能会检测到客户的重复请求,尽管有错误会发生例如原始请求和客户端重试之间出错,你会实现这个在 Lab 3 ,你会在复杂性和性能上付出相当大的代价.

Q: How does an application know what sections of a chunk consist of
padding and duplicate records?

应用怎么知道 chunk 的哪些部分由填充和重复记录组成?

A: To detect padding, applications can put a predictable magic number
at the start of a valid record, or include a checksum that will likely
only be valid if the record is valid. The application can detect
duplicates by including unique IDs in records. Then, if it reads a
record that has the same ID as an earlier record, it knows that they
are duplicates of each other. GFS provides a library for applications
that handles these cases.

为了检测填充,应用程序在有效记录的开头放置一个可预测的魔术数字,或者包含一个校验和,只有在记录有效的时候才有效.应用程序可以检测重复记录 by 每条记录包含一个独一无二的 id , GFS 提供了一个库来做这些.

Q: How can clients find their data given that atomic record append
writes it at an unpredictable offset in the file?

由于是原子追加不知道确切偏移量,那么用户端怎么找到他们的数据?

A: Append (and GFS in general) is mostly intended for applications
that sequentially read entire files. Such applications will scan the
file looking for valid records (see the previous question), so they
don’t need to know the record locations in advance. For example, the
file might contain the set of link URLs encountered by a set of
concurrent web crawlers. The file offset of any given URL doesn’t
matter much; readers just want to be able to read the entire set of
URLs.

追加记录通常是为了服务于那些顺序读取整个文件的应用.这些文件会扫描整个文件查找有效数据,所以他们不需要知道记录的位置.比如,一个文件包含一组 URLS 的集合,单个 URL 的偏移量根本不重要,应用程序只是读取整个 URLS 的集合.

Q: What’s a checksum?

什么是校验和?

A: A checksum algorithm takes a block of bytes as input and returns a
single number that’s a function of all the input bytes. For example, a
simple checksum might be the sum of all the bytes in the input (mod
some big number). GFS stores the checksum of each chunk as well as the
chunk. When a chunkserver writes a chunk on its disk, it first
computes the checksum of the new chunk, and saves the checksum on disk
as well as the chunk. When a chunkserver reads a chunk from disk, it
also reads the previously-saved checksum, re-computes a checksum from
the chunk read from disk, and checks that the two checksums match. If
the data was corrupted by the disk, the checksums won’t match, and the
chunkserver will know to return an error. Separately, some GFS
applications stored their own checksums, over application-defined
records, inside GFS files, to distinguish between correct records and
padding. CRC32 is an example of a checksum algorithm.

校验和算法以一个字节块作为输入,输出一个与所有输入字节有关的单一数字.例如,一个简单的校验和可能是把所有字节相加然后模一个大的数字.GFS为每个chunk存储校验和.GFS存储chunk的时候也会为每个 chunk 存储校验和.当 chunkserver 向磁盘写入一个 chunk 的时候,它先计算这个 chunk 的校验和,然后向磁盘写入这个 chunk 的内容和校验和.当读取一个 chunk 的时候也会先独取预先存储的校验和,然后自己计算校验和是否与存储的匹配.如果磁盘错误导致数据丢失导致校验和不匹配, chunkserver 会返回一个错误.另外, GFS 应用程序会存储他们自己的校验和?以区分正确的数据和填充的数据. CRC32 是一个常见的校验和算法.

Q: The paper mentions reference counts — what are they?

论文中提到了引用计数,这是什么?

A: They are part of the implementation of copy-on-write for snapshots.
When GFS creates a snapshot, it doesn’t copy the chunks, but instead
increases the reference counter of each chunk. This makes creating a
snapshot inexpensive. If a client writes a chunk and the master
notices the reference count is greater than one, the master first
makes a copy so that the client can update the copy (instead of the
chunk that is part of the snapshot). You can view this as delaying the
copy until it is absolutely necessary. The hope is that not all chunks
will be modified and one can avoid making some copies.

这是快照实现写时复制的一部分.当 GFS 创建快照时不会复制 chunk ,但是会增加 chunk 的引用计数.所以创建一个快照的开销很小.假设一个客户端写入一个 chunk 并且 master 注意到引用计数大于 1 , master 首先会复制这个 chunk (而不是作为快照的一部分).你可以把它看作一种延迟拷贝,在一些 chunk 不被修改的时候减少拷贝.

Q: If an application uses the standard POSIX file APIs, would it need
to be modified in order to use GFS?

如果应用程序用标准 POSIX 文件接口,是否需要修改才能使用 GFS ?

A: Yes, but GFS isn’t intended for existing applications. It is
designed for newly-written applications, such as MapReduce programs.

是的, GFS 不是为了拓展现有应用程序,它是为了类似 mr 这种程序新开发的.

Q: How does GFS determine the location of the nearest replica?

GFS如何确定哪个副本比较近?

A: The paper hints that GFS does this based on the IP addresses of the
servers storing the available replicas. In 2003, Google must have
assigned IP addresses in such a way that if two IP addresses are close
to each other in IP address space, then they are also close together
in the machine room.

通过 IP 地址,如果两个 IP 的地址空间越近那么物理空间也就越近.

Q: What’s a lease?

什么是租约?

A: For GFS, a lease is a period of time in which a particular
chunkserver is allowed to be the primary for a particular chunk.
Leases are a way to avoid having the primary have to repeatedly ask
the master if it is still primary — it knows it can act as primary
for the next minute (or whatever the lease interval is) without
talking to the master again.

对于 GFS 来说,租约是一段期限,在这段时期内 chunkserver 被允许成为一个 chunk 的 primary .租约是避免 chunkserver 不断向 master 请求自己是否是 primary ,它知道自己在租约期限内是 primary ,不会再向 master 请求.

Q: Suppose S1 is the primary for a chunk, and the network between the
master and S1 fails. The master will notice and designate some other
server as primary, say S2. Since S1 didn’t actually fail, are there
now two primaries for the same chunk?

假设 S1 是一个 chunk 的 primary ,现在 S1 和 master 之间的网络出现问题. master 节点会指派另一个服务节点为 primary 比如 S2 .由于 S1 并不是真的出现问题只是网络故障,所以他们两个现在都是用一个 chunk 的 primary 码?

A: That would be a disaster, since both primaries might apply
different updates to the same chunk. Luckily GFS’s lease mechanism
prevents this scenario. The master granted S1 a 60-second lease to be
primary. S1 knows to stop being primary before its lease expires. The
master won’t grant a lease to S2 until after the previous lease to S1
expires. So S2 won’t start acting as primary until after S1 stops.

这将会是一个灾难,因为两个 primary 可能会向一个 chunk 写入不同的数据.幸运的是 GFS 的租约机制可以避免这个问题. master 节点授予 S1 一个60秒的租约期, S1 在租约到期的时候会放弃自己的 primary 身份,并且 master 在 S1 的租约到期之前不会授予 S2 租约,所以 S2 在 S1 的租约到期之前不会成为这个 chunk 的 primary.

Q: 64 megabytes sounds awkwardly large for the chunk size!

64MB 对于 chunk 来说是不是太大了?

A: The 64 MB chunk size is the unit of book-keeping in the master, and
the granularity at which files are sharded over chunkservers. Clients
could issue smaller reads and writes — they were not forced to deal
in whole 64 MB chunks. The point of using such a big chunk size is to
reduce the size of the meta-data tables in the master, and to avoid
limiting clients that want to do huge transfers to reduce overhead. On
the other hand, files less than 64 MB in size do not get much
parallelism.

64MB是 master 节点的一个记录单位,也是文件的分块粒度.客户端可以写入和读取更小的数据,并不是强制处理整个 64MB 的 chunk.使用这么大的块是为了减少 master 中元数据的大小,以避免限制那些传输大文件以减小开销的用户.另一方面,小于 64MB 也不会由太好的并行性.

Q: Does Google still use GFS?

Google 现在还在用 GFS 吗?

A: Rumor has it that GFS has been replaced by something called
Colossus, with the same overall goals, but improvements in master
performance and fault-tolerance. In addition, many applications within
Google have switched to more database-like storage systems such as
BigTable and Spanner. However, much of the GFS design lives on in
HDFS, the storage system for the Hadoop open-source MapReduce.

传言是已经被 Colossus 取代了,相比于 GFS 有更好的性能和容错.Google 内部的应用已经转向了更像数据库的存储系统如 BigTable 和 Spanner.然而,很多 GFS 的设计思路在 HDFS 中还有体现.

Q: How acceptable is it that GFS trades correctness for performance
and simplicity?

可以接受 GFS 用正确性换取性能和简易性?

A: This a recurring theme in distributed systems. Strong consistency
usually requires protocols that are complex and require chit-chat
between machines (as we will see in the next few lectures). By
exploiting ways that specific application classes can tolerate relaxed
consistency, one can design systems that have good performance and
sufficient consistency. For example, GFS optimizes for MapReduce
applications, which need high read performance for large files and are
OK with having holes in files, records showing up several times, and
inconsistent reads. On the other hand, GFS would not be good for
storing account balances at a bank.

这个问题在分布式系统中经常出现.强一致性通常需要很复杂的的在各个节点通信的协议(在后面几节课会讲到).通过用一种特定应用程序类可以容忍弱一致性,可以设计出高性能和足够一致性的系统.例如, GFS 为 MR 应用进行了优化,这些应用需要高速读取大文件并且可以容忍文件中的部分错误,重复数据,和不一致的读取.另一方面 GFS 不利于存储银行账户的余额.

Q: What if the master fails?

master 节点出错怎么办?

A: There are replica masters with a full copy of the master state; the
paper’s design requires some outside entity (a human?) to decide to
switch to one of the replicas after a master failure (Section 5.1.3).
We will see later how to build replicated services with automatic
cut-over to a backup, using Raft.

可以复制一个 master 节点,在 master 节点出错的时候人为介入切换到备份的 master 节点.我们在后面的章节会介绍怎么用 RAFT 构建一个自动切换的备份服务.

Q: Why 3 replicas?

为什么副本数是3?

A: Perhaps this was the line of reasoning: two replicas are not enough
because, after one fails, there may not be enough time to re-replicate
before the remaining replica fails; three makes that scenario much
less likely. With 1000s of disks, low-probabilty events like multiple
replicas failing in short order occur uncomfortably often. Here is a
study of disk reliability from that era:
https://research.google.com/archive/disk_failures.pdf. You need to
factor in the time it takes to make new copies of all the chunks that
were stored on a failed disk; and perhaps also the frequency of power,
server, network, and software failures. The cost of disks (and
associated power, air conditioning, and rent), and the value of the
data being protected, are also relevant.

两个副本的话在一个出错后,可能没有足够的时间在另一个副本出错前备份,三个副本就很好了.在有 1000 个磁盘的集群中像多个副本出错的低概率事件经常发生.你需要考虑在出错硬盘上拷贝所有 chunk 的时间,也要考虑电源,服务器,网络,软件的出错概率.磁盘的成本(相关电力,空调和租金),和被保护的数据的价值也是相关的.

Q: Did having a single master turn out to be a good idea?

单一主节点是否是个好主意?

A: That idea simplified initial deployment but was not so great in the
long run. This article (GFS: Evolution on Fast Forward,
https://queue.acm.org/detail.cfm?id=1594206) says that as the years
went by and GFS use grew, a few things went wrong. The number of files
grew enough that it wasn’t reasonable to store all files’ metadata in
the RAM of a single master. The number of clients grew enough that a
single master didn’t have enough CPU power to serve them. The fact
that switching from a failed master to one of its backups required
human intervention made recovery slow. Apparently Google’s replacement
for GFS, Colossus, splits the master over multiple servers, and has
more automated master failure recovery.

这样处理对于初始部署比较简单但是不利于长远运行.这篇文章说随着时间的推移和 GFS 使用的增加,随着文件数量的增加,主节点的内存已经存不下元数据了,也没有足够的 CPU 资源来处理过多的客户端请求,主节点的故障也需要人工干预,恢复很慢,好在现在 Google 用 Colossus 替代了 GFS,有了多主节点和自动故障恢复.

Q: What is internal fragmentation? Why does lazy allocation help?

什么是内部碎片?为什么懒分配会有帮助?

A: Internal fragmentation is the space wasted when a system uses an
allocation unit larger than needed for the requested allocation. If
GFS allocated disk space in 64MB units, then a one-byte chunk would
waste almost 64MB of disk. GFS avoids this problem by allocating disk
space lazily. Every chunk is a Linux file, and Linux file systems use
block sizes of a few tens of kilobytes; so when an application creates
a 1-byte GFS file, the file’s chunk consumes only one Linux disk
block, not 64 MB.

内部碎片是在系统使用单元分配空间大于文件存储所需空间时浪费的空间.比如 GFS 磁盘分配单元是 64MB 大小,那么一个 1 字节的块就会浪费整个 64MB 的磁盘.GFS 使用懒分配策略来避免这种情况.每个块是一个 Linux 文件,Linux文件系统的块大小为几十 KB,所以应用程序创建一个一字节的文件时,只会创建一个 Linux 文件块,而不是 64MB 的块.

Q: What benefit does GFS obtain from the weakness of its consistency?

GFS 的弱一致性有什么优势吗?

A: It’s easier to think about the additional work GFS would have to do
to achieve stronger consistency.

更容易想到? GFS 为了实现强一致性需要做额外的工作.

The primary should not let secondaries apply a write unless all
secondaries will be able to do it. This likely requires two rounds of
communication — one to ask all secondaries if they are alive and are
able to promise to do the write if asked, and (if all answer yes) a
second round to tell the secondaries to commit the write.

primary 不应该允许副本写入除非所有的副本都会写入.这可能需要两轮通信,第一轮确认是否所有的副本存活并且能进行写操作,如果所有返回都是肯定的,第二轮通信则会告诉副本提交写入.

If the primary dies, some secondaries may have missed the last few
messages the primary sent. Before resuming operation, a new primary
should ensure that all the secondaries have seen exactly the same
sequence of messages.

当 primary 故障,副本应该会丢失部分 primary 传送的数据.在恢复操作前,新的 primary 应该确认所有的副本拥有相同的消息队列.

Since clients re-send requests if they suspect something has gone
wrong, primaries would need to filter out operations that have already
been executed.

由于客户端在怀疑出错的时候会重新发送请求,primary 需要过滤哪些操作是已经执行了的.

Clients cache chunk locations, and may send reads to a chunkserver
that holds a stale version of a chunk. GFS would need a way to
guarantee that this cannot succeed.

客户端会缓存 chunk 的位置信息,并且发送读请求到过期的块服务节点.GFS 需要一种方法来保证这不会发生.

原文地址:http://www.cnblogs.com/autumnnnn/p/16809449.html

1. 本站所有资源来源于用户上传和网络,如有侵权请邮件联系站长! 2. 分享目的仅供大家学习和交流,请务用于商业用途! 3. 如果你也有好源码或者教程,可以到用户中心发布,分享有积分奖励和额外收入! 4. 本站提供的源码、模板、插件等等其他资源,都不包含技术服务请大家谅解! 5. 如有链接无法下载、失效或广告,请联系管理员处理! 6. 本站资源售价只是赞助,收取费用仅维持本站的日常运营所需! 7. 如遇到加密压缩包,默认解压密码为"gltf",如遇到无法解压的请联系管理员! 8. 因为资源和程序源码均为可复制品,所以不支持任何理由的退款兑现,请斟酌后支付下载 声明:如果标题没有注明"已测试"或者"测试可用"等字样的资源源码均未经过站长测试.特别注意没有标注的源码不保证任何可用性