Is WAFL a Filesystem? - by Dave hitz  

Many people think WAFL is a filesystem. I certainly thought so fifteen years ago when I wrote it, but folks like Kostadis Roussos are now claimingthat I was wrong. (A NetApp employee no less!)

To understand why, you have to understand how WAFL is structured. 

WAFL has a “top-half” that deals with files and folders—“My Documents” and “QuarterlyEarnings.ppt”. It keeps track of who created the file, when they created it, who can look at the file, who can modify it, and so on. This certainly sounds like what a filesystem would do. The top-half actually supports multiple filesystem protocols. We started with NFS for UNIX, but right from the beginning we knew we’d be adding more. Originally we expected that Novell Netware would be next, but Windows CIFS gained momentum so quickly that we did that instead and never got around to Netware. 

WAFL also has a “bottom-half” that manages the physical disks in the system. It organizes the disks into separately managed pools, it keeps track of which disks are part of which RAID array, and it arranges data on the disk to maximize read and write performance. The “bottom half” also handles data management functions like snapshots, remote mirrors, cloning, de-duplication, thin provisioning, and so on. These capabilities would traditionally be part of a volume manageror a block virtualization layer.

One of the things that was unique about WAFL when we shipped it was the way we integrated the two layers. There are many opportunities for new features and clever optimization if you design the layers to work well together.

When we decided to support iSCSI and Fibre Channel SAN, WAFL’s bottom-half data management capabilities were a perfect fit. In fact, one of the things that helped convince me that NetApp should support block-based storage was realizing how valuable our data management features would be in that environment. 

This top-half/bottom-half structure explains the confusion about WAFL. My current view is that WAFL contains a filesystem, multiple filesystems actually, but that’s different from being a filesystem.
Read more!

EMC SAN Architecture

This document will be using the EMC symmetrix configuration. There are a number of EMC Symmetrix configurations but they all use the same architecture as detailed below.. 
 
Front End Director Ports (SA-16b:1)
Front End Director (SA-16b)
Cache
Back End Director (DA-02b)
Back End Director Ports (DA-02b:c)
Disk Devices


Front End Director
A channel director (front end director) is a card that connects a host to the symmetrix, each card can have upto four ports.


Cache
Symmetrix cache memory buffers I/O transfers between the director channels and the storage devices. The cache is divided up into regions to eliminate contension.


Back End Director
A disk director (back end director) transfers data from disk to cache. Each back-end director can have upto four interfaces (C,D,E and F). Each back-end director interface can handle seven SCSI ids (0-6)


Disk Devices
The disk devices that are attached to the back-end directors could be either SCSI or FC-AL.



Interconnect
The direct matrix interconnect is a matrix of high speed connections to all componentswith bandwidth up to 64Gb/s



Symmetrix DMX Architecture core components are:
Channel Directors - for host communication.
Disk Directors - for Disk communication.
Global Memory Directors - for I/O delivery from hosts to Disk Directors.




SAN Components

The are many components to a SAN Architecture. A host can connect to a SAN via direct connection or via a SAN switch.
Host HBA Host bus adaptor cards are used to access SAN storage systems
SAN Cables There are many types of cables and connectors:

Types: Multimode (<500m), single mode (>500m) and copper
Connectors: ST, SC (1Gb), LC (2Gb)
SAN Switches The primary function of a switch is to provide a physical connection and logical routing of data frames between the attached devices.
Support multiple protocols: Fibre channel, iSCSI, FCIP, iFCP
Type of switch: Workgroup, Directors
SAN Zoning Zoning is used to partition a fibre channel switched fabric into subsets of logical devices. Each zone contains a set of members that are permitted to access each other. Members are HBA's, switch ports and SAN ports.
Types of zoning: hard, soft and mixed
Zone set s This is a group of zones that relate to one another, only one zone set can be active at any one time.
Storage arrays Storage array is were all the disk devices are located.
Volume access control This is also know as LUN masking. The storage array maintains a database that contains a map of the storage volumes and WWN's that are allowed to access it. The VCM database in a symmetrix would contain the LUN masking information.

SAN Login

The below table documents the various proccesses that occur when a fibre channel device is connected to a SAN
Information/process FLOGI (fabric login) PLOGI (port login) PRLI (process login)
What is need ? - Link initialization
- Cable
- HBA and driver
- Switch Port
- FLOGI
- Zoning
- Persistent binding
- Driver setting
- PLOGI
- Device masking (target)
- Device mapping (initiator)
- Driver setting (initiator)
What information is passed - WWN
- S_ID
- Protocol
- Class
- Zoning
- WWN
- S_ID
- ULP
- Class
- BB Credit
- LUN
Who does the communication ? - N_port to F_port - N_port to N_port - ULP( scsi-3 to scsi-3)
where to find the information ? Unix
- syslog
- switch utilites
Windows
- Event viewer
- Switch viewer
Unix
- Syslog
- Driver Ulitities
Windows
- Driver utilities
Unix
- Syslog
- Host based volume management
Windows
- Driver Utilities
- Host based volume management
- Device Manager
If any one of the above were to fail then the host will not be allowed to access the disks on the SAN.

VCM Database
The Symmetrix Volume Configuration Management (VCM) database stores access configurations that are used to grant host access to logical devices in a Symmetrix storage array.
The VCM database resides on a special system resource logical device, referred to as the VCMDB device, on each Symmetrix storage array.
Information stored in the VCM database includes, but is not limited to:
  • Host and storage World Wide Names
  • SID Lock and Volume Visibility settings
  • Native logical device data, such as the front-end directors and storage ports to which they are mapped
Masking operations performed on Symmetrix storage devices result in modifications to the VCM database in the Symmetrix array. The VCM database can be backed up, restored, initialized and activated. The Symmetrix SDM Agent must be running in order to perform VCM database operations (except deleting backup files).
Switches
There are three models of switchs M-series (Mcdata), B-series (Brocade) and the MDS-series (Cisco). Each of the switch offer a web interface and a CLI. The following tasks can be set on most switches:
  • Configure network params
  • Configure fabric params (BB Credit, R_A_TOV, E_D_TOV, switch PID format, Domain ID)
  • Enable/Disable ports
  • Configure port speeds
  • Configure Zoning
BB Credit Configure the number of buffers that are available to attached devices for frame receipt default 16. Values range 1-16.
R_A_TOV Resource allocation time out value. This works with the E_D_TOV to determine switch actions when presented with an error condition
E_D_TOV Error detect time out value. This timer is used to flag potential error condition when an expected response is not received within the set time
The CLI commands can be viewed here for all switches, also a detailed view of zoning can be located here.
Host HBA's
The table below outlines which card will work with a particular O/S
Solaris Emulex PCI (lputil)
Qlogic
HPUX PCI-X gigabit fibre channel and ethernet card
AIX FC6227/6228/6239 using IBM native drivers
Windows Emulex (HBAnyware or lputilnt)
Linux Emulex PCI (lputil)
Read more!

How do I add a disk to a VG in an active Metrocluster ( CA/XP )

How do I add a disk to a VG in an active Metrocluster (CA/XP)?
# These items are as I used them on my system
# Substitute with your own values!!!!!!!!!
# HORCMINST : 200
# dev_group : p01oms
# dev_name : p01oms_00, p01oms_01, p01oms_02 etc...

# Export the HORCM instance number you want to work with

->  export HORCMINST=200

1. Find spare disk
*** Hint: use xpinfo -i and raidscan -p -fx, pvdisplay and/or strings the lvmtab file 

2. pvcreate disk 

3. Add disk to /etc/horcm(?).conf on local system 
(Just follow the standard in the file) 

4. Add disk to /etc/horcm(?).conf on remote system 

5. Stop / start raidmanager (local and remote) 
(Note: This is to re-read the config file and does not affect the running packages or cluster) 

-> horcmshutdown.sh 200
-> horcmstart.sh 200 

7:  Check using pairdisplay -g xxxxx -fx (disks will show SMPL) 
->pairdisplay -g p01oms -fx | more 

8:  paircreate the new disks 
# Syntax: paircreate -g -vl -f -c

 -> paircreate -g p01oms p01oms_25 -vl -c 15 -f data
 -> paircreate -g p01oms p01oms_26 -vl -c 15 -f data 

9. Check the status with pairdisplay
->pairdisplay -g p01oms -fx # (Will show COPY) 

10. vgextend vg's on local host 

11. Do a pairsplit 
-> pairsplit -g p01oms -rw 

12. Create mapfile with vgexport -s -p on local host and rcp to remote host 

13. On remote host, vgexport, then vgimport 

14. Do a pairresync
 ->pairresync -g p01oms -c 15
Read more!

Business Copy Pair Status Conditions

Pair Status 
The following figure illustrates the Business Copy pair status transitions and the relationship between the pair status and the Business Copy operations. 


Figure Business Copy Pair Status Transitions 
1. If a volume is not assigned to a Business Copy pair, the volume's status is SMPL. 
2. Select the SMPL volumes for P-VOL and S-VOL to create a Business Copy pair. When you create a Business Copy pair, the initial copy operation starts. During the initial copy operation, the status of the P-VOL and S-VOL changes to COPY(PD). 
3. When the initial copy operation is complete, the pair status becomes PAIR. When the initial copy is completed, the differential data between the P-VOL and the S-VOL will be copied by the update copy. 
4. There are two kinds of pair status (PSUS and PSUE) when the pair is not synchronized. 
     • When you split a pair (pairsplit), the pair status changes to PSUS. During the pairsplit   process, the pair status becomes COPY(SP). If you specify Quick Split pairsplit, the pair status becomes PSUS(SP) during the process. When the pairsplit operation is complete, the pair status changes to PSUS to enable you to access the split S-VOL. The update copy operation is not performed on the pairs that have status PSUS. 
     • If the storage system cannot maintain PAIR status for any reason, or if you suspend the pair (pairsplit-E), the pair status changes to PSUE. 5. When you start a pairresync operation, the pair status changes to COPY(RS) or COPY(RS-R). When the pairresync operation is complete, the pair status changes to PAIR. 
     • When you specify Reverse Copy or Quick Restore mode for a pairresync operation, the pair status changes to COPY(RS-R). 
     • When you delete a pair (pairsplit-S), the status of the volumes that were used by the pair changes to SMPL. You cannot delete the pair that has status PSUS(SP). 

Business Copy Pair Status
The following table lists and describes the Business Copy pair status conditions.


Read more!

Fibre Channel Internals - Flow Control

IBM redbooks - Introduction to Storage Area Networks is a nice ebook to have, I liked the book particularly for its description on SAN Basics and FC Internals, here is an excerpt on Flow Control.

Flow control
Now that we know data is sent in frames, we also need to understand that devices need to temporarily store the frames as they arrive, and until they are assembled in sequence, and then delivered to the upper layer protocol. The reason for this is that due to the high bandwidth that Fibre Channel is capable of, it would be possible to inundate and overwhelm a target device with frames. There needs to be a mechanism to stop this happening. The ability of a device to accept a frame is called its credit. This credit is usually referred to as the number of buffers (its buffer credit) that a node maintains for accepting incoming data. 

Buffer to buffer
During login, N_Ports and F_Ports at both ends of a link establish its buffer to buffer credit (BB_Credit). Each port states the maximum BB_Credit that they can offer and the lower of the two is used.
 
End to end
At login all N_Ports establish end to end credit (EE_Credit) with each other. During data transmission, a port should not send more frames than the buffer of the receiving port can handle before getting an indication from the receiving port that it has processed a previously sent frame.
 
Controlling the flow
Two counters are used to accomplish successful flow control: BB_Credit_CNT and EE_Credit_CNT, and both are initialized to 0 during login. Each time a port sends a frame it increments BB_Credit_CNT and EE_Credit_CNT by one. When it receives R_RDY from the adjacent port it decrements BB_Credit_CNT by one, and when it receives ACK from the destination port it decrements EE_Credit_CNT by one. If at any time BB_Credit_CNT becomes equal to the BB_Credit or EE_Credit_CNT equal to the EE_Credit of the receiving port, the transmitting port has to stop sending frames until the respective count is decremented. The previous statements are true for Class 2 service. Class 1 is a dedicated connection, so it does not need to care about BB_Credit and only EE_Credit is used (EE Flow Control). Class 3 on the other hand is an unacknowledged service, so it only uses BB_Credit (BB Flow Control), but the mechanism is the same on all cases.

Performance
Here we can see the importance that the number of buffers has in overall performance. We need enough buffers to make sure the transmitting port can continue sending frames without stopping in order to use the full bandwidth. This is particularly true with distance. At 1 Gbps a frame occupies between about 75 m and 4 km of fiber depending on the size of the data payload. In a 100 km link we could send many frames before the first one reaches its destination. We need an acknowledgement (ACK) back to start replenishing EE_Credit or a receiver ready (R_RDY) indication to replenish BB_Credit. For a moment, let us consider frames with 2 KB of data. These occupy approximately 4 km of fiber. We will be able to send about 25 frames before the first arrives at the far end of our 100 km link. We will be able to send another 25 before the first R_RDY or ACK is received, so we would need at least 50 buffers to allow for nonstop transmission at 100 km distance with frames of this size. If the frame size is reduced, more buffers would be required to allow nonstop transmission.

Read more!